ggml-cpu: parallelize tensor repacking with OpenMP #18239

pestopoppa · 2025-12-21T00:08:06Z

Summary

Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs.

Motivation

On systems with many cores (e.g., AMD EPYC), tensor repacking for AVX-512 optimization was single-threaded and became a significant bottleneck during model loading. The repack functions convert quantized tensors from storage layout to SIMD-optimized interleaved layout.

Benchmark Results

Measured on AMD EPYC 9655 "Turin" (96 cores, 192 threads):

Model Size	Before	After	Speedup
6.8GB Q4_K	5.0s	3.3s	1.5x
19GB Q4_K	11.9s	5.3s	2.2x

Speedup increases with model size as repack time dominates over I/O.

Changes

Convert pointer-increment loops to explicit indexing (parallelizable)
Add #pragma omp parallel for to outer loops
Move thread-local dst_tmp arrays inside parallel region
Each thread processes independent row groups with no synchronization needed

Functions Parallelized

repack_q4_0_to_q4_0_4_bl - Q4_0 x4 interleave
repack_q4_K_to_q4_K_8_bl - Q4_K models (most common)
repack_q2_K_to_q2_K_8_bl - Q2_K models
repack_q4_0_to_q4_0_8_bl - Q4_0 x8 interleave
repack_iq4_nl_to_iq4_nl_4_bl - IQ4_NL x4
repack_iq4_nl_to_iq4_nl_8_bl - IQ4_NL x8

Testing

Verified outputs match original implementation
Tested on multiple Q4_K_M models
Build verified with GCC 13.3.0 + OpenMP

Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs. Measured on AMD EPYC 9655 (96 cores): | Model Size | Before | After | Speedup | |------------|--------|-------|---------| | 6.8GB Q4_K | 5.0s | 3.3s | 1.5x | | 19GB Q4_K | 11.9s | 5.3s | 2.2x | Key changes: - Convert pointer-increment loops to explicit indexing - Add #pragma omp parallel for to outer loops - Each thread processes independent row groups - Move thread-local dst_tmp arrays inside parallel region Functions parallelized: - repack_q4_0_to_q4_0_4_bl - repack_q4_K_to_q4_K_8_bl - repack_q2_K_to_q2_K_8_bl - repack_q4_0_to_q4_0_8_bl - repack_iq4_nl_to_iq4_nl_4_bl - repack_iq4_nl_to_iq4_nl_8_bl

Key changes: - patches/: OpenMP parallelization of tensor repack functions - PR submitted: ggml-org/llama.cpp#18239 - Measured: 19GB model loads in 5.3s vs 11.9s (2.2x faster) - scripts/lib/executor.py: Remove OMP_NUM_THREADS=1 - Enables parallel repack in benchmark scripts - Also improves prompt processing 2.4x (49 → 119 t/s) - README.md: Add modded llama.cpp fork reference - Fork: https://github.com/pestopoppa/llama.cpp - Instructions for reproducing the setup - CLAUDE.md: Document fork and patches directory - research/RESULTS_SUMMARY.md: Add parallel repack section - orchestration/progress/PROGRESS_2025-12-21.md: Progress report 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

pestopoppa requested a review from ggerganov as a code owner December 21, 2025 00:08

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 21, 2025

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #18239: ggml-cpu: parallelize tensor repacking with OpenMP auroralabs-loci/llama.cpp#645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: parallelize tensor repacking with OpenMP #18239

ggml-cpu: parallelize tensor repacking with OpenMP #18239

pestopoppa commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggml-cpu: parallelize tensor repacking with OpenMP #18239

Are you sure you want to change the base?

ggml-cpu: parallelize tensor repacking with OpenMP #18239

Conversation

pestopoppa commented Dec 21, 2025

Summary

Motivation

Benchmark Results

Changes

Functions Parallelized

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant