ggml-cpu: parallelize tensor repacking with OpenMP #18239
+89
−55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs.
Motivation
On systems with many cores (e.g., AMD EPYC), tensor repacking for AVX-512 optimization was single-threaded and became a significant bottleneck during model loading. The repack functions convert quantized tensors from storage layout to SIMD-optimized interleaved layout.
Benchmark Results
Measured on AMD EPYC 9655 "Turin" (96 cores, 192 threads):
Speedup increases with model size as repack time dominates over I/O.
Changes
#pragma omp parallel forto outer loopsdst_tmparrays inside parallel regionFunctions Parallelized
repack_q4_0_to_q4_0_4_bl- Q4_0 x4 interleaverepack_q4_K_to_q4_K_8_bl- Q4_K models (most common)repack_q2_K_to_q2_K_8_bl- Q2_K modelsrepack_q4_0_to_q4_0_8_bl- Q4_0 x8 interleaverepack_iq4_nl_to_iq4_nl_4_bl- IQ4_NL x4repack_iq4_nl_to_iq4_nl_8_bl- IQ4_NL x8Testing