Skip to content

Conversation

@pestopoppa
Copy link

Summary

Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs.

Motivation

On systems with many cores (e.g., AMD EPYC), tensor repacking for AVX-512 optimization was single-threaded and became a significant bottleneck during model loading. The repack functions convert quantized tensors from storage layout to SIMD-optimized interleaved layout.

Benchmark Results

Measured on AMD EPYC 9655 "Turin" (96 cores, 192 threads):

Model Size Before After Speedup
6.8GB Q4_K 5.0s 3.3s 1.5x
19GB Q4_K 11.9s 5.3s 2.2x

Speedup increases with model size as repack time dominates over I/O.

Changes

  • Convert pointer-increment loops to explicit indexing (parallelizable)
  • Add #pragma omp parallel for to outer loops
  • Move thread-local dst_tmp arrays inside parallel region
  • Each thread processes independent row groups with no synchronization needed

Functions Parallelized

  • repack_q4_0_to_q4_0_4_bl - Q4_0 x4 interleave
  • repack_q4_K_to_q4_K_8_bl - Q4_K models (most common)
  • repack_q2_K_to_q2_K_8_bl - Q2_K models
  • repack_q4_0_to_q4_0_8_bl - Q4_0 x8 interleave
  • repack_iq4_nl_to_iq4_nl_4_bl - IQ4_NL x4
  • repack_iq4_nl_to_iq4_nl_8_bl - IQ4_NL x8

Testing

  • Verified outputs match original implementation
  • Tested on multiple Q4_K_M models
  • Build verified with GCC 13.3.0 + OpenMP

Add OpenMP parallelization to tensor repack functions to significantly
speed up model loading on many-core CPUs.

Measured on AMD EPYC 9655 (96 cores):

| Model Size | Before | After | Speedup |
|------------|--------|-------|---------|
| 6.8GB Q4_K | 5.0s   | 3.3s  | 1.5x    |
| 19GB Q4_K  | 11.9s  | 5.3s  | 2.2x    |

Key changes:
- Convert pointer-increment loops to explicit indexing
- Add #pragma omp parallel for to outer loops
- Each thread processes independent row groups
- Move thread-local dst_tmp arrays inside parallel region

Functions parallelized:
- repack_q4_0_to_q4_0_4_bl
- repack_q4_K_to_q4_K_8_bl
- repack_q2_K_to_q2_K_8_bl
- repack_q4_0_to_q4_0_8_bl
- repack_iq4_nl_to_iq4_nl_4_bl
- repack_iq4_nl_to_iq4_nl_8_bl
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 21, 2025
pestopoppa added a commit to pestopoppa/amd-epyc-inference that referenced this pull request Dec 21, 2025
Key changes:
- patches/: OpenMP parallelization of tensor repack functions
  - PR submitted: ggml-org/llama.cpp#18239
  - Measured: 19GB model loads in 5.3s vs 11.9s (2.2x faster)

- scripts/lib/executor.py: Remove OMP_NUM_THREADS=1
  - Enables parallel repack in benchmark scripts
  - Also improves prompt processing 2.4x (49 → 119 t/s)

- README.md: Add modded llama.cpp fork reference
  - Fork: https://github.com/pestopoppa/llama.cpp
  - Instructions for reproducing the setup

- CLAUDE.md: Document fork and patches directory
- research/RESULTS_SUMMARY.md: Add parallel repack section
- orchestration/progress/PROGRESS_2025-12-21.md: Progress report

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant