Skip to content

Conversation

@gatbontonpc
Copy link

@gatbontonpc gatbontonpc commented Dec 23, 2025

Add metal count equal op

This PR extends the CPU implementations of count_equal to Metal.

The current implementation uses a single thread group, but supports multiple if anything changes. This currently matches the CPU / Cuda implementation in which only takes int32 for src0 and src1. This kernel uses the atomic_fetch_add_explicit, which only supports returning an int32 adds similar to Cuda. This limits the size of the buffers we can take in to 2^31 - 1.

The docs have been updated.

codex generated summary:

Summary

This PR introduces a Metal implementation for COUNT_EQUAL on int32 tensors that uses SIMD-group reduction to efficiently compute per-threadgroup partial counts and accumulate the result into the destination buffer using atomic operations.

The change improves parallel efficiency over a naïve per-element atomic approach by:

  • Performing the equality comparison per thread
  • Reducing results within a SIMD group via simd_sum
  • Emitting a single atomic update per SIMD group

Key Changes

  • Added a templated Metal kernel kernel_count_equal<int32_t>
  • Uses shared memory (shmem_i32) and SIMD intrinsics (simd_sum) to aggregate counts
  • Emits a single atomic_fetch_add_explicit per SIMD group
  • Registers kernel under the exported symbol:
    kernel_count_equal_i32

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 23, 2025
Comment on lines 4137 to 4140
const size_t smem = pipeline.smem;
int64_t z = 0;
ggml_backend_tensor_set(op, &z, 0, sizeof(z));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work, you need to call a separate kernel that fills the buffer with zeros

Copy link
Author

@gatbontonpc gatbontonpc Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new kernel to memset a buffer to a value. Similar to fill but simpler pipeline and only takes the buffer and value.

gatbontonpc and others added 2 commits December 22, 2025 23:18
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants