[Mirror] Support Youtu-VL Model #67

ngxson · 2025-12-23T09:47:34Z

Mirror from upstream PR: ggml-org#18315

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

New Features
- Added UTUVL vision model support across conversion, preprocessing, and inference.
- New vision metadata keys for window-attention layers and window size; window-attention now uses explicit per-layer sets.
- Improved image preprocessing with dynamic resizing, spatial merging and windowed-attention handling.
- Added UTU‑VL pre-tokenizer and expanded Unicode category handling for tokenization.
Enhancements
- Broader tensor-mapping compatibility for additional vision variants.
- Safer handling/fallback for tied word-embeddings.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-23T09:47:50Z

📝 Walkthrough

Walkthrough

Adds end-to-end UTUVL vision support (conversion, GGUF metadata, tensor mappings, projector, graph, preprocessing, tokenization), updates QwenVL/wa-layer handling, DeepseekV2 tie-word-embeddings handling, Unicode \p{...} parsing, and build/CMake entries for UTUVL.

Changes

Cohort / File(s)	Summary
Conversion & Model Registry `convert_hf_to_gguf.py`, `convert_hf_to_gguf_update.py`	Adds `UtuVLVisionModel(MmprojModel)`, registers `"UtuVLForConditionalGeneration"` / `"UTUVLForCausalLM"`, UTUVL-specific GGUF params, tensor-mapping updates, and new model entry `utu-vl`.
GGUF constants & writer `gguf-py/gguf/constants.py`, `gguf-py/gguf/gguf_writer.py`	Replaces `N_WA_PATTERN` with `WA_LAYERS`, adds `WINDOW_SIZE` and `UTUVL` projector constant, replaces scalar n_wa API with array-based `add_vision_wa_layers(...)` and adds `add_vision_window_size(...)`.
Tensor mapping extensions `gguf-py/gguf/tensor_mapping.py`	Adds many Siglip2 vision source keys to TensorNameMap for patch/embed/encoder/QKV/O/norms/FFN/layer entries (purely additive mappings).
Projector enums & header changes `tools/mtmd/clip-impl.h`, `tools/mtmd/clip-model.h`	Adds `PROJECTOR_TYPE_UTUVL` / `"utuvl"` name, renames KEY_WIN_ATTN_PATTERN→KEY_WIN_ATTN_LAYERS, swaps `n_wa_pattern` scalar for `std::unordered_set<int32_t> wa_layers`.
Vision projector implementation & models `tools/mtmd/clip.cpp`, `tools/mtmd/models/models.h`, `tools/mtmd/models/utuvl.cpp`	Integrates UTUVL across graph selection, loader hyperparams, tensor loading, preprocessing, token counting, batch encoding; adds `clip_graph_utuvl` declaration and `clip_graph_utuvl::build()` implementation with windowed-attention support and VLPatchMerger.
mtmd build & init `tools/mtmd/CMakeLists.txt`, `tools/mtmd/mtmd.cpp`	Adds `models/utuvl.cpp` to mtmd sources and includes `PROJECTOR_TYPE_UTUVL` in vision-projector branch (image delimiters).
Tokenizer / Vocab updates `src/llama-vocab.h`, `src/llama-vocab.cpp`	Adds enum `LLAMA_VOCAB_PRE_TYPE_UTU_VL`, registers tokenizer_pre `"utu-vl"` with BPE-specific regex and flags (`clean_spaces=false`, `ignore_merges=true`).
Unicode parsing `src/unicode.cpp`	Replaces fixed-length `\p{...}` parsing with dynamic brace detection and category validation; registers additional Unicode letter categories (Lu, Ll, Lt, Lm, Lo).
Embedding / tensor load tweaks `src/llama-model.cpp`	Adds fallback handling for missing `output.weight` by optionally loading token embeddings as duplicated output when appropriate (uses `TENSOR_NOT_REQUIRED` and duplicate-load path).
DeepseekV2 & conversion tensor handling `convert_hf_to_gguf.py`	DeepseekV2Model: recognizes UTUVL registry entry, skips specified siglip2/merger tensors, and conditionally skips `lm_head.weight` when `tie_word_embeddings` is true.

Sequence Diagram

sequenceDiagram
    actor User
    participant Registry as ModelRegistry
    participant Converter as HF→GGUF
    participant GGUF as GGUFWriter
    participant Loader as mtmdLoader
    participant Projector as clip_graph_utuvl
    participant Preproc as ImagePreproc

    User->>Registry: request "utu-vl" model
    Registry->>Converter: instantiate UtuVLVisionModel
    Converter->>GGUF: add_vision_wa_layers / add_vision_window_size -> write GGUF metadata
    Converter->>Loader: produce tensors + mapped names
    Loader->>Projector: select PROJECTOR_TYPE_UTUVL
    Projector->>Preproc: request patches / window config
    Preproc->>Projector: deliver image tensor (merged/reshaped)
    Projector->>Projector: transformer pass (Q/K/V, window mask, FFN, VLPatchMerger)
    Projector->>User: return vision embeddings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

"I hopped through code with whiskers bright,
mapped tensors by moonlit byte,
stitched windows, patches, GGUF delight,
a carrot-shaped model now takes flight 🥕,
— Yours, the converting rabbit"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The description is minimal and vague, only referencing an upstream PR without explaining the change or providing implementation details.	Consider adding details about what Youtu-VL model support includes (e.g., new tokenizer type, vision model classes, tensor mappings), key changes made, or link preview of the upstream PR.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for the Youtu-VL model.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

convert_hf_to_gguf.py (1)

7247-7344: OUTPUT tensor bypass of quantization is intentional but undocumented; embedding naming assumption is reasonable for DeepseekV2.

The tied-embeddings implementation in DeepseekV2Model is functionally correct but has two points worth clarifying:

Quantization bypass for OUTPUT is real: The code calls self.gguf_writer.add_tensor(output_name, self._token_embd.numpy()) directly, which bypasses the quantization pipeline in ModelBase.prepare_tensors(). This means OUTPUT will be stored at the original tensor precision (likely F16) rather than following the model's ftype setting (e.g., MOSTLY_Q8_0). This appears intentional for tied-embedding models to preserve logit precision, but should be documented if that's the design goal.

Embedding naming assumption is reasonable: The check for "model.embed_tokens.weight" is the correct naming for DeepseekV2 architecture variants. If future variants use different naming (e.g., without the model. prefix), the OUTPUT tensor would not be written, but this is unlikely for the DeepseekV2/UTUVL family.

If high-precision OUTPUT tensors are intended for quality, consider adding a comment explaining this trade-off. If OUTPUT should follow quantization settings instead, capture the quantized tensor from the parent prepare_tensors() call rather than the raw _token_embd.

🧹 Nitpick comments (8)

src/llama-vocab.cpp (1)
1869-1873: Minor: Trailing whitespace on line 1873.

The tokenizer configuration looks correct with clean_spaces = false and ignore_merges = true, consistent with similar pre-tokenizers like llama3 and tekken.
🔎 Proposed fix
             } else if (
                     tokenizer_pre == "utu-vl") {
                 pre_type = LLAMA_VOCAB_PRE_TYPE_UTU_VL;
                 clean_spaces = false;
-                ignore_merges = true; 
+                ignore_merges = true;
tools/mtmd/models/utuvl.cpp (2)
160-160: Inconsistent epsilon value - uses hardcoded 1e-6 instead of member eps.

The RMS norm for the merger uses a hardcoded 1e-6 epsilon, while other norms in this file use the member variable eps. If this is intentional (perhaps the merger requires a different epsilon), consider adding a comment to clarify.
🔎 Proposed fix (if same eps is intended)
-    embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, 1e-6, -1);
+    embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, eps, -1);
173-185: Remove commented-out dead code.

This block was replaced by the build_ffn call above. Commented-out code should be removed to improve maintainability.
🔎 Proposed fix
     embeddings = build_ffn(embeddings,
                     model.mm_0_w, model.mm_0_b,
                     nullptr, nullptr,
                     model.mm_1_w, model.mm_1_b,
                     FFN_GELU,
                     -1);
-    // // 3. First linear layer
-    // embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings);
-    // embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);
-    // cb(embeddings, "merger_fc1", -1);
-
-    // // 4. GELU activation
-    // embeddings = ggml_gelu(ctx0, embeddings);
-    // cb(embeddings, "merger_gelu", -1);
-
-    // // 5. Second linear layer
-    // embeddings = ggml_mul_mat(ctx0, model.mm_1_w, embeddings);
-    // embeddings = ggml_add(ctx0, embeddings, model.mm_1_b);
 
     // build the graph
tools/mtmd/clip.cpp (3)
1165-1175: Consider aligning warning messages with QWEN25VL for consistency.

The warning for UTUVL is more terse compared to QWEN25VL which provides additional context about the --image-min-tokens flag and a reference link. Also, QWEN25VL loads n_wa_pattern from config while UTUVL hardcodes window attention behavior in the batch encoder.

If window attention pattern should be configurable, consider loading it from the model config similar to QWEN25VL.
🔎 Suggested improvement for warning message
                         const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
                         if (hparams.image_min_pixels < warn_min_pixels) {
                             LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                            LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                         }
2710-2760: Potential edge cases in scaling loop.

A few observations about the preprocessing logic:

Loop termination: The while (scale > 0.0f) loop could theoretically reach very small or zero scale values for extremely large images. Consider adding a minimum scale threshold to prevent potential issues.

Fallback value: When params.image_max_pixels is 0, max_num_patches becomes 0, which would always trigger the scale reduction.

Comment accuracy: The comment mentions "binary search" but the implementation is a linear search with 0.02 decrements.
🔎 Suggested defensive bounds
-                // Binary search for optimal scale to fit within max_num_patches
+                // Linear search for optimal scale to fit within max_num_patches
                 float scale = 1.0f;
                 int target_height = original_size.height;
                 int target_width = original_size.width;
                 
                 auto get_scaled_image_size = [align_size](float scale, int size) -> int {
                     float scaled_size = size * scale;
                     // Round up to nearest multiple of align_size
                     int aligned = static_cast<int>(std::ceil(scaled_size / align_size)) * align_size;
                     // Ensure at least one patch
                     return std::max(align_size, aligned);
                 };
                 
-                // Binary search with 0.02 step size
-                while (scale > 0.0f) {
+                // Linear search with 0.02 step size
+                const float min_scale = 0.02f;
+                while (scale >= min_scale) {
                     target_height = get_scaled_image_size(scale, original_size.height);
                     target_width = get_scaled_image_size(scale, original_size.width);
                     
                     int num_patches_h = target_height / patch_size;
                     int num_patches_w = target_width / patch_size;
                     int num_patches = num_patches_h * num_patches_w;
                     
                     if (num_patches > max_num_patches) {
                         scale -= 0.02f;
                     } else {
                         break;
                     }
                 }
3426-3476: Dead code and hardcoded window attention.

Dead code: Since use_window_attn is hardcoded to true (line 3428), the else branch at lines 3472-3475 will never execute.

Consider making window attention configurable: Unlike QWEN25VL which reads n_wa_pattern from model config to determine window attention behavior, UTUVL hardcodes it. If this is intentional (UTUVL always requires window attention), consider removing the dead else branch or adding a comment explaining this design choice.
🔎 Remove dead code
         case PROJECTOR_TYPE_UTUVL:
             {
-                const bool use_window_attn = true;
                 const int merge_ratio = 2;
                 const int pw  = image_size_width  / patch_size / merge_ratio;  // patches after merger
                 const int ph  = image_size_height / patch_size / merge_ratio;
                 const int ipw = image_size_width  / patch_size;  // internal patches in ViT
                 const int iph = image_size_height / patch_size;
                 std::vector<int> idx    (ph * pw);
                 std::vector<int> inv_idx(ph * pw);
-                if (use_window_attn) {
+                // UTUVL always uses window attention
+                {
                     const int attn_window_size = patch_size * 2 * 8;
                     // ... rest of window attention code ...
-                } else {
-                    for (int i = 0; i < ph * pw; i++) {
-                        idx[i] = i;
-                    }
                 }
convert_hf_to_gguf.py (2)

7138-7237: DeepseekV2 MoE GGUF metadata wiring for UTUVL is reasonable but a bit implicit.

Registering "UTUVLForCausalLM" on DeepseekV2Model matches how other multimodal families register both text and mmproj variants under the same HF arch name; no issue there.

In set_gguf_parameters, the new MoE section:

Falls back from moe_intermediate_size to intermediate_size (or 0) for expert_feed_forward_length, which keeps GGUF consistent for both MoE and dense-only DeepseekV2‑style configs.

Writes expert_count, expert_shared_count (default 0), expert_weights_scale (default 1.0), and expert_weights_norm when norm_topk_prob is truthy. This mirrors how other MoE models in this file are handled and should be safe.
Two minor nits you might want to consider (not blockers):

If a model is not MoE but happens to have an intermediate_size, you’ll still emit expert_feed_forward_length with a non‑zero value; if that’s unintended, consider guarding all MoE metadata behind a single if hparams.get("n_routed_experts") is not None: block.

add_expert_weights_norm is called with hparams["norm_topk_prob"] which appears to be boolean in Deepseek configs; if any future variant uses this field differently, this call might need to be revisited.

Given this mirrors upstream behavior, I’d leave it as‑is unless you see non‑MoE DeepseekV2 models hitting this path.

10501-10541: UtuVLVisionModel implementation looks sound; small polish opportunities only.

Behavior:

Registration under both "UtuVLForConditionalGeneration" and "UTUVLForCausalLM" on an MmprojModel mirrors how other VL families (e.g. Qwen2VL) expose both text and vision variants; this is consistent.

set_gguf_parameters:

Calls super().set_gguf_parameters() so all generic vision/audio metadata is preserved.

Sets projector type to UTUVL.

Uses self.hparams["layer_norm_eps"] (inherited from MmprojModel, which rebinds self.hparams to the vision config) and a tolerant activation check:

Treats several GELU aliases (gelu_pytorch_tanh, etc.) as vision_use_gelu=True.

Supports silu via vision_use_silu=True.

Raises ValueError for anything else, which is an appropriate hard‑fail given unknown activations would break runtime.

Writes vision_spatial_merge_size with a default of 2 when missing, which is at least explicit and will not crash.

modify_tensors:

Cleanly drops obvious language‑model tensors (lm_head., model.layers., model.embed_tokens., model.norm.).

For everything else, delegates to self.map_tensor_name(name) and, on failure, logs a warning and silently drops the tensor. That’s in line with other vision subclasses that prefer “best‑effort + warning” over hard failure.

Optional polish (not required):

Ruff’s TRY003/TRY300 hints here are stylistic only:

The ValueError(f"Unsupported activation function for UTUVL: {hidden_act}") could be shortened slightly, but it’s already clear and localized.

The try/except ValueError in modify_tensors is fine as written; splitting the return [] into an else block would be marginally clearer but not necessary.

From a functional standpoint, this class fits well with the existing multimodal patterns in this file.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed75977 and 8733bf3.

📒 Files selected for processing (12)

convert_hf_to_gguf.py
gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
src/llama-vocab.cpp
src/llama-vocab.h
src/unicode.cpp
tools/mtmd/CMakeLists.txt
tools/mtmd/clip-impl.h
tools/mtmd/clip.cpp
tools/mtmd/models/models.h
tools/mtmd/models/utuvl.cpp
tools/mtmd/mtmd.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

src/llama-vocab.h
tools/mtmd/models/models.h
src/llama-vocab.cpp
tools/mtmd/models/utuvl.cpp
tools/mtmd/mtmd.cpp
src/unicode.cpp
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

src/llama-vocab.h
gguf-py/gguf/constants.py
tools/mtmd/models/models.h
src/llama-vocab.cpp
tools/mtmd/models/utuvl.cpp
tools/mtmd/mtmd.cpp
src/unicode.cpp
tools/mtmd/clip.cpp
gguf-py/gguf/tensor_mapping.py
tools/mtmd/clip-impl.h
convert_hf_to_gguf.py

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
convert_hf_to_gguf.py

src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

src/llama-vocab.cpp
src/unicode.cpp

🧠 Learnings (2)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-05-26T09:45:20.653Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

tools/mtmd/CMakeLists.txt
tools/mtmd/mtmd.cpp
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h

🧬 Code graph analysis (3)

tools/mtmd/models/models.h (1)

tools/mtmd/models/utuvl.cpp (2)

build (3-190)

build (3-3)

tools/mtmd/models/utuvl.cpp (1)

ggml/src/ggml.c (9)

ggml_reshape_4d (3583-3601)

ggml_permute (3700-3752)

ggml_cont_4d (3492-3508)

ggml_mul_mat (3174-3189)

ggml_add (1969-1974)

ggml_reshape_2d (3546-3562)

ggml_set_name (1841-1848)

ggml_cast (3433-3445)

ggml_build_forward_expand (6793-6795)

convert_hf_to_gguf.py (3)

gguf-py/gguf/gguf_writer.py (10)

add_expert_feed_forward_length (714-715)

add_expert_count (800-801)

add_expert_shared_count (806-807)

add_expert_weights_scale (815-816)

add_expert_weights_norm (818-819)

add_clip_projector_type (1083-1084)

add_vision_attention_layernorm_eps (1104-1105)

add_vision_use_gelu (1122-1123)

add_vision_use_silu (1125-1126)

add_vision_spatial_merge_size (1119-1120)

convert_lora_to_gguf.py (1)

modify_tensors (448-473)

gguf-py/gguf/constants.py (1)

VisionProjectorType (3414-3435)

🪛 Ruff (0.14.10)

convert_hf_to_gguf.py

10521-10521: Avoid specifying long messages outside the exception class

(TRY003)

10536-10536: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
GitHub Check: build-cmake-pkg / linux
GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ubuntu-24-wasm-webgpu
GitHub Check: macOS-latest-cmake-visionos
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: ios-xcode-build
GitHub Check: macOS-latest-cmake-ios
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: macOS-latest-cmake-arm64

🔇 Additional comments (19)

src/unicode.cpp (2)

967-971: LGTM! Unicode letter subcategories correctly added.

The addition of Lu, Ll, Lt, Lm, and Lo subcategories mapping to the LETTER flag is correct and follows the existing pattern. The inline comments clearly document each subcategory type.

1082-1103: LGTM! Dynamic \p{...} pattern resolution is well-implemented.

The replacement of fixed-length pattern matching with dynamic resolution correctly:

Searches for the closing brace dynamically (Line 1087)

Validates the pattern length with a reasonable limit (Line 1088)

Conditionally inserts brackets to avoid nesting when not inside a character class (Lines 1091-1098)

Advances the loop index correctly past the matched pattern (Line 1099)

The boundary checks are sound, and the logic handles varying-length Unicode property patterns correctly.

gguf-py/gguf/constants.py (1)

3414-3435: UTUVL projector string is consistent with existing naming

VisionProjectorType.UTUVL = "utuvl" aligns with the new C++ projector type name and other projector strings; no issues here.

gguf-py/gguf/tensor_mapping.py (1)

1218-1458: UTUVL/Siglip2 tensor mappings look consistent; validate against real checkpoints

The added "merger.*" and "siglip2.vision_model.*" source keys for V_MMPROJ, V_ENC_* and related tensors are consistent with existing patterns for other vision models and correctly target the corresponding MODEL_TENSOR entries. Behavior remains purely additive.

It would still be good to run at least one UTUVL HF → GGUF conversion and load in mtmd to confirm all expected tensors are picked up and no additional Siglip2 variants are missing.

tools/mtmd/CMakeLists.txt (1)

5-31: UTUVL model source correctly wired into mtmd target

Adding models/utuvl.cpp to the mtmd library matches the pattern of other model sources and is sufficient to include the new graph implementation in the build.

tools/mtmd/clip-impl.h (1)

163-221: Enum and string mapping for UTUVL are correctly aligned

PROJECTOR_TYPE_UTUVL is added to projector_type and mapped to "utuvl" in PROJECTOR_TYPE_NAMES, matching the Python-side constant and gguf metadata string. No issues seen.

src/llama-vocab.h (1)

9-55: New UTU_VL pre-tokenizer enum value is fine

Appending LLAMA_VOCAB_PRE_TYPE_UTU_VL = 43 to llama_vocab_pre_type is consistent with existing entries and keeps ordinal values stable. Assuming the corresponding handling is implemented in llama-vocab.cpp, this looks good.

tools/mtmd/mtmd.cpp (1)

268-290: UTUVL reuses Qwen-style vision delimiters correctly

Including PROJECTOR_TYPE_UTUVL in the Qwen2VL/Qwen25VL/Qwen3VL branch makes UTUVL use <|vision_start|> / <|vision_end|> markers, which is consistent with how other Qwen-based VL projectors are handled.

Please double‑check that the UTUVL text tokenizer indeed defines these exact special tokens.

tools/mtmd/models/models.h (1)

25-28: UTUVL clip graph declaration matches existing pattern

clip_graph_utuvl follows the same inheritance and constructor pattern as other clip_graph_* types and declares a virtual build() override, which fits the documented implementation in utuvl.cpp.

src/llama-vocab.cpp (1)

317-322: LGTM - UTU_VL regex patterns are well-structured.

The regex expressions correctly handle Korean Hangul, CJK punctuation, Bopomofo, and Chinese/Japanese characters, followed by a comprehensive pattern for mixed-case words with contractions. This follows the established pattern for other multilingual tokenizers in the codebase.

tools/mtmd/models/utuvl.cpp (3)

3-14: LGTM - Clean initialization with appropriate assertions.

The setup correctly initializes window attention parameters. The batch_size = 1 constraint is properly enforced via GGML_ASSERT statements later in the code (lines 79, 146).

18-49: LGTM - Window partitioning logic for patch embeddings.

The sequence of reshape/permute/cont operations correctly reorganizes patches into a windowed format for efficient attention computation. The math appears consistent: patches are organized into 2×2 windows (m=2) and the final shape matches the expected (3*patch_size*patch_size, Hm*Wm*m*m, 1) before the patch embedding projection.

86-138: LGTM - Standard transformer layer implementation with windowed attention.

The layer loop correctly implements:

Pre-norm architecture with residual connections

Q/K/V projections with RoPE positional encoding

Windowed attention with full attention every 8th layer and on the last layer ((il + 1) % 8 == 0 || il == n_layer - 1)

FFN with configurable activation

tools/mtmd/clip.cpp (5)

848-851: LGTM!

The UTUVL case follows the established pattern for other projector types in the graph builder switch statement.

1511-1518: LGTM!

Tensor loading correctly loads UTUVL-specific tensors including the RMS normalization weight for the merger.

2993-2994: LGTM!

Token counting functions correctly include UTUVL alongside related projector types (QWEN2VL, QWEN25VL, QWEN3VL, GLM4V) that share the same merge-based output token calculation.

Also applies to: 3009-3010, 3070-3076

3477-3496: LGTM!

Position calculation for UTUVL correctly remaps positions based on window grouping, following the same pattern as QWEN25VL but with parameterized merge_ratio which is cleaner.

3668-3669: LGTM!

The embedding dimension correctly uses mm_1_b->ne[0], consistent with tensor loading and similar projector types.

convert_hf_to_gguf.py (1)

1176-1177: UTUVL tokenizer hash entry looks consistent; just ensure runtime support exists.

The new chkhsh == "9d7013..." → res = "utu-vl" branch follows the same pattern as other models and should work as long as the C++ side and convert_hf_to_gguf_update.py have been updated to recognize "utu-vl" as a valid tokenizer.ggml.pre value. Nothing functionally wrong here from this script’s perspective; just double‑check the upstream/runtime side is in sync so this identifier isn’t treated as “unknown pre‑tokenizer”.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tools/mtmd/models/utuvl.cpp (1)
41-41: Minor: Inconsistent spacing in dimension calculation.

Line 41 has inconsistent spacing: 3*patch_size* patch_size should be 3 * patch_size * patch_size for consistency with the rest of the codebase.
Proposed fix
-        3*patch_size* patch_size,  Hm * Wm * m * m, 1);
+        3 * patch_size * patch_size, Hm * Wm * m * m, 1);
tools/mtmd/clip.cpp (1)
1165-1175: Consider adding guidance to the warning message.

The warning at line 1173-1174 only states the minimum requirement but doesn't provide guidance on how to fix it, unlike the Qwen-VL warning (lines 1160-1162) which suggests --image-min-tokens 1024.
Proposed enhancement
                     if (hparams.image_min_pixels < warn_min_pixels) {
                         LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                        LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                     }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8733bf3 and 1600974.

📒 Files selected for processing (2)

tools/mtmd/clip.cpp
tools/mtmd/models/utuvl.cpp

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/mtmd/clip.cpp
tools/mtmd/models/utuvl.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/mtmd/clip.cpp
tools/mtmd/models/utuvl.cpp

🧠 Learnings (2)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-05-26T09:45:20.653Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

tools/mtmd/clip.cpp

🔇 Additional comments (8)

tools/mtmd/models/utuvl.cpp (3)

64-83: LGTM!

The window attention setup correctly handles index remapping and mask creation, with appropriate F16 casting for flash attention compatibility.

86-138: LGTM!

The transformer layer implementation correctly handles:

Window/full attention alternation pattern (every 8th layer + final layer uses full attention)

Multi-head attention with separate Q/K/V projections

RoPE positional encoding via ggml_rope_multi

Proper residual connections and FFN processing

160-160: Verify the hardcoded epsilon value for merger RMS norm.

Line 160 uses a hardcoded 1e-6 for the RMS normalization epsilon, while other build_norm calls in this function use the eps member variable. This may be intentional if the VLPatchMerger requires a different epsilon, but it's worth confirming this matches the original model specification.

tools/mtmd/clip.cpp (5)

848-851: LGTM!

The graph builder selection for UTUVL follows the established pattern for other projector types.

1511-1518: LGTM!

The tensor loading correctly loads the merger normalization weight and MLP projection weights, matching the graph construction in utuvl.cpp.

2993-2994: LGTM!

The token counting logic for UTUVL correctly shares the same calculation as Qwen and GLM4V models, consistent with the 2x spatial merge configuration.

Also applies to: 3009-3010, 3070-3071

3597-3598: LGTM!

The embedding size resolution correctly uses mm_1_b->ne[0], consistent with the merger MLP output tensor loaded for UTUVL.

3345-3360: and

tools/mtmd/clip.cpp

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tools/mtmd/clip.cpp (1)

2751-2765: Misleading comment: this is linear search, not binary search

The comment on line 2751 states "Binary search with 0.02 step size" but the implementation performs a linear decrement (scale -= 0.02f) until the patch count fits within limits. While the logic works correctly, the comment should be updated to accurately describe the approach (e.g., "Linear search: decrement scale by 0.02 until patches fit").

Based on learnings, this is a mirrored PR from upstream, so ngxson should be informed to relay this feedback to the contributor.

🧹 Nitpick comments (4)

tools/mtmd/models/utuvl.cpp (2)

9-13: Patch reshaping pipeline is algebraically sound but could use explicit invariants and fewer magic numbers

The reshape/permute/cont chain from build_inp_raw() down to the final ggml_cont_3d(..., 3*patch_size*patch_size, Hm * Wm * m * m, 1) is consistent in terms of element count and yields the expected [3 * patch_size^2, n_patches, 1] layout as long as n_patches_x and n_patches_y are multiples of m (2 here). Right now that’s only implicit; a pair of assertions like GGML_ASSERT(Wp % m == 0); and GGML_ASSERT(Hp % m == 0); would make the preconditions obvious and fail fast if a future UTUVL variant uses different shapes. Likewise, const int spatial_merge_unit = 4; later in the file is logically m * m; deriving it from m instead of hard‑coding 4 would keep the encoder and merger logic in sync if the window size ever changes.

Also applies to: 20-45

5-8: Encoder/window-attention/merger wiring matches the QwenVL pattern but encodes several implicit contracts

The UTUVL graph largely mirrors clip_graph_qwen2vl:

use_window_attn = !hparams.wa_layers.empty() and full_attn = use_window_attn ? hparams.wa_layers.count(il) > 0 : true give clear per-layer control, assuming wa_layers uses 0‑based layer indices.

The inv_window_idx windowing (Lines 67–86) and subsequent window_idx un-windowing (Lines 143–154) both assume a fixed 4‑token spatial unit; if the converter ever changes its grouping scheme, both indices must stay perfectly inverse over those 4‑token blocks or the restored sequence will be scrambled.

The merger section (Lines 156–175) applies RMS norm with mm_input_norm_w and then a 2‑layer FFN over a [n_embd * 4, n_pos / 4, 1] view, consistent with the SigLIP‑style VLPatchMerger described in the PR.

Given how tightly this all depends on converter‑side metadata (WA_LAYERS contents, window indices, and patch grid layout), it would be helpful to document these assumptions in comments and to verify with a small UTUVL sample that round‑tripping through windowing/unwindowing truly preserves token order.

Also applies to: 63-86, 88-141, 143-176

convert_hf_to_gguf.py (2)

3812-3815: Qwen2VL window‑attention layer export is correct but assert is a bit brittle

Using fullatt_block_indexes and wiring it directly into add_vision_wa_layers matches how UTUVL is handled and gives llama.cpp the exact full‑attention block layout it expects. One minor robustness nit: relying on assert fullatt_block_indexes is not None means running Python with -O would silently skip the check; if you ever touch this outside a mirror, consider a normal if ...: raise ValueError(...) for clearer failures.

10489-10535: UTUVL mmproj integration is functionally sound; minor style nits only

Functionally, UtuVLVisionModel looks correct:

Reuses MmprojModel so self.hparams becomes the vision config, then normalizes image_size and emits projector metadata: projector type UTUVL, attention eps, spatial merge size, optional window_size, and window‑attention layers via add_vision_wa_layers.

Activation handling groups all GELU‑style variants (gelu_pytorch_tanh, gelu_fast, etc.) under vision_use_gelu, with SILU mapped to vision_use_silu, which matches how other vision projectors in this file are encoded.

modify_tensors cleanly skips the text side and maps only tensors that have a corresponding entry in tensor_map, logging and ignoring anything unmapped instead of raising.

Two small, non‑blocking nits (you may want to leave them as‑is to stay aligned with upstream):

The ValueError(f"Unsupported activation function for UTUVL: {hidden_act}") triggers Ruff TRY003; if you ever refactor, you could move this into a custom exception type or shorten the message, but it’s harmless.

Ruff TRY300 would prefer a try/except/else shape in modify_tensors instead of returning directly from the try block, again purely stylistic.

Given this is a mirror PR, I’d keep the implementation as‑is and surface any style tweaks upstream if they decide to change their patterns.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1600974 and 867709c.

📒 Files selected for processing (10)

convert_hf_to_gguf.py
convert_hf_to_gguf_update.py
gguf-py/gguf/constants.py
gguf-py/gguf/gguf_writer.py
src/llama-model.cpp
tools/mtmd/clip-impl.h
tools/mtmd/clip-model.h
tools/mtmd/clip.cpp
tools/mtmd/models/qwen2vl.cpp
tools/mtmd/models/utuvl.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/mtmd/clip-model.h
tools/mtmd/clip.cpp
tools/mtmd/models/qwen2vl.cpp
tools/mtmd/clip-impl.h
tools/mtmd/models/utuvl.cpp
src/llama-model.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/mtmd/clip-model.h
gguf-py/gguf/constants.py
tools/mtmd/clip.cpp
tools/mtmd/models/qwen2vl.cpp
tools/mtmd/clip-impl.h
tools/mtmd/models/utuvl.cpp
convert_hf_to_gguf_update.py
src/llama-model.cpp
convert_hf_to_gguf.py
gguf-py/gguf/gguf_writer.py

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

gguf-py/gguf/constants.py
convert_hf_to_gguf_update.py
convert_hf_to_gguf.py
gguf-py/gguf/gguf_writer.py

src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

src/llama-model.cpp

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-05-26T09:45:20.653Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

src/llama-model.cpp

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical

Applied to files:

src/llama-model.cpp

🧬 Code graph analysis (2)

tools/mtmd/models/utuvl.cpp (3)

tools/mtmd/models/qwen2vl.cpp (2)

build (3-182)

build (3-3)

tools/mtmd/clip.cpp (18)

model (217-219)

model (935-1283)

model (935-935)

model (1971-1984)

model (1971-1971)

build_inp_raw (469-474)

build_inp_raw (469-469)

positions (3299-3299)

positions (3467-3467)

positions (3479-3479)

build_norm (476-499)

build_norm (476-482)

build_attn (583-646)

build_attn (583-591)

build_ffn (501-581)

build_ffn (501-510)

cb (255-264)

cb (255-255)

ggml/src/ggml.c (16)

ggml_reshape_4d (3583-3601)

ggml_permute (3700-3752)

ggml_cont_4d (3492-3508)

ggml_cont_3d (3483-3490)

ggml_mul_mat (3174-3189)

ggml_add (1969-1974)

ggml_reshape_2d (3546-3562)

ggml_new_tensor_1d (1747-1752)

ggml_set_name (1841-1848)

ggml_set_input (7435-7437)

ggml_new_tensor_2d (1754-1761)

ggml_cast (3433-3445)

ggml_get_rows (3776-3797)

ggml_reshape_3d (3564-3581)

ggml_rope_multi (4121-4140)

ggml_build_forward_expand (6793-6795)

src/llama-model.cpp (1)

src/llama-model-loader.cpp (2)

create_tensor (817-838)

create_tensor (817-817)

🪛 Ruff (0.14.10)

convert_hf_to_gguf.py

10509-10509: Avoid specifying long messages outside the exception class

(TRY003)

10531-10531: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (19)

convert_hf_to_gguf_update.py (1)

148-148: LGTM — local repo path is properly supported.

The new utu-vl entry correctly follows the existing pattern. The local directory path ./Youtu-VL is a valid approach since download_model() already handles local paths via the os.path.isdir(repo) branch (lines 201-213), copying files instead of downloading.

Note for contributors: running this script requires the ./Youtu-VL directory to be present locally with the necessary tokenizer files (config.json, tokenizer.json, tokenizer_config.json).

src/llama-model.cpp (2)

4702-4706: DeepSeek: Optional output.weight with tied-embedding fallback is consistent and safe

Making output.weight non-required and falling back to token_embd with TENSOR_DUPLICATED matches the existing pattern used for other arches (e.g., LLaMA, GPT2, BLOOM) and correctly handles GGUFs that rely on tied word embeddings instead of a separate output head. Legacy DeepSeek GGUFs that include output.weight will continue to load unchanged, while new tied-embedding variants avoid a hard “missing tensor” failure.

I don’t see functional or performance issues here; just recommend exercising both a legacy DeepSeek model (with explicit output.weight) and a new tied-embedding GGUF to confirm the fallback behaves as intended. Based on learnings, this keeps the mirror PR aligned with upstream while being robust to model variations.

4769-4773: DeepSeek2: Mirrored tied-embedding fallback matches DeepSeek and other arches

The DeepSeek2 block now mirrors the DeepSeek logic: output.weight is optional (TENSOR_NOT_REQUIRED), and when absent, output is created as a duplicated token_embd tensor. This is consistent with how tied embeddings are handled elsewhere in this file and should be compatible with DeepSeek V2 GGUFs that omit a separate output head.

Behavior for existing DeepSeek2 GGUFs that already define output.weight is preserved, since the fallback only triggers when that tensor is missing. As with the previous hunk, it would be good to load at least one existing DeepSeek V2 GGUF and one new Youtu‑VL/DeepSeek2 variant to validate that both the direct and fallback paths work correctly. Based on learnings, this looks like a faithful mirror of upstream tied-output handling.

tools/mtmd/models/qwen2vl.cpp (2)

7-8: Window-attention enable/disable and full-attn gating look consistent with wa_layers semantics

use_window_attn = !hparams.wa_layers.empty() combined with full_attn = use_window_attn ? hparams.wa_layers.count(il) > 0 : true makes behavior explicit and avoids the previous pattern-based heuristic. This should keep old models (no WA_LAYERS key) on full attention while allowing per-layer overrides when WA_LAYERS is present, assuming wa_layers uses 0‑based layer indices consistent with il. You may want to double-check that the converter always emits 0‑based indices here so UTUVL/Qwen2.5VL don’t accidentally shift the intended full-attn layers.

Also applies to: 79-82

156-176: Projection/window reorder block assumes a specific relation between projection_dim and n_embd

The reshape at Line 173 requires hparams.projection_dim * (n_patches_x * n_patches_y / 4) to equal the element count of embeddings coming out of the FFN (n_embd * n_patches_x * n_patches_y). That implies hparams.projection_dim == 4 * n_embd when n_pos == n_patches. If UTUVL/Qwen2.5VL models ever use a different projection dimension, this reshape will assert at runtime. Please confirm that the converter always sets projection_dim accordingly, or, if not, adjust the reshape to derive its first dimension from the actual tensor shape instead of projection_dim.

tools/mtmd/clip-model.h (1)

59-64: wa_layers wiring in clip_hparams is consistent and idiomatic

Adding std::unordered_set<int32_t> wa_layers alongside vision_feature_layer is a straightforward way to represent sparse full-attention layers. This aligns well with the wa_layers.count(il) > 0 checks in the graphs and keeps the default “no window attention” behavior when the set is empty.

tools/mtmd/clip-impl.h (1)

51-52: WA_LAYERS key and PROJECTOR_TYPE_UTUVL mapping are coherent across C++ and Python

KEY_WIN_ATTN_LAYERS uses the same "clip.vision.wa_layers" string as the Python constants, and PROJECTOR_TYPE_UTUVL is correctly registered in PROJECTOR_TYPE_NAMES as "utuvl", matching VisionProjectorType.UTUVL. This should make the new Youtu‑VL projector discoverable and correctly parsed from GGUF without extra glue.

Also applies to: 163-221

gguf-py/gguf/constants.py (1)

283-299: New vision metadata keys and UTUVL projector string are aligned with the C++ side

Keys.ClipVision.WA_LAYERS = "clip.vision.wa_layers" and WINDOW_SIZE = "clip.vision.window_size" line up with the C++ macros and the new hparams.wa_layers / attn_window_size fields, and VisionProjectorType.UTUVL = "utuvl" matches the C++ PROJECTOR_TYPE_UTUVL name. The only behavioral change to be aware of is that legacy clip.vision.n_wa_pattern metadata will no longer be recognized; that’s expected for this refactor, but worth keeping in mind when converting older QwenVL models.

Also applies to: 3415-3437

gguf-py/gguf/gguf_writer.py (1)

1131-1138: New WA_LAYERS/WINDOW_SIZE writer helpers are correctly implemented

add_vision_wa_layers() and add_vision_window_size() correctly emit an INT32 array and UINT32 scalar for the new clip.vision.wa_layers and clip.vision.window_size keys, and the empty‑sequence early return for arrays is consistent with the runtime behavior. Both methods are already being used in convert_hf_to_gguf.py.

Likely an incorrect or invalid review comment.

tools/mtmd/clip.cpp (6)

848-851: LGTM: UTUVL graph builder integration

The UTUVL case addition follows the same pattern as other projector types and correctly instantiates the clip_graph_utuvl builder.

1154-1161: LGTM: Window attention layer loading for QWEN25VL

The window attention layer loading logic is correct and properly populates the wa_layers set.

1172-1188: LGTM: UTUVL hyperparameter initialization

The UTUVL hyperparameter loading follows the established pattern from QWEN2VL/QWEN25VL models. The configuration properly requires attn_window_size and wa_layers, sets appropriate token limits, and includes a helpful warning about minimum pixel requirements for grounding tasks.

1256-1262: LGTM: Window attention layers logging

The logging addition provides useful debug information about which layers use window attention.

1530-1537: LGTM: UTUVL projection tensor loading

The tensor loading correctly maps to the UTUVL merger architecture (ln_q normalization and MLP layers). The inline comments helpfully document the tensor mappings.

3012-3034: LGTM: UTUVL integration in token counting and batch encoding

The UTUVL additions to clip_n_output_tokens_x, clip_n_output_tokens_y, clip_n_output_tokens, batch encoding logic, and clip_n_mmproj_embd are consistent with the QWEN2VL/QWEN25VL/GLM4V architecture. The window attention handling at lines 3364-3445 correctly uses the default attn_window_size of 112 for QWEN25VL (where it's optional) while UTUVL mandates it during loading.

Also applies to: 3089-3095, 3364-3445, 3616-3617

convert_hf_to_gguf.py (4)

1233-1235: Tokenizer hash entry for Youtu‑VL looks consistent

The new chkhsh → "utu-vl" mapping is in line with the existing pattern in get_vocab_base_pre and should safely identify the Youtu‑VL pre‑tokenizer variant. No further changes needed here.

7131-7137: DeepseekV2Model registration of UTUVLForCausalLM is coherent

Adding "UTUVLForCausalLM" to the DeepseekV2Model registry plugs UTUVL’s text side into the existing DeepSeek‑V2 path while the new UtuVLVisionModel handles the mmproj side via the MMPROJ registry. The dual registration under different ModelTypes should not conflict and aligns with how other multimodal families are wired.

7213-7233: MoE metadata export for DeepseekV2/UTUVL is reasonable

The added block that sets expert_feed_forward_length, expert_count, expert_shared_count, and expert weight scale/norm from hparams mirrors patterns used in other MoE implementations (e.g. Qwen2Moe/Dots1). The fallbacks (intermediate_size and defaults of 0/1.0) give sane behavior when some MoE fields are absent, including non‑MoE models reusing this class. Just be aware that you will now always emit an expert_feed_forward_length (possibly 0) and expert_shared_count (0) for such models; if upstream ever cares about omitting these entirely, that would need a follow‑up there rather than in this mirror.

7247-7257: DeepseekV2 tensor‑filtering and MoE guard logic look solid

Skipping vision_tower / multi_modal_projector / siglip2. / merger. tensors is consistent with deferring UTUVL/KimiVL vision handling to the mmproj path instead of the text model.

The tie_word_embeddings check that drops lm_head.weight when tied prevents redundant output‑head storage and matches patterns used elsewhere in this file.

Gating the experts merge block on n_routed_experts being present avoids crashes for non‑MoE configs while keeping behavior unchanged for actual MoE setups.

All of these changes improve robustness without altering expected conversions for supported Deepseek/UTUVL variants.

Also applies to: 7271-7299

Support Youtu-VL Model

8733bf3

github-actions bot added examples python labels Dec 23, 2025

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

merge code

1600974

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tools/mtmd/clip.cpp Show resolved Hide resolved

fix bug

867709c

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mirror] Support Youtu-VL Model #67

[Mirror] Support Youtu-VL Model #67

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 23, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Mirror] Support Youtu-VL Model #67

Are you sure you want to change the base?

[Mirror] Support Youtu-VL Model #67

Conversation

ngxson commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2025 •

edited

Loading