-
Notifications
You must be signed in to change notification settings - Fork 5
[Mirror] Support Youtu-VL Model #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds end-to-end UTUVL vision support (conversion, GGUF metadata, tensor mappings, projector, graph, preprocessing, tokenization), updates QwenVL/wa-layer handling, DeepseekV2 tie-word-embeddings handling, Unicode \p{...} parsing, and build/CMake entries for UTUVL. Changes
Sequence DiagramsequenceDiagram
actor User
participant Registry as ModelRegistry
participant Converter as HF→GGUF
participant GGUF as GGUFWriter
participant Loader as mtmdLoader
participant Projector as clip_graph_utuvl
participant Preproc as ImagePreproc
User->>Registry: request "utu-vl" model
Registry->>Converter: instantiate UtuVLVisionModel
Converter->>GGUF: add_vision_wa_layers / add_vision_window_size -> write GGUF metadata
Converter->>Loader: produce tensors + mapped names
Loader->>Projector: select PROJECTOR_TYPE_UTUVL
Projector->>Preproc: request patches / window config
Preproc->>Projector: deliver image tensor (merged/reshaped)
Projector->>Projector: transformer pass (Q/K/V, window mask, FFN, VLPatchMerger)
Projector->>User: return vision embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
convert_hf_to_gguf.py (1)
7247-7344: OUTPUT tensor bypass of quantization is intentional but undocumented; embedding naming assumption is reasonable for DeepseekV2.The tied-embeddings implementation in
DeepseekV2Modelis functionally correct but has two points worth clarifying:
Quantization bypass for OUTPUT is real: The code calls
self.gguf_writer.add_tensor(output_name, self._token_embd.numpy())directly, which bypasses the quantization pipeline inModelBase.prepare_tensors(). This means OUTPUT will be stored at the original tensor precision (likely F16) rather than following the model'sftypesetting (e.g., MOSTLY_Q8_0). This appears intentional for tied-embedding models to preserve logit precision, but should be documented if that's the design goal.Embedding naming assumption is reasonable: The check for
"model.embed_tokens.weight"is the correct naming for DeepseekV2 architecture variants. If future variants use different naming (e.g., without themodel.prefix), the OUTPUT tensor would not be written, but this is unlikely for the DeepseekV2/UTUVL family.If high-precision OUTPUT tensors are intended for quality, consider adding a comment explaining this trade-off. If OUTPUT should follow quantization settings instead, capture the quantized tensor from the parent
prepare_tensors()call rather than the raw_token_embd.
🧹 Nitpick comments (8)
src/llama-vocab.cpp (1)
1869-1873: Minor: Trailing whitespace on line 1873.The tokenizer configuration looks correct with
clean_spaces = falseandignore_merges = true, consistent with similar pre-tokenizers like llama3 and tekken.🔎 Proposed fix
} else if ( tokenizer_pre == "utu-vl") { pre_type = LLAMA_VOCAB_PRE_TYPE_UTU_VL; clean_spaces = false; - ignore_merges = true; + ignore_merges = true;tools/mtmd/models/utuvl.cpp (2)
160-160: Inconsistent epsilon value - uses hardcoded 1e-6 instead of membereps.The RMS norm for the merger uses a hardcoded
1e-6epsilon, while other norms in this file use the member variableeps. If this is intentional (perhaps the merger requires a different epsilon), consider adding a comment to clarify.🔎 Proposed fix (if same eps is intended)
- embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, 1e-6, -1); + embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, eps, -1);
173-185: Remove commented-out dead code.This block was replaced by the
build_ffncall above. Commented-out code should be removed to improve maintainability.🔎 Proposed fix
embeddings = build_ffn(embeddings, model.mm_0_w, model.mm_0_b, nullptr, nullptr, model.mm_1_w, model.mm_1_b, FFN_GELU, -1); - // // 3. First linear layer - // embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings); - // embeddings = ggml_add(ctx0, embeddings, model.mm_0_b); - // cb(embeddings, "merger_fc1", -1); - - // // 4. GELU activation - // embeddings = ggml_gelu(ctx0, embeddings); - // cb(embeddings, "merger_gelu", -1); - - // // 5. Second linear layer - // embeddings = ggml_mul_mat(ctx0, model.mm_1_w, embeddings); - // embeddings = ggml_add(ctx0, embeddings, model.mm_1_b); // build the graphtools/mtmd/clip.cpp (3)
1165-1175: Consider aligning warning messages with QWEN25VL for consistency.The warning for UTUVL is more terse compared to QWEN25VL which provides additional context about the
--image-min-tokensflag and a reference link. Also, QWEN25VL loadsn_wa_patternfrom config while UTUVL hardcodes window attention behavior in the batch encoder.If window attention pattern should be configurable, consider loading it from the model config similar to QWEN25VL.
🔎 Suggested improvement for warning message
const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size; if (hparams.image_min_pixels < warn_min_pixels) { LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__); + LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__); }
2710-2760: Potential edge cases in scaling loop.A few observations about the preprocessing logic:
Loop termination: The
while (scale > 0.0f)loop could theoretically reach very small or zero scale values for extremely large images. Consider adding a minimum scale threshold to prevent potential issues.Fallback value: When
params.image_max_pixelsis 0,max_num_patchesbecomes 0, which would always trigger the scale reduction.Comment accuracy: The comment mentions "binary search" but the implementation is a linear search with 0.02 decrements.
🔎 Suggested defensive bounds
- // Binary search for optimal scale to fit within max_num_patches + // Linear search for optimal scale to fit within max_num_patches float scale = 1.0f; int target_height = original_size.height; int target_width = original_size.width; auto get_scaled_image_size = [align_size](float scale, int size) -> int { float scaled_size = size * scale; // Round up to nearest multiple of align_size int aligned = static_cast<int>(std::ceil(scaled_size / align_size)) * align_size; // Ensure at least one patch return std::max(align_size, aligned); }; - // Binary search with 0.02 step size - while (scale > 0.0f) { + // Linear search with 0.02 step size + const float min_scale = 0.02f; + while (scale >= min_scale) { target_height = get_scaled_image_size(scale, original_size.height); target_width = get_scaled_image_size(scale, original_size.width); int num_patches_h = target_height / patch_size; int num_patches_w = target_width / patch_size; int num_patches = num_patches_h * num_patches_w; if (num_patches > max_num_patches) { scale -= 0.02f; } else { break; } }
3426-3476: Dead code and hardcoded window attention.
Dead code: Since
use_window_attnis hardcoded totrue(line 3428), theelsebranch at lines 3472-3475 will never execute.Consider making window attention configurable: Unlike QWEN25VL which reads
n_wa_patternfrom model config to determine window attention behavior, UTUVL hardcodes it. If this is intentional (UTUVL always requires window attention), consider removing the dead else branch or adding a comment explaining this design choice.🔎 Remove dead code
case PROJECTOR_TYPE_UTUVL: { - const bool use_window_attn = true; const int merge_ratio = 2; const int pw = image_size_width / patch_size / merge_ratio; // patches after merger const int ph = image_size_height / patch_size / merge_ratio; const int ipw = image_size_width / patch_size; // internal patches in ViT const int iph = image_size_height / patch_size; std::vector<int> idx (ph * pw); std::vector<int> inv_idx(ph * pw); - if (use_window_attn) { + // UTUVL always uses window attention + { const int attn_window_size = patch_size * 2 * 8; // ... rest of window attention code ... - } else { - for (int i = 0; i < ph * pw; i++) { - idx[i] = i; - } }convert_hf_to_gguf.py (2)
7138-7237: DeepseekV2 MoE GGUF metadata wiring for UTUVL is reasonable but a bit implicit.
- Registering
"UTUVLForCausalLM"onDeepseekV2Modelmatches how other multimodal families register both text and mmproj variants under the same HF arch name; no issue there.- In
set_gguf_parameters, the new MoE section:
- Falls back from
moe_intermediate_sizetointermediate_size(or0) forexpert_feed_forward_length, which keeps GGUF consistent for both MoE and dense-only DeepseekV2‑style configs.- Writes
expert_count,expert_shared_count(default0),expert_weights_scale(default1.0), andexpert_weights_normwhennorm_topk_probis truthy. This mirrors how other MoE models in this file are handled and should be safe.
Two minor nits you might want to consider (not blockers):- If a model is not MoE but happens to have an
intermediate_size, you’ll still emitexpert_feed_forward_lengthwith a non‑zero value; if that’s unintended, consider guarding all MoE metadata behind a singleif hparams.get("n_routed_experts") is not None:block.add_expert_weights_normis called withhparams["norm_topk_prob"]which appears to be boolean in Deepseek configs; if any future variant uses this field differently, this call might need to be revisited.Given this mirrors upstream behavior, I’d leave it as‑is unless you see non‑MoE DeepseekV2 models hitting this path.
10501-10541: UtuVLVisionModel implementation looks sound; small polish opportunities only.Behavior:
Registration under both
"UtuVLForConditionalGeneration"and"UTUVLForCausalLM"on anMmprojModelmirrors how other VL families (e.g. Qwen2VL) expose both text and vision variants; this is consistent.
set_gguf_parameters:
- Calls
super().set_gguf_parameters()so all generic vision/audio metadata is preserved.- Sets projector type to
UTUVL.- Uses
self.hparams["layer_norm_eps"](inherited fromMmprojModel, which rebindsself.hparamsto the vision config) and a tolerant activation check:
- Treats several GELU aliases (
gelu_pytorch_tanh, etc.) asvision_use_gelu=True.- Supports
siluviavision_use_silu=True.- Raises
ValueErrorfor anything else, which is an appropriate hard‑fail given unknown activations would break runtime.- Writes
vision_spatial_merge_sizewith a default of2when missing, which is at least explicit and will not crash.
modify_tensors:
- Cleanly drops obvious language‑model tensors (
lm_head.,model.layers.,model.embed_tokens.,model.norm.).- For everything else, delegates to
self.map_tensor_name(name)and, on failure, logs a warning and silently drops the tensor. That’s in line with other vision subclasses that prefer “best‑effort + warning” over hard failure.Optional polish (not required):
- Ruff’s TRY003/TRY300 hints here are stylistic only:
- The
ValueError(f"Unsupported activation function for UTUVL: {hidden_act}")could be shortened slightly, but it’s already clear and localized.- The
try/except ValueErrorinmodify_tensorsis fine as written; splitting thereturn []into anelseblock would be marginally clearer but not necessary.From a functional standpoint, this class fits well with the existing multimodal patterns in this file.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/llama-vocab.cppsrc/llama-vocab.hsrc/unicode.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip.cpptools/mtmd/models/models.htools/mtmd/models/utuvl.cpptools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
src/llama-vocab.htools/mtmd/models/models.hsrc/llama-vocab.cpptools/mtmd/models/utuvl.cpptools/mtmd/mtmd.cppsrc/unicode.cpptools/mtmd/clip.cpptools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
src/llama-vocab.hgguf-py/gguf/constants.pytools/mtmd/models/models.hsrc/llama-vocab.cpptools/mtmd/models/utuvl.cpptools/mtmd/mtmd.cppsrc/unicode.cpptools/mtmd/clip.cppgguf-py/gguf/tensor_mapping.pytools/mtmd/clip-impl.hconvert_hf_to_gguf.py
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.py
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/llama-vocab.cppsrc/unicode.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/CMakeLists.txttools/mtmd/mtmd.cpptools/mtmd/clip.cpptools/mtmd/clip-impl.h
🧬 Code graph analysis (3)
tools/mtmd/models/models.h (1)
tools/mtmd/models/utuvl.cpp (2)
build(3-190)build(3-3)
tools/mtmd/models/utuvl.cpp (1)
ggml/src/ggml.c (9)
ggml_reshape_4d(3583-3601)ggml_permute(3700-3752)ggml_cont_4d(3492-3508)ggml_mul_mat(3174-3189)ggml_add(1969-1974)ggml_reshape_2d(3546-3562)ggml_set_name(1841-1848)ggml_cast(3433-3445)ggml_build_forward_expand(6793-6795)
convert_hf_to_gguf.py (3)
gguf-py/gguf/gguf_writer.py (10)
add_expert_feed_forward_length(714-715)add_expert_count(800-801)add_expert_shared_count(806-807)add_expert_weights_scale(815-816)add_expert_weights_norm(818-819)add_clip_projector_type(1083-1084)add_vision_attention_layernorm_eps(1104-1105)add_vision_use_gelu(1122-1123)add_vision_use_silu(1125-1126)add_vision_spatial_merge_size(1119-1120)convert_lora_to_gguf.py (1)
modify_tensors(448-473)gguf-py/gguf/constants.py (1)
VisionProjectorType(3414-3435)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
10521-10521: Avoid specifying long messages outside the exception class
(TRY003)
10536-10536: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: ios-xcode-build
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: macOS-latest-cmake-arm64
🔇 Additional comments (19)
src/unicode.cpp (2)
967-971: LGTM! Unicode letter subcategories correctly added.The addition of Lu, Ll, Lt, Lm, and Lo subcategories mapping to the LETTER flag is correct and follows the existing pattern. The inline comments clearly document each subcategory type.
1082-1103: LGTM! Dynamic \p{...} pattern resolution is well-implemented.The replacement of fixed-length pattern matching with dynamic resolution correctly:
- Searches for the closing brace dynamically (Line 1087)
- Validates the pattern length with a reasonable limit (Line 1088)
- Conditionally inserts brackets to avoid nesting when not inside a character class (Lines 1091-1098)
- Advances the loop index correctly past the matched pattern (Line 1099)
The boundary checks are sound, and the logic handles varying-length Unicode property patterns correctly.
gguf-py/gguf/constants.py (1)
3414-3435: UTUVL projector string is consistent with existing naming
VisionProjectorType.UTUVL = "utuvl"aligns with the new C++ projector type name and other projector strings; no issues here.gguf-py/gguf/tensor_mapping.py (1)
1218-1458: UTUVL/Siglip2 tensor mappings look consistent; validate against real checkpointsThe added
"merger.*"and"siglip2.vision_model.*"source keys for V_MMPROJ, V_ENC_* and related tensors are consistent with existing patterns for other vision models and correctly target the correspondingMODEL_TENSORentries. Behavior remains purely additive.It would still be good to run at least one UTUVL HF → GGUF conversion and load in mtmd to confirm all expected tensors are picked up and no additional Siglip2 variants are missing.
tools/mtmd/CMakeLists.txt (1)
5-31: UTUVL model source correctly wired into mtmd targetAdding
models/utuvl.cppto themtmdlibrary matches the pattern of other model sources and is sufficient to include the new graph implementation in the build.tools/mtmd/clip-impl.h (1)
163-221: Enum and string mapping for UTUVL are correctly aligned
PROJECTOR_TYPE_UTUVLis added toprojector_typeand mapped to"utuvl"inPROJECTOR_TYPE_NAMES, matching the Python-side constant and gguf metadata string. No issues seen.src/llama-vocab.h (1)
9-55: New UTU_VL pre-tokenizer enum value is fineAppending
LLAMA_VOCAB_PRE_TYPE_UTU_VL = 43tollama_vocab_pre_typeis consistent with existing entries and keeps ordinal values stable. Assuming the corresponding handling is implemented inllama-vocab.cpp, this looks good.tools/mtmd/mtmd.cpp (1)
268-290: UTUVL reuses Qwen-style vision delimiters correctlyIncluding
PROJECTOR_TYPE_UTUVLin the Qwen2VL/Qwen25VL/Qwen3VL branch makes UTUVL use<|vision_start|>/<|vision_end|>markers, which is consistent with how other Qwen-based VL projectors are handled.Please double‑check that the UTUVL text tokenizer indeed defines these exact special tokens.
tools/mtmd/models/models.h (1)
25-28: UTUVL clip graph declaration matches existing pattern
clip_graph_utuvlfollows the same inheritance and constructor pattern as otherclip_graph_*types and declares a virtualbuild()override, which fits the documented implementation inutuvl.cpp.src/llama-vocab.cpp (1)
317-322: LGTM - UTU_VL regex patterns are well-structured.The regex expressions correctly handle Korean Hangul, CJK punctuation, Bopomofo, and Chinese/Japanese characters, followed by a comprehensive pattern for mixed-case words with contractions. This follows the established pattern for other multilingual tokenizers in the codebase.
tools/mtmd/models/utuvl.cpp (3)
3-14: LGTM - Clean initialization with appropriate assertions.The setup correctly initializes window attention parameters. The
batch_size = 1constraint is properly enforced viaGGML_ASSERTstatements later in the code (lines 79, 146).
18-49: LGTM - Window partitioning logic for patch embeddings.The sequence of reshape/permute/cont operations correctly reorganizes patches into a windowed format for efficient attention computation. The math appears consistent: patches are organized into 2×2 windows (m=2) and the final shape matches the expected
(3*patch_size*patch_size, Hm*Wm*m*m, 1)before the patch embedding projection.
86-138: LGTM - Standard transformer layer implementation with windowed attention.The layer loop correctly implements:
- Pre-norm architecture with residual connections
- Q/K/V projections with RoPE positional encoding
- Windowed attention with full attention every 8th layer and on the last layer (
(il + 1) % 8 == 0 || il == n_layer - 1)- FFN with configurable activation
tools/mtmd/clip.cpp (5)
848-851: LGTM!The UTUVL case follows the established pattern for other projector types in the graph builder switch statement.
1511-1518: LGTM!Tensor loading correctly loads UTUVL-specific tensors including the RMS normalization weight for the merger.
2993-2994: LGTM!Token counting functions correctly include UTUVL alongside related projector types (QWEN2VL, QWEN25VL, QWEN3VL, GLM4V) that share the same merge-based output token calculation.
Also applies to: 3009-3010, 3070-3076
3477-3496: LGTM!Position calculation for UTUVL correctly remaps positions based on window grouping, following the same pattern as QWEN25VL but with parameterized
merge_ratiowhich is cleaner.
3668-3669: LGTM!The embedding dimension correctly uses
mm_1_b->ne[0], consistent with tensor loading and similar projector types.convert_hf_to_gguf.py (1)
1176-1177: UTUVL tokenizer hash entry looks consistent; just ensure runtime support exists.The new
chkhsh == "9d7013..."→res = "utu-vl"branch follows the same pattern as other models and should work as long as the C++ side andconvert_hf_to_gguf_update.pyhave been updated to recognize"utu-vl"as a validtokenizer.ggml.prevalue. Nothing functionally wrong here from this script’s perspective; just double‑check the upstream/runtime side is in sync so this identifier isn’t treated as “unknown pre‑tokenizer”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tools/mtmd/models/utuvl.cpp (1)
41-41: Minor: Inconsistent spacing in dimension calculation.Line 41 has inconsistent spacing:
3*patch_size* patch_sizeshould be3 * patch_size * patch_sizefor consistency with the rest of the codebase.Proposed fix
- 3*patch_size* patch_size, Hm * Wm * m * m, 1); + 3 * patch_size * patch_size, Hm * Wm * m * m, 1);tools/mtmd/clip.cpp (1)
1165-1175: Consider adding guidance to the warning message.The warning at line 1173-1174 only states the minimum requirement but doesn't provide guidance on how to fix it, unlike the Qwen-VL warning (lines 1160-1162) which suggests
--image-min-tokens 1024.Proposed enhancement
if (hparams.image_min_pixels < warn_min_pixels) { LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__); + LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__); }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/mtmd/clip.cpptools/mtmd/models/utuvl.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/clip.cpptools/mtmd/models/utuvl.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/clip.cpptools/mtmd/models/utuvl.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpp
🔇 Additional comments (8)
tools/mtmd/models/utuvl.cpp (3)
64-83: LGTM!The window attention setup correctly handles index remapping and mask creation, with appropriate F16 casting for flash attention compatibility.
86-138: LGTM!The transformer layer implementation correctly handles:
- Window/full attention alternation pattern (every 8th layer + final layer uses full attention)
- Multi-head attention with separate Q/K/V projections
- RoPE positional encoding via
ggml_rope_multi- Proper residual connections and FFN processing
160-160: Verify the hardcoded epsilon value for merger RMS norm.Line 160 uses a hardcoded
1e-6for the RMS normalization epsilon, while otherbuild_normcalls in this function use theepsmember variable. This may be intentional if the VLPatchMerger requires a different epsilon, but it's worth confirming this matches the original model specification.tools/mtmd/clip.cpp (5)
848-851: LGTM!The graph builder selection for UTUVL follows the established pattern for other projector types.
1511-1518: LGTM!The tensor loading correctly loads the merger normalization weight and MLP projection weights, matching the graph construction in
utuvl.cpp.
2993-2994: LGTM!The token counting logic for UTUVL correctly shares the same calculation as Qwen and GLM4V models, consistent with the 2x spatial merge configuration.
Also applies to: 3009-3010, 3070-3071
3597-3598: LGTM!The embedding size resolution correctly uses
mm_1_b->ne[0], consistent with the merger MLP output tensor loaded for UTUVL.
3345-3360:and
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tools/mtmd/clip.cpp (1)
2751-2765: Misleading comment: this is linear search, not binary searchThe comment on line 2751 states "Binary search with 0.02 step size" but the implementation performs a linear decrement (
scale -= 0.02f) until the patch count fits within limits. While the logic works correctly, the comment should be updated to accurately describe the approach (e.g., "Linear search: decrement scale by 0.02 until patches fit").Based on learnings, this is a mirrored PR from upstream, so ngxson should be informed to relay this feedback to the contributor.
🧹 Nitpick comments (4)
tools/mtmd/models/utuvl.cpp (2)
9-13: Patch reshaping pipeline is algebraically sound but could use explicit invariants and fewer magic numbersThe reshape/permute/cont chain from
build_inp_raw()down to the finalggml_cont_3d(..., 3*patch_size*patch_size, Hm * Wm * m * m, 1)is consistent in terms of element count and yields the expected[3 * patch_size^2, n_patches, 1]layout as long asn_patches_xandn_patches_yare multiples ofm(2 here). Right now that’s only implicit; a pair of assertions likeGGML_ASSERT(Wp % m == 0);andGGML_ASSERT(Hp % m == 0);would make the preconditions obvious and fail fast if a future UTUVL variant uses different shapes. Likewise,const int spatial_merge_unit = 4;later in the file is logicallym * m; deriving it fromminstead of hard‑coding4would keep the encoder and merger logic in sync if the window size ever changes.Also applies to: 20-45
5-8: Encoder/window-attention/merger wiring matches the QwenVL pattern but encodes several implicit contractsThe UTUVL graph largely mirrors
clip_graph_qwen2vl:
use_window_attn = !hparams.wa_layers.empty()andfull_attn = use_window_attn ? hparams.wa_layers.count(il) > 0 : truegive clear per-layer control, assumingwa_layersuses 0‑based layer indices.- The
inv_window_idxwindowing (Lines 67–86) and subsequentwindow_idxun-windowing (Lines 143–154) both assume a fixed 4‑token spatial unit; if the converter ever changes its grouping scheme, both indices must stay perfectly inverse over those 4‑token blocks or the restored sequence will be scrambled.- The merger section (Lines 156–175) applies RMS norm with
mm_input_norm_wand then a 2‑layer FFN over a[n_embd * 4, n_pos / 4, 1]view, consistent with the SigLIP‑style VLPatchMerger described in the PR.Given how tightly this all depends on converter‑side metadata (WA_LAYERS contents, window indices, and patch grid layout), it would be helpful to document these assumptions in comments and to verify with a small UTUVL sample that round‑tripping through windowing/unwindowing truly preserves token order.
Also applies to: 63-86, 88-141, 143-176
convert_hf_to_gguf.py (2)
3812-3815: Qwen2VL window‑attention layer export is correct but assert is a bit brittleUsing
fullatt_block_indexesand wiring it directly intoadd_vision_wa_layersmatches how UTUVL is handled and gives llama.cpp the exact full‑attention block layout it expects. One minor robustness nit: relying onassert fullatt_block_indexes is not Nonemeans running Python with-Owould silently skip the check; if you ever touch this outside a mirror, consider a normalif ...: raise ValueError(...)for clearer failures.
10489-10535: UTUVL mmproj integration is functionally sound; minor style nits onlyFunctionally,
UtuVLVisionModellooks correct:
- Reuses
MmprojModelsoself.hparamsbecomes the vision config, then normalizesimage_sizeand emits projector metadata: projector typeUTUVL, attention eps, spatial merge size, optionalwindow_size, and window‑attention layers viaadd_vision_wa_layers.- Activation handling groups all GELU‑style variants (
gelu_pytorch_tanh,gelu_fast, etc.) undervision_use_gelu, with SILU mapped tovision_use_silu, which matches how other vision projectors in this file are encoded.modify_tensorscleanly skips the text side and maps only tensors that have a corresponding entry intensor_map, logging and ignoring anything unmapped instead of raising.Two small, non‑blocking nits (you may want to leave them as‑is to stay aligned with upstream):
- The
ValueError(f"Unsupported activation function for UTUVL: {hidden_act}")triggers Ruff TRY003; if you ever refactor, you could move this into a custom exception type or shorten the message, but it’s harmless.- Ruff TRY300 would prefer a
try/except/elseshape inmodify_tensorsinstead of returning directly from thetryblock, again purely stylistic.Given this is a mirror PR, I’d keep the implementation as‑is and surface any style tweaks upstream if they decide to change their patterns.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
convert_hf_to_gguf.pyconvert_hf_to_gguf_update.pygguf-py/gguf/constants.pygguf-py/gguf/gguf_writer.pysrc/llama-model.cpptools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/qwen2vl.cpptools/mtmd/models/utuvl.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/qwen2vl.cpptools/mtmd/clip-impl.htools/mtmd/models/utuvl.cppsrc/llama-model.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/clip-model.hgguf-py/gguf/constants.pytools/mtmd/clip.cpptools/mtmd/models/qwen2vl.cpptools/mtmd/clip-impl.htools/mtmd/models/utuvl.cppconvert_hf_to_gguf_update.pysrc/llama-model.cppconvert_hf_to_gguf.pygguf-py/gguf/gguf_writer.py
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf_update.pyconvert_hf_to_gguf.pygguf-py/gguf/gguf_writer.py
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/llama-model.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpptools/mtmd/clip-impl.h
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
src/llama-model.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical
Applied to files:
src/llama-model.cpp
🧬 Code graph analysis (2)
tools/mtmd/models/utuvl.cpp (3)
tools/mtmd/models/qwen2vl.cpp (2)
build(3-182)build(3-3)tools/mtmd/clip.cpp (18)
model(217-219)model(935-1283)model(935-935)model(1971-1984)model(1971-1971)build_inp_raw(469-474)build_inp_raw(469-469)positions(3299-3299)positions(3467-3467)positions(3479-3479)build_norm(476-499)build_norm(476-482)build_attn(583-646)build_attn(583-591)build_ffn(501-581)build_ffn(501-510)cb(255-264)cb(255-255)ggml/src/ggml.c (16)
ggml_reshape_4d(3583-3601)ggml_permute(3700-3752)ggml_cont_4d(3492-3508)ggml_cont_3d(3483-3490)ggml_mul_mat(3174-3189)ggml_add(1969-1974)ggml_reshape_2d(3546-3562)ggml_new_tensor_1d(1747-1752)ggml_set_name(1841-1848)ggml_set_input(7435-7437)ggml_new_tensor_2d(1754-1761)ggml_cast(3433-3445)ggml_get_rows(3776-3797)ggml_reshape_3d(3564-3581)ggml_rope_multi(4121-4140)ggml_build_forward_expand(6793-6795)
src/llama-model.cpp (1)
src/llama-model-loader.cpp (2)
create_tensor(817-838)create_tensor(817-817)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
10509-10509: Avoid specifying long messages outside the exception class
(TRY003)
10531-10531: Consider moving this statement to an else block
(TRY300)
🔇 Additional comments (19)
convert_hf_to_gguf_update.py (1)
148-148: LGTM — local repo path is properly supported.The new
utu-vlentry correctly follows the existing pattern. The local directory path./Youtu-VLis a valid approach sincedownload_model()already handles local paths via theos.path.isdir(repo)branch (lines 201-213), copying files instead of downloading.Note for contributors: running this script requires the
./Youtu-VLdirectory to be present locally with the necessary tokenizer files (config.json,tokenizer.json,tokenizer_config.json).src/llama-model.cpp (2)
4702-4706: DeepSeek: Optionaloutput.weightwith tied-embedding fallback is consistent and safeMaking
output.weightnon-required and falling back totoken_embdwithTENSOR_DUPLICATEDmatches the existing pattern used for other arches (e.g., LLaMA, GPT2, BLOOM) and correctly handles GGUFs that rely on tied word embeddings instead of a separate output head. Legacy DeepSeek GGUFs that includeoutput.weightwill continue to load unchanged, while new tied-embedding variants avoid a hard “missing tensor” failure.I don’t see functional or performance issues here; just recommend exercising both a legacy DeepSeek model (with explicit
output.weight) and a new tied-embedding GGUF to confirm the fallback behaves as intended. Based on learnings, this keeps the mirror PR aligned with upstream while being robust to model variations.
4769-4773: DeepSeek2: Mirrored tied-embedding fallback matches DeepSeek and other archesThe DeepSeek2 block now mirrors the DeepSeek logic:
output.weightis optional (TENSOR_NOT_REQUIRED), and when absent,outputis created as a duplicatedtoken_embdtensor. This is consistent with how tied embeddings are handled elsewhere in this file and should be compatible with DeepSeek V2 GGUFs that omit a separate output head.Behavior for existing DeepSeek2 GGUFs that already define
output.weightis preserved, since the fallback only triggers when that tensor is missing. As with the previous hunk, it would be good to load at least one existing DeepSeek V2 GGUF and one new Youtu‑VL/DeepSeek2 variant to validate that both the direct and fallback paths work correctly. Based on learnings, this looks like a faithful mirror of upstream tied-output handling.tools/mtmd/models/qwen2vl.cpp (2)
7-8: Window-attention enable/disable and full-attn gating look consistent withwa_layerssemantics
use_window_attn = !hparams.wa_layers.empty()combined withfull_attn = use_window_attn ? hparams.wa_layers.count(il) > 0 : truemakes behavior explicit and avoids the previous pattern-based heuristic. This should keep old models (no WA_LAYERS key) on full attention while allowing per-layer overrides when WA_LAYERS is present, assumingwa_layersuses 0‑based layer indices consistent withil. You may want to double-check that the converter always emits 0‑based indices here so UTUVL/Qwen2.5VL don’t accidentally shift the intended full-attn layers.Also applies to: 79-82
156-176: Projection/window reorder block assumes a specific relation betweenprojection_dimandn_embdThe reshape at Line 173 requires
hparams.projection_dim * (n_patches_x * n_patches_y / 4)to equal the element count ofembeddingscoming out of the FFN (n_embd * n_patches_x * n_patches_y). That implieshparams.projection_dim == 4 * n_embdwhenn_pos == n_patches. If UTUVL/Qwen2.5VL models ever use a different projection dimension, this reshape will assert at runtime. Please confirm that the converter always setsprojection_dimaccordingly, or, if not, adjust the reshape to derive its first dimension from the actual tensor shape instead ofprojection_dim.tools/mtmd/clip-model.h (1)
59-64:wa_layerswiring inclip_hparamsis consistent and idiomaticAdding
std::unordered_set<int32_t> wa_layersalongsidevision_feature_layeris a straightforward way to represent sparse full-attention layers. This aligns well with thewa_layers.count(il) > 0checks in the graphs and keeps the default “no window attention” behavior when the set is empty.tools/mtmd/clip-impl.h (1)
51-52: WA_LAYERS key andPROJECTOR_TYPE_UTUVLmapping are coherent across C++ and Python
KEY_WIN_ATTN_LAYERSuses the same"clip.vision.wa_layers"string as the Python constants, andPROJECTOR_TYPE_UTUVLis correctly registered inPROJECTOR_TYPE_NAMESas"utuvl", matchingVisionProjectorType.UTUVL. This should make the new Youtu‑VL projector discoverable and correctly parsed from GGUF without extra glue.Also applies to: 163-221
gguf-py/gguf/constants.py (1)
283-299: New vision metadata keys andUTUVLprojector string are aligned with the C++ side
Keys.ClipVision.WA_LAYERS = "clip.vision.wa_layers"andWINDOW_SIZE = "clip.vision.window_size"line up with the C++ macros and the newhparams.wa_layers/attn_window_sizefields, andVisionProjectorType.UTUVL = "utuvl"matches the C++PROJECTOR_TYPE_UTUVLname. The only behavioral change to be aware of is that legacyclip.vision.n_wa_patternmetadata will no longer be recognized; that’s expected for this refactor, but worth keeping in mind when converting older QwenVL models.Also applies to: 3415-3437
gguf-py/gguf/gguf_writer.py (1)
1131-1138: New WA_LAYERS/WINDOW_SIZE writer helpers are correctly implemented
add_vision_wa_layers()andadd_vision_window_size()correctly emit an INT32 array and UINT32 scalar for the newclip.vision.wa_layersandclip.vision.window_sizekeys, and the empty‑sequence early return for arrays is consistent with the runtime behavior. Both methods are already being used inconvert_hf_to_gguf.py.Likely an incorrect or invalid review comment.
tools/mtmd/clip.cpp (6)
848-851: LGTM: UTUVL graph builder integrationThe UTUVL case addition follows the same pattern as other projector types and correctly instantiates the clip_graph_utuvl builder.
1154-1161: LGTM: Window attention layer loading for QWEN25VLThe window attention layer loading logic is correct and properly populates the
wa_layersset.
1172-1188: LGTM: UTUVL hyperparameter initializationThe UTUVL hyperparameter loading follows the established pattern from QWEN2VL/QWEN25VL models. The configuration properly requires
attn_window_sizeandwa_layers, sets appropriate token limits, and includes a helpful warning about minimum pixel requirements for grounding tasks.
1256-1262: LGTM: Window attention layers loggingThe logging addition provides useful debug information about which layers use window attention.
1530-1537: LGTM: UTUVL projection tensor loadingThe tensor loading correctly maps to the UTUVL merger architecture (ln_q normalization and MLP layers). The inline comments helpfully document the tensor mappings.
3012-3034: LGTM: UTUVL integration in token counting and batch encodingThe UTUVL additions to
clip_n_output_tokens_x,clip_n_output_tokens_y,clip_n_output_tokens, batch encoding logic, andclip_n_mmproj_embdare consistent with the QWEN2VL/QWEN25VL/GLM4V architecture. The window attention handling at lines 3364-3445 correctly uses the defaultattn_window_sizeof 112 for QWEN25VL (where it's optional) while UTUVL mandates it during loading.Also applies to: 3089-3095, 3364-3445, 3616-3617
convert_hf_to_gguf.py (4)
1233-1235: Tokenizer hash entry for Youtu‑VL looks consistentThe new
chkhsh→"utu-vl"mapping is in line with the existing pattern inget_vocab_base_preand should safely identify the Youtu‑VL pre‑tokenizer variant. No further changes needed here.
7131-7137: DeepseekV2Model registration ofUTUVLForCausalLMis coherentAdding
"UTUVLForCausalLM"to theDeepseekV2Modelregistry plugs UTUVL’s text side into the existing DeepSeek‑V2 path while the newUtuVLVisionModelhandles the mmproj side via the MMPROJ registry. The dual registration under differentModelTypes should not conflict and aligns with how other multimodal families are wired.
7213-7233: MoE metadata export for DeepseekV2/UTUVL is reasonableThe added block that sets
expert_feed_forward_length,expert_count,expert_shared_count, and expert weight scale/norm fromhparamsmirrors patterns used in other MoE implementations (e.g. Qwen2Moe/Dots1). The fallbacks (intermediate_sizeand defaults of 0/1.0) give sane behavior when some MoE fields are absent, including non‑MoE models reusing this class. Just be aware that you will now always emit anexpert_feed_forward_length(possibly 0) andexpert_shared_count(0) for such models; if upstream ever cares about omitting these entirely, that would need a follow‑up there rather than in this mirror.
7247-7257: DeepseekV2 tensor‑filtering and MoE guard logic look solid
- Skipping
vision_tower/multi_modal_projector/siglip2./merger.tensors is consistent with deferring UTUVL/KimiVL vision handling to the mmproj path instead of the text model.- The
tie_word_embeddingscheck that dropslm_head.weightwhen tied prevents redundant output‑head storage and matches patterns used elsewhere in this file.- Gating the experts merge block on
n_routed_expertsbeing present avoids crashes for non‑MoE configs while keeping behavior unchanged for actual MoE setups.All of these changes improve robustness without altering expected conversions for supported Deepseek/UTUVL variants.
Also applies to: 7271-7299
Mirror from upstream PR: ggml-org#18315
Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.
Summary by CodeRabbit
New Features
Enhancements
✏️ Tip: You can customize this high-level summary in your review settings.