qwen35-verify / qwen35-verify-qwen35
Purpose
Verify model structure and runtime behavior for Qwen3.5 workflows. Both commands accept local paths and HF Hub repo ids, with automatic mode/device detection.
When to use
- After conversion/strip, before upload or training.
- When you need sanity checks for BNB vs f16 checkpoints.
- When you need VLM-specific checks (thinking ON/OFF + image pipeline).
Syntax
qwen35-verify --model <repo_or_path> [--hf-token hf_...]
qwen35-verify-qwen35 --model <repo_or_path> [--hf-token hf_...]Options
| Flag | Default | Description |
|---|---|---|
--model | — | HF repo id or local path |
--hf-token | env HF_TOKEN | HF access token; falls back to env var |
Command selection
| Command | Module | Use for |
|---|---|---|
qwen35-verify | qwen35_toolkit.verify | Generic verification for AutoModelForCausalLM-compatible models (BNB/f16 auto-detected) |
qwen35-verify-qwen35 | qwen35_toolkit.verify_qwen35 | Qwen3.5-specific checks: thinking ON/OFF + image inference test |
Source detection behavior:
- Paths starting with
./,../,/,~or existing directories are treated as local. - Any other string containing
/is treated as HF Hub repo id.
Verification flow
Generic flow (verify.py)
1. Config:
- BNB: validate `quantization_config` and required keys.
- f16: report `architectures`, `model_type`, `torch_dtype`.
2. Load:
- Load model and tokenizer.
- Pick device strategy.
- Move model with `_move_to_cuda()` (OOM-safe fallback to CPU).
3. Precision:
- BNB: count `Linear4bit` layers and check visual precision.
- f16: report linear count and predominant weight dtype.
4. Image:
- Skipped in generic verifier (use `qwen35-verify-qwen35` for VLM image test).
5. Inference:
- Run text tasks and report latency/speed metrics.Qwen3.5-specific flow (verify_qwen35.py)
1. Config:
- Same config checks as generic flow.
2. Load:
- Auto-select loader by architecture + mode.
3. Precision:
- Same precision checks as generic flow.
4. Image:
- VLM only: run end-to-end image pipeline test.
5. Inference branches:
- thinking OFF
- thinking ONLoader matrix (verify_qwen35.py)
1. vlm + bnb:
- Qwen3_5ForConditionalGeneration
- `restore_visual_to_fp` fallback enabled
2. vlm + f16:
- Qwen3_5ForConditionalGeneration
- `torch_dtype=auto`
3. text + bnb:
- AutoModelForCausalLM
4. text + f16:
- AutoModelForCausalLM
- `torch_dtype=auto`Automatic device strategy
_pick_device() uses on-disk file sizes (no model load) to choose strategy before loading.
Decision rule:
- Compare full model size (
total) and text-only estimate (total - visual) to VRAM budget. - Threshold is
<= 95% VRAM.
| Strategy | Threshold | Behavior |
|---|---|---|
cuda_direct | total <= 95% VRAM | full model to GPU |
cuda_drop | (total - visual) <= 95% VRAM | image test on CPU -> drop visual -> text inference on GPU |
cpu | (total - visual) > 95% VRAM | CPU-only execution |
_move_to_cuda() is an OOM guard around model.cuda():
- catches
OutOfMemoryError, - clears cache,
- falls back to CPU,
- continues inference instead of aborting.
Accounting note:
- Full-size accounting intentionally includes non-quantized components (
lm_head, embeddings, visual tower). - This avoids underestimating memory on larger checkpoints.
Strategy mechanics
cuda_direct:
- load on CPU,
- move full model to GPU,
- run image test on GPU (if VLM),
- optionally drop visual tower and keep text inference on GPU.
cuda_drop:
- load on CPU,
- run image test on CPU,
- drop visual tower on CPU,
- move remaining text stack to GPU for text inference.
cpu:
- keep full flow on CPU,
- run image test (VLM) on CPU,
- drop visual tower,
- run text inference on CPU.
Implementation notes
Visual tower handling (BNB)
Primary mechanism:
llm_int8_skip_modulescontains exact paths (for examplemodel.visual) so transformers does not re-quantize visual layers.
Fallback mechanism:
restore_visual_to_fp()reloads original bf16 visual weights from safetensors for older/broken checkpoints.
lm_head.weight precision
lm_head.weight stays full precision for generation stability. On larger models this tensor is a significant memory component, so full-size accounting matters in device selection.
Image test
Qwen3.5-specific verifier uses an embedded red/blue image prompt. Expected left-half color answer: red (case-insensitive).
Module tree
qwen35_toolkit/verify.py (generic core)
is_hf_repo() -> source type detection
resolve_source() -> source label/normalization
check_config() -> config validation/reporting
model_size_gb() -> on-disk size accounting
detect_mode() -> bnb/f16
detect_arch() -> vlm/text
_pick_device() -> cuda_direct/cuda_drop/cpu
_move_to_cuda() -> OOM-safe cuda() move
restore_visual_to_fp() -> visual precision fallback
drop_visual_tower() -> free visual branch for text-only path
count_quantized_layers() -> quantization ratio diagnostics
run_inference_tasks() -> task probes + tok/s
verify() -> generic orchestrator
qwen35_toolkit/verify_qwen35.py (Qwen3.5 specializations)
load_qwen35() -> arch+mode loader selection
verify_qwen35() -> image + thinking on/off orchestratorConsole output examples
Example output (qwen35-verify-qwen35 on VLM checkpoint)
./test/Qwen3.5-0.8B-bnb-4bit [Qwen3.5 · local · vlm · bnb]
📦 Model weight size on disk: 0.90 GB
── 1. Config ──
✅ config.json OK | skip_modules: ['lm_head', 'model.visual', 'model.language_model.embed_tokens']
── 2. Load ──
✅ Model loaded (Qwen3_5ForConditionalGeneration) | strategy: cuda_direct
── 3. Precision ──
✅ 186/237 linear layers quantized (78%)
✅ Visual tower: 50 layer(s) at full precision (bfloat16)
── 4. Inference [IMAGE] ──
✅ [Image] response: 'red' (expected: 'red', case-insensitive)
── 5a. Inference [THINKING OFF] ──
✅ [Math] ...
── 5b. Inference [THINKING ON] ──
✅ [Math] ...
✅ ALL CHECKS PASSEDInterpretation: VLM checkpoint passes config/load/precision/image and both reasoning branches.
Stable fields: stage names, pass markers, image expected value, strategy label. Variable fields: tokens/time/tok-s values, full model responses, VRAM usage.
Example output (qwen35-verify-qwen35 on text-only checkpoint)
./test/Qwen3.5-text-0.8B-bnb-4bit [Qwen3.5 · local · text · bnb]
📦 Model weight size on disk: 0.71 GB
── 1. Config ──
✅ config.json OK | skip_modules: ['lm_head', 'model.language_model.embed_tokens']
── 2. Load ──
✅ Model loaded (Qwen3_5ForConditionalGeneration) | strategy: cuda_direct
── 3. Precision ──
✅ 186/187 linear layers quantized (99%)
ℹ️ No visual tower detected (text-only model)
── 4. Inference [IMAGE] ──
ℹ️ Skipped (text-only model)
✅ ALL CHECKS PASSEDInterpretation: text-only checkpoint skips image stage and still passes full verification.
Stable fields: text-only detection, image-skip behavior, pass markers. Variable fields: file size, quantized ratio, runtime metrics.
Examples
# Generic verify (text-only)
qwen35-verify --model ./Qwen3.5-text-0.8B-bnb-4bit# Generic verify from HF Hub
qwen35-verify --model <your-hf-username>/Qwen3.5-text-0.8B-bnb-4bit# Qwen3.5-specific verify (VLM)
qwen35-verify-qwen35 --model ./Qwen3.5-0.8B-bnb-4bit# 9B on small VRAM (expects auto fallback strategy)
qwen35-verify-qwen35 --model ./Qwen3.5-9B-bnb-4bitEdge cases / limitations
WARNING
Mixing visual weights from one model size with text weights from another size is unsupported and can produce invalid outputs.
- Generic verifier skips image test; use
qwen35-verify-qwen35for VLM image validation. - Private HF repos require
--hf-tokenorHF_TOKEN.