Qwen35 toolkitConvert, strip, verify Qwen3.5 models

Qwen3.5 models are often distributed as full-precision VLM checkpoints. This toolkit prepares text-only variants for training and inference: BNB conversion, visual tower strip, verification, and upload.

Get started

Published models

GitHub →

⚡

BNB 4-bit quantization

Converts full-precision f16 models to BNB NF4 4-bit on CPU, then quantizes layer-by-layer or all at once. A 4B model fits in ~5 GB VRAM. Use --low-vram for 9B on a 7.7 GB card.

Learn more

✂️

Visual tower strip

Removes the visual tower at the file level — no model load, no rounding error. Saves ~0.2–0.9 GB and produces a clean text-only model ready for LoRA training. VLM fine-tuning is not the goal.

Learn more

✅

Verification before publishing

Checks quantization correctness, layer precision, and runs inference — thinking ON/OFF and image test for VLM models. Auto-detects BNB vs f16, local path vs HF Hub, device strategy.

Learn more

🔄

HuggingFace Hub sync

Six sync modes — push, pull, diff, patch, and dry-runs. SHA256-level comparison via LFS pointers means large shards are never re-uploaded unless they actually changed.

Learn more

🛠️

GGUF pipeline

Strip → f16 text-only → GGUF f16 → quantized (Q4_K_M recommended). Designed for llama.cpp workflows with explicit conversion and quantization steps.

Learn more

🖥️

Hardware-aware workflow

The docs include tested constraints and fallback paths for limited VRAM setups (for example RTX 3070 7.7 GB). Check the hardware guide before choosing model size.

Learn more

Part of a two-repo ecosystem

Repo	Purpose
qwen35-toolkit (this repo)	Model prep — BNB quantization, visual tower strip, verify, upload
qwen-qlora-train	LoRA training, adapter inference, CPU merge

Qwen35 toolkitConvert, strip, verify Qwen3.5 models

BNB 4-bit quantization

Visual tower strip

Verification before publishing

HuggingFace Hub sync

GGUF pipeline

Hardware-aware workflow

Part of a two-repo ecosystem ​

Part of a two-repo ecosystem