Quickstart
NOTE
Recommended starting point: 0.8B first. On smaller GPUs (for example RTX 3070 7.7 GB), 8B/9B steps often require fallback modes (--low-vram, CPU fallback in verify).
Prerequisites
See Setup for full install details.
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -e . # or: pip install git+https://github.com/techwithsergiu/qwen35-toolkit.git
hf auth loginSteps
Path A — BNB text-only (training target)
Step 1 — Quantize source model to BNB 4-bit
Create BNB VLM checkpoint.
qwen35-convert --model unsloth/Qwen3.5-0.8B --output ./Qwen3.5-0.8B-bnb-4bit
# repeat for 2B, 4B; use --low-vram for 9BSuccess criteria: output directory contains BNB model files.
Step 2 — Strip visual tower
Create text-only BNB checkpoint.
qwen35-strip --model ./Qwen3.5-0.8B-bnb-4bit --output ./Qwen3.5-text-0.8B-bnb-4bit
# repeat for 2B, 4B, 9BSuccess criteria: stripped model has no visual tower weights.
Step 3 — Verify both checkpoints
Run Qwen3.5-specific verification for VLM and generic verification for text-only.
qwen35-verify-qwen35 --model ./Qwen3.5-0.8B-bnb-4bit
qwen35-verify --model ./Qwen3.5-text-0.8B-bnb-4bitSuccess criteria: verification steps complete without blocking errors.
Step 4 — Upload to Hub
Push both model variants.
qwen35-upload --local ./Qwen3.5-0.8B-bnb-4bit --repo <your-hf-username>/Qwen3.5-0.8B-bnb-4bit
qwen35-upload --local ./Qwen3.5-text-0.8B-bnb-4bit --repo <your-hf-username>/Qwen3.5-text-0.8B-bnb-4bitSuccess criteria: remote repo contains expected files and metadata.
Path B — f16 text-only + GGUF (inference / merge base)
Step 1 — Strip visual tower from f16 source
Create text-only f16/bf16 model.
qwen35-strip --model unsloth/Qwen3.5-0.8B --output ./Qwen3.5-text-0.8B --mode f16
# repeat for 2B, 4B, 9BSuccess criteria: text-only model directory created.
Step 2 — Verify text-only model
Ensure converted text model is valid before GGUF export.
qwen35-verify --model ./Qwen3.5-text-0.8BSuccess criteria: verification passes.
Step 3 — Convert to GGUF f16
Generate F16 GGUF file using llama.cpp converter.
python llama.cpp/convert_hf_to_gguf.py ./Qwen3.5-text-0.8B \
--outtype f16 \
--outfile ./Qwen3.5-text-0.8B-GGUF/Qwen3.5-text-0.8B-F16.ggufSuccess criteria: Qwen3.5-text-0.8B-F16.gguf exists.
Step 4 — Quantize GGUF
Create inference-friendly quant file.
./llama.cpp/build/bin/llama-quantize \
./Qwen3.5-text-0.8B-GGUF/Qwen3.5-text-0.8B-F16.gguf \
./Qwen3.5-text-0.8B-GGUF/Qwen3.5-text-0.8B-Q4_K_M.gguf \
Q4_K_MSuccess criteria: quantized GGUF file exists (for example Q4_K_M).
Step 5 — Upload GGUF artifacts
Push GGUF directory.
qwen35-upload --local ./Qwen3.5-text-0.8B-GGUF --repo <your-hf-username>/Qwen3.5-text-0.8B-GGUFSuccess criteria: GGUF files available on Hub repo.
Expected result
- Path A: BNB VLM + BNB text-only models verified and published.
- Path B: text-only f16 model converted to GGUF and published.
Common failures
- OOM during quantization -> rerun with
--low-vram. - Verify fails on mixed artifacts -> ensure strip output and source sizes match.
- GGUF conversion fails on VLM input -> use text-only model from
qwen35-strip --mode f16. - Hub auth errors -> rerun
hf auth loginor pass--hf-token.