Qwen QLoRA trainFine-tune Qwen3 and Qwen3.5 on limited GPU memory

Practical QLoRA workflow for smaller setups. Focused on clear dataset handling, controlled truncation, and reproducible training/inference steps.

Get started

GitHub →

🎯

Assistant-only loss masking

Gradients only where they belong. Masking is computed in character space before tokenization — works correctly with any subword vocabulary.

Learn more

✂️

Structured truncation

Preserves system prompt and tool schemas under any max_length. Drops oldest middle turns first — never corrupts the conversation structure.

Learn more

🧠

Full reasoning control

Keep, drop, or selectively train Qwen3 think content. Control gradient scope with think_mode and think_loss per training run.

Learn more

🔧

Tools-aware

Auto-detects tool schema columns and passes them directly to the chat template. No manual prompt building required.

Learn more

🚀

Test before merging

Run inference on base model + LoRA adapter directly. Check outputs before deciding whether you need a merge.

Learn more

🔀

CPU merge

Merge LoRA adapter into base weights on CPU. Useful when you need a standalone fp16 model for export.

Learn more

Part of a two-repo ecosystem

Repo	Purpose
qwen35-toolkit	Model prep — BNB quantization, visual tower strip, verify, upload
qwen-qlora-train (this repo)	LoRA training, adapter inference, CPU merge

⚠️ Validated training on RTX 3070 8 GB currently covers Qwen3 1.7B and 4B (see Quickstart). Qwen3 8B OOMs on unsloth 2026.3.4+, and sizes above 4B should be treated as experimental on this hardware class.

Qwen QLoRA trainFine-tune Qwen3 and Qwen3.5 on limited GPU memory

Assistant-only loss masking

Structured truncation

Full reasoning control

Tools-aware

Test before merging

CPU merge

Part of a two-repo ecosystem ​

Part of a two-repo ecosystem