§01
Abstract
LumiChats v1.3 11B Vision is a domain-adapted vision-language model fine-tuned from Meta's Llama 3.2 11B Vision Instruct for the specialised task of radiological image interpretation. The model is trained to generate structured, clinically accurate descriptions of panoramic radiographs, X-rays, and CT scans using precise medical terminology. Domain adaptation significantly reduces hallucinations compared to the general-purpose base model — eliminating fabricated pathologies such as unseen fractures and misalignments — replacing them with focused, evidence-based pathology descriptions. The model is released in 4-bit quantised format, reducing GPU memory requirements by approximately 60% while maintaining high accuracy.
§02
Architecture & Configuration
LumiChats v1.3 11B Vision is built on unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit using Low-Rank Adaptation (LoRA) — a parameter-efficient fine-tuning technique. Only < 1% of parameters are updated.
Architecture
Auto-regressive multimodal transformer with vision encoder (MLLaMA)
Total Parameters
~11B total
Trainable Parameters
LoRA-adapted subset (< 1%)
Context Length
Extended for image-text alignment
Quantization
4-bit bitsandbytes NF4
LoRA Rank (r)
16
LoRA Alpha (α)
16
LoRA Target Modules
Vision layers, Language layers, Attention modules, MLP layers
Languages
English medical terminology
§03
Training Details
Dataset
Medical radiological image captioning dataset
Dataset Size
Radiology-specific image-caption pairs
Objective
Supervised fine-tuning: image → structured clinical description
Framework
Unsloth FastVisionModel + TRL
Hardware
Tesla T4 / A100-class GPU
Training Time
Training run on radiological dataset
Peak Memory
Significantly reduced via 4-bit
Max Steps
30
Hyperparameters
Learning Rate
2e-4
Batch Size
2
Gradient Accum.
4
Effective Batch
8
Optimizer
AdamW 8-bit
LR Scheduler
Linear
§04
Evaluation & Benchmarks
| Metric | Value | Baseline | Description |
|---|---|---|---|
| Hallucination rate on panoramic radiographs | Substantially reduced | Base: fabricates fractures, misalignments, absent findings | Fine-tuned model focuses only on findings present in the image |
| Medical terminology accuracy | Improved | Base: general terms, occasional inaccuracies | Uses precise clinical language (osteolytic, resorption, maxillary sinus floor) |
| Output conciseness | Concise, actionable | Base: long, speculative descriptions | Reports are focused and clinically useful |
| Min VRAM (4-bit) | ~4.2 GB | — | 4-bit model size on disk / approximate VRAM requirement |
§05
Base Model vs Fine-Tuned
Key improvements from fine-tuning on the Medical radiological image captioning dataset dataset versus the Llama-3.2-11B-Vision-Instruct-bnb-4bit base model.
| Dimension | Base (Llama-3.2-11B-Vision-Instruct-bnb-4bit) | LumiChats v1.3 11B Vision |
|---|---|---|
| Image type identification | ✅ Identifies correctly | ✅ Exact + pathology focused |
| Hallucinated pathologies | ❌ Fractures, misalignments (fabricated) | ✅ None — ground-truth only |
| Medical terminology | ⚠️ General terms, some inaccuracies | ✅ Professional clinical language |
| Output length | 📝 Long and speculative | 📝 Concise, actionable reports |
| Clinical relevance | ❌ Includes irrelevant details | ✅ Pathology-focused analysis |
§06
Use Cases
Preliminary analysis of panoramic radiographs and dental X-rays
Medical education and radiology training
Clinical documentation draft generation for radiologist review
Teleradiology support for initial triage of imaging studies
Research in AI-assisted medical image interpretation
§07
Limitations & Disclaimers
LumiChats v1.3 11B Vision inherits limitations of its base architecture and training data.
Research tool only — not a diagnostic device; always consult qualified radiologists
Optimised for panoramic radiographs; performance on other imaging modalities varies
30-step training run — extended training will improve generalisation
Should not be used for final diagnostic or treatment decisions
Requires compliance with local healthcare AI regulations (HIPAA, GDPR, etc.)
§08
Citation
If you use LumiChats v1.3 11B Vision in research or products, please cite:
@misc{lumichats-llama32-vision-11b-4bit,
author = {LumiChats Team},
title = {LumiChats v1.3 11B: Radiology-Specialised Vision-Language Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/adityakum667388/lumichats_v1.3_11b_vision}
}License: Llama 3.2 Community License (Meta) + Apache 2.0 (fine-tune) — View full license on Hugging Face