§01
Abstract
LumiChats TTS 1B is a fine-tuned conditional text-to-speech model based on the CSM-1B (Context-Speech-Model) architecture. Using LoRA fine-tuning with only 1.75% of parameters updated, the model specialises in generating natural-sounding speech with consistent speaker identity and style preservation across extended sequences. When provided with a reference audio context, the model conditions on the speaker's voice characteristics — pitch, rate, pronunciation, and prosodic patterns — to produce new utterances that closely match the target voice. The model was fine-tuned on the MrDragonFox/Elise dataset, which provides high-quality speech recordings with consistent speaker characteristics.
§02
Architecture & Configuration
LumiChats TTS 1B is built on csm-1b (Context-Speech-Model) using Low-Rank Adaptation (LoRA) — a parameter-efficient fine-tuning technique. Only 1.75% of parameters are updated.
Architecture
Transformer-based conditional TTS: Text Encoder + Audio Encoder + Audio Decoder
Total Parameters
~1 billion
Trainable Parameters
~17.5M (1.75%)
Context Length
2,048 tokens
Quantization
16-bit (FP16) full precision SafeTensors; 4-bit quantisation available
LoRA Rank (r)
32
LoRA Alpha (α)
32
LoRA Target Modules
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Languages
English (primary)
§03
Training Details
Dataset
MrDragonFox/Elise
Dataset Size
High-quality speech recordings with diverse linguistic contexts
Objective
Conditional speech generation: text + optional reference audio → audio waveform
Framework
Hugging Face Transformers + TRL
Hardware
CUDA GPU
Training Time
60 training steps
Peak Memory
Efficient via gradient checkpointing + 8-bit optimiser
Max Steps
60
Hyperparameters
Learning Rate
2e-4
Batch Size
2
Gradient Accum.
4
Effective Batch
8
Optimizer
AdamW 8-bit
LR Scheduler
Linear
§04
Evaluation & Benchmarks
| Metric | Value | Baseline | Description |
|---|---|---|---|
| Voice consistency (subjective) | Substantially improved over base | — | Maintains speaker characteristics across generated sentences |
| Naturalness | Improved on domain-specific content | — | More human-like prosody on target dataset characteristics |
| Style fidelity (reference audio) | High | — | Preserves speaking rate, pitch range, and emphasis patterns |
| Audio sample rate | 24,000 Hz | — | Output audio sampling rate |
| Min VRAM (inference) | 8 GB | — | Minimum for stable single-sample inference |
§05
Base Model vs Fine-Tuned
Key improvements from fine-tuning on the MrDragonFox/Elise dataset versus the csm-1b (Context-Speech-Model) base model.
| Dimension | Base (csm-1b (Context-Speech-Model)) | LumiChats TTS 1B |
|---|---|---|
| Speaker identity preservation | Moderate (general model) | ✅ High consistency across generation |
| Prosodic pattern capture | Generic prosody | ✅ Target speaker's pace and emphasis |
| Natural speech quality on target domain | Good | ✅ Improved naturalness |
| Reference audio conditioning | Supported (base behaviour) | ✅ Strongly conditioned |
§06
Use Cases
Consistent voiceovers for videos, podcasts, and audiobooks
Voice assistant personalisation with user-provided reference audio
Educational narration tools
Accessibility tools: screen readers and audio interfaces
Gaming character voice generation
Content creator productivity — re-recording edits in original voice
§07
Limitations & Disclaimers
LumiChats TTS 1B inherits limitations of its base architecture and training data.
Optimised for the Elise dataset characteristics — may not generalise to all voices
Best results in English; multilingual capability is limited
Single-speaker optimisation (speaker_id=0 as default configuration)
Voice cloning requires reference audio and explicit user consent in production
4-bit quantisation may affect subtle audio qualities
60 training steps — extended training on additional data will improve generalisation
§08
Citation
If you use LumiChats TTS 1B in research or products, please cite:
@misc{lumichats_tts_1b_2025,
author = {adityakum667388},
title = {LumiChats TTS 1B: Fine-Tuned Voice Synthesis Model on CSM-1B},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/adityakum667388/lumichats_TTS_1B_finetune_16bit},
note = {Fine-tuned from csm-1b using LoRA on MrDragonFox/Elise}
}License: Apache 2.0 — View full license on Hugging Face