Text-to-Speechv1.0July 1, 2025

LumiChats TTS 1B

Voice-consistent text-to-speech model with reference audio cloning

Parameters:~1
Trainable:1.75%
Training time:60 training steps
Dataset:High-quality speech recordings
Apache 2.0
Only 1.75% of parameters updated via LoRA — efficient speaker adaptationReference audio conditioning for voice cloningMaintains speaker identity across multi-sentence generationCaptures prosodic patterns: pitch, rate, emphasis, and speaking styleApache 2.0 license — full commercial use permitted4-bit quantised deployment option for consumer hardware
§01

Abstract

LumiChats TTS 1B is a fine-tuned conditional text-to-speech model based on the CSM-1B (Context-Speech-Model) architecture. Using LoRA fine-tuning with only 1.75% of parameters updated, the model specialises in generating natural-sounding speech with consistent speaker identity and style preservation across extended sequences. When provided with a reference audio context, the model conditions on the speaker's voice characteristics — pitch, rate, pronunciation, and prosodic patterns — to produce new utterances that closely match the target voice. The model was fine-tuned on the MrDragonFox/Elise dataset, which provides high-quality speech recordings with consistent speaker characteristics.
§02

Architecture & Configuration

LumiChats TTS 1B is built on csm-1b (Context-Speech-Model) using Low-Rank Adaptation (LoRA) — a parameter-efficient fine-tuning technique. Only 1.75% of parameters are updated.

Architecture
Transformer-based conditional TTS: Text Encoder + Audio Encoder + Audio Decoder
Total Parameters
~1 billion
Trainable Parameters
~17.5M (1.75%)
Context Length
2,048 tokens
Quantization
16-bit (FP16) full precision SafeTensors; 4-bit quantisation available
LoRA Rank (r)
32
LoRA Alpha (α)
32
LoRA Target Modules
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Languages
English (primary)
§03

Training Details

Dataset
MrDragonFox/Elise
Dataset Size
High-quality speech recordings with diverse linguistic contexts
Objective
Conditional speech generation: text + optional reference audio → audio waveform
Framework
Hugging Face Transformers + TRL
Hardware
CUDA GPU
Training Time
60 training steps
Peak Memory
Efficient via gradient checkpointing + 8-bit optimiser
Max Steps
60
Hyperparameters
Learning Rate
2e-4
Batch Size
2
Gradient Accum.
4
Effective Batch
8
Optimizer
AdamW 8-bit
LR Scheduler
Linear
§04

Evaluation & Benchmarks

MetricValueBaselineDescription
Voice consistency (subjective)Substantially improved over baseMaintains speaker characteristics across generated sentences
NaturalnessImproved on domain-specific contentMore human-like prosody on target dataset characteristics
Style fidelity (reference audio)HighPreserves speaking rate, pitch range, and emphasis patterns
Audio sample rate24,000 HzOutput audio sampling rate
Min VRAM (inference)8 GBMinimum for stable single-sample inference
§05

Base Model vs Fine-Tuned

Key improvements from fine-tuning on the MrDragonFox/Elise dataset versus the csm-1b (Context-Speech-Model) base model.

DimensionBase (csm-1b (Context-Speech-Model))LumiChats TTS 1B
Speaker identity preservationModerate (general model)✅ High consistency across generation
Prosodic pattern captureGeneric prosody✅ Target speaker's pace and emphasis
Natural speech quality on target domainGood✅ Improved naturalness
Reference audio conditioningSupported (base behaviour)✅ Strongly conditioned
§06

Use Cases

Consistent voiceovers for videos, podcasts, and audiobooks
Voice assistant personalisation with user-provided reference audio
Educational narration tools
Accessibility tools: screen readers and audio interfaces
Gaming character voice generation
Content creator productivity — re-recording edits in original voice
§07

Limitations & Disclaimers

LumiChats TTS 1B inherits limitations of its base architecture and training data.

Optimised for the Elise dataset characteristics — may not generalise to all voices
Best results in English; multilingual capability is limited
Single-speaker optimisation (speaker_id=0 as default configuration)
Voice cloning requires reference audio and explicit user consent in production
4-bit quantisation may affect subtle audio qualities
60 training steps — extended training on additional data will improve generalisation
§08

Citation

If you use LumiChats TTS 1B in research or products, please cite:

@misc{lumichats_tts_1b_2025,
  author    = {adityakum667388},
  title     = {LumiChats TTS 1B: Fine-Tuned Voice Synthesis Model on CSM-1B},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/adityakum667388/lumichats_TTS_1B_finetune_16bit},
  note      = {Fine-tuned from csm-1b using LoRA on MrDragonFox/Elise}
}
License: Apache 2.0 View full license on Hugging Face

Related Models