§01

Abstract

LumiChats TTS 1B is a fine-tuned conditional text-to-speech model based on the CSM-1B (Context-Speech-Model) architecture. Using LoRA fine-tuning with only 1.75% of parameters updated, the model specialises in generating natural-sounding speech with consistent speaker identity and style preservation across extended sequences. When provided with a reference audio context, the model conditions on the speaker's voice characteristics — pitch, rate, pronunciation, and prosodic patterns — to produce new utterances that closely match the target voice. The model was fine-tuned on the MrDragonFox/Elise dataset, which provides high-quality speech recordings with consistent speaker characteristics.

§02

Architecture & Configuration

LumiChats TTS 1B is built on csm-1b (Context-Speech-Model) using Low-Rank Adaptation (LoRA) — a parameter-efficient fine-tuning technique. Only 1.75% of parameters are updated.

Architecture

Transformer-based conditional TTS: Text Encoder + Audio Encoder + Audio Decoder

Total Parameters

~1 billion

Trainable Parameters

~17.5M (1.75%)

Context Length

2,048 tokens

Quantization

16-bit (FP16) full precision SafeTensors; 4-bit quantisation available

LoRA Rank (r)

LoRA Alpha (α)

LoRA Target Modules

q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Languages

English (primary)

§03

Training Details

Dataset

MrDragonFox/Elise

Dataset Size

High-quality speech recordings with diverse linguistic contexts

Objective

Conditional speech generation: text + optional reference audio → audio waveform

Framework

Hugging Face Transformers + TRL

Hardware

CUDA GPU

Training Time

60 training steps

Peak Memory

Efficient via gradient checkpointing + 8-bit optimiser

Max Steps

Hyperparameters

Learning Rate

2e-4

Batch Size

2

Gradient Accum.

4

Effective Batch

8

Optimizer

AdamW 8-bit

LR Scheduler

Linear

Dataset: https://huggingface.co/datasets/MrDragonFox/Elise

§04

Evaluation & Benchmarks

Metric	Value	Baseline	Description
Voice consistency (subjective)	Substantially improved over base	—	Maintains speaker characteristics across generated sentences
Naturalness	Improved on domain-specific content	—	More human-like prosody on target dataset characteristics
Style fidelity (reference audio)	High	—	Preserves speaking rate, pitch range, and emphasis patterns
Audio sample rate	24,000 Hz	—	Output audio sampling rate
Min VRAM (inference)	8 GB	—	Minimum for stable single-sample inference

§05

Base Model vs Fine-Tuned

Key improvements from fine-tuning on the MrDragonFox/Elise dataset versus the csm-1b (Context-Speech-Model) base model.

Dimension	Base (csm-1b (Context-Speech-Model))	LumiChats TTS 1B
Speaker identity preservation	Moderate (general model)	✅ High consistency across generation
Prosodic pattern capture	Generic prosody	✅ Target speaker's pace and emphasis
Natural speech quality on target domain	Good	✅ Improved naturalness
Reference audio conditioning	Supported (base behaviour)	✅ Strongly conditioned

§06

Use Cases

Consistent voiceovers for videos, podcasts, and audiobooks

Voice assistant personalisation with user-provided reference audio

Educational narration tools

Accessibility tools: screen readers and audio interfaces

Gaming character voice generation

Content creator productivity — re-recording edits in original voice

§07

Limitations & Disclaimers

LumiChats TTS 1B inherits limitations of its base architecture and training data.

Optimised for the Elise dataset characteristics — may not generalise to all voices

Best results in English; multilingual capability is limited

Single-speaker optimisation (speaker_id=0 as default configuration)

Voice cloning requires reference audio and explicit user consent in production

4-bit quantisation may affect subtle audio qualities

60 training steps — extended training on additional data will improve generalisation

§08

Citation

If you use LumiChats TTS 1B in research or products, please cite:

@misc{lumichats_tts_1b_2025,
  author    = {adityakum667388},
  title     = {LumiChats TTS 1B: Fine-Tuned Voice Synthesis Model on CSM-1B},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/adityakum667388/lumichats_TTS_1B_finetune_16bit},
  note      = {Fine-tuned from csm-1b using LoRA on MrDragonFox/Elise}
}

License: Apache 2.0 — View full license on Hugging Face

LumiChats TTS 1B