2025-03-09 Programming, Technology, Productivity

How to Download, Train, and Run Coqui TTS on a Mac (Text-to-Speech with Custom Voice)

By O. Wolfson

Coqui TTS is an open-source text-to-speech (TTS) system that allows you to generate speech from text. You can also train it with your own voice to create a personalized TTS model. This guide covers:

  • Installing Coqui TTS on a Mac (Apple Silicon and Intel)
  • Running pre-trained models
  • Training a custom voice model
  • Generating speech from text locally

1. Install Coqui TTS on a Mac

1.1 Prerequisites

Ensure you have the following installed:

  • Python 3.8+ (recommended 3.10)
  • Homebrew (for package management)
  • ffmpeg (for audio processing)
  • PyTorch with Metal support (for Apple Silicon GPUs)

Step 1: Install Homebrew (if not installed)

sh
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Install Dependencies

sh
brew install ffmpeg

Step 3: Set Up a Virtual Environment

Using a virtual environment ensures that all packages are installed in an isolated directory, preventing conflicts with system-wide dependencies.

  1. Create a virtual environment:

    sh
    python -m venv coqui-venv
    
  2. Activate the virtual environment:

    • On macOS/Linux:
      sh
      source coqui-venv/bin/activate
      
    • On Windows: (if applicable)
      sh
      coqui-venv\Scripts\activate
      

Once activated, your shell prompt may change, indicating you are inside the virtual environment.

Step 4: Install PyTorch (for Apple Silicon Macs)

sh
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Step 5: Verify GPU Support (Apple Silicon only)

Run the following command to check if Metal (MPS) is available:

python
import torch
print(torch.backends.mps.is_available())  # Should return True on M1/M2/M3

Step 6: Install Coqui TTS

sh
pip install TTS

Step 7: Verify Installation

Check available models:

sh
tts --list_models

To exit the virtual environment, run:

sh
deactivate

2. Running a Pre-Trained Model (Quick Test)

Ensure your virtual environment is activated before running:

sh
source coqui-venv/bin/activate

Run a basic TTS model to check if everything works:

sh
tts --text "Hello, this is a test of Coqui TTS." --model_name tts_models/en/ljspeech/tacotron2-DDC

This will generate a speech WAV file using a built-in model.


3. Training Coqui TTS with Your Own Voice

3.1 Prepare Your Voice Dataset

You need:

  • 1–5 hours of high-quality recordings (WAV format, preferably 22kHz or 44kHz).
  • A transcript (CSV or JSON) matching the speech.

Dataset Folder Structure

text
/my-dataset/
├── wavs/
│   ├── audio_001.wav
│   ├── audio_002.wav
│   ├── ...
├── metadata.csv

Example metadata.csv format

text
audio_001.wav|Hello, this is my voice.
audio_002.wav|I am training my own TTS model.

3.2 Train the Model

Run the training command:

sh
tts --train_config_path configs/your_config.json --dataset_path /my-dataset/

For fine-tuning an existing model:

sh
tts --train_config_path configs/your_config.json --dataset_path /my-dataset/ --restore_path path/to/pretrained/model.pth

For Apple Silicon Macs, enable GPU acceleration (MPS):

  1. Open your_config.json and change:
json
"device": "mps"
  1. Start training:
sh
tts --train_config_path configs/your_config.json --dataset_path /my-dataset/

4. Generating Speech from a Trained Model

Ensure your virtual environment is activated before running:

sh
source coqui-venv/bin/activate

Once the model is trained, you can generate speech files:

sh
tts --text "This is my custom trained voice." \
    --model_path path/to/your/trained_model.pth \
    --config_path path/to/config.json \
    --out_path output.wav

For batch processing multiple sentences, use Python:

python
from TTS.api import TTS

# Load the trained model
tts = TTS("path/to/your/trained_model.pth")

# Generate speech and save to a file
tts.tts_to_file(text="Hello, this is my voice.", file_path="output.wav")

To play the audio on Mac:

sh
afplay output.wav

5. Deploying a Local TTS API

You can turn Coqui TTS into an API to generate speech via HTTP requests.

Step 1: Install FastAPI & Uvicorn

sh
pip install fastapi uvicorn

Step 2: Create server.py

python
from fastapi import FastAPI
from TTS.api import TTS

app = FastAPI()
tts = TTS("path/to/your/trained_model.pth")

@app.get("/synthesize/")
async def synthesize(text: str):
    output_file = "output.wav"
    tts.tts_to_file(text=text, file_path=output_file)
    return {"message": "Speech generated", "file": output_file}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 3: Run the API

sh
python server.py

Step 4: Test the API

sh
curl "http://localhost:8000/synthesize/?text=Hello%20world"

This will generate a WAV file and return its path.


6. Deploying to the Cloud

If you need cloud deployment, you can:

  • Use Google Colab for training (free GPU access)
  • Deploy on RunPod.io / Lambda Labs for cheap GPU rentals
  • Use AWS / GCP for production-grade hosting
  • Host a web app using Hugging Face Spaces

Conclusion

Coqui TTS allows you to train and run a text-to-speech model on a Mac, including custom voice training. Apple Silicon Macs can leverage MPS acceleration, but if training is too slow, cloud GPUs are an option.

With this setup, you can generate custom TTS audio files, deploy a local API, or even build your own AI voice assistant.