2024-09-30 Web Development

How to Convert Text to Speech Using Google Cloud Text-to-Speech API

By O Wolfson

Google Cloud Text-to-Speech API allows developers to synthesize natural-sounding speech from text. This guide will walk you through the process of setting up the API, obtaining the necessary credentials, and writing a Node.js script to convert text to speech.

Step 1: Set Up a Google Cloud Project

Create a Google Cloud Project:
- Go to the Google Cloud Console.
- Click on the project dropdown at the top of the page and select "New Project."
- Enter a name for your project and click "Create."
Enable the Text-to-Speech API:
- Once your project is created, navigate to the Text-to-Speech API page.
- Click "Enable" to enable the API for your project.

Step 2: Set Up Service Account Credentials

Create a Service Account:
- In the Google Cloud Console, go to the Service Accounts page.
- Click "Create Service Account."
- Enter a name and description for your service account, then click "Create."
Grant the Service Account Access:
- On the next screen, select the "Text-to-Speech API User" role from the dropdown.
- Click "Continue" and then "Done."
Create a Key for the Service Account:
- Click on the newly created service account to open its details.
- Go to the "Keys" tab and click "Add Key" -> "Create New Key."
- Choose the JSON key type and click "Create."
- Save the JSON file to a secure location on your computer.

Step 3: Write the Node.js Script

Install the required Node.js packages:

bash
npm install @google-cloud/text-to-speech

Create a script (synthesize.js) with the following content:

javascript
const textToSpeech = require("@google-cloud/text-to-speech");
const fs = require("node:fs");
const util = require("node:util");

// Initialize the Text-to-Speech client with the service account key file
const client = new textToSpeech.TextToSpeechClient({
  keyFilename: "./tts-key.json",
});

// Function to synthesize speech from text and save it to an MP3 file
async function synthesizeSpeech(text, outputFile) {
  // Define the request payload
  const request = {
    input: { text: text },
    voice: {
      languageCode: "en-US",
      name: "en-US-Neural2-D",
    },
    audioConfig: { audioEncoding: "MP3" },
  };

  // Make the API request to synthesize speech
  const [response] = await client.synthesizeSpeech(request);
  // Write the audio content to a file
  const writeFile = util.promisify(fs.writeFile);
  await writeFile(outputFile, response.audioContent, "binary");
  console.log(`Audio content written to file: ${outputFile}`);
}

// Sample text to convert to speech
const text = `This is a generic sentence intended for testing text-to-speech.`;
// Output file path
const outputFile = "output.mp3";
// Call the function to synthesize speech
synthesizeSpeech(text, outputFile);

In this script:

We initialize the Text-to-Speech client using the service account key file.
We define a function synthesizeSpeech that takes text and an output file path as arguments.
The function makes a request to the Text-to-Speech API to synthesize speech and saves the audio content to an MP3 file.

Step 4: Run the Script

To run the script, execute the following command in your terminal:

bash
node synthesize.js

If everything is set up correctly, you should see the message "Audio content written to file: output.mp3" and an MP3 file will be generated with the synthesized speech.

Voice and Language Options

The Google Cloud Text-to-Speech API provides a variety of voices and languages to choose from. Here are some of the available options:

English (United States) Neural2 Voices

en-US-Neural2-A (Female)
en-US-Neural2-B (Male)
en-US-Neural2-C (Female)
en-US-Neural2-D (Male)
en-US-Neural2-E (Female)
en-US-Neural2-F (Male)
en-US-Neural2-G (Female)
en-US-Neural2-H (Male)
en-US-Neural2-I (Female)
en-US-Neural2-J (Male)

English (United States) WaveNet Voices

en-US-Wavenet-A (Female)
en-US-Wavenet-B (Male)
en-US-Wavenet-C (Female)
en-US-Wavenet-D (Male)
en-US-Wavenet-E (Male)
en-US-Wavenet-F (Female)
en-US-Wavenet-G (Male)
en-US-Wavenet-H (Female)

English (United Kingdom) Neural2 Voices

en-GB-Neural2-A (Female)
en-GB-Neural2-B (Male)
en-GB-Neural2-C (Female)
en-GB-Neural2-D (Male)

English (United Kingdom) WaveNet Voices

en-GB-Wavenet-A (Female)
en-GB-Wavenet-B (Male)
en-GB-Wavenet-C (Female)
en-GB-Wavenet-D (Male)

English (Australian) Neural2 Voices

en-AU-Neural2-A (Female)
en-AU-Neural2-B (Male)
en-AU-Neural2-C (Female)
en-AU-Neural2-D (Male)

English (Australian) WaveNet Voices

en-AU-Wavenet-A (Female)
en-AU-Wavenet-B (Male)
en-AU-Wavenet-C (Female)
en-AU-Wavenet-D (Male)

English (Indian) Neural2 Voices

en-IN-Neural2-A (Female)
en-IN-Neural2-B (Male)
en-IN-Neural2-C (Female)
en-IN-Neural2-D (Male)

English (Indian) WaveNet Voices

en-IN-Wavenet-A (Female)
en-IN-Wavenet-B (Male)
en-IN-Wavenet-C (Female)
en-IN-Wavenet-D (Male)

Pricing

Google Cloud Text-to-Speech API offers a flexible pricing structure based on the number of characters synthesized per month. Here’s an overview of the costs:

Free Tier:
- First 1 million characters each month for WaveNet voices are free.
Paid Usage:
- Standard voices: $4.00 per 1 million characters.
- WaveNet voices: $16.00 per 1 million characters.
- Neural2 voices: $16.00 per 1 million characters.
- Studio voices: $160.00 per 1 million characters.

New users also get $300 in free credits for the first 90 days to explore Google Cloud services.

For more details, visit the Google Cloud Pricing page.

Conclusion

Using the Google Cloud Text-to-Speech API, you can easily convert text to natural-sounding speech in various languages and voices. This guide walked you through the process of setting up the API, obtaining credentials, and writing a Node.js script to synthesize speech. You can now integrate this functionality into your applications for enhanced user interactions.

This article should provide a comprehensive guide for setting up and using the Google Cloud Text-to-Speech API.