OWolf

2024-09-30 Web Development

Speech Synthesis Markup Language for Google Text-to-Speech

By O Wolfson

Google's Text-to-Speech (TTS) service supports a variety of SSML (Speech Synthesis Markup Language) tags that allow you to control the pronunciation, pitch, rate, volume, and other aspects of speech synthesis. Below is an overview of the key SSML tags and attributes supported by Google TTS, along with examples of how to use them:

1. <speak>

  • The root element that encapsulates the entire SSML content.
xml
<speak>
   Welcome to our service!
</speak>

2. <emphasis>

  • Used to apply emphasis to certain words or phrases.
  • level: Can be "strong", "moderate", or "reduced".
xml
<speak>
   This is <emphasis level="strong">very important</emphasis> information.
</speak>

3. <break>

  • Introduces a pause in speech.
  • time: Specifies the duration of the pause (e.g., "500ms").
  • strength: Specifies the strength of the pause ("none", "x-weak", "weak", "medium", "strong", "x-strong").
xml
<speak>
   Please wait <break time="500ms"/> before continuing.
</speak>

4. <prosody>

  • Modifies the pitch, rate, and volume of the speech.
  • pitch: Changes the pitch of the speech (e.g., "+10%", "high", "low").
  • rate: Changes the speed of the speech (e.g., "slow", "fast", "medium", "x-slow", "x-fast").
  • volume: Adjusts the volume (e.g., "soft", "loud", "x-loud", "-10dB").
xml
<speak>
   <prosody pitch="+10%" rate="slow" volume="loud">
      This text is spoken slowly, with a higher pitch and louder volume.
   </prosody>
</speak>

5. <say-as>

  • Defines how certain types of text should be interpreted (e.g., dates, times, numbers).
  • interpret-as: Can be "date", "time", "telephone", "characters", "fraction", etc.
xml
<speak>
   The meeting is scheduled for <say-as interpret-as="date">2024-08-15</say-as>.
</speak>

6. <sub>

  • Substitutes an alternate string for the text in the tag.
xml
<speak>
   Read the abbreviation as <sub alias="National Aeronautics and Space Administration">NASA</sub>.
</speak>

7. <audio>

  • Embeds an audio file within the speech synthesis.
  • src: URL of the audio file.
xml
<speak>
   Welcome to the tutorial. <audio src="https://www.example.com/audio/welcome.mp3" />
</speak>

8. <p> and <s>

  • <p> is used to define a paragraph, and <s> is used to define a sentence.
xml
<speak>
   <p>This is the first paragraph.</p>
   <p>This is the second paragraph.</p>
</speak>

9. <voice>

  • Allows you to select a specific voice for a portion of the text.
  • name: The name of the voice to use.
xml
<speak>
   <voice name="en-US-Wavenet-D">This part is spoken by a different voice.</voice>
</speak>

10. <lang>

  • Changes the language for a specific section of the text.
  • xml:lang: The language code (e.g., "en-US", "fr-FR").
xml
<speak>
   <lang xml:lang="fr-FR">Bonjour tout le monde.</lang>
</speak>

Practical Example Combining Tags:

xml
<speak>
   <p>
      <emphasis level="strong">Attention!</emphasis> Please note that the event is on
      <say-as interpret-as="date">2024-12-01</say-as>.
   </p>
   <p>
      <prosody pitch="+5%" rate="slow" volume="soft">
         Make sure you <emphasis level="moderate">arrive</emphasis> on time.
      </prosody>
      <break time="1s"/>
      The doors will close promptly at <say-as interpret-as="time">09:00 AM</say-as>.
   </p>
</speak>

Documentation Reference:

For the most up-to-date and detailed information on the supported SSML tags and their usage, you can refer to the Google Cloud Text-to-Speech SSML documentation.