The Speech Service

The Speech service provides APIs that you can use to build speech-enabled applications. Specifically, the Speech service supports:

Speech-to-Text: An API that enables speech recognition in which your application can accept spoken input.
Text-to-Speech: An API that enables speech synthesis in which your application can provide spoken output.
Speech Translation: An API that you can use to translate spoken input into multiple languages.
Speaker Recognition: An API that enables your application to recognize individual speakers based on their voice.
Intent Recognition: An API that integrates with the Language Understanding service to determine the semantic meaning of spoken input.

You can provision Speech as a single-service resource, or you can use the Text Analytics API in a multi-service Cognitive Services resource.

Speech-to-Text

The Speech service supports speech recognition through two REST APIs:

The Speech-to-text API, which is the primary way to perform speech recognition. The endpoint for this API is https://<LOCATION>.api.cognitive.microsoft.com/sts/v1.0
The Speech-to-text Short Audio API, which is optimized for short streams of audio (up to 60 seconds). The endpoint for this API is at https://<LOCATION>.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1

You can use either API for interactive speech recognition, depending on the expected length of the spoken input. You can also use the Speech-to-text API for batch transcription, transcribing multiple audio files to text as a batch operation.

In practice, most interactive speech-enabled applications use the Speech service through a (programming) language-specific SDK.

Using the Speech-to-text SDK

While the specific details vary, depending on the SDK being used (Python, C#, and so on); there's a consistent pattern for using the Speech-to-text API:

Use a SpeechConfig object to encapsulate the information required to connect to your Speech resource. Specifically, its location and key.
Optionally, use an AudioConfig to define the input source for the audio to be transcribed. By default, this is the default system microphone, but you can also specify an audio file.
Use the SpeechConfig and AudioConfig to create a SpeechRecognizer object. This object is a proxy client for the Speech-to-text API.
Use the methods of the SpeechRecognizer object to call the underlying API functions. For example, the RecognizeOnceAsync() method uses the Speech service to asynchronously transcribe a single spoken utterance.
Process the response from the Speech service. In the case of the RecognizeOnceAsync() method, the result is a SpeechRecognitionResult object that includes the following properties:
- Duration
- OffsetInTicks
- Properties
- Reason
- ResultId
- Text

If the operation was successful, the Reason property has the enumerated value RecognizedSpeech, and the Text property contains the transcription. Other possible values for Result include NoMatch (indicating that the audio was successfully parsed but no speech was recognized) or Cancelled, indicating that an error occurred (in which case, you can check the Properties collection for the CancellationReason property to determine what went wrong.)

Text-to-Speech

Similarly to Speech-to-text APIs, the Speech service offers two REST APIs for speech synthesis:

The Text-to-speech API, which is the primary way to perform speech synthesis. The endpoint for this API is https://<LOCATION>.api.cognitive.microsoft.com/sts/v1.0
The Text-to-speech Long Audio API, which is designed to support batch operations that convert large volumes of text to audio - for example to generate an audio-book from the source text. The endpoint for this API is at https://<LOCATION>.customvoice.api.speech.microsoft.com/api/texttospeech/v3.0/longaudiosynthesis

As with speech recognition, in practice most interactive speech-enabled applications are built using the Speech SDK.

The pattern for implementing speech synthesis is similar to that of speech recognition:

Use a SpeechConfig object to encapsulate the information required to connect to your Speech resource. Specifically, its location and key.
Optionally, use an AudioConfig to define the output device for the speech to be synthesized. By default, this is the default system speaker, but you can also specify an audio file, or by explicitly setting this value to a null value, you can process the audio stream object that is returned directly.
Use the SpeechConfig and AudioConfig to create a SpeechSynthesizer object. This object is a proxy client for the Text-to-speech API.
Use the methods of the SpeechSynthesizer object to call the underlying API functions. For example, the SpeakTextAsync() method uses the Speech service to convert text to spoken audio.
Process the response from the Speech service. In the case of the SpeakTextAsync method, the result is a SpeechSynthesisResult object that contains the following properties:
- AudioData
- Properties
- Reason
- ResultId

When speech has been successfully synthesized, the Reason property is set to the SynthesizingAudioComplete enumeration and the AudioData property contains the audio stream (which, depending on the AudioConfig may have been automatically sent to a speaker or file).

Audio Format and Voices

When synthesizing speech, you can use an SpeechConfig object to customize the audio that is returned by the Speech service.

Audio format

The Speech service supports multiple output formats for the audio stream that is generated by speech synthesis. Depending on your specific needs, you can choose a format based on the required:

Audio file type
Sample-rate
Bit-depth

The supported formats are indicated in the SDK using the SpeechSynthesisOutputFormat enumeration. For example, SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm.

To specify the required output format, use the SetSpeechSynthesisOutputFormat method of the SpeechConfig object:

speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

For a full list of supported formats and their enumeration values, see the Speech SDK documentation.

Voices

The Speech service provides multiple voices that you can use to personalize your speech-enabled applications. There are two kinds of voice that you can use:

Standard voices - synthetic voices created from audio samples.
Neural voices - more natural sounding voices created using deep neural networks.

Voices are identified by names that indicate a locale and a person's name - for example en-GB-George.

To specify a voice for speech synthesis in the SpeechConfig, set its SpeechSythesisVoiceName property to the voice you want to use:

speechConfig.SpeechSynthesisVoiceName = "en-GB-George";

Speech Synthesis Markup Language (SSML)

While the Speech SDK enables you to submit plain text to be synthesized into speech (for example, by using the SpeakTextAsync() method), the service also supports an XML-based syntax for describing characteristics of the speech you want to generate. This Speech Synthesis Markup Language (SSML) syntax offers greater control over how the spoken output sounds, enabling you to:

Specify a speaking style, such as “excited” or "cheerful" when using a neural voice.
Insert pauses or silence.
Specify phonemes (phonetic pronunciations), for example to pronounce the text “SQL” as "sequel".
Adjust the prosody of the voice (affecting the pitch, timbre, and speaking rate).
Use common “say-as” rules, for example to specify that a given string should be expressed as a date, time, telephone number, or other form.
Insert recorded speech or audio, for example to include a standard recorded message or simulate background noise.

For example, consider the following SSML:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
                     xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> 
    <voice name="en-US-AriaNeural"> 
        <mstts:express-as style="cheerful"> 
          I say tomato 
        </mstts:express-as> 
    </voice> 
    <voice name="en-US-GuyNeural"> 
        I say <phoneme alphabet="sapi" ph="t ao m ae t ow"> tomato </phoneme>. 
        <break strength="weak"/>Lets call the whole thing off! 
    </voice> 
</speak>

This SSML specifies a spoken dialog between two different neural voices, like this:

Ariana (cheerfully): "I say tomato:
Guy: “I say tomato (pronounced tom-ah-toe) ... Let's call the whole thing off!”

To submit an SSML description to the Speech service, you can use the SpeakSsmlAsync() method, like this:

speechSynthesizer.SpeakSsmlAsync(ssml_string);

LAB

Recognize and Synthesize Speech

The Speech service is an Azure cognitive service that provides speech-related functionality, including:

A speech-to-text API that enables you to implement speech recognition (converting audible spoken words into text).
A text-to-speech API that enables you to implement speech synthesis (converting text into audible speech).

In this exercise, you'll use both of these APIs to implement a speaking clock application.

The Speech service includes a Speech translation API that you can use to translate spoken language. For example, suppose you want to develop a translator application that people can use when traveling in places where they don't speak the local language. They would be able to say phrases such as "Where is the station?" or "I need to find a pharmacy" in their own language, and have it translate them to the local language.

Search This Blog

AI-102

Modul4: Building Speech Enabled applications

The Speech Service

Speech-to-Text

Using the Speech-to-text SDK

Text-to-Speech

Audio Format and Voices

Audio format

Voices

Speech Synthesis Markup Language (SSML)

Recognize and Synthesize Speech

Comments

Post a Comment

Popular posts from this blog

Module6: QnA Maker and Module7: Conversational AI and Azure Bot service