January 8, 2025

Using iFlytek's Speech Model in Unity for Text-to-Speech (TTS) Playback

Using iFlytek’s Speech Model in Unity for Text-to-Speech (TTS) Playback

Introduction

Integrating iFlytek’s Text-to-Speech (TTS) technology into Unity can elevate game immersion by enabling realistic and dynamic voice responses. This article will detail how to use iFlytek’s TTS in Unity, covering:

  1. How iFlytek TTS works.
  2. Step-by-step integration into Unity.
  3. Handling audio data for playback.
  4. Key considerations and troubleshooting.

How iFlytek TTS Works

iFlytek TTS synthesizes speech by:

  1. Logging into iFlytek’s cloud services.
  2. Sending text to the TTS API for synthesis.
  3. Receiving and processing PCM (Pulse Code Modulation) audio data.
  4. Playing the processed audio in Unity.

Prerequisites

Step-by-Step Integration

1. Setting Up Unity

  1. Create a Unity project and add a GameObject to host the TTS script.
  2. Import the msc_x64.dll file into your project (place it in the Assets/Plugins folder).

2. Implementing the TTS Script

Below is a complete breakdown of the XunfeiTTS class:

Initialization

The Awake method logs into the iFlytek service and sets up an AudioSource component for playback:

private void Awake()
{
audioSource = gameObject.AddComponent<AudioSource>();
Login();
}
Login and Logout

The Login and Logout methods handle authentication with iFlytek’s servers:

public void Login()
{
int res = MSCDLL.MSPLogin(null, null, AppID);
if (res == 0)
{
Debug.Log("iFlytek login successful!");
}
else
{
Debug.LogError($"iFlytek login failed, error code: {res}");
}
}

public void Logout()
{
int res = MSCDLL.MSPLogout();
if (res == 0)
{
Debug.Log("iFlytek logout successful!");
}
else
{
Debug.LogError($"iFlytek logout failed, error code: {res}");
}
}
Sending a TTS Request

The SendTTSRequest method initiates a TTS session, sends the text for synthesis, and fetches the resulting audio:

public void SendTTSRequest(string text)
{
int res = 0;

// Start TTS session
ttsSession = MSCDLL.QTTSSessionBegin("engine_type=cloud,voice_name=aisjiuxu,speed=50,pitch=50,text_encoding=utf8,sample_rate=16000", ref res);
if (res != 0 || ttsSession == IntPtr.Zero)
{
Debug.LogError($"TTS session start failed, error code: {res}");
return;
}

// Send text for synthesis
res = MSCDLL.QTTSTextPut(ttsSession, text, (uint)text.Length, null);
if (res != 0)
{
Debug.LogError($"Text synthesis failed, error code: {res}");
EndSession();
return;
}

// Fetch and play audio
FetchAndPlayAudio();
}
Fetching and Playing Audio

FetchAndPlayAudio retrieves PCM audio data from the TTS service and converts it for playback:

private void FetchAndPlayAudio()
{
MemoryStream audioStream = new MemoryStream();
SynthStatus synthStatus = SynthStatus.MSP_TTS_FLAG_STILL_HAVE_DATA;

int res = 0;
while (synthStatus == SynthStatus.MSP_TTS_FLAG_STILL_HAVE_DATA)
{
uint audioLen = 0;
IntPtr audioData = MSCDLL.QTTSAudioGet(ttsSession, ref audioLen, ref synthStatus, ref res);

if (res != 0)
{
Debug.LogError($"Audio retrieval failed, error code: {res}");
EndSession();
return;
}

if (audioData != IntPtr.Zero && audioLen > 0)
{
byte[] buffer = new byte[audioLen];
Marshal.Copy(audioData, buffer, 0, (int)audioLen);
audioStream.Write(buffer, 0, buffer.Length);
}
}

PlayAudio(audioStream.ToArray());
EndSession();
}

The PlayAudio method converts PCM data into a Unity AudioClip:

private void PlayAudio(byte[] pcmData)
{
int sampleRate = 16000;
int channels = 1;
float[] audioSamples = new float[pcmData.Length / 2];

for (int i = 0; i < audioSamples.Length; i++)
{
short sample = BitConverter.ToInt16(pcmData, i * 2);
audioSamples[i] = sample / 32768f; // Normalize to [-1, 1]
}

AudioClip audioClip = AudioClip.Create("TTS_Audio", audioSamples.Length, channels, sampleRate, false);
audioClip.SetData(audioSamples, 0);

audioSource.clip = audioClip;
audioSource.Play();
Debug.Log("Audio playback complete.");
}
Ending the Session

The EndSession method ensures proper resource cleanup:

private void EndSession()
{
int res = MSCDLL.QTTSSessionEnd(ttsSession, "end");
if (res != 0)
{
Debug.LogError($"TTS session end failed, error code: {res}");
}
else
{
Debug.Log("TTS session ended successfully.");
}
}

Key Considerations

Text Encoding

Ensure the text sent to the API is UTF-8 encoded to avoid errors.

Audio Format

The iFlytek TTS API typically outputs PCM audio at 16 kHz, mono. Convert this to a Unity-compatible format before playback.

Error Handling

Handle error codes returned by iFlytek’s SDK functions, as they provide vital information for debugging.

Dependency Management

Include all required SDK files (msc_x64.dll) in the Unity project.

Troubleshooting

  1. No Audio Playback: Verify the AudioSource is correctly configured and the PCM data is properly converted.
  2. SDK Errors: Refer to iFlytek’s SDK documentation for error code definitions.
  3. DLL Not Found: Ensure the DLL is placed in the correct directory (Assets/Plugins).

Conclusion

By integrating iFlytek’s TTS into Unity, developers can create dynamic and engaging voice interactions in their games. This guide provides the foundational steps to harness iFlytek’s robust speech synthesis capabilities, paving the way for richer storytelling and enhanced player immersion.

About this Post

This post is written by FFFeiya, licensed under CC BY-NC 4.0.

#CSharp#Unity#TTS