Text-to-Speech Wrapper

Jumping the gun to make a simple OOP-based AudioTextToSpeech wrapper.

My original goal was to pre-load all of the generated text by splitting up the given message into an array of strings. This didn’t work because Roblox doesn’t open source their API. (Cloud_GenerateSpeechAsset)
I would have to use something like roproxy.com, or make my own proxy. In the past when I’ve tried to use open-source proxies, such as roproxy, my requests are almost always declined.

Some of the upsides would be less stress on the server and less waiting time for new messages.
But this is a new API. I’m sure they’ll find a way to make it entirely based on Roblox rather than using HttpService.

Instead, I went with a simpler approach by using the AudioTextToSpeech instance.
Given that there are already alternatives, I figured this approach has the most potential.

Player messages that have been received by the default RBXGeneral TextChannel will have their content narrated by the Character.

Sound (utility—optional)
export type EmitterMap = {
	Destroy: (self: EmitterMap) -> (),
	Wire: (self: EmitterMap, Target: Instance) -> (),

	_Instance: AudioEmitter,
	_Wire: Wire,
}

local emitter = {}
emitter.__index = emitter

function emitter.Destroy(self: EmitterMap)
	self._Instance:Destroy()
end

function emitter.Wire(self: EmitterMap, Target: Instance)
	self._Wire.SourceInstance = Target
end

function emitter.new(Parent: Instance): EmitterMap
	local self = setmetatable({}, emitter)
	self._Instance = Instance.new("AudioEmitter")
	self._Wire = Instance.new("Wire")

	self._Wire.TargetInstance = self._Instance
	self._Wire.Parent = self._Instance

	self._Instance.Parent = Parent
	return self
end

return {
	Emitter = emitter.new,
}
TextToSpeech
local Sound = require(game:GetService("ReplicatedStorage").Sound)

export type SpeechMap = {
	_LoadTextAsync: (self: SpeechMap) -> boolean,
	Play: (self: SpeechMap, message: string) -> (),
	Stop: (self: SpeechMap) -> (),
	Destroy: (self: SpeechMap) -> (),

	_Instance: AudioTextToSpeech,
	_emitter: Sound.EmitterMap,
	_voice_id: number,
}

local module = {}
module.__index = module

function module._LoadTextAsync(self: SpeechMap, text: string): boolean
	local success, result = pcall(self._Instance.LoadAsync, self._Instance)

	return success and result == Enum.AssetFetchStatus.Success
end

function module.Play(self: SpeechMap, message: string)
	self:Stop()

	self._Instance.Text = message

	if self:_LoadTextAsync() then
		self._Instance:Play()
	else
		self._Instance.Text = ""
	end
end

function module.Stop(self: SpeechMap)
	self._Instance.TimePosition = 0

	self._Instance:Unload()
	self._Instance:Pause()
end

function module.Destroy(self: SpeechMap)
	self._Instance:Destroy()
	self._emitter:Destroy()
end

function module.new(Parent: Instance, voice_id: number, volume: number?, speed: number?, pitch: number?): SpeechMap
	local self = setmetatable({}, module)

	self._Instance = Instance.new("AudioTextToSpeech")
	self._Instance.Volume = volume or 1
	self._Instance.Speed = speed or 1
	self._Instance.Pitch = pitch or 0
	self._Instance.VoiceId = voice_id
	self._Instance.Parent = Parent

	self._emitter = Sound.Emitter(Parent)
	self._emitter:Wire(self._Instance)

	return self
end

return {
	new = module.new,
}
Server
local TextChatService = game:GetService("TextChatService")

local TextToSpeech = require("@self/TextToSpeech")

local TextChannels = TextChatService:WaitForChild("TextChannels", 5) :: Folder
local RBXGeneral = TextChannels:WaitForChild("RBXGeneral", 5) :: TextChannel

local playerTTS = {} :: { [number]: TextToSpeech.SpeechMap }

local Players = game:GetService("Players")

local function CleanPlayer(Player: Player)
	local speechMap = playerTTS[Player.UserId]

	if speechMap then
		speechMap:Destroy()
	end
end

Players.PlayerRemoving:Connect(CleanPlayer)

Players.PlayerAdded:Connect(function(Player)
	Player.CharacterAdded:Connect(function(Character)
		CleanPlayer(Player)

		playerTTS[Player.UserId] = TextToSpeech.new(Character.PrimaryPart, 1)
	end)
end)

RBXGeneral.ShouldDeliverCallback = function(TextMessage: TextChatMessage)
	local TextSource = TextMessage.TextSource

	if TextSource then
		local speechMap = playerTTS[TextSource.UserId]

		print(`Set "{TextMessage.Text}" to AudioTextToSpeech.Text`)
		speechMap:Play(TextMessage.Text)
	end
end

Now it doesn’t look too fancy.
It’s meant to be robust, easy to read, and quick to change out as soon as Roblox updates their API.

Every time a character is added it will give them a new AudioTextToSpeech that’s wired to an AudioEmitter. This is stored inside of a global table on the server to be referenced every time they chat a message.

When the server gets the message, it’s going to tell the TextToSpeech module to Play the text. To follow along with the current state, I use LoadAsync to wait for the message to be loaded.

I used another class to make sound management easier to manage and clean.

Since the player can always immediately make another message before it’s loaded, I call stop to Unload, assuming that that stops the LoadAsync process. This will also stop any sound that’s playing by pausing (Pause) and then setting the TimePosition back to zero.

Again, very simple. This is brand new API that they just released the other day, and it needs more there to be fully effective.

Methods to generate text and sound through a single service would be very helpful and open the way to make this module how I intended to from a start. Something that can be cached and gotten easily. A library of sounds that you would not have to generate anymore because it’s already been generated. All of which could be cached or inserted into the game on the fly.

One thing I found confusing was how LoadAsync worked. The Text property should be read only. Then you use LoadAsync to set the property and generate the speech by passing a string.

Please add to the discussion!
How can we make real-time TTS a reality? Roblox would be the first to pioneer such a large project.
I can’t think of any other game that remotely offers this. It would be very helpful for players without a microphone, all while vastly improving immersion.