AudioAnalyzer and AudioPlayer: Phoneme, Speech, Sound, and Pitch Recognition Features

Overview

I have been pleasently surprised with the new Audio API features, as well as their performance, presentation, behavior, and implementations. Even so, I have noticed that there are still a handful of pertinent issues that I cannot manage to solve in Roblox due to performance concerns and/or limitations of the AudioAnalyzer and AudioPlayer objects. I will start with explaining the simpler use cases to develop and work up to the more complicated scenarios as appropriate.


Pitch Recognition

Pitch recognition is a fairly general concept which is used on a range of tools, such as tuners, karaoke machines, and music software, and it plays a key part when working with audio. Without this feature, developers will have a difficult time making sing-along games, making vocal practices or lessons, creating music, organizing music files, recording instruments, and pitch-correcting audio with Roblox.

This process commonly involves basic frequency analysis, which, while theoretically possible through AudioAnalyzer:GetSpectrum() and AudioPlayer:GetWaveform(), the former cannot be used to analyze the output from an AudioInputDevice, and the former may perform better with a more efficient native solution. User privacy is also not likely a concern, as pitch information can be determined by Roblox without exposing any of the internal waveforms of the audio to developers or additional cloud-based systems.

Phoneme Recognition

Phoneme recognition plays a massive role in lip-syncing, both in real-time for live performances, such as from the microphones of players in an experience, or during development, such that NPC actors may speak their lines without requiring heavily articulated facial animation work from the developers. This is normally done through the process of phoneme recognition or similar methods, which allows input to smoothly flow into a set of individual “categories” of sound, which are then used to animate facial expressions in order to “reproduce” the original sound.

Similarly to before, this requires advanced frequency analysis and cannot be performed on input devices which inhibits the developer’s ability to have their own independent lipsync system from the one provided by Roblox,. The only way the developers can currently work with this is to read FACS data directly from the Face Controls object which is not ideal, especially when not using standard character structures due to implementing server authoritative replication practices or fully custom character controllers.

Speech Recognition

Speech recognition has a wide variety of purposes, from live captions to voice commands, and several of them may be incredibly useful for the development of experiences on roblox. One key example is with accessibility through voice commands. Players would be able to specify voice commands or phrases which can trigger corresponding events in the game. For example, a player could assign a word or phrase that may quickly toggle certain features during gameplay, or avoid navigation of complex menus by activating or browsing options with voice searching. These may also be used to produce live subtitles from speaking players and character dialog.

For more experimental use cases, developers will have the opportunity to bind game actions to localized voice commands like in the case of shouting the names of spells to cast them or conversing audibly with NPCs or AI models. Speech to text will also greatly benefit creators directly, as they will be able to speak volumes of text if they have an easier or faster time speaking information as opposed to manually typing it.

The required complexity involved with speech analysis makes it unlikely to be developed by independent creators on Roblox, and it will most likely require AI-based solutions, which will largely delay development and increase costs to gather the appropriate data, train a model, and develop a system to execute the model in Roblox.

Sound Recognition

Sound recognition is another feature that is most commonly associated with accessibility features, but is also helpful when categorizing sounds and providing a more accurate description of the ambient environment. Normally, this is not a necessary process, as developers tend to know in advance what sound assets they will play where, but in the case of live audio, game development, and UGC, it is not always as simple.

For example, a musician may like to categorize their instrument samples by guitars, and then separate it further to distorted guitars, and they would be able to do so easily through detailed sound analysis. They would also be able to determine things such as sound events in ambient tracks, such as vehicle horns and police sirens in city ambiences. Developers may also make use of automatic sound categorization in order to organize or sort their sound libraries, such as quickly separating impact sounds of various strength from footsteps when uploading in bulk.

Such a system could be implented either with a library of sounds to pull from and categorize/tag sections of any audio from a particular buffer or window, or to have another audio as a sample to compare its likeness against.


Personal Anecdote (My Own Current Use Cases)

I have been working on a multitude of projects, big and small, over the years, and have been especially toying with the Audio API trying to figure out if it is even possible to accomplish many of the things I have described, but I would prefer not to have hacky/loophole style solutions in my final versions. Some of the frequency analysis I was able to perform through the use of filters/EQs and analysis of the volume response across various frequency bands from the microphone, but this is relatively performance-intensive, and not convenient at all.

While I don’t currently have a use for Pitch Detection in my larger projects, I have been excited to see the development of projects such as RoBeats, and would love to develop similar games in the future more akin to titles such as RockBand, but I am aware that there are many other developers that are likely more passionate about such games that they will develop them before I do when this technology becomes readily available.

I plan to use Phoneme Recognition in order to provide accurate lipsync data for character performances in-game, both when the character is speaking pre-recorded voicelines and dialog interactions, and when the player controlling the character speaks through their microphone.

Speech Recognition is a feature I would appreciate as it would allow me to provide automatic captioning with time markers for character dialog which I will be able to bake into the game without using external software, copies of a variety scripts, or manual transcription; especially for localized voicelines. I would also like to enable an additional accessibility feature, where players may directly be transcribed with captions to facilitate gameplay for the deaf or hard of hearing.

The use of Sound Recognition will mostly be during the process of categorizing sound assets, particularly those bulk imported from sound libraries or sound packs, for use in game development. Additionally, I would like to use them experimentally in order to easily provide closed caption information for complex soundscapes, such as creaky and run-down metallic structures, and cutscenes. While this can be performed manually though transcription, the process has to be re-done any time there is a change in the audio direction, design, and timing.


Closing

While the new Audio API has a whole host of amazing features and concepts, especially when it comes to applying affects to live audio and manually parsing recorded audio, it seems to complicate realtime audio analysis and the deeper concepts of audio processing which can be found in other engines and plugins alike. These additions to the engine would greatly improve the range of possibilities with the Audio API system, and provide developers with simpler and more universal solutions for multiple important issues in a reliable manner. Privacy for players using voice chat will remain preserved, and the computational load would be minimal when compared to stacking filters/EQs or doing conversions to and from tables.

Hopefully, I have explained my suggestions at length and provided thorough examples detailing where they can enhance the experience of both players and developers, and I would love to learn more about the future of Audio API as I continue to provide my feedback from my experience on this platform and others.

7 Likes

Hello.

For one of your suggestions, you might be interested in [Beta] Introducing Text Generation API

1 Like

I very much am! Though I would ideally like to use my own APIs for text due to having more configurations when it comes to model selection and style (unless Roblox provides similar level of control over flow, structure, rules, environment, and history, much like the recent announcements at the State of Play for Unreal and the Press the Button demo), and have also been keeping a close eye on SpeechToText as it is currently disabled in studio.

It is possible that I may have to use custom APIs for this as well, and import the finalized audio stream into an AudioPlayer whenever that becomes available similar to EditableImage and EditableMesh. However, this would introduce notable latency and would potentially be incompatible with data streaming as Roblox doesn’t support data streaming protocols out of the box (i.e. through connection).

Hey @Wunder_Wulfe – Some of these requests should be addressed soon

For pitch recognition: we’ve talked about adding something like a read-only AudioAnalyzer.LoudestFrequency property. But reporting just one pitch is not super stable – even monophonic instruments (e.x. trumpets) that are playing one note at a time can have significant overtones, leading to unexpected octave-jumps. Multi-pitch or non-harmonic (e.x. metals) streams get really hard. There are techniques to overcome that if you have advance knowledge about what sort of signals you’ll be analyzing, but they’d necessarily make the API less general. In the year 2025 it is surprisingly difficult to implement robust, real-time, multi-pitch detection :sad:

For speech recognition: we are working on a new AudioSpeechToText instance that should solve this. It may also help with phoneme recognition, but I don’t know how real-time it would be.

For sound recognition: since the files are uploaded in advance, it probably makes sense to use an offline/metadata api for this – I’m not sure if AssetService has anything for getting category info yet, but we can look into exposing that.

2 Likes

I understand this, especially since I have only seen such features in Melodyne, however I think detecting fundamental frequencies would definitely be one of the simpler and more convenient approaches that should serve for a vast majority of circumstances, including static analysis (i.e. detecting the key of a song from a melody, or detecting notes in a vocal track for pitch-correction or auto-tune), and maybe very simple real-time analysis for something like karaoke games. I don’t expect to see something like detecting the exact notes in guitar chords anytime soon, but it would certainly be very interesting!

On an adjacent note, there will also be some other things that might be pretty fun like Oscillators, FM synthesis, and vocoding, but those are more oriented for the Music aspects of roblox development. Hopefully, tools such as AudioRecorder will be more robust in the future so that you can mix and master sounds without hitting the audio limit, and AudioPlayer implementSetWaveform(), AudioInput implement SetBuffer(), or similar enhancements for creating oscillators manually and simulating instruments by ‘playing’ sounds in a single sound object as opposed to multiple, or speech via custom synthesis engines and algorithms.

Though I believe that one main feature which would be amazing in terms of the audio limit specifically is generic forms of ‘offline rendering’, i.e. rendering an FX chain asynchronously across the playback of an audio for mixing and mastering tracks together to optimize audio without having to run an AudioRecorder in real-time.

The concern here for me is in detecting phonemes instead of phrases, as phrases will not be able to provide me any information on the timings of the sounds, nor the sounds themselves, only their text representation. VRChat and other systems tend to use ‘visemes’ and whatnot for example, which would allow for a player’s avatar to visually distinguish the american and british english productions of a word such as “tuesday” or “privacy,” along with being general enough to work on most languages without requiring any translation tools and text-to-phoneme. Text production is also not efficient as more information and larger buffers are needed in order to properly analyze the speech, which is not ideal for minimizing latency in visual effects. Subtitles are fine to have a small amount of latency, but a character’s mouth should have the minimal latency possible as they are speaking.

1 Like

I believe that one main feature which would be amazing in terms of the audio limit specifically is generic forms of ‘offline rendering’, i.e. rendering an FX chain asynchronously across the playback of an audio for mixing and mastering tracks together to optimize audio without having to run an AudioRecorder in real-time.

Faster-than-realtime baking/bouncing recordings would be super powerful.

This is tricky to support without good timing guarantees – atm our scripting engine gives you a window of time to change properties or call functions once per frame – but for sample-accurate manipulations, you’d kind of want to spell out a whole sequence (a “plan”) of property changes – in advance – before you do the baking.

We don’t have any API that lets you do that (yet), but it’s definitely something we’re talking about.

1 Like

This is something that is difficult to optimize without the use of finite state machines, as those are the most optimal in terms of action and step reduction, optimal plan/flow, and preserving states between computations, but an approach worth considering (at least for development or music production purposes) may be synchronous behaviors that operate on audio buffers and update them, and these may run in parallel as long as the components are not wired serially/sequentially. Ideally the desired buffer sizes should be specified in advance for the appropriate data size and latency during computations.

I would also love to see internal interpolation for more properties aside from just AudioEcho, as messing with PlaybackSpeed, TimePosition, and Volume directly is undesirable and makes it impossible to smoothly slide audio without encountering jitters and stutters. Prominent examples include record scratching (see fig. 1), pitch sliding (i.e. when playing midi or procedural music), filter envelope simulation for sounds (attack, delay, sustain, release), etc. Running the game at higher framerates does help to alleviate the issue somewhat, but 240fps is not comparable to 44+ kHz, and PlaybackSpeed doesn’t currently support negative values either.


fig. 1

1 Like

Aw heck yeah! I’ve been waiting for a speech recognition instance since the new audio api came out, can’t wait!