Overview
I have been pleasently surprised with the new Audio API features, as well as their performance, presentation, behavior, and implementations. Even so, I have noticed that there are still a handful of pertinent issues that I cannot manage to solve in Roblox due to performance concerns and/or limitations of the AudioAnalyzer
and AudioPlayer
objects. I will start with explaining the simpler use cases to develop and work up to the more complicated scenarios as appropriate.
Pitch Recognition
Pitch recognition is a fairly general concept which is used on a range of tools, such as tuners, karaoke machines, and music software, and it plays a key part when working with audio. Without this feature, developers will have a difficult time making sing-along games, making vocal practices or lessons, creating music, organizing music files, recording instruments, and pitch-correcting audio with Roblox.
This process commonly involves basic frequency analysis, which, while theoretically possible through AudioAnalyzer:GetSpectrum()
and AudioPlayer:GetWaveform()
, the former cannot be used to analyze the output from an AudioInputDevice
, and the former may perform better with a more efficient native solution. User privacy is also not likely a concern, as pitch information can be determined by Roblox without exposing any of the internal waveforms of the audio to developers or additional cloud-based systems.
Phoneme Recognition
Phoneme recognition plays a massive role in lip-syncing, both in real-time for live performances, such as from the microphones of players in an experience, or during development, such that NPC actors may speak their lines without requiring heavily articulated facial animation work from the developers. This is normally done through the process of phoneme recognition or similar methods, which allows input to smoothly flow into a set of individual âcategoriesâ of sound, which are then used to animate facial expressions in order to âreproduceâ the original sound.
Similarly to before, this requires advanced frequency analysis and cannot be performed on input devices which inhibits the developerâs ability to have their own independent lipsync system from the one provided by Roblox,. The only way the developers can currently work with this is to read FACS
data directly from the Face Controls
object which is not ideal, especially when not using standard character structures due to implementing server authoritative replication practices or fully custom character controllers.
Speech Recognition
Speech recognition has a wide variety of purposes, from live captions to voice commands, and several of them may be incredibly useful for the development of experiences on roblox. One key example is with accessibility through voice commands. Players would be able to specify voice commands or phrases which can trigger corresponding events in the game. For example, a player could assign a word or phrase that may quickly toggle certain features during gameplay, or avoid navigation of complex menus by activating or browsing options with voice searching. These may also be used to produce live subtitles from speaking players and character dialog.
For more experimental use cases, developers will have the opportunity to bind game actions to localized voice commands like in the case of shouting the names of spells to cast them or conversing audibly with NPCs or AI models. Speech to text will also greatly benefit creators directly, as they will be able to speak volumes of text if they have an easier or faster time speaking information as opposed to manually typing it.
The required complexity involved with speech analysis makes it unlikely to be developed by independent creators on Roblox, and it will most likely require AI-based solutions, which will largely delay development and increase costs to gather the appropriate data, train a model, and develop a system to execute the model in Roblox.
Sound Recognition
Sound recognition is another feature that is most commonly associated with accessibility features, but is also helpful when categorizing sounds and providing a more accurate description of the ambient environment. Normally, this is not a necessary process, as developers tend to know in advance what sound assets they will play where, but in the case of live audio, game development, and UGC, it is not always as simple.
For example, a musician may like to categorize their instrument samples by guitars, and then separate it further to distorted guitars, and they would be able to do so easily through detailed sound analysis. They would also be able to determine things such as sound events in ambient tracks, such as vehicle horns and police sirens in city ambiences. Developers may also make use of automatic sound categorization in order to organize or sort their sound libraries, such as quickly separating impact sounds of various strength from footsteps when uploading in bulk.
Such a system could be implented either with a library of sounds to pull from and categorize/tag sections of any audio from a particular buffer or window, or to have another audio as a sample to compare its likeness against.
Personal Anecdote (My Own Current Use Cases)
I have been working on a multitude of projects, big and small, over the years, and have been especially toying with the Audio API trying to figure out if it is even possible to accomplish many of the things I have described, but I would prefer not to have hacky/loophole style solutions in my final versions. Some of the frequency analysis I was able to perform through the use of filters/EQs and analysis of the volume response across various frequency bands from the microphone, but this is relatively performance-intensive, and not convenient at all.
While I donât currently have a use for Pitch Detection in my larger projects, I have been excited to see the development of projects such as RoBeats, and would love to develop similar games in the future more akin to titles such as RockBand, but I am aware that there are many other developers that are likely more passionate about such games that they will develop them before I do when this technology becomes readily available.
I plan to use Phoneme Recognition in order to provide accurate lipsync data for character performances in-game, both when the character is speaking pre-recorded voicelines and dialog interactions, and when the player controlling the character speaks through their microphone.
Speech Recognition is a feature I would appreciate as it would allow me to provide automatic captioning with time markers for character dialog which I will be able to bake into the game without using external software, copies of a variety scripts, or manual transcription; especially for localized voicelines. I would also like to enable an additional accessibility feature, where players may directly be transcribed with captions to facilitate gameplay for the deaf or hard of hearing.
The use of Sound Recognition will mostly be during the process of categorizing sound assets, particularly those bulk imported from sound libraries or sound packs, for use in game development. Additionally, I would like to use them experimentally in order to easily provide closed caption information for complex soundscapes, such as creaky and run-down metallic structures, and cutscenes. While this can be performed manually though transcription, the process has to be re-done any time there is a change in the audio direction, design, and timing.
Closing
While the new Audio API has a whole host of amazing features and concepts, especially when it comes to applying affects to live audio and manually parsing recorded audio, it seems to complicate realtime audio analysis and the deeper concepts of audio processing which can be found in other engines and plugins alike. These additions to the engine would greatly improve the range of possibilities with the Audio API system, and provide developers with simpler and more universal solutions for multiple important issues in a reliable manner. Privacy for players using voice chat will remain preserved, and the computational load would be minimal when compared to stacking filters/EQs or doing conversions to and from tables.
Hopefully, I have explained my suggestions at length and provided thorough examples detailing where they can enhance the experience of both players and developers, and I would love to learn more about the future of Audio API as I continue to provide my feedback from my experience on this platform and others.