Introduction
At Roblox, we’re harnessing the power of generative AI to support you in your creative process. We’re building AI tools designed to help you be more productive, so you can craft immersive experiences like never before.
We want to assure you that we handle your shared data for AI training with the utmost care and respect for your privacy and creativity. This post outlines how we process and utilize your script data for AI model training while safeguarding your privacy. For more information on how to protect your IP when creating with generative AI, you can read this post.
Our Approach
To ensure that the shared datasets do not contain sensitive/harmful information, as well as to protect the originality of your unique experiences, Roblox has followed a list of rigorous practices. The data is split up into tokens (small snippets) for training, similar to paper shredding. Here, we will give a brief overview of each step and illustrate how each individual step helps to protect the privacy of your data. An overview can be found depicted in this image:
PII Filter
We adopted a PII filter developed by the BigCode StarCoder project. This classifier identifies and filters sensitive information including email addresses, API keys, and home addresses. In our AI training pipeline, we apply the PII filter on the script level, and once a PII identifier is found inside the script, we will disregard it.
Malicious Asset Filter
To ensure the safety and integrity of our platform, we have implemented robust measures to handle potentially harmful content.
Our system adapts advanced safety filters similar to those used in Creator Store. By analyzing multiple signals, we can effectively identify and mitigate risks associated with malicious scripts and assets.
This multi-layered approach helps us uphold our Community Standards and protect our users from harmful content
Additional Safety and Quality Steps
In addition, to further refine our dataset and maintain a high standard of data quality, we have implemented a series of safety measures:
As an additional filtering measure, we exclude scripts that contain rare tokens. This removes unique code, helps us train on common scripting patterns, and limits the likelihood of including something specific to just one experience.
Another filter performs a complexity analysis of the Luau scripts and discards those that are either too simplistic or convoluted.
Additionally, we conduct regular data quality checks and extensive evaluations to remove low-quality data and employ techniques to ensure our models generalize well across diverse inputs. Continuous monitoring of the data pipeline allows us to adapt and enhance our safety measures as needed.
By adhering to these practices, we strive to maintain the highest standards in data quality and user privacy, ensuring that your contributions help us build better, more effective AI models while protecting your personal information.
Evaluation Process
After processing your shared data and finishing the model training pipeline, we will have extensive evaluation processes to avoid any kind of regression.
First, we have an automated offline evaluation process that can effectively evaluate the trained model’s performance on a variety of tasks to mimic the actual user experience.
Next, if the model shows reasonable performance in the offline evaluation phase compared to the last iteration, we conduct an online A/B experiment to make sure key metrics are improved before rolling it out to all users.
Anonymization
The scripts we train on are delinked from any personal information of the creator and are thus fully anonymized. As an example of this you can look at the first Luau dataset that we made publicly available. We constantly update our data pipeline and training strategies with respect to our creators’ privacy.
Data Retention
We recognize the importance of data lifecycle management. Our data retention policy is designed to store creator data only as long as it is beneficial for model improvement. We purge data from our systems in an automated way beyond the retention period, ensuring that we do not retain any information longer than necessary. If you decide to change your data sharing preferences to no longer share your data to improve AI services, any downstream artifacts such as models derived from shared data are also periodically purged.
Future Asset Types
Before the data policy update, we only used creator-contributed Luau scripts to help generative AI capabilities features such as Assistant and Code Assist. With this new update, we welcome a wide range of asset types that will fuel the development of multi-modal AI models for your benefit. Every asset type will have unique requirements, and as we add new asset types, we will provide an update on how we plan to process those new asset types.
Conclusion
At Roblox, we believe that responsible technology use can coexist with the utmost respect for user privacy. By adhering to strict data processing protocols and constantly refining our practices, we ensure that creator data contributes to the advancement of AI in a secure and ethical manner.