Enhancing AI Responsibly: Our Approach to Creator Data Sharing

Introduction

At Roblox, we’re harnessing the power of generative AI to support you in your creative process. We’re building AI tools designed to help you be more productive, so you can craft immersive experiences like never before.

We want to assure you that we handle your shared data for AI training with the utmost care and respect for your privacy and creativity. This post outlines how we process and utilize your script data for AI model training while safeguarding your privacy. For more information on how to protect your IP when creating with generative AI, you can read this post.

Our Approach

To ensure that the shared datasets do not contain sensitive/harmful information, as well as to protect the originality of your unique experiences, Roblox has followed a list of rigorous practices. The data is split up into tokens (small snippets) for training, similar to paper shredding. Here, we will give a brief overview of each step and illustrate how each individual step helps to protect the privacy of your data. An overview can be found depicted in this image:

data sharing pipeline

PII Filter

We adopted a PII filter developed by the BigCode StarCoder project. This classifier identifies and filters sensitive information including email addresses, API keys, and home addresses. In our AI training pipeline, we apply the PII filter on the script level, and once a PII identifier is found inside the script, we will disregard it.

Malicious Asset Filter

To ensure the safety and integrity of our platform, we have implemented robust measures to handle potentially harmful content.

Our system adapts advanced safety filters similar to those used in Creator Store. By analyzing multiple signals, we can effectively identify and mitigate risks associated with malicious scripts and assets.

This multi-layered approach helps us uphold our Community Standards and protect our users from harmful content

Additional Safety and Quality Steps

In addition, to further refine our dataset and maintain a high standard of data quality, we have implemented a series of safety measures:

As an additional filtering measure, we exclude scripts that contain rare tokens. This removes unique code, helps us train on common scripting patterns, and limits the likelihood of including something specific to just one experience.

Another filter performs a complexity analysis of the Luau scripts and discards those that are either too simplistic or convoluted.

Additionally, we conduct regular data quality checks and extensive evaluations to remove low-quality data and employ techniques to ensure our models generalize well across diverse inputs. Continuous monitoring of the data pipeline allows us to adapt and enhance our safety measures as needed.

By adhering to these practices, we strive to maintain the highest standards in data quality and user privacy, ensuring that your contributions help us build better, more effective AI models while protecting your personal information.

Evaluation Process

After processing your shared data and finishing the model training pipeline, we will have extensive evaluation processes to avoid any kind of regression.

First, we have an automated offline evaluation process that can effectively evaluate the trained model’s performance on a variety of tasks to mimic the actual user experience.

Next, if the model shows reasonable performance in the offline evaluation phase compared to the last iteration, we conduct an online A/B experiment to make sure key metrics are improved before rolling it out to all users.

Anonymization

The scripts we train on are delinked from any personal information of the creator and are thus fully anonymized. As an example of this you can look at the first Luau dataset that we made publicly available. We constantly update our data pipeline and training strategies with respect to our creators’ privacy.

Data Retention

We recognize the importance of data lifecycle management. Our data retention policy is designed to store creator data only as long as it is beneficial for model improvement. We purge data from our systems in an automated way beyond the retention period, ensuring that we do not retain any information longer than necessary. If you decide to change your data sharing preferences to no longer share your data to improve AI services, any downstream artifacts such as models derived from shared data are also periodically purged.

Future Asset Types

Before the data policy update, we only used creator-contributed Luau scripts to help generative AI capabilities features such as Assistant and Code Assist. With this new update, we welcome a wide range of asset types that will fuel the development of multi-modal AI models for your benefit. Every asset type will have unique requirements, and as we add new asset types, we will provide an update on how we plan to process those new asset types.

Conclusion

At Roblox, we believe that responsible technology use can coexist with the utmost respect for user privacy. By adhering to strict data processing protocols and constantly refining our practices, we ensure that creator data contributes to the advancement of AI in a secure and ethical manner.

11 Likes

Are we able to decide if the AI can learn from our stuff? For instance, I wouldn’t like AI just taking all the code I spent months on, just for it to give it to someone who then proceeds to ripoff my whole game.

What are these rare tokens? They sound exciting lol

4 Likes

We have just rolled out our generative AI data sharing preferences where you can decide which of your creations can be used for generative AI training.

rare tokens

You can think of a token as a word of text.

Rare tokens are tokens that occur very infrequently across different experiences.
If a file contains a very rare token, it indicates that the content might be very specialized or contain private information.

We have multiple layers of guards against training on PII such as auth tokens, and this is one of them.

3 Likes

Ahh yes, I saw the post after I replied here. It is very nice! But by any chance is it possible that you can implement a way to make it so for you in particular you can set if the AI can use your experience that way I dont have to set it every time myself? I think it would be very nice, but since I can tell what you are going for I doubt you’d implement it

3 Likes

There is a way to set up data sharing settings for all your creations in CreatorHub so you don’t have to set it every time yourself: https://create.roblox.com/settings/data-collection?tab=experiences

9 Likes

Screenshot_20240711_114724_Chrome

Stolen scripts too, right?

2 Likes