Empower Luau creation: Public Luau Dataset initiative needs your help

[Update] December 5, 2023


Hi Creators,

As we announced at RDC 2023, Roblox is building a public Luau dataset for training AI models. Our goal is to ensure that existing and future AI tools understand Luau and Roblox development. AI tools that deeply understand Luau can help you to work more efficiently, thanks to a new wave of integrations, AI assistants, and tools.

To contribute your Luau Code to the Luau public data set, please visit: https://create.roblox.com/data-collection

Thanks to the community, we have enough data to be able to improve the performance of fine-tuned code-focused large language models by 15%-20%. Our next goal is to grow the corpus by an order of magnitude to contribute to models at their earliest training stages, which require significantly more training data that can improve their ability to understand and generate Roblox Luau substantially.

We are asking you to opt in so anyone can make the tools that can boost your productivity. Please note that you will be able to pick which experiences you want to contribute and can adjust or opt out any time here. Data is collected anonymously, and we will filter out any sensitive data, such as your API keys.

We want to say a BIG thank you to those who have contributed their scripts. Your input has already significantly enriched the public dataset and enhanced our models. In fact, we think our internal code-generation models now exceed the quality of other leading LLMs when generating Roblox scripts. We will share our first version of the public Luau dataset with StarCoder and HuggingFace soon.

For more information on everything discussed and to opt-in, please visit our webpage and let us know if you have any questions.

Thank you!


FAQ

Share with Roblox

Why do we want scripts from our community for our AI models?

  • Roblox is building its own AI models to make AI-powered creation tools. Today, we use code from a subset of free, publicly available-marketplace assets to train these models. Using additional data contributed by our community helps our AI offer you accurate and up-to-date suggestions.

How can I contribute to the Roblox AI models?

  • To help improve Roblox tools with AI, you can opt-in to select the experiences to share.

    • You are in control: You can opt-out anytime from your Account Settings. No data will be used for training models until one week after you opt-in. Upon your opt-out, we will not use your data for any new training beyond 30 days. However, we reserve the right to keep the existing models trained on your data active for up to 120 days.

    • From and for the community: We respect your privacy. Data is collected anonymously, and we will filter out any sensitive data, such as your API keys.

    • Owner-governed sharing: Group-owned experiences can only be shared by owners. For experiences that have multiple owners, sharing with Roblox or the wider public is only possible if all owners opt-in.

Can I still use Roblox AI tools if I don’t opt-in?

  • Everyone can use our AI products like Code Assist. Those who share their data with our AI models will get access to more comprehensive and performant models that include community contributions.

If I opt-in to share my data, does this only include scripts? What about other types of data, like my models or images?

  • Opting into sharing your data with Roblox and Luau currently only includes scripts. In the future, if we decide to add other data types, you will still have to opt-in to sharing these types, and we won’t automatically include your data without your opt-in.
Contribute to the Luau Dataset

What does contributing to the Luau Dataset mean?

  • Roblox is building a public Luau Dataset for anyone training AI models.
    • Our goal is to make Luau a first-class programming language and ensure that existing and future AI tools understand Luau.

    • AI tools support for Luau: This means more accurate suggestions and more integrations when using 3rd party AI models, so creators like you can use their tool of choice in their creation process.

How can I contribute to the public Luau data set?

  • Much like contributing to Roblox, privacy and compliance are our top priorities. Experiences won’t be shared unless all owners consent.

    However, opting out from the Luau Dataset won’t be possible after 30 days as the dataset will be publicly available. Please only contribute code from experiences that you are comfortable with. Only a small fraction of your contributed code, less than 30%, will be compiled and aggregated into an open source-available dataset for AI model training.

62 Likes

This topic was automatically opened after 10 minutes.

A way to get more people to opt in would be to give them something in return. Example a free avatar item.

20 Likes

Does this mean that all code that is opted in will be manually reviewed by someone? I imagine it’s nontrivial to automatically remove such keys

14 Likes

Are there any benefits for us contributing our code that we worked hard to create to Roblox’s datasets? I don’t give away my work for free to others, why should I do it for Roblox?

12 Likes

A bribe to opt-in would not sit right with me, there should be no reward for opting-in other than knowing that your code is helping improve Roblox’s dataset.

36 Likes

Is there more information on how this works? How can developers be sure that their sensitive data is not accidentally used during training if they happen to store them in some unconventional manner?

Ensuring that all sensitive data is safely filtered out sounds like a bold promise to make. I think if developers were given some more details on this process, they would be more inclined to share their code for training.

11 Likes

I agree with this. This is why I think maybe a free avatar item or some Robux would be nice that way more people would contribute.

5 Likes

Where will this public Luau dataset be available? I would like to train an AI model of my own using this dataset for uses that can help people getting into Roblox game development.

6 Likes

who in the right mind would voluntarily send free code to roblox or public datasets for AI for no compensation at all? basically asking to give more data for telemetry

7 Likes

Dear Mauio, you are not doing it for Roblox, you are doing it for the community of Roblox Creators. The Luau public Dataset will provide data to anyone in the world who wants to train AI on Luau/Roblox Experiences and that will enable a new generation of AI tools for Roblox Creators. We are putting an effort on our side to manage the data, clean it and prepare it to be used for training by others who make tools for Roblox Creators. Does it make sense?

13 Likes

That is great to hear! And very exciting. We are making it available on Huggingface. We will provide a link very soon. Thanks!

11 Likes

It’s up to you whether you want to share your code or not. If you are not comfortable giving your code for free, don’t contribute.

5 Likes

I would like to know what specifically the cleaned data shared from my experiences would look like. This would increase my confidence that what’s being shared is reasonable and does not contain secrets.

7 Likes

For cleanup we are using an ML model trained to clean source code from PII, and we run it with aggressive thresholds. If a file has any PII, we exclude the entire file.
Additionally, we had a second layer of filtering where we tokenize the Lua file, and if some token did not occur in >= 5 opted-in experiences, we exclude the entire file. I hope this helps!

10 Likes

Strong agree. Maybe the ability to export what you’re able to contribute to this dataset?

6 Likes

We will share the link to the Luau public Dataset very soon. Thanks!

8 Likes

Definitely useful information that I’m sure others would appreciate too. Thanks for sharing!

2 Likes

I would not give out my code unless I was given something in return.

6 Likes

It seems like only public experiences can be used to contribute (or my group’s experience aren’t showing up even though I’m the group owner).

I’m working on a few experiences that aren’t ready for public debut but they have scripts that I’m willing to contribute to AI training.

3 Likes