In-Engine Compression Algorithms & Binary Format Processing

Hexcede · May 6, 2021, 10:26am

As a Roblox developer, it is currently too hard to process certain formats or utilize certain algorithms, such as the rbxm format.

If Roblox is able to address this issue, it would improve my development experience because it would allow for all of the benefits of using compression we may never see ported like LZ4.

Below is a very detailed explanation of just how big of an impact native compression support and/or support for processing rbxms or other formats might have, I really don’t know if many will read it, but, I wrote it regardless. It very thoroughly goes through a lot of use cases of having compression and access to the rbxm format, and, reading it all isn’t really necessary, I’m just passionate about it, being a nodejs dev and someone who likes Rojo a lot.

Summary: Rojo, node, and all kinds of modern web services, programming languages, and libraries utilize formats that are impossible to use on Roblox due to the resources needed to port some of them. Being able to use some common compression algorithms (and especially having native support for rbxms) would unlock enormous performance benefits compared to some of the most performant options out there on Roblox, and lots of new potential for devs at almost no cost to Roblox. The work has been done by those who wrote the libraries.

Compression

I think native support for compression is long overdue, if you want to interact with web services, or use most file types on the web, compression is an absolute must, and, that’s a severely limiting factor for people who aren’t doing stuff on Roblox.

Compression should be something in the engine, and, its something that was touched on in the past by Roblox staff but afaik its never been publicly concluded whether or not this could be put in the engine, I haven’t been able to find a place where this was explained.

The amount of use cases having compression algorithms or the ability to load rbxm/rbxmx content directly are endless and if there are security concerns for this, I think they need to be mentioned again if they haven’t been, in a place they’ll be more easy to find.

`rbxm`s

With the sunsetting of :SaveInstance there will now be no more native way of saving and loading instances for real from developer code. What reason is there not to expose the ability to go from a string containing rbxm data to an instance Instance to developers? What reason is there not to at least expose LZ4 to developers so they can utilize the rbxm format with a lua port?

While developers can technically do all of the instance data storage themselves already, concluding it at that contradicts the fact that there is a format for it in the first place. The rbxm format is by very definition designed to store instances. Its faster than what developers will ever create when used natively, its more compact than anything else that’s available with all of the effort that’s been put in to processing the rbxm format, and its portable.

There are also a lot of things you can’t do from lua code that access to the rbxm format could open up, for example, if developers enable loadstring on ServerScriptService they could be allowed to load code too. This is something currently impossible in the engine.

There are even more benefits to the rbxm format that as developers we just completely lack, especially around script stuff. Being able to patch scripts before the game loads is not something we can do but its something necessary for some modern programming languages and libraries. Take typescript for example. A lot of npm modules too. A lot of things modify code through code, its a really useful thing to be able to do because it allows you to transcend the language.

Currently, its impossible to get 100% of the data 100% perfectly out of Roblox in a form that can be read and modified. Rojo is a perfect example of this. Rojo has lots of issues and bugs around syncing still, and, there are so many things it just can’t sync. Rojo would be a perfect example of a tool that would benefit from being able to utilize the rbxm format in engine.

For me, I would use the rbxm format for performance reasons. I can’t feasibly not use the rbxm format because I need to create and store so many instances that even if I take all of the best compression algorithms under my tool belt and all of the most perfect, well optimized lua code I can come up with, I don’t even come close to creating something as compact, as accurate, or as fast as what I could with the rbxm format. All of my efforts in the area are effectively useless, and, I feel taunted by the format because its already in the engine, utilized by CoreScripts, and utilized by so many things, but, I just can’t have access to it, and, that’s somewhat of a frustrating thing.

`Zip` files

Being able to use the zip format would allow developers to import and export lots and lots of data from outside sources, either over the web, or just in general. Being able to use zip files would allow for me to simulate a file system in cases where I want to emulate, for example, vanilla lua because I am using a vanilla lua module that needs it. I have to resort to hacky solutions and add messy, pointless code that hard codes things and makes it hard to swap out.

Some web services can be used to download or generate lots of data that would be useful in the engine, and, being able to use the zip format to access this data is currently not possible.

The zip format is designed to store lots of files in a compact way, and, I think it would be useful for Roblox, because it would allow for much, much larger, more complex games, and would open the door for many more new projects along the lines of Rojo.

`rbxm`s + `Zip`s

Being able to utilize both rbxms and zips would be the holy grail of Roblox features because it would allow for a perfect way to generate a massive amount of instance data from outside of Roblox, and then take all of those instances and performantly and quickly load them all in with other bits and pieces of data and metadata that could be included. Again, Rojo is a prime example of something that could benefit from this. Externally aided Roblox development could be expanded even further and Roblox could see so many new things.

Conclusion

Really all it takes is access to a few compression algorithms, and, developers will do the rest of the porting. The rbxm format is well known, and, lots of tools for it have already been made in JS, C#, etc. If Roblox doesn’t end up adding native support for rbxms, adding support for common compression algorithms would ultimately fill a lot of use cases.

Being able to process zips, rbxms, and all kinds of other formats that use compression would unlock such a wide variety of things that its strange to me that not much has been spoken of on the topic of natively supporting compression for years.

Scarious · May 10, 2021, 5:42am

This is needed. I think @zeuxcg even mentioned something like this (source, it’s also his most liked reply).

Although we can port numerous compression algorithms (see my port of zlib/deflate), it is no where near as efficient as it would be to port it in c++. Based on what zeuxcg said, I think they would just have to make these algorithms accessible, since they already exist?

In the meantime for purely datastore uses, I would recommend using 1waffle1’s text compression, since it’s pretty good.

Anything implemented by Roblox, especially for compression algorithms will be way faster and just as if not more efficient than anything we can implement. For something that takes as long as text compression, along with how parallel lua is coming out, this would be a fantastic addition, and a great example of what we could use with parallel lua.

Dekkonot · June 2, 2021, 2:48pm

The Roblox binary format is bad and inconsistent and I hate it. You don’t want to deal with it if you can avoid it.

Source: I wrote most of the spec file Rojo has on hand and implemented a significant portion of it in Rust and wrote my own implementation in Lua (you can find that here but it’s really only good for reading files, and isn’t designed to work in Roblox).

I’m not sure what benefit you think you’d get from being able to manipulate them in-engine but I almost guarantee it’s not a real one. Hypotheticals are nice but they don’t explain why you need it. If you need ways to serialize instances, there are more reasonable alternatives before you start using rbxm.

As for compression, sure, it would be nice to have, but it’s not exactly necessary for most people. Most people wouldn’t even get many benefits from something like LZ4.

As an aside, you should really only make one feature request per thread. No product manager is going to read this and pick apart your requests; they’ll just move on.

Hexcede · June 3, 2021, 5:05am

I get this feature request is a giant mess, but, you sound like you’re about to rip me limb from limb haha

Summarizing this feature request:
Expose code to us that’s already in the engine, and maybe give us a few extra compression algorithms.

Personally I have no issue with the RBXM format besides its complexity, its really just very overcomplicated and has way too many blocks and types of blocks and data types, etc, etc, its just very over engineered imo so assuming you don’t already have code to process the format, its a pain to work around. What you’re saying is the exact reason why I think there should be a native API.

I don’t know of any alternatives to instance serialization for any of the cases I want to store some kind of instance data. The only alternative I can really think of is implementing your own serialization. But, that’s easier said than done. Can you serialize instances decently easy? Yes. Can you serialize them well without much trouble? Not at all, you need hours and hours of designing to get something fast and space efficient storage.

If there really are better alternatives to this, then there should be better alternatives for Roblox too, but, the thing is, the code is already there in the engine. There isn’t really much work to be done, other than expose existing code and lock it down with a few broad security features.

Why did you work with the RBXM format if it wasn’t necessary for you to work with it? You can’t deny that it has examples of where there is genuinely necessity for it. If you want to work with a Roblox model, if you want to process assets of the website, if you want to do anything with something someone has saved or uploaded, there is simply no way to do so natively. This is really backwards to me, because, the code is already there for it, sitting out of reach.

I don’t see why you are as opposed as you are to the idea of having a better, native way to work with the RBXM format considering how complex it is to work with in lua.

A lack of necessity should not dictate the importance of something. There are countless engine features which are not necessary but make something incredibly convenient that would otherwise take hours of work. LZ4 has no purelua ports other than one or two I’ve seen and those can’t compress. What if I want to use LZ4, and not some other algorithm? Why should I, the developer be responsible for spending hours learning how the LZ4 algorithm works so I can reimplement it when its already in the engine in a much faster form?

The feature request is a mess, I know, I had no idea how to consolidate everything at the time so this feature request is pretty pointless and there’s no way it’s going to be looked at, but, its not devoid of meaningful content.

The last thing I will bring up is, why would natively exposing any of this be harmful? I don’t feel like its a waste of resources. I don’t feel like it’d be exploitable for the most part. I feel like this would be incredibly useful to some people even if its not to everyone, it would open the doors for new things, and importantly it really shouldn’t cost Roblox hardly anything with the exception of security.

loadstring is a great analogy. loadstring loads Roblox’s luau format. Do we have a way to do this in lua? No. Do we have a way to do this anywhere else? Not really. Is it a potentially dangerous feature? Yes. Is it hard to implement? No. Does it have a lot of broad use cases that suit the everyday developer? Not at all! Do you still see games and projects use it? Yes, quite a few.

Exposing the RBXM format is quite similar to exposing luau code execution. Similar dangers exist. Its quite complex. It doesn’t have good ports or lua code for it. And, importantly, its relatively easy to expose.

Dekkonot · June 3, 2021, 6:17am

Don’t tempt me!

Like I said, the problem with the binary model format is that it’s incredibly inconsistent. Some data types are interleaved. Others of the same shape aren’t. Some are transformed, others aren’t.

It makes it a huge pain to work with even if you know what you’re doing.

I worked with the binary format because it was a necessity for Rojo; people were running into walls because of how inefficient the xml format was, and the binary format was a relatively easy fix for that issue.

Your argument seems to hinge on the idea that we should be asking “why not” when it comes to features but the opposite is true: why do we need these features?

There are legitimate use cases for serializing instances and compressing text… but you haven’t provided any. You’ve just indicated that since it will be useful to somebody, it should be added.

It’s not remotely equivalent to loadstring because that function exists in normal Lua 5.1; a similiar function exists in most scripting languages, and removing it would be a breaking change.

Anaminus · June 3, 2021, 11:19am

You really shouldn’t be using the rbxm format for things like Data Stores, because it’s meant to serialize everything. It’s rare that everything is actually needed, or with perfect precision. Only the data should be stored, rather than the Instance-representation of the data. For example, a building game might only need to store bricks with position, rotation, and size, and each with reduced precision. A brick might be represented by a Part, but it would be wasteful to store data for the entire Part.

I have a module I’ve been experimenting with that is meant to addresses this issue. It uses tables to define the structure of binary data. With the inclusion of hooks and filters, it ended becoming really powerful, capable of encoding and decoding instances directly. It’s also backed by a well-tested bit buffer module (not the fastest, because Roblox hasn’t optimized bit32 yet).

There’s a lot of work to be done beyond just exposing the code. Roblox would have to commit to supporting the format. That means a low-level specification. High-level documentation. Announcements and grace periods for substantial changes. Cooperation with 3rd-party implementations. By no longer being in control of the only official producer and consumer of this format, flexibility and agility is reduced significantly.

zeuxcg · July 10, 2021, 5:13pm

Uhh okay The only reason why we don’t use interleaving or transforms on floats/ints in some cases is that it’s easier not to. I believe the initial format did this for all compound formats present at the time where it was beneficial size-wise, but for later additions we often hadn’t bothered. Eg PhysicalProperties is basically always serialized as “default” so there’s no significant advantage to compressing the float data when present.

I designed and implemented the format in 2013 and 8 years later there’s only one big thing I would change:

I would include a way to have multiple PROP chunks for the same property (ditto for INST) with an id range. This allows to encode the files in a way that makes deserialization more cache efficient, something that wasn’t on my mind at the time.

To support this I would also merge strings used for property and type names into a separate chunk. There are reasons not to do it but there are benefits wrt compression and ease of parsing that outweigh it, and it’s especially necessary if the point above holds.

Everything else is just fine. Compression or deinterleaving could be more generic but honestly it’s not a huge deal - it would slightly simplify the implementation so if we were revising the format we’d do that, but there isn’t a significant incentive to do so by itself.

I can’t comment of the actual FR since there’s too many things in one here. I would recommend separate requests with detailed use cases, for example it’s surprising that compression is mentioned hand in hand with third party (web) services since our HttpService implementation supports gzip transparently.

zeuxcg · July 10, 2021, 5:34pm

Since I bumped the thread anyway…

Can you expand on that? We have a pretty efficient bit32 implementation - anything specific you feel is amiss? Would be great to know if we can improve further.

Dekkonot · July 13, 2021, 3:22am

To be clear, when I say “I hate it” I’m using the typical internet flair; I don’t really feel very strongly about it, the inconsistencies just made us wring our hands a few times while implementing it for Rojo.

Specifically things like Vector2 vs NumberRange and float vs double. I get that NumberRange and double aren’t exactly common (and they didn’t even exist in 2013) but it’s still upsetting from an implementor’s POV that data types that are identical memory-wise don’t get stored the same way. Ditto with types of floats.

We ended up not being able to make many assumptions, which made implementing a lot more tedious than it could’ve been.

It is overall a rather well designed format and even if the Why of something like byte interleaving or float rotations weren’t immediately obvious, they became obvious with a bit of thought. I just wish it was more universally applied.

Well, I also wish that there was a publicly available spec file so we didn’t have to reverse engineer new types as they came, but that’s starting to get off topic. It would also be nice to get warning before changes to the format, e.g. serializing a property but not any data for said property, but again, that’s off topic.

As an aside, a link in my above post was broken due to some changes in how we’re formatting the project; it should be fixed now. Sorry to anyone who clicked on it and expected great things.

Anaminus · July 13, 2021, 4:39am

To me, it’s not so much the format that’s bad, but the reverse-engineered spec. It turns out that there are cases where reverse-engineered data can be interpreted in multiple ways. For example, I came up with a different way to do byte interleaving that still meets the requirements while also being way more elegant. The spec, while correct, encourages a more complicated implementation in the way it is written.

Dekkonot · July 13, 2021, 6:25am

I’d be curious to know what you mean – feel free to open an issue on rbx_dom about it (we should probably not derail this thread).

zeuxcg · July 13, 2021, 3:41pm

Yeah - I see what you mean here. Interleaving and float reorganization could be more generic but the thought that it could be only came a few years after the initial design

FWIW we resolved this in a private message and the issue wasn’t related to bit32 operations.

In-Engine Compression Algorithms & Binary Format Processing

Compression

rbxms

Zip files

rbxms + Zips

Conclusion

`rbxm`s

`Zip` files

`rbxm`s + `Zip`s