Release Notes for 441

Tomarty · July 24, 2020, 3:07am

+= has been life changing.

do
    local z = 1 + (a or 256)
    freq[z] = freq[z] + 1
end

I wrote a script to convert my entire codebase to the new compound operators. The script is <100 lines but depends on a lot of other modules that do tokenization / lexing so it’s operator-priority-safe, but isn’t easy to share. It found 519 scripts and removed 22kb of source.

Anyways, here’s a quick multicore bench:

With batches of 64 the performance is pretty much identical. I had decided to skip \0 because it’s what we’re trying to escape and usually has a high frequency in uncompressed data.

It actually concatenates in batches of 64 because it takes over 2x longer to add to add each character to an array #data long and concatenate it (even when the table is reused.)

It’s interesting that a + 1 takes 0.95x the time 1 + a does (not controlling for everything else that’s being done.) In most languages it’s just preference but here we get slightly different instructions.
Perhaps someday we’ll get an instruction that takes the add out of a[b + 1] = c. The addition probably isn’t much compared to converting from a double though

I’m just hoping that attributes will support \0 so I don’t need to use this at all. BinaryStringValue also seems like it would be a good alternative if it’s value was exposed.

zeuxcg · July 24, 2020, 4:21pm

Ah, yes, +1 instead of 1+ is a good idea. We currently don’t automatically reorder this because if the right hand side has an __add metamethod that is implementing a non-commutative operation, the order becomes significant (although whether __add should be allowed to be non-commutative for numbers is an open question…)

It actually concatenates in batches of 64 because it takes over 2x longer to add to add each character to an array #data long and concatenate it (even when the table is reused.)

Yeah, concats need to be batched for optimal performance. This can likely be faster for large sequences if implemented via table.create and table.concat. I’ve been thinking about a buffer data type that allows to efficiently build up large strings (similar to StringBuilder in Java/C#), which would address this problem more cleanly (this data type is kinda necessary in some internal functions that might need to be optimized in the future).

But yeah ideally we need to support binary data better in various places. The issue with DataStores is that it uses JSON for actual data storage, which isn’t very friendly to binary data.

Autterfly · July 24, 2020, 10:21pm

LJ solves the ordering problem I think by just having verbose variations of the instructions. ADDVV, ADDVN and ADDNV for example. It’s a bit more bloat-y in terms of instruction set size but it fixes the issue.

zeuxcg · July 24, 2020, 10:49pm

I know Instruction design is a bit of an art, it’s a judgment call as to whether a particular instruction helps noticeably. We usually prioritize based on performance of code we see often and it’s comparatively rare to see this being important (there are some other instructions we are likely to prioritize before these ones).

Tomarty · July 24, 2020, 11:15pm

That would be great to see! Some way to add string.pack’s result to a buffer without creating the string would be really useful too.

To build long data strings I usually add bytes to an array, then use string.char(unpack(array, i, j)) in batches of LUAI_MAXCSTACK - 3 (7997).

Support for combining buffers would be nice; For my game’s save system I often write binary data to bufferB, then store bufferB preceded by its data length in bufferA so I can potentially skip over that data without processing it when loading saves.

Would it be make sense to be able to mutate a byte at a previously reserved position in a buffer? A lot of data can be expressed compactly as discrete bytes, but it’s possible to save a lot of space by doing bit packing across different objects; An easy way to do this without compromising on byte/string performance is to allocate 1 byte when writing the first bool, then finalize & set that byte once 8 bits have been added before starting again. When deserializing it just needs to get one byte when reading the first bool, then extract the bits one at a time before starting again:

local bytes = {}
local bytesLen = 0

local writeByte = function(v)--$inline
	bytesLen += 1
	bytes[bytesLen] = v
end

local bitLevel = 128
local bitValue = 0
local bitBytePosition = 0 -- Position in 'bytes' where 'bitValue' will be stored

local writeBool = function(v)--$inline
	if v then
		bitValue += bitLevel
	end
	
	if bitLevel == 128 then -- First bit in byte
		bytesLen += 1
		bytes[bytesLen] = 0 -- Allocate
		bitBytePosition = bytesLen -- We will set this once 8 bits are written, or when done serializing.
		
		bitLevel = 64 -- Next less-significant bit
	elseif bitLevel == 1 then -- Last bit in byte
		print("Byte", bitValue)
		bytes[bitBytePosition] = bitValue -- Set byte in buffer
		
		-- Reset for next byte
		bitLevel = 128
		bitValue = 0
	else
		bitLevel /= 2 -- Next less-significant bit
	end
end

do -- Write data
	local rng = Random.new()
	for i = 1, 256 do
		local v = rng:NextNumber() < 0.5
		print("Bit", i, v)
		writeBool(v)
	end
end

if bitLevel < 128 then
	bytes[bitBytePosition] = bitValue -- Don't wan't to forget to add this byte.
end

local data = string.char(unpack(bytes, 1, bytesLen))

print(string.format("Result: %q", data))

Oh, I’ve always just used a string. JSON is harder to mess up and is great for simple readable data, but binary data is really necessary for huge games to scale up and support huge user creations (like houses) and highly persistent worlds (like quests).

Hexcede · July 25, 2020, 5:36am

A quick question as I’m curious, what was the primary reason for JSON as a storage format choice? Personally I tend to write my own formats when I store large data as it means I can optimize for space efficiency & computation time, but, JSON is really only useful to me when it’s either overkill or just plain easier.

I originally thought Roblox would reuse pieces from some existing formats like I’ve been seeing a lot, for example, I think the http cache uses its own format and stores header info and can store RBXM files. Obviously in that case no escaping is necessary in the cache file and I don’t think any is used and thus I don’t think that’s a great example, however, RBXM files additionally are decent at storing binary content. I’ve noticed an increasing amount of reusage of formats like RBXM that have been repurposed to contain certain alternative info (I think even recently at one point as far as to store precompiled core scripts showing that RBXMs actually have a storage type for luau bytecode too which I thought was weird).

But yeah, I have a suspicion that if datastores weren’t JSON they could be designed to store full bytes, special characters, and even entire RBXM files like the old Player:SaveInstance function (I miss him, he was so neat ) and thus instances at a near 1:1 (or less but that’d take additional work) ratio and probably more computationally efficiently than JSON so I’d think there’s likely an important technical reason JSON was chosen.

The compression algorithm I’ve still been slowly working towards some enormous optimizations on enormously suffers since I have a full 126 values I can’t use for datastores (and I have to account for JSON escaping which I’m too lazy to worry about yet so that reduces my space even less). This even effects things as far as actual computation time because I effectively double the amount of data past the decoding stage for compression, and decompression I likewise run into problems where it ends up effecting my pristine (iirc) 100x faster performance ratio.

Speaking of, the new string.pack and string.unpack functions are going to enormously improve performance in my code because the biggest problem I face is performantly converting 5-10 MB strings into bytes without exploding the game server.

Tomarty · July 25, 2020, 5:59am

I think this is what bothers me most. The storage endpoint must work with regular binary data. If SetAsync’s argument is a string then it could precede it with a single byte not used by JSON and send it on its way. Who wouldn’t want a more performant API that supports compact binary data for huge user creations, with a pretty nice increase in usable storage space. Not to mention reduced storage overheads because it’s easier for devs to make save data for highly persistent experiences compact.

buildthomas · July 25, 2020, 11:03am

JSON is easy to use for users, human-readable without effort, easy to version, and highly portable. Low commitment and easy to recover if they need to switch to a different back-end solution.

Totally makes sense as the default format IMO. Most developers only store a small blob of player data where it brings a lot of value to store it in a human-readable way with non-strict shapes.

Hexcede · July 25, 2020, 5:57pm

I suppose it maybe makes sense on its own, but, I’d personally think that something better than JSON would have been used in the case of Roblox. None of the benefits of having JSON don’t really make sense to me in this case, because, we can never actually see or manipulate any of that JSON ourselves. I still really believe there is (or was) some technical reason, but, who knows.

zeuxcg · July 25, 2020, 7:03pm

I believe the reason why we use JSON as the DataStore format is because:

a) This is the transport format of choice for REST APIs, and we need to send the data through a web API
b) DynamoDB, which was (and is) hosting DataStore data, used to only support JSON well back in the day; I think they have options for binary storage now

So we didn’t pick this format specifically because of efficiency. We don’t use JSON in the engine and try to use binary formats in general when performance or memory is vital (see rbxm, shader packs, http cache, etc.), but here we were working with a system where JSON was a more natural fit, and data size issues only started surfacing way later. Worth noting is that it was just announced on RDC that DataStore limit is going to be 4 MB.

wravager · July 25, 2020, 7:21pm

What is the difference between Automatic and Performance? I’m not seeing any difference and it isn’t explained anywhere.

Ripull · July 25, 2020, 7:55pm

So does render fidelity “performance” force low poly meshes regardless of a users graphics settings, whereas automatic now changes the mesh’s poly based on the graphics settings?

Ncuti_Gatwa · July 25, 2020, 9:11pm

Is there going to be anything soon to allow us to save Models without the need to write our own custom Model serialiser? (E.g being able to pass the model into the :SetAsync value parameter directly)
I’m referring to having something very similar to the old “Data Persistence” :SaveInstance method

When I first saw this on the Roadmap:

I was expecting that we’d finally get something like that but now hearing that the DataStore limit is increasing significantly I am now wondering if this is what was being referred to instead.

(If you cannot confirm I’ll put together a proper feature request for something like this as it would save a lot of development time not having to write and manage a custom model serialiser)

zeuxcg · July 27, 2020, 10:39pm

The 5 people who will use these functions can now PACK ALL THE THINGS fwiw.

zeuxcg · July 27, 2020, 10:54pm

Here’s a quick primer for string.pack for people who think “how can this possibly be of any use to me”.

Let’s say you want to transmit the character state over the wire. A character has position (where they are), walk direction (where are they going) and health.

Position requires three numbers but they can be 32-bit floating point because that’s as much precision as Vector3 gives you anyhow.
Direction requires three numbers but they are from -1…1 and you don’t need to be as precise.
Health requires one number that’s 0-100.

Let’s pack this!

Create a format string that describes the message, with each component using a format specification per Lua documentation (Lua 5.3 Reference Manual). Just ignore the aligment and endianness options and focus on types:

local characterStateFormat = "fffbbbB"

How many bytes will the packed message take? Let’s ask string.packsize:

> print(string.packsize(characterStateFormat))
16

Nice, 16 bytes! That’s pretty compact, you can send a lot of these if you want to.

Then let’s pack the data! We’re going to store direction as a byte -127…127, and health as a 0…100 byte (using unsigned byte for clarity, no difference in this case):

local characterState = string.pack(characterStateFormat,
    posx, posy, posz, dirx * 127, diry * 127, dirz * 127, health)

For astute readers: yes this is losing ~0.5 bits per directional component because of imprecise rounding; if you want to you can correct it with math.round, e.g. math.round(dirx * 127)

Then let’s unpack the data on the other side! (after sending the string across the wire through a remote event):

local posx, posy, posz, dirx, diry, dirz, health =
    string.unpack(characterStateFormat, characterState)
dirx /= 127
diry /= 127
dirz /= 127

Note that direction that you reconstruct will have a bit of an error (usually not a big problem) and health will drop the fractional parts, but considering we just used 4 bytes to store both of these it’s not that big of a deal.

Elttob · July 27, 2020, 10:59pm

Is this compatible with data stores? I remember data stores used to have some problems storing some characters, not sure if that’s still the case.

zeuxcg · July 27, 2020, 11:00pm

You can not currently store non-UTF8 data to DataStores, so right now it isn’t - this only works in contexts where binary data is safe to use.

Elttob · July 27, 2020, 11:01pm

Alright, thanks! In that case, is saving binary data to data stores something that you guys would like to support in the future?

LuaCow · January 23, 2021, 11:01pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.