Table encoding function, can it be improved?

Halalaluyafail3 · July 31, 2020, 4:02am

Recently the functions string.pack, string.unpack, and string.packsize were backported. I thought about using them to cut down on the size of saved data. Even if you can’t save binary data in datastores it could help when sending the information to clients (presumably sending a table across the network uses JSON?).

The data I am storing could be represented with the type

type T = number|boolean|string|{[number]:T}

This is what I used to convert the table to a string:

-- types:
-- 0 is table start
-- 1 is table end
-- 2 is boolean
-- 3 is string
-- 4 is 32 bit float
-- 5 is 64 bit float
-- >= 6 are for integers
local EncodingFunctions
local function EncodeBool(Value)
	return string.char(2,Value and 1 or 0)
end
local function EncodeString(Value)
	return "\x03"..string.pack("z",Value)
end
local function EncodeNumber(Value)
	if Value%1 == 0 then
		if Value < 0 then
			local NValue = -Value
			if NValue <= 0xFF then
				return "\x0E"..string.pack("I1",NValue)
			elseif NValue <= 0xFFFF then
				return "\x0F"..string.pack("I2",NValue)
			elseif NValue <= 0xFFFFFF then
				return "\x10"..string.pack("I3",NValue)
			elseif NValue <= 0xFFFFFFFF then
				return "\x11"..string.pack("I4",NValue)
			elseif NValue <= 0xFFFFFFFFFF then
				return "\x12"..string.pack("I5",NValue)
			elseif NValue <= 0xFFFFFFFFFFFF then
				return "\x13"..string.pack("I6",NValue)
			elseif NValue <= 0xFFFFFFFFFFFFF8 then
				return "\x14"..string.pack("I7",NValue)
			elseif NValue <= 0xFFFFFFFFFFFFF800 then
				return "\x15"..string.pack("I8",NValue)
			end
		else
			if Value <= 0xFF then
				return "\x06"..string.pack("I1",Value)
			elseif Value <= 0xFFFF then
				return "\x07"..string.pack("I2",Value)
			elseif Value <= 0xFFFFFF then
				return "\x08"..string.pack("I3",Value)
			elseif Value <= 0xFFFFFFFF then
				return "\x09"..string.pack("I4",Value)
			elseif Value <= 0xFFFFFFFFFF then
				return "\x0A"..string.pack("I5",Value)
			elseif Value <= 0xFFFFFFFFFFFF then
				return "\x0B"..string.pack("I6",Value)
			elseif Value <= 0xFFFFFFFFFFFFF8 then
				return "\x0C"..string.pack("I7",Value)
			elseif Value <= 0xFFFFFFFFFFFFF800 then
				return "\x0D"..string.pack("I8",Value)
			end
		end
	end
	local Result = string.pack("f",Value)
	if string.unpack("f",Result) == Value or Value ~= Value then
		return "\x04"..Result
	end
	return "\x05"..string.pack("d",Value)
end
local function EncodeTable(Value)
	local List = {"\x00"}
	for _,v in ipairs(Value) do
		table.insert(List,EncodingFunctions[type(v)](v))
	end
	table.insert(List,"\x01")
	return table.concat(List)
end
EncodingFunctions = {boolean=EncodeBool,number=EncodeNumber,string=EncodeString,table=EncodeTable}
local function Encode(Value)
	return EncodingFunctions[type(Value)](Value)
end
local function _Decode(Value,Position)
	local Type = string.byte(Value,Position)
	if Type == 0 then
		local Starting = Position
		local Table = {}
		Position += 1
		Type = string.byte(Value,Position)
		while Type ~= 1 do
			local Result,DataSize = _Decode(Value,Position)
			table.insert(Table,Result)
			Position += DataSize+1
			Type = string.byte(Value,Position)
		end
		return Table,Position-Starting
	elseif Type >= 6 then
		if Type >= 14 then
			local Length = Type-13
			return -string.unpack(string.format("I%i",Length),Value,Position+1),Length
		else
			local Length = Type-5
			return string.unpack(string.format("I%i",Length),Value,Position+1),Length
		end
	elseif Type == 4 then
		return string.unpack("f",Value,Position+1),4
	elseif Type == 5 then
		return string.unpack("d",Value,Position+1),8
	elseif Type == 3 then
		local String = string.unpack("z",Value,Position+1)
		return String,#String+1
	else
		return string.byte(Value,Position+1) ~= 0,1
	end
end
local function Decode(Value)
	return (_Decode(Value,1))
end

Is there a better way to encode a value of the type described above?
Currently the size of every element when encoding:

Type	Size
Boolean	2 bytes
Number	2 to 9 bytes
String	Length+2 bytes
Table	2 bytes
Strings are null terminated which means no nulls can be saved in them (which is fine for what I’m doing)
Numbers can vary in size, if the number is an integer it will try to save it as an integer instead of a float. If it can’t save it as an integer or the number is not an integer, it will first try to save as a float, and if that loses some precision it will save as a double.

I could use a format specific to the data that I’m using, which would probably be fine assuming I’m never saving the data (would have to keep a decoder for every version around).

ThanksRoBama · July 31, 2020, 9:32am

It looks like you’re not saving dictionaries in any way? If the indices of values in the decoded table don’t matter to you, then you could avoid most of the “type information” by having all bools come right after each other, then all numbers, then all strings, then all tables. That way, you don’t have to do

"bool0bool1bool0bool1bool0bool1", you could do

"(some bytes of type count) 6 bools010101"

It does for datastores, so probably?

Since keys, names, and scopes are strings, their length can be checked with string.len() . Data is also saved as a string in data stores, regardless of its initial type. The size of data can be checked with the JSONEncode() function that converts Lua data into a serialized JSON table.

Documentation - Roblox Creator Hub

Kinda.

Halalaluyafail3 · July 31, 2020, 5:20pm

The indices do matter, so this isn’t an option.

sircfenner · August 5, 2020, 8:18pm

This code is not easy to read! If there’s a good reason not to, for example, add empty lines in between functions, I’d love to hear it. I’d probably also split the encoder and decoder into separate files, or at least make a more obvious division between the logic.

As for the code itself, I noticed you’re discarding the second return value from calls to string.unpack, which is the position of the next byte to read. Looks like you could fairly easily make _Decode return (value, next position) instead of (value, data size).

In your current format, you have room for 256 different value type indicators, so you have plenty of room to give a unique value type to each of “true” and “false”. This way, they each only take up a single byte rather than two.

There are a lot of similar chunks of code / similar calls, generally in this format:

return string.unpack(fmt, Value, Position + 1), hardCodedLength
...
Position += DataSize ...

This strikes me as something that would be better handled more generically. I would probably represent the data being decoded as a stream: some very simple object holding (1) the raw data and (2) an index pointing to the next part of the data to be read from.

This would probably help tidy up the code and avoid having to pass around length / data size values in a repetitive way - reading and incrementing the index could instead be encapsulated in a function that operates on the stream.

To better explain it, here is a quick implementation to compare with yours:

encoding.rbxm (2.4 KB)

Halalaluyafail3 · August 6, 2020, 12:20am

Is there a point to adding empty lines? I think the distinction between encoding/decoding is obvious, the encoding functions have Encode in the name, and the decoding functions have Decode in the name.

Returning data size was what I thought of first, although either way could work.

Where I would use this currently only stores 1 boolean, so saving 1 byte there isn’t that important.

Some notes about your implementation:

All the strings I store are user inputted, and I make sure they only have characters in the range [20,7E], so I don’t need to handle for the case where the string has a null character.

I don’t use dictionary style tables, so there is no need to be able to encode them.

Length of arrays don’t need to be stored, a terminator can be used instead (saves 3 bytes per table).

32 bit floats aren’t enough for what I’m storing, some of the numbers I’m storing are in the range [0,1e14]. When 1e14 is packed as a 32 bit float and then unpacked it’s 376832 above 1e14.

sircfenner · August 6, 2020, 12:44am

Definitely! I had to reformat your code with a blank line between each function before I could get a sense of it. Most people can’t parse solid walls of text like that. Grouping code is also a really important factor in readability - forgoing all vertical whitespace means you can’t do that.

Your use case for this code sounds increasingly specific. Code review probably won’t be helpful at all unless you give all of the relevant details.

Either way, reducing booleans to the suggested format does encode them more compactly - even if only used once - which addresses a central question in your post.

The point of providing a sample implementation was to demonstrate a different way to handle reading the data back out.

I already know that you only need to support certain data types and want to store numbers differently! Obviously those are trivial changes to make