Lua LZW Compression

aemino · February 2, 2016, 12:40am

Hey there everyone! I recently wrote a Lua LZW compressor while working on a project.

Lempel-Ziv-Welch compression, otherwise known as LZW compression is a method of compressing data. The GIF file format uses LZW compression to minimize the file size.

On top of the LZW compression algorithm, I added my own layer to encode the otherwise numerical compressed data into characters, further compressing the data.

ModuleScript code

local LZW = {}

local encodeDict = {}
local decodeDict = {}

local numericEncodingChars = {}
 
do
    local c = 33
    for i = 0, 99 do
        if (c == string.byte("-")) then
            c = c + 1 -- skip "-", it is allocated as a lzw encoding delimiter
        end
       
        numericEncodingChars[i] = string.char(c)
       
        c = c + 1
    end
end

for i, c in pairs(numericEncodingChars) do
    encodeDict[i] = c
    decodeDict[c] = i
end
 
 
local function getdict(isEncode)
    local dict = {}
   
    local s = " !#$%&'\"()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
    local len = string.len(s)
   
    for i = 1, len do
        if isEncode then
            dict[string.sub(s, i, i)] = i      
        else
            dict[i] = string.sub(s, i, i)
        end
    end
   
    return dict, len
end
 
 
local function getEncodedDictCode(code)
    local encodedDictCode = {}
   
    local nums = ""
    for n in string.gmatch(tostring(code), "%d") do
        local temp = nums .. n
       
        if ((string.sub(temp, 1, 1) ~= "0") and encodeDict[tonumber(temp)]) then
            nums = temp
        else
            encodedDictCode[#encodedDictCode + 1] = encodeDict[tonumber(nums)]
            nums = n
        end
    end
    encodedDictCode[#encodedDictCode + 1] = encodeDict[tonumber(nums)]
   
    return table.concat(encodedDictCode)
end
 
local function encodeDictCodes(codes)
    local translated = {}
   
    for i, code in pairs(codes) do
        translated[i] = getEncodedDictCode(code)
    end
   
    return translated
end

local function decodeDictCodes(codes)
    local translated = {}
   
    for i, code in pairs(codes) do
        translated[i] = ""
       
        for c in string.gmatch(code, ".") do
            translated[i] = translated[i] .. decodeDict[c]
        end
       
        translated[i] = tonumber(translated[i])
    end

    return translated
end
 
 
function LZW:Compress(text, disableExtraEncoding)
    local s = ""
    local ch
   
    local data = text
   
    local dlen = string.len(data)
    local result = {}
   
    local dict, len = getdict(true)
    local temp
       
    for i = 1, dlen do
        ch = string.sub(data, i, i)
        temp = s .. ch
        if dict[temp] then
            s = temp
        else
            result[#result + 1] = dict[s]
            len = len + 1
            dict[temp] = len
            s = ch
        end
    end
   
    result[#result + 1] = dict[s]
   
    if (not disableExtraEncoding) then
        result = encodeDictCodes(result)
    end
   
    return table.concat(result, "-")
end
 
function LZW:Decompress(text, disableExtraEncoding)
    local dict, len = getdict(false)
   
    local entry
    local ch
    local prevCode, currCode
   
    local result = {}

    local data = {}
    for c in string.gmatch(text, '([^%-]+)') do
        data[#data + 1] = c
    end
   
    if (not disableExtraEncoding) then
        data = decodeDictCodes(data)
    end

    prevCode = data[1]
    result[#result + 1] = dict[prevCode]
   
    for i = 2, #data do
        currCode = data[i]
        entry = dict[currCode]
       
        if entry then
            ch = string.sub(entry, 1, 1)   
            result[#result + 1] = entry
        else   
            ch = string.sub(dict[prevCode], 1, 1)
            result[#result + 1] = dict[prevCode] .. ch
        end
   
        dict[#dict + 1] = dict[prevCode] .. ch
       
        prevCode = currCode
    end
   
    return table.concat(result)
end
 
 
return LZW

Pastebin URL

Short documentation:

LZW:Compress(string text [, bool disableExtraEncoding])

LZW:Decompress(string text [, bool disableExtraEncoding])

Here’s an example of how this module could be used.

I put a bunch of random free models into the workspace, made a table of all parts, their name, size, and position, and saved it as a JSON string.

The original text length was approximately 300,000 characters long.
The compressed version of this text was approximately 100,000 characters long.

Original text
Compressed text

Hopefully this is useful in some way.

I’d like to thank the authors of this MIT handout detailing how LZW works.

Enjoy.

Thelegender · February 2, 2016, 1:06am

Neat.

Maximum_ADHD · February 2, 2016, 1:44am

Ooooh this is handy.
I think @Maelstronomer needed something like this IIRC.

einsteinK · February 2, 2016, 12:21pm

I expected the encoded text to contain “Size” and stuff.
Oh, what was I disappointed.

I like very much, though.
I might actually use it, depending if it isn’t too resource intensive.

aemino · February 2, 2016, 3:05pm

I haven’t run a benchmark test on it, but it seems quite fast. The compression test I mentioned in the OP (300,000 characters) took what seemed like less than 2-3 seconds. I’ll run a real benchmark and post the results here as soon as I have the time.

EDIT: I have yet to run more tests, but I had the time to run a benchmark on the example I showed. Compressing the 300,000 characters took approximately 2228ms (2.228s) and decompression of the compressed 100,000 characters took approximately 683ms (0.683s).

1waffle1 · February 4, 2016, 7:00am

I liked this and happened to be playing with compression at the same time you posted this, so I made my own LZW compression script. I benchmarked mine by compressing your example text and then decompressing the text it generated, and it took 17.049 seconds to do that 100 times. The compressed text is also a little shorter than yours.

Instead of using - as a delimiter, I convert each dictionary index number into a high base (base 95) and mark the longest index, which is 3 characters long from your example text, and put spaces in front of everything that is shorter, which is okay because a space represents 0 in the base it’s in. It puts 3| at the beginning so it knows every group is 3 characters, and converts the string back into numbers in 3-character increments.

It prints the original length of the string, the length of the compressed string, the difference in those lengths and the percentage compressed on the first line, then the first 200 characters of the compressed string, followed by checking if the decompressed string is equal to the original string (and it is) and then how long it took (~0.171 seconds.)

Sorry about readability if anyone has trouble reading this.
edit: optimized a bit. compresses and decompresses the example text 100 times in ~14.5 seconds now.