I see that there is a lot of confusion here about integers and floats.
Integer
First of all, unlike decimal numbers, integers are radix 2 (power of 2) numbers because each position in an integers can only be 0 or 1. For decimal numbers (radix 10), each position in a number can be 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. So if you have a number 106 That is equivilent to 1,000,000. For an integer, 216 is 65,536. 232 = 4,294,967,296. These are unsigned numbers. For signed numbers, the formula is 2x - 1-1. So for a signed 32-bit integer, the value range is -2,147,483,648 to 2,147,483,647. The maximum positive value for any unsigned integer is 2x - 1 because you still have to represent 0.
Note that the exponent represents the hard limit as to the maximum values that a integer can hold. Unpredictable results can occur if the limit is exceeded.
As for the OP’s question, you can split and combine integers if you can guarantee that they will within 8, 16, or 32 bits. Roblox does not support 64-bit integers at this time. The way to do this is as follows:
-- Splits a 32-bit integer into two 16-bit integers.
local function split32to16(x)
local low = bit32.band(0x0000FFFF, x)
local high = bit32.band(0x0000FFFF, bit32.rshift(x, 16))
return low, high
end
-- Combines two 16-bit integers into a 32-bit integer.
local function comb16to32(low, high)
return bit32.bor(bit32.band(0x0000FFFF, low), bit32.band(0xFFFF0000, bit32.lshift(high, 16)))
end
-- Splits a 16-bit integer into two 8-bit integers.
local function split16to8(x)
local low = bit32.band(0x000000FF, x)
local high = bit32.band(0x000000FF, bit32.rshift(x, 8))
end
-- Combines two 8-bit integers into a single 16-bit integer.
local function comb8to16(low, high)
return bit32.bor(bit32.band(0x000000FF, low), bit32.band(0x0000FF00, bit32.lshift(high, 8)))
end
Disclaimer: There’s a few things that you need to keep in mind when using these.
- There might be some errors to this since I did this from memory. I wrote these routines and quite a few others some time ago in C/C++.
- If you try to combine numbers greater than what it’s looking for, those extra bits will be masked off, so you may get a number you weren’t expecting.
- No error checking is done.
Another thing to consider is endianness, or byte order. Although LUA insulates us from this, in other languages it can be a concern when dealing with CPUs that are not Intel/AMD/Cyrix (Little Endian). ARM CPUs (most, if not all mobile devices) have the ability to set the byte order to either 1234 (Big Endian) or 4321 (Little Endian). Other CPUs such as MIPS, Sparc, and IBM’s Z-Processor are big endian devices. Furthermore, byte order on the network is also big endian. Endian has to do with the order bytes are stored in memory for multi-byte integers in respect to increasing memory addresses. For instance, the 32-bit number 0x12345678 is stored as 0x12, 0x34, 0x56, 0x78 in memory for big endian machines. For little endian machines, it’s backwards: 0x78, 0x56, 0x34, 0x12. So make sure you get your byte order right.
Floating Point
Now the floating point specification is the IEEE-754 standard. It specifies the layout of floating point numbers in 16, 32, 64, 128, and 256 bit formats, also known as precision (someone did mention that). For all the formats, the basic layout is the same regardless of the width of the fields.
- The sign bit. When it’s a 1, the number is negative.
- The exponent. The exponent is encoded using offsets, so a 0 exponent is not 0 but another value. So for a double, its 0xb0111111111 (0x3ff). 0 and 0x7ff have special meanings which are mentioned in the double document on Wikipedia.
- The mantissa or fraction. The leading 1 is always assumed, but the first bit of the mantissa is 1/2, the second is 1/4, the third is 1/8, and on down the line for however many bits the mantissa is.
A word of warning though. LUA does not support direct manipulation of floating point types at the bit level. I written code in C/C++ that does do this for a big number library (numbers that are so big they do not fit into a native CPU register). It can get quite complicated depending on what you are trying to do.
Another way you can shoot yourself in the foot with floats is comparison. It is not recommended to directly compare two floats using == or !=. In fact, C/C++ compilers will warn you of this. The best way to handle this is as follows:
local x = 0.33298575
local y = 0.33298243
if math.abs(x - y) < 0.0000000001 then
-- Do something
else
-- Do something else
end
Hopefully this helps people.