I am making a word game like Word Bomb, but I am curious about how they store all the words. I have a file of 200,000+ english words, but I’m not sure how about what the best and most efficient way to implement it into the game is.
What do you want to achieve?
An array of more than 200,000 strings. I don’t mind it being split up, but the most common function I’ll be using is checking if a word is in within my list of 200,000+ words. I feel like that many words would be laggy to continuously check over and over again.
What is the issue?
I have a JSON file with 200,000+ words in them as an array, and I want to be able to import them into the game as an array.
Note that the file will not be updated very often, so for the most part, I’ll be fetching it once in a server and don’t need to send any data back. I may even add more files with thousands of more words, so I need something robust.
Is it better to simply store it in a script, or outside of the studio? If outside, what datastore should I use to store it so that it is private and convenient?
What solutions have you tried so far?
I’m very new to this, and I’ve tried copy pasting the word into a module script, which might theoretically work, though I haven’t properly tested it because my computer began to lag. I’ve heard about APIs but I don’t know what’s best for this situation. I’ve tried using Google Drive and GitHub on a PRIVATE repo (I don’t want it public) but I can’t keep it consistent.
One thing you could do which would keep the solution in studio is to use a datastore. Use the word as a key and set its value to true, and when you want to know if a word is valid or not, just get async on the word. I know it wouldn’t be the most efficient solution, but it’s less of a hassle than setting up a python server with a database. There’s games out there that have a million+ datastore rows, I think you’d be fine.
i am not sure if there are any drawbacks to this method as I have not tested it
you can store the words in a dictionary for checking and store them again in an array to get a random word
let’s say you have 200k words and every word is 10 characters then 200k *10 = 2m or 2 megabytes which isn’t much
-- this table is used for checking
local WordsDictionary = {
["Hello"] = true,
["Bye"] = true,
["Lua"] = true,
["More Words"] = true
}
--this one is used to get a random word
local WordsArray = {
"Hello",
"Bye",
"Lua",
"More Words"
}
local function getRandomWord()
return WordsArray[math.random(#WordsArray)]
end
local function isValidWord(word)
if WordsDictionary[word] then
return true
else
return false
end
end
yes, it’s better to add the tables in 2 module scripts because the tables would be too long which will make it hard to navigate through the script
server storage or server script service if you don’t want the client to access the words and replicated storage if you want the server and client to access the words
so, server storage/script service would be better if you don’t want the client to have access to all the words
that is the biggest drawback of this method, since trying to add new words to the table would take a really long time
After a bit of research, according to the Roblox assistant, it’s faster to find a key in a dictionary than to find a value in an array because dictionaries lookups are implemented as hash table lookups (whatever that means lol), which takes the same amount of time to find any word (or time complexity of O(1)).
For an array however, to find a value, you’d have to loop through the whole array, which can significantly increase the time it takes to check the word if the word is further down the array (time complexity of O(n)).
This wouldn’t be a problem if the list was small, but I’m using thousands of words, so it all adds up.
If anyone can confirm or correct me on this, that would be appreciated.
I was saying that you should change the data to a much shorter serializer. JSON is more verbose compared to CSV or even a simple newline-separated format.,
JSON: ["apple","banana","carrot"] → CSV: apple,banana,carrot.
Smaller file sizes mean faster parsing and loading times, especially at scale. If you’re just reading the words into an array or a set and don’t need nested structures (which JSON is intended to), using CSV or a newline-separated file could make loading more efficient and much more simple.
That’s exactly right. However, Luau has optimized linear searching via table.find to be nearly indistinguishable from the constant-time lookups of a dictionary. You really do need to have a large number of elements in the subject array for table.find to start declining in comparison. There are some additional optimizations you could make with the Binary Search algorithm and the lexicographical properties of the words involved, but I can’t speak on how impactful that might be in light of the existing performance of table.find. The throughput OP is expecting is low too.
On another note, dictionaries are formally called “hash tables”. When you index an array, you target a value with immediate certainty, therefore its retrieval time is constant (O(1)). A hash table works by using a hashing algorithm to associate keys with indices in an internal array, so when you access a key-value pair from a dictionary, that fixed hashing algorithm reproduces the immediate index of the target value, leading to O(1) lookup. There are some universal nuances when implementing hash-tables that can cause this lookup performance to degrade to O(n), but I won’t expand on that. If you’d like to know more, research “hash tables collisions”. Luau uses a combination of chaining and open-addressing to improve the problem.