How and where to store large JSON files for fetching?

Hello people,

I am making a word game like Word Bomb, but I am curious about how they store all the words. I have a file of 200,000+ english words, but I’m not sure how about what the best and most efficient way to implement it into the game is.

  1. What do you want to achieve?

An array of more than 200,000 strings. I don’t mind it being split up, but the most common function I’ll be using is checking if a word is in within my list of 200,000+ words. I feel like that many words would be laggy to continuously check over and over again.

  1. What is the issue?

I have a JSON file with 200,000+ words in them as an array, and I want to be able to import them into the game as an array.

Note that the file will not be updated very often, so for the most part, I’ll be fetching it once in a server and don’t need to send any data back. I may even add more files with thousands of more words, so I need something robust.

Is it better to simply store it in a script, or outside of the studio? If outside, what datastore should I use to store it so that it is private and convenient?

  1. What solutions have you tried so far?

I’m very new to this, and I’ve tried copy pasting the word into a module script, which might theoretically work, though I haven’t properly tested it because my computer began to lag. I’ve heard about APIs but I don’t know what’s best for this situation. I’ve tried using Google Drive and GitHub on a PRIVATE repo (I don’t want it public) but I can’t keep it consistent.

Thank you in advance.

2 Likes

One thing you could do which would keep the solution in studio is to use a datastore. Use the word as a key and set its value to true, and when you want to know if a word is valid or not, just get async on the word. I know it wouldn’t be the most efficient solution, but it’s less of a hassle than setting up a python server with a database. There’s games out there that have a million+ datastore rows, I think you’d be fine.

I would just create a cloudflare worker that has an endpoint like this

/isvalid/{word}

then the server would check if the word exists inside a json file and returns the result

2 Likes

i am not sure if there are any drawbacks to this method as I have not tested it

you can store the words in a dictionary for checking and store them again in an array to get a random word

let’s say you have 200k words and every word is 10 characters then 200k *10 = 2m or 2 megabytes which isn’t much

-- this table is used for checking
local WordsDictionary = {
	["Hello"] = true,
	["Bye"] = true,
	["Lua"] = true,
	["More Words"] = true
}

--this one is used to get a random word
local WordsArray = {
	"Hello",
	"Bye",
	"Lua",
	"More Words"
}

local function getRandomWord()
	return WordsArray[math.random(#WordsArray)]
end

local function isValidWord(word)
	if WordsDictionary[word] then
		return true
	else
		return false
	end
end

2 Likes

Thanks for the suggestion!

I’m trying this out and and after an hour of pasting and 10 minutes of saving, it does seem to work, I have a few other questions though:

  1. Would this be best in a module script?
  2. Should I split the list across multiple modules or scripts (for example, moduleA would have all A words and moduleB, all B words)?
  3. And where in the explorer should the script be for security (ServerStorage, ReplicatedStorage, etc.)

Thanks in advance

2 Likes

Is it possible that you could store it as a comma-separated value, or a newline-separated value?

And upload it as a package.

1 Like

yes, it’s better to add the tables in 2 module scripts because the tables would be too long which will make it hard to navigate through the script

server storage or server script service if you don’t want the client to access the words and replicated storage if you want the server and client to access the words

so, server storage/script service would be better if you don’t want the client to have access to all the words

that is the biggest drawback of this method, since trying to add new words to the table would take a really long time

1 Like

After a bit of research, according to the Roblox assistant, it’s faster to find a key in a dictionary than to find a value in an array because dictionaries lookups are implemented as hash table lookups (whatever that means lol), which takes the same amount of time to find any word (or time complexity of O(1)).

For an array however, to find a value, you’d have to loop through the whole array, which can significantly increase the time it takes to check the word if the word is further down the array (time complexity of O(n)).

This wouldn’t be a problem if the list was small, but I’m using thousands of words, so it all adds up.

If anyone can confirm or correct me on this, that would be appreciated.

1 Like

I was saying that you should change the data to a much shorter serializer. JSON is more verbose compared to CSV or even a simple newline-separated format.,
JSON: ["apple","banana","carrot"] → CSV: apple,banana,carrot.

Smaller file sizes mean faster parsing and loading times, especially at scale. If you’re just reading the words into an array or a set and don’t need nested structures (which JSON is intended to), using CSV or a newline-separated file could make loading more efficient and much more simple.

That’s exactly right. However, Luau has optimized linear searching via table.find to be nearly indistinguishable from the constant-time lookups of a dictionary. You really do need to have a large number of elements in the subject array for table.find to start declining in comparison. There are some additional optimizations you could make with the Binary Search algorithm and the lexicographical properties of the words involved, but I can’t speak on how impactful that might be in light of the existing performance of table.find. The throughput OP is expecting is low too.

On another note, dictionaries are formally called “hash tables”. When you index an array, you target a value with immediate certainty, therefore its retrieval time is constant (O(1)). A hash table works by using a hashing algorithm to associate keys with indices in an internal array, so when you access a key-value pair from a dictionary, that fixed hashing algorithm reproduces the immediate index of the target value, leading to O(1) lookup. There are some universal nuances when implementing hash-tables that can cause this lookup performance to degrade to O(n), but I won’t expand on that. If you’d like to know more, research “hash tables collisions”. Luau uses a combination of chaining and open-addressing to improve the problem.

“Hash uses a mix of chained scatter table with Brent’s variation.”
https://github.com/luau-lang/luau/blob/master/VM/src/ltable.cpp

TL;DR: The best-case lookup time is Ω(1), and worst-case, O(n). On average, Θ(1)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.