I’ll try to explain how IR systems work.
I am currently also working on implementing this into Roblox because the issue in Roblox is that Roblox does not have this feature integrated nor do they even have many of the core tools to lookup data efficiently especially from datastores. You can’t build up custom search queries but that is fortunately further down the Roblox roadmap.
Search engines usually use something known as FTS “Full-Text Search” which is a complicated but heavily optimized IR system. I strongly advice reading all about it from other sources and papers.
Typically very primitive search engines use an ILIKE search, which in SQL looks like this:
SELECT ... WHERE some_column ILIKE '%some_string%'
This will match keys that contains the search term and returns a queryset containing all those results.
This method easily slows down if your database contains millions of indexes since you’re iterating over that data per key due to O(n) complexity.
This sadly isn’t possible out of the box when it comes to Roblox because the issue is that you can’t retrieve multiple datastore keys with 1 request. These search functions would constantly have to poll new data from the datastore and Roblox just doesn’t have the tools for that. You’d have to spam GetAsync over and over which would quickly halt the datastore. Even something like ListKeysAsync which uses prefix search is currently broken (I’ve sent Roblox a bug report about it since its pagination doesn’t function as intended and colbert has already sent a report mentioned the faulty documentation flaws).
For now, you just have to store your data in scripts or get them from external sources. If you do decide to store them within scripts I’d advice looking into how to compress that data. Storing large datasets of strings will eat up a lot of memory though.
If you absolutely have to do this I’d not use any datastore requests (again due to what I explained earlier) for this and rather write your data into a modulescript or more realistically just store the data in an external database and now you can just apply the filter querysets as the API parameters. If you stick to a pure Roblox implementation, generate a lookup string where each key is positioned in the sentence by their index. Now you can just apply regex and return the indexes where those matches were found and with these return the actual data per index from the modulescript.
How do more complicated IR systems work?
Primitive FTS implementations usually work like this:
-
You tokenize your input sentence:
"Big brown fox!" -> {"Big", "brown", "fox!"}
-
Remove common stop words such as: in, and, or, to… They occur in almost every sentences and would make the lookup slower.
-
Remove special characters and lowercase everything.
-
Apply a stemmer onto your tokens that is used in text normalization. It removes inflexional endings from words and stems them to their root form. I’ve recreated that algorithm onto Roblox: StringStem - Porter Stemmer Algorithm for IR systems - Resources / Community Resources - DevForum | Roblox The stemmer part is very important in this process and you can read all about it in the documentation.
-
You go through your ‘big-data’ and create a key for each word with its values being which document IDs they are found in. i.e:
["home"] = {38, 24, 50, 235 ...}
-
You can use the same process with the input search string and look up which words are found in which documents and apply a set intersection on them:
"Big brown fox!" -> ( "big" (23, 47, 35 ...), "brown" (59, 23, 45), "fox" (40, 23) )
results in document 23 being the one with highest matches found and is returned. This way you can also rank them and return other perhaps less relevant results (i’ll get to relevancy soon). -
Additionally you can perform an OR search instead of an AND search with unions of sets instead of getting the intersection of sets. This would return more results since you’re only comparing whether 1 of the tokens exist in the documents. Might not be good since with millions of documents you’d likely returns tens of thousands.
-
Term frequency. When you generate the keys in step 5. It’s good to store the frequency of those terms in each document for ranking. This way when we perform our search, those documents with higher term frequency get returned highest even if they share the same intersection count.
-
Inverse document frequency. Our scoring system is quite good now but there are some short comings with our document retrieval scoring system. Just because we have a lot of similar words in our document does not necessarily mean it’s the relevant data that we look for. The whole reason why we also remove stop words is for documents that contain a lot of the same common words don’t get returned because they affect our relevancy scoring. We can use log10 to find how often the term occurs across all documents by dividing how many documents we have with the term frequency.
-
Optionally, you can generate something known as search vectors in step 5, which are lexemes that look like this:
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
Basically the number 1, 6, 10 in a refer to the position where that word was found in the document. -
Weights. These positions can further be expanded with weights like
A, B, C or D.
which are typically used to reflect document structure:'a':1A 'cat':5 'fat':2B,4C
Weights are typically marked as such:{D-weight = 0.1, C-weight = 0.2, B-weight = 0.4, A-weight = 1.0}.
Your ranking functions can use these to prioritize which documents to retrieve. So this form of search would search for the terms only in those specific areas / structures of documents.
This probably isn’t necessary unless you specifically have an abstract and a body section explicitly set per document but you can totally do this with specific words as well. -
TrigramSimilarity and TrigramDistance. This breaks up words into sets of max 3 letters which is used in the query to work out a similarity metric between the input sentence to each documents. I have a diffing algorithm that doesn’t work quite like TrigramSearch but can totally be used for this application which uses my very own invented similarity metric algorithm: StringDiff - String Similarity Metric Diffing Algorithm - Resources / Community Resources - DevForum | Roblox So now finally if you apply this similarity metric ratio between the returned results you’ll get a more desired document. I’d advice doing this process in the actual search functions due to possible term typos. Not an easy task by any means since diffing can become quite slow in larger datasets.
Again, I’d advice looking up actual papers on Full-Text search but the core tools like stemming and diffing already exist in Roblox thanks to my hard work.
Here’s a good reference video that explain this process better than I can:
Django Search - YouTube