Hi guys.
I apologize for the lack of updates. Unfortunately I’ve been extremely busy with other things yet I still took on rewriting the entire site’s codebase, making work extremely slow and having not much to show for it.
I’ve fixed one bug that displayed post counts (and usernames) incorrectly if a user had changed their username before. It was due to Roblox not updating post counts on old posts if a user changed their username, something I was not aware of when I initially constructed the archive.
Despite the very generous donations from many of you, and me putting 100% of those donations toward server costs, money ran out a couple months ago. Server costs are extremely high because of the large database and I have been paying (quite a lot) out of pocket recently.
The site is still running at the time I’m writing this but without donations I cannot see the site being kept up much longer. I would really like to keep the database some how because it can be very tedious and even expensive to construct it from the raw files. I will continue paying out of pocket to keep the raw files because they are cheaper to store.
Downloads Available
The one update I have is that the raw files are once again publicly available (as the snapshot downloads went down quite a while ago). This time they are in a public s3 bucket named “rbx-archive” located in the US West (Oregon) region. The bucket contains 154964 gzipped folders. The name of each file describes which threads it contains. For example, file “100000229-1.gz~100001184-1.gz” contains all threads between page 1 of thread 100000229 and page 1 of thread 100001184 inclusive. Each folder typically contains up to 200 gzipped files that each represent one page of one thread. For example, the file “100000229-1.gz” has page 1 of the thread with the postID 100000229. Note this does not allow you to find a post directly, you have to know which thread it belongs to and which page of the thread it is on. An index exists but is currently not available publicly.
You will need knowledge of amazon s3 to extract these files. To download the entire archive you will likely need to do it programatically. Please refer to amazon’s documentation if you are interested in doing this.
The s3 bucket is “requester pays”, which means that the person downloading pays for the data transfer costs associated with the download. On a small scale this is nothing: if you want to download the entire 482.8 GB, however, it will cost somewhere around $50 (there may be a loophole using lightsail - if you are very serious about downloading the archive and you know how you can contact me for a cheaper method).
Again, because the cost of hosting the files on s3 is not significant, I will continue to pay it for the time being. It is really the database which allows for searching and immediate retrieval that is very expensive.