Memory Store quota randomly dropped out of the blue, causing major issues

lewisakura · September 15, 2025, 7:05pm

We use MemoryStores for a lot of our game’s functionality. Up until today, we have been well under our assigned quota, until 3 hours ago where our quota was randomly dropped:

You can see that it drops extremely suddenly, then slowly over the course of the 3 hours, until we run out of memory. Our quota has been completely stable before this:

We have never experienced anything like this before and our quota has always been about double what we actually use. I’ve rapidly implemented more cleanups for our memory usage, but because of the way we use MemoryStores it is very difficult to lower our usage any further.

This has caused several issues for our game, mostly for our ranked gamemode which makes use of MemoryStores to temporarily store match data, but also for our custom private server system.

zhongbot · September 15, 2025, 7:20pm

Thanks for your report, what is your experience/universe ID?

lewisakura · September 15, 2025, 7:21pm

The experience ID is 7028566528

ivorycastle · September 15, 2025, 7:35pm

Thanks, can you share what TTL you are using for the MemoryStore entries (expiration time in seconds)?

lewisakura · September 15, 2025, 7:38pm

The relatively large pieces of data (tables and the like) were 30min TTLs (since dropped to 15min, which is slightly longer than the average lifetime of the servers it’s relevant to), private server access codes (one per private server id) were stored for 10 days max but should’ve been cleaned up properly.

I have added more cleanup code and checks to try and reduce memory usage further.

ivorycastle · September 15, 2025, 7:44pm

Can you share more about what happened around 11:30 AM Pacific Time today? Did you’ll release an update? Did this update change your Memory Stores usage? If so, how?

lewisakura · September 15, 2025, 7:49pm

~~It was a minor patch update (so no migrations) for an unrelated issue. No Memory Store usage changes.~~

Nevermind, I did the timezone math wrong.

I released an update, still without migrations since it targets reserved servers, to add more RemoveAsync calls for places where memory store data may have leaked beyond its intentional lifetime, and reduced the TTLs, mostly to attempt to remedy the situation. However, since this only affected a tiny amount of servers, this should have had no large-scale impact.

ivorycastle · September 17, 2025, 4:45pm

Following up from the other thread, can you share a bit more about your MemoryStore usage?

Does your usage of MemoryStores scale with the number of players?
Is your matchmaking algorithm designed in a way that the MemoryStore usage scales with # of servers instead of players?
Do you have any aggressive caps on the number of players per server?

It appears that for certain entries you had very large TTL. These entries weren’t cleaned up (at least majority of them weren’t). Memory Quota has an 8 day lookback, so since your player count dropped from the high 8 days ago, memory quota also dropped steadily to match that.

Your experience had ~30k players on 9/7. From 11:30AM-3PM on the 15th, I see on our telemetry that the HashMap.Remove calls decreased to 0 requests per minute. Requests to HashMap.Set and HashMap.Update increased significantly.

I don’t believe the quota here is the issue. There was likely some code that was part of the 11:30AM PT update on 9/15 that

Significantly increased requests to HashMap.Set and HashMap.Update
These requests don’t scale with number of players (maybe no. of servers, no. of peak players over last X days, etc.)
Your MemoryStore usage was consuming 1.5M request units per minute, just for HashMap Set/Update requests. Can you look into how your algorithm scales?

lewisakura · September 17, 2025, 5:20pm

Players and servers.

Yes. Each matchmaking server has a fixed cap of players that can play in that match.

In retrospect, this makes a lot more sense on why the quota dropped suddenly (since our quota has always been very high), but based on my understanding of the documentation I would’ve expected a singular sudden drop, not a gradual one, hence why I filed a bug report. I think if this is intended behavior the documentation should clarify saying that.

After review with DevRel we found the following regarding your other points:

We did a review of the code and found a leak that caused a lot of extremely long TTL keys to be permanently kept in memory since their TTLs were consistently refreshed. We added cleanup code to fix this.

Because of a retry-on-fail loop that didn’t have a breakout condition, all servers were trying to write to their memory stores and failing. After flushing the stores and letting them write, the loop finished and request rate dropped to normal. Removes dropped because the servers that would’ve cleaned up said data never came into existence.

I know it’s probably not the most appropriate place to ask for these things, but the following would’ve been really good features to have to remedy the situation quicker:

Show what stores are being written/read from in the request log
Show which stores are taking memory on the usage graph
Ability to flush MemoryStores from the dashboard instead of exclusively through Open Cloud
Graph annotations that explain the quota drops

Besides that though I think this can be closed as not a bug. Just bad code rearing its ugly head rather than a problem with the platform.

ivorycastle · September 17, 2025, 5:22pm

Thanks for the response. This makes sense.

I’ll mention this feedback to my team as we plan on enhancing the MemoryStores product further