We use MemoryStores for a lot of our game’s functionality. Up until today, we have been well under our assigned quota, until 3 hours ago where our quota was randomly dropped:
You can see that it drops extremely suddenly, then slowly over the course of the 3 hours, until we run out of memory. Our quota has been completely stable before this:
We have never experienced anything like this before and our quota has always been about double what we actually use. I’ve rapidly implemented more cleanups for our memory usage, but because of the way we use MemoryStores it is very difficult to lower our usage any further.
This has caused several issues for our game, mostly for our ranked gamemode which makes use of MemoryStores to temporarily store match data, but also for our custom private server system.
The relatively large pieces of data (tables and the like) were 30min TTLs (since dropped to 15min, which is slightly longer than the average lifetime of the servers it’s relevant to), private server access codes (one per private server id) were stored for 10 days max but should’ve been cleaned up properly.
I have added more cleanup code and checks to try and reduce memory usage further.
Can you share more about what happened around 11:30 AM Pacific Time today? Did you’ll release an update? Did this update change your Memory Stores usage? If so, how?
It was a minor patch update (so no migrations) for an unrelated issue. No Memory Store usage changes.
Nevermind, I did the timezone math wrong.
I released an update, still without migrations since it targets reserved servers, to add more RemoveAsync calls for places where memory store data may have leaked beyond its intentional lifetime, and reduced the TTLs, mostly to attempt to remedy the situation. However, since this only affected a tiny amount of servers, this should have had no large-scale impact.
Following up from the other thread, can you share a bit more about your MemoryStore usage?
Does your usage of MemoryStores scale with the number of players?
Is your matchmaking algorithm designed in a way that the MemoryStore usage scales with # of servers instead of players?
Do you have any aggressive caps on the number of players per server?
It appears that for certain entries you had very large TTL. These entries weren’t cleaned up (at least majority of them weren’t). Memory Quota has an 8 day lookback, so since your player count dropped from the high 8 days ago, memory quota also dropped steadily to match that.
Your experience had ~30k players on 9/7. From 11:30AM-3PM on the 15th, I see on our telemetry that the HashMap.Remove calls decreased to 0 requests per minute. Requests to HashMap.Set and HashMap.Update increased significantly.
I don’t believe the quota here is the issue. There was likely some code that was part of the 11:30AM PT update on 9/15 that
Significantly increased requests to HashMap.Set and HashMap.Update
These requests don’t scale with number of players (maybe no. of servers, no. of peak players over last X days, etc.)
Your MemoryStore usage was consuming 1.5M request units per minute, just for HashMap Set/Update requests. Can you look into how your algorithm scales?
Yes. Each matchmaking server has a fixed cap of players that can play in that match.
In retrospect, this makes a lot more sense on why the quota dropped suddenly (since our quota has always been very high), but based on my understanding of the documentation I would’ve expected a singular sudden drop, not a gradual one, hence why I filed a bug report. I think if this is intended behavior the documentation should clarify saying that.
After review with DevRel we found the following regarding your other points:
We did a review of the code and found a leak that caused a lot of extremely long TTL keys to be permanently kept in memory since their TTLs were consistently refreshed. We added cleanup code to fix this.
Because of a retry-on-fail loop that didn’t have a breakout condition, all servers were trying to write to their memory stores and failing. After flushing the stores and letting them write, the loop finished and request rate dropped to normal. Removes dropped because the servers that would’ve cleaned up said data never came into existence.
I know it’s probably not the most appropriate place to ask for these things, but the following would’ve been really good features to have to remedy the situation quicker:
Show what stores are being written/read from in the request log
Show which stores are taking memory on the usage graph
Ability to flush MemoryStores from the dashboard instead of exclusively through Open Cloud
Graph annotations that explain the quota drops
Besides that though I think this can be closed as not a bug. Just bad code rearing its ugly head rather than a problem with the platform.