MemoryStore InternalError failure rate has significantly increased after flushing

We’re unsure if it’s directly correlated with flushing, but the events seem to match up. Through our own logging as of recent, we’re also seeing higher InternalError rates than reported through Roblox’s analytics - this is very preliminary since we’ve only recently noticed this specifically and added additional logging on it.

Experience: [SSJG1] Dragon Soul | Anime Adventure 🐲 - Roblox

Thanks for the report; We’ll follow up when we have an update for you.

Seconding this. Memory stores are randomly throwing internal errors.

Hi there! Looking into this now, could you give me the following details please?

  • When exactly did you flush the data for your universe?
  • Any usage pattern changes after flushing? E.g. are you using any different APIs now?
  • Does this have any impact within your experience?

Are you still seeing these errors?

So far, it does appear that there were these 2 spikes that appear to be transient errors in your Sorted Maps (have since subsided). I noticed that your overall memory usage dropped right around the new year. In case you are migrating data/rehydrating your MemoryStore maps after flushing, more details on this would be useful in diagnosing further.

Thanks for co-operating and we appreciate you using the MemoryStores service and filing this bug report :slight_smile:

  1. January 1st around 8:50PM CST
  2. We disabled some uses of MemoryStore that led to the need to flush, which were some leaderboards that were lasting longer than necessary. Otherwise, usage was the same as has been historically before that feature was introduced.
  3. We thought it did, but so far there has not been any majorly noticeable impact.
  4. We stopped tracking these errors because there was too many. Roblox’s analytics is not picking all of them up properly.
  5. We did not migrate or rehydrate any data. Our primary use of MemoryStores is to provide a session lock on the player so other servers can’t save to their DataStore, akin to how ProfileService works.
1 Like
  1. Sounds good, that lines up with our charts
  2. Makes sense
  3. Aligns with what I expected (I see no error logs in our telemetry for your experience for this service)
  4. Do you have any snippets from your telemetry? Would be interested in inspecting it in case something missed my eye
  5. I see, that also makes sense

It does appear that the InternalError rate is now back to normal levels after those 2 spikes. I will follow up internally on both spikes and why this may have happened. After the flush, when you were populating your leaderboards/session locks again, did you set the same expiration time for all entries that were inserted were around then? This could lead to some contention, and I’d be happy to investigate this path further.

For session locks, yes. We have yet to restore the leaderboards portion and are preferring to implement it differently to avoid risking this problem again.

We wrapped our MemoryStore API calls through pcall and it would return errors simply as “Internal Error: Internal error” more frequently than we saw reported on Roblox’s analytics. This would happen in small bursts, like we would suddenly see multiple failures come in at once every 3-15 minutes. This was consistent.

The API call was simply an UpdateAsync that stored a string representing which server the player was in with an expiration time of around 60 seconds, re-applying every 45 seconds the player was in the server.

1 Like

Can you try adding a randomized timeout in addition to the expiration time of 60s? So for eg, expiration of 60s + a random value between 1 and 5s. That will help with these ‘bursts’ of errors that you see. I will also continue to investigate on my end.

1 Like

Hey there, checking in to see if you’re still observing the issue? According to the telemetry on our end, your experience shouldn’t be seeing a lot of Internal Errors for MemoryStores API right now. If all looks good on your end, I can close out this ticket (We are working on some general improvements to reduce the frequency of these errors, not specific though).

We’re not seeing any problems on our end, though we stopped closely monitoring InternalError rates on our own long ago. Unfortunately I’ve been sick the past two weeks and never got a chance to follow up properly.

1 Like

No worries and thanks for the reply. Get well soon :slight_smile:

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.