MessagingService:PublishAsync yields forever

Reproduction Steps
I can’t reproduce it, but I have proof that it happens. I increment and decrement a counter before and after pcall-ing PublishAsync. I have a server where the counter is stuck at 2 (this is the maximum number of requests my MessagingService wrapper was allowed to run simultaneously.) It’s been stuck like this for 15+ minutes (probably longer based on reports I received from players a few hours ago.)

Proof

MessageManager script:
(top)
image
(debug code that lets me see the values)


(pcall code)

(what I see in the affected server)
image
All of the other “reference jobs” are constantly updating and changing so I know the gui isn’t frozen. Players are also stuck in the lobby server unable to join the game.


Expected Behavior
MessagingService:PublishAsync calls should eventually time out or error.

Actual Behavior
MessagingService:PublishAsync yield forever. Players are unable to join a server from the lobby because the server is waiting for old requests to go through before sending new ones. My messaging service wrapper queues multiple requests meant for a specific server and sends them in one message so it can recover more elegantly from temporary service outages and deliver higher throughput. It’s robust, but it can’t recover from infinite yields and people are unable to play the game.

Workaround
Restarting servers is the only way to fix it.

I will increase the maximum simultaneous MessagingService yields in my code from 2 (I had it set it to 2 for testing.) This will decrease the likelyhood of me noticing this bug again but isn’t a real fix (players will still get stuck waiting for their request, and eventually the server will run out of budget in my system.) These methods should never yield for this long.

Issue Area: Engine
Issue Type: Other
Impact: High
Frequency: Very Rarely
Date First Experienced: 2022-03-03 20:03:00 (-08:00)

2 Likes

We’ve ran into a similar issue on our game ER:LC for our private server queue system. For us, it happened randomly throughout the week and for periods up to 30 minutes. Personally, if you’re able, I’d recommend using the Memory Store Service for your queue/teleporting uses. I’ve found it to be WAY more reliable and have not had an issue with it since our implementation of it a few weeks ago. I do hope the reliability of Messaging Service will improve, but I understand stuff like that can take time, and Memory Store is a great alternative since it’s built for this.

4 Likes

I use MessagingService to reserve a spot for the player in a server while they teleport to prevent over-filling, as well as to request to follow a friend in a specific server. The target server can respond with things like ServerFull, UserNotFound, UserNotSpawned, etc., or respond with success along with the server’s access code. I developed this a month before MemoryStoreService released so it uses ordered data stores for finding servers, but I plan to convert it to MemoryStoreService eventually. Servers keep track of other servers that it deems unresponsive (in case a server crashed) and will remove them from the OrderedDataStore after a while (and try to send a “I removed you” message to the dead server in case it’s still alive and needs to immediately re-add itself.)

My MessagingService wrapper’s requests will auto-retry and are disconnectable (so old unneeded requests won’t build up in the case of a temporary outage.) Server management is complex and difficult to test, but I haven’t had many problems and the APIs have been generally reliable except for this infinite yield issue.

1 Like

Ah I got ya. Yeah for MemoryStore I used both the map and the queue parts to coordinate joining a server, cancelling your own spot in the queue and so forth. I feel you’ll be able to do the same to get the outcome you desire with some added reliability that MemoryStore offers from my experience.

1 Like

Huge thanks for the report. This was a one time server outage. Going to close out the thread.

1 Like