Nightmare on Actor Subtree Shutdown

Limor Stotland
skai engineering blog
4 min readMay 1, 2019

--

Back when we were young, pretty, and very new to both Scala and Akka, we somehow managed to cause a NullPointerException (NPE) in what should have been a simple Actor Shutdown use case. Not one of our finest moments. While investigating the issue we got to know a bit more about the Akka infrastructure. This is the story of how we got there, and how we got out of there.

The setup is simple. We have trait called SuicideActor and, as its name implies, it’s an actor that runs a block of code and then kills itself. One of our SuicideActors creates another SuicideActor as its child.

https://gist.github.com/LimorStotland/cb7dd8f495444a34607c10732b28250e

The NPE was thrown as the result of a wrong combination between Actors and Futures:

The SuicideActor’s block was wrapped in a Future, inside of which we called the context stop self.

That lead to a situation where the Future created by the child Actor was still running after the child was already dead — it died of not-so-natural causes: the similar Future created by its parent has already completed, thus called context stop self already, and child-actors are stopped when the parent is stopped. This meant that the child’s context was null. First rule of Actors and Future is never change an Actor’s inner state in a Future.

First Attempt: PoisonPill

Our first attempt to solve this issue was to use an alternative to context stop self — sending a PoisonPill. The PoisonPill is considered more graceful in comparison to context stop self, because the PoisonPill is just another message added to the Actor’s queue, thus all preceding messages are guaranteed to be processed before the Actor is shutdown.

This solution did solve the NPE (yay!). Alas, sometimes the child Actor did not run its block (d’oh!). Not the side-effect we were expecting…

After digging a bit deeper into Akka code, we learned that when an actor receives a PoisonPill message, it actually calls self.stop(). So while the parent gets to process all of its pending messages, its children might not — they are brutally murdered with context stop.

We couldn’t override the behavior of the PoisonPill in the receive method because PoisonPill is an AutoReceivedMessage, and an AutoReceivedMessage doesn’t get to the Actor’s receive method, but is rather processed by a autoReceiveMessage(msg: Envelope) method implemented in the base Actor.

Second Attempt: Custom Messages to Children

After almost taking a poison pill ourselves, we realized this “graceful cascading shutdown” requires some custom implementation. The new mechanism we came up with included 2 new messages:

  1. case class PleaseKillYourself() — sent from parent to children, instructing them to kill themselves
  2. case class IKilledMyself() — sent from a child to its parent letting it know it killed itself

The flow here would be:

  • Parent sends PleaseKillYourself to children
  • Children kill themselves gracefully (with this same flow) and reply with an IKilledMyself message
  • Parent kills itself only once all children replied with IKilledMyself

(You can only imagine how lovely the office conversations about this sounded).

The challenge here is to implement this once in a trait that can be easily extended by any Actor — we didn’t want every actor to implement the handling of these 2 messages in its receive method. So we needed to find the right hook to process those messages outside of the receive method implemented by each Actor. We decided to override the unhandled method of the Akka actor. By default, unhandled will publish an UnhandledMessage event of the ActorSystem for every message it captures (except for the Terminated message). The event eventually triggers a push to the dead letter queue.

In our implementation, the unhandled method handles the PleaseKillYourself message before falling back to the default implementation. Once an Actor receives a PleaseKillYourself message, it would send a PleaseKillYourself message to all of its children, and then wait for the IKilledMyself messages using the become pattern. The Actor would thus ignore any other message from that point on (as it should!).

This solution worked, but it was messy: overriding base methods in your infrastructure might expose you to bugs and failures that the authors of the library didn’t expect.

Third Attempt: The Reaper Pattern

After a lot of research, we finally discovered the Reaper pattern. The Reaper pattern solves the problem of shutting down the ActorSystem when only once all Actors finished handing their messages (AKA graceful shutdown for the ActorSystem). In this pattern we create a Reaper directly under the user Guardian and, as its name implies, it “reaps” the souls of other Actors, by watching them. Once the Reaping is finished, it signals the user guardian that it’s safe to shutdown.

In our solution, we decided to treat every subtree of Actors (with a SuicideActor at its root), as a mini “ActorSystem”. Once the SuicideActor ends its block, it creates a dedicated Reaper under the ActorSystem. This Reaper watches over all the SuicideActor's children, and once all of them are dead, it sends the SuicideActor a PoisonPill and kills itself.

New SuicideActor code now pretty straightforward:

The Reaper implementation is slightly more involved:

A nice side-effect of this design is that the SuicideActor scope is now split in half: The SuicideActor is now responsible only for running the block, while the Reaper does the heavy-lifting of handling the subtree of children.

A Happy Ending

With the Reaper pattern, we finally achieved graceful shutdown without abusing the Akka infrastructure. It’s also easy to test — both Reaper and SuicideActor can be unit-tested like any other Actor, while the entire flow can be validated with an integration test.

To this day, Reapers live (and kill) happily in our production systems.

--

--