We experienced a recent outage here at AppNexus surrounding an ephemeral user session store. This is a story of how visibly correct code can still fail in unexpected ways.
We have a Scala service based on the Play framework that is a simple wrapper around Redis storage of user sessions. This service is completely separate from authentication and merely exists to abstract the Redis KV store so that other services in our infrastructure can communicate in a standard, signed, and RESTful fashion. This service runs in an OS level virtual machine in the JVM and uses the scredis driver to communicate with Redis.
The code can be essentially boiled down to:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
get on the scredis driver with some String
optionally set an expiration if one is provided and then return the
Recently the root OS of the machine on which the VM was deployed had a memory pressure situation. As a result the scheduling of each individual VM on the host machine experienced delays. The box itself was in fairly heavy distress to the point where it was dropping network packets and suspending execution of individual VMs for seconds at a time.
1 2 3 4 5 6 7 8 9
(somekey and someindex is obfuscating actual data above)
Understandably, our simple wrapper application was experiencing
timeouts in communication with the Redis server. The OS was not
giving the VM enough scheduler priority to execute things. However,
we also saw other strange behavior like
trying to parse what responses we did get back, implying corruption or
packet loss; and strangest of all, completely mismatched data where a
request for key
A would return data for key
B! For those of you
unfamiliar with KV storage, it’s akin to executing a SQL statement for
some primary key in a mysql database and getting back data for a
totally different row! The last behavior there is the scary part
since it appeared with no related processing error.
We initially thought the problem must be in our code: maybe a buffer
we were holding onto and not clearing on an
something was off by one in our handling of requests and responses, or
some assumption about Actor execution order exposing a race condition.
We were seeing
RedisProtocolExceptions about unexpected responses
for some requests:
This was telling us a
which should return a simple string reply of ‘OK’ was instead
returning something that was not a simple string reply. This could
still be packet corruption somewhere lower in the network stack but
coupled with the mismatched data behavior we were pretty sure it
pointed to the scredis driver
The frustrating part is that we could not reproduce this issue in our testing environments. The working theory was that packet loss and/or lack of process priority was exposing some race condition in the driver resulting in the weird behavior we were seeing. So we constructed a test.
On Centos 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Then we wrote a test program, the source of which is (note I am not a Scala dude, so forgive any non-Scalaisms):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
- Loop forever
- Go through our 1 to 100 key set
- Get the value for key from redis
- If the key does not match the value print out that fact
We then ran this for a long time with the packet loss on the test instance and still could not reproduce it. Despair.
Eventually someone suggested:
1 2 3
to simulate the lack of responses or timeouts from the redis server. And then, glorious output was glorious:
We had reproduced the weird behavior of this driver! We were asking
5 and getting back value
99. It’s possible that redis
itself was malfunctioning, but this was quickly discarded because with
so many worldwide users we were pretty sure we would have heard of
this fact at some point, however, Google searches revealed nothing.
At this point we set a breakpoint in the debugger in the
handleReceiveTimeout and added a condition that we only break if
remainingByteStringOpt was set.
So after trying a few more times with
SIGCONT we hit the
condition again and now had it in the debugger.
1 2 3 4 5 6 7
First thing the driver does is send a message to the current ioActor
reference to shutdown. Based on how the actor model functions this is
really just a message enqueued into that actor’s mailbox for execution
the next time it gets a turn. Next the driver calls
change the state machine of this actor into
is just passed up to the
This passes up to the
ActorCell implementation in Akka which
replaces the top item in the
Which eventually executes
1 2 3 4 5 6 7 8 9 10 11 12 13 14
true here we will
which simply dumps everything in the
requests queue (including the
request that matches the data in
remainingByteStringOpt) and then we
1 2 3 4 5 6 7
This overwrites our
ioActor reference with a new one, adds to the
context and switches the state machine yet again. Note that the
remainingByteStringOpt still contains remnant data at this point:
connecting we re-establish the connection to the Redis server and
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
We then swing all the way back around to
receive where we prepend
this remnant data to the buffer to be processed. This remnant data
that was for an older terminated request was never finished processing
by the driver:
1 2 3 4 5 6 7 8 9 10 11
All of this results in data that mismatches our original request.
If you are having trouble following the execution path of the code
above, do not despair. I have trouble following it as well. I
believe this is a case where the complexity in execution path
introduced by using the actor model for a synchronous protocol makes
it very difficult to reason about correct behavior. The state machine
in an Actor is hard to get right. When you split this state
management about a connection to a database across multiple actors
(ListenerActor, IOActor) you run the risk of getting the state machine
wrong in one of them. It might simplify the execution path of this
driver if they were to use a single Actor to deal with connection
state and network IO, centralising the state management to one place.
Then they could fire off Actors to deal with each complete response.
I believe the fix of the above is simple: in
remainingByteStringOpt to None and discard any
unprocessed bytes. But this code path is so complex there might
actually be a valid case for not discarding it, making the fix more
The fix for the above has been submitted as a pull request to the
scredis maintainers. I have also pre-shared this write up with them.