I suppose every software engineer accumulates a few stories about unusually elusive bugs. These stories are fun to reminisce about because, often, there’s a perfect set of circumstances that led to the perfect bug. One of my projects while interning on the data team has been to work on the Job Management Framework (JMF)—an internal (web) service used by the data team to manage and monitor various data pipeline jobs such as aggregations, syncs, purges, etc. This bug chasing story began a few weeks ago, when I was preparing JMF for its weekly deployment after implementing some routine bug fixes.
** JMF: A few weeks ago **
I had just fixed a minor bug in the way one JMF endpoint handled timestamps retrieved from the MySQL database. My code was unit tested, peer reviewed, and merged. Jenkins deemed it worthy. I was ready to deploy! I did one last smoke test of the UI component (called DMF-UI) just before deployment when things went awry. One of the core features of the UI is the ability to see recent and currently running jobs and retrieve information about them. The leftmost column of this currently running jobs table has a clickable job id. Normally, when a user clicks on it, a new window displays information such as the curl request that triggered the job and the job’s execution summary. With the newest version of JMF that I was about to deploy, that feature degenerated into a no-op. Who is to blame?
On the hunt
Using Chrome’s network developer tools, I inspected JMF’s response:
1 2 3 4
The ellipsis in the response indicates a repeating pattern that seemed to recurse infinitely. In actuality, the response was long but eventually truncated (screen buffer is finite after all) resulting in invalid JSON. I then inspected JMF’s logs which contained no trace of error except for a this strange assertion failure, printed a number of times:
Typically, an assertion failure inside the Java Virtual Machine (JVM) is accompanied by a stack trace and core dump (i.e., after a JVM crash), likely indicating an internal JVM bug. The error in JMF was different. Every click on a job id in the UI was followed by a number of the above assertion failures in the* same* JVM (i.e., no crash). With no available stack trace, my only clues were the invalid JSON with a recurring pattern and the assertion failure message. I delved into the latter first. (In retrospect, the former would have led directly to the answer, had I known where to look.)
Googling the assertion failure did not reveal anything other than the actual JVM code around line 806, inside the function transformClassFile:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
I ignored this for a while. I noticed that there were other features of DMF-UI that were broken in a similar fashion, and I tried to explore the similarities between the affected endpoints, but there didn’t appear to be anything wrong with them (I would later note that the broken endpoints all used the Ebean ORM). After trying different combinations of reverting my changes, testing the working JMF that was deployed in prod at the time, adding print statements, and using an older version of DMF-UI, I was left without an answer.
I eventually enlisted the help of Stan Rosenberg, the data team Java guru who, conveniently, wrote much of JMF. With Stan’s help I began to inspect the JVM code above. We found that errorOutstanding is set if an exception is raised as a result of invoking the JNI method CallObjectMethod. The latter in turn invokes a Java method, given by agent->mTransform, on the object given by agent->mInstrumentationImpl. The above code (in JPLISAgent.c) is part of Java Programming Language Instrumentation Services (JPLIS). A JPLIS agent is typically registered via the JVM option -javaagent; it provides an entry point via the method premain wherein a custom implementation of ClassFileTransformer is registered. The above JNI code invokes the transform method of the registered ClassFileTransformer. So, why would the invocation of the transform method raise an exception, and why are we using JPLIS in the first place?
Our backend service does indeed install an agent via -javaagent because we use AspectJ runtime instrumentation of the Ebean ORM framework. (Note, compile time instrumentation is not possible because it is a third-party dependency; moreover, instrumentation inserts some necessary hooks for reconciling relational model with object model and auditing executed SQL statements, none of which is available directly in Ebean.) Why did agent->mTransform raise an exception? There could be a host of reasons. Thus, we needed to isolate the cause-effect chain to narrow down the causes of an exception.
Indeed, one of our initial debugging attempts was to disable -javaagent. The result was rather underwhelming. We lost our only clue in the server logs, i.e., JPLIS assertion failures disappeared, and the rest of the behavior was virtually unaltered. We were also suspicious that the problem may lie within Spring’s dynamic instrumentation (we use certain Spring method interceptors to perform user authorization, etc). Disabling Spring interceptors also made no difference.
Meanwhile, I suggested diffing the “good” jar, i.e., previously deployed version in prod, with the bad jar. (This step could have been avoided, had we been able to produce a diff at the source level. One complication is that part of the source code resides in a separate repository.) We unpacked both jars and ran diff -r on the corresponding directories. Fortunately, only a few .class files
differed. Again, applying the principle of isolation, we replaced each file (showing delta) in the “good” jar with the “bad” version until we could reproduce the same error. We eventually ended up with a single .class file which must contain the breaking change. A quick examination of the byte-code differences (with javap -c) left us with a single line change:
1 2 3
git log showed us that this single line change was committed sometime after the previous deployment.
In the above, mapper is an instance of org.codehaus.jackson.map.ObjectMapper, from the JSON library, Jackson. Basically, this single line gives Jackson the permission to read and write to private fields within objects (for the purpose of JSON serialization/deserialization.) Incidentally, JMF relies on Ebean’s runtime enhancement of model classes—runtime instrumentation which adds private fields, e.g., ebean_intercept– and such enhancement is needed to provide lazy loading and dirty checking within the ORM. Because this single line change was in a seperate repository (dataWebServices), it didn’t manifest itself as a code change when I ran git log -p. Instead, it appeared as a version change inside our pom.xml.
1 2 3
Let’s summarize: at this point, we have a single line of source code that is responsible for bringing down one of the core functions of JMF, yet we still don’t exactly know why. A stack trace would have been immensely useful. Note: checkForAndClearThrowable, as the name suggests, clears any raised exception as required by JNI contract. Most JNI methods cannot be invoked until an exception
is cleared (the exception is essentially forgotten; i.e., it will not be re-thrown when control is transferred back to the caller). Furthermore, we have not ruled out AspectJ’s agent; it’s possible that agent->mTransform yields an exception indirectly related to the code change above.
To rule out AspectJ, we implemented the identity class transformer:
1 2 3 4 5
used in conjunction with -javaagent. Same result! Thus, AspectJ has been ruled out.
JVMTI to the rescue
So the code in JPLISAgent clears the exception as required by JNI. Fortunately, we can use Java Virtual Machine Tool Interface (JVMTI) to examine the stack trace immediately before the invocation. To do so, we pulled the source code for OpenJDK and modified JPLISAgent.c to print out the stack trace. Few short
minutes later we had OpenJDK built with our change and ready to run JMF.
Here is what the stack trace revealed,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
The ellipsis above stands for the recurring pattern of BeanPropertyWriter.serializeAsField, BeanSerializerBase.serializeFields, BeanSerializer.serialize, all of which occur in the Jackson library. Also note that the stack is unusually deep—there are 7384 stack frames. Upon inspecting the code in BeanSerializerBase.serializeFields, we find a rather interesting exception handler:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
An usually deep stack coupled with the recurring pattern and the exception handler for StackOverflowError lead us to the source of the problem, namely stack overflow. Generally, it is not a good idea to catch StackOverflowError, since not a whole lot can be done to recover from it. Indeed, the “tricky” nature is acknowledged in the comment. Evidently, the handler is used for wrapping the exception inside JsonMappingException. However, this begs the question: why didn’t JsonMappingException appear in our server logs? It happens that JsonMappingException derives from IOException. We happen to run jetty which by default relegates all IOException’s to the DEBUG log level, which of course is disabled in most production environments. Below is the relevant code which handles IOException inside jetty, taken from org.eclipse.jetty.servlet.ServletHandler:
1 2 3 4 5
Piecing the puzzle together
We now had all the pieces of the puzzle: the single line change resulted in a stack overflow when a model was being serialized in response to the user click. Stack overflow occurred because the private (instrumented) field owner in the requested Job object is a self-reference. (This explains the infinite recursion, since JSON is not suitable for serializing cyclic graphs due to the lack of reference types.)
Because of the logic above, StackOverflowError essentially became IOException which was ignored by jetty. Furthermore, the assertion failure was due to line 19 in the exception handler above—instantiation of JsonMappingException resulted in the loading of the class byte-codes, which in turn resulted in the call to agent->mTransform. That call must have raised an internal exception because the stack had overflown!
The sheer number of abstractions we use in our jobs can be a two-edged sword. Without a doubt, software and hardware abstractions are responsible for rapid acceleration of software. A multitude of frameworks and libraries lift the abstraction boundaries to the point where a non-specialist can implement software by merely writing what is typically known as the business logic. On the contrary, too much abstraction can hurt in some unpredictable ways such as the bug we encountered. (There are many examples: automatic memory management is a great abstraction, yet sometimes it can lead to difficult to diagnose memory problems; highly expressive languages are powerful abstractions, but they typically yield poor runtime performance due to lack of optimal code generation, etc.) The trick is to find a fine balance, but until we do, we must hone our skills to navigate these abstraction boundaries.
An astute reader may ask, “why didn’t (unit/integration) tests catch this bug?”. The sad fact is that they probably could. However, in a fast paced environment, the prevailing mentality is “deliver or perish”. Furthermore, achieving complete test coverage is nearly impossible, and even complete test coverage is no guarantee of correctness. Thus, every line of code is guilty until proven correct.