Yep, another bug fixing blog. But this one has a twist, an open-source hero swoops in and saves the day.
Debugging concurrency bugs is no picnic, but we're going to get into it. Enter Fray, a deterministic concurrency testing framework from CMU’s PASTA Lab, that turns flaky failures into reliably reproducible ones. Thanks to Fray’s clever shadow lock design and precise thread control, we tracked down a tricky Lucene bug and finally squashed it. This post explores how open-source heroes and tools are making concurrency debugging less painful—and the software world a whole lot better.
Software Engineer's Bane
Concurrency bugs are the worst. Not only are they difficult to fix, simply getting them to fail reliably is the hardest part. Take this test failure, TestIDVersionPostingsFormat#testGlobalVersions
, as an example. It spawns multiple document writing and updating threads, challenging Lucene’s optimistic concurrency model. This test exposed a race condition in the optimistic concurrency control. Meaning, a document operation may falsely claim to be the latest in a sequence of operations 😱. Meaning, in certain conditions, an update or delete operation might actually succeed when it should have failed given optimistic concurrency constraints.
Apologies for those who hate Java stack traces. Note, delete doesn’t necessarily mean “delete”. It can also indicate a document “update”, as Lucene’s segments are read-only.
Apache Lucene manages each thread that is writing documents through the DocumentsWriter
class. This class will create or reuse threads for document writing and each write action controls its information within the DocumentsWriterPerThread
(DWPT) class. Additionally, the writer keeps track of what documents are deleted in the DocumentsWriterDeleteQueue
(DWDQ). These structures keep all document mutation actions in memory and will periodically flush, freeing up in-memory resources and persisting structures to disk.
In an effort to prevent blocking threads and ensuring high throughput in concurrent systems, Apache Lucene tries to only synchronize in very critical sections. While this can be good in practice, like in any concurrent systems, there are dragons.
A False Hope
My initial investigation pointed me to a couple of critical sections that were not appropriately synchronized. All interactions to a given DocumentsWriterDeleteQueue
are controlled by its enclosing DocumentsWriter
. So while individual methods may not be appropriately synchronized in the DocumentsWriterDeleteQueue
, their access to the world is (or should be). (Let’s not delve into how this muddles ownership and access—it’s a long-lived project written by many contributors. Cut it some slack.)
However, I found one place during a flush that was not synchronized.
These actions aren’t synchronized into a single atomic operation. Meaning, between newQueue
being created, and calling getMaxSeqNo
, other code could have executed incrementing the sequence number in the documentsWriter
class. I found the bug!
data:image/s3,"s3://crabby-images/fe2dc/fe2dce028199f96373972add7bf38759f4554133" alt=""
If only it were that easy.
But, as with most complex bugs, finding the root cause wasn't simple. That's when a hero stepped in.
A hero in the fray
Enter our hero: Ao Li and his colleagues at the PASTA Lab. I will let him explain how they saved the day with Fray.
Fray is a deterministic concurrency testing framework developed by researchers at the PASTA Lab, Carnegie Mellon University. The motivation behind building Fray stems from a noticeable gap between academia and industry: while deterministic concurrency testing has been extensively studied in academic research for over 20 years, practitioners continue to rely on stress testing—a method widely acknowledged as unreliable and flaky—to test their concurrent programs. Thus, we wanted to design and implement a deterministic concurrency testing framework with generality and practical applicability as the primary goal.
The Core Idea
At its heart, Fray leverages a straightforward yet powerful principle: sequential execution. Java’s concurrency model provides a key property—if a program is free of data races, all executions will appear sequentially consistent. This means the program’s behavior can be represented as a sequence of program statements.
Fray operates by running the target program in a sequential manner: at each step, it pauses all threads except one, allowing Fray to precisely control thread scheduling. Threads are selected randomly to simulate concurrency, but the choices are recorded for subsequent deterministic replay. To optimize execution, Fray only performs context-switches when a thread is about to execute a synchronizing instruction such as locking or atomic/volatile access. A nice property about data-race freedom is that this limited context switching is sufficient to explore all observable behaviors due to any thread interleaving (our paper has a proof sketch).
The Challenge: Controlling Thread Scheduling
While the core idea seems simple, implementing Fray presented significant challenges. To control thread scheduling, Fray must manage the execution of each application thread. At first glance, this might seem straightforward—replacing concurrency primitives with customized implementations. However, concurrency control in the JVM is intricate, involving a mix of bytecode instructions, high-level libraries, and native methods.
This turned out to be a rabbit hole:
- For example, every
MONITORENTER
instruction must have a correspondingMONITOREXIT
in the same method. If Fray replacesMONITORENTER
with a method call to a stub/mock, it also needs to replaceMONITOREXIT
. - In code that makes use of
object.wait/notify
, IfMONITORENTER
is replaced, the correspondingobject.wait
must also be replaced. This replacement chain extends toobject.notify
and beyond. - JVM invokes certain concurrency-related methods (e.g.,
object.notify
when a thread ends) within native code. Replacing these operations would require modifying the JVM itself. - JVM functions, such as class loaders and garbage collection (GC) threads, also use concurrency primitives. Modifying these primitives can create mismatches with those JVM functions.
- Replacing concurrency primitives in the JDK often results in JVM crashes during its initialization phase.
These challenges made it clear that a comprehensive replacement of concurrency primitives was not feasible.
Our Solution: Shadow Lock Design
To address these challenges, Fray uses a novel shadow lock mechanism to orchestrate thread execution without replacing concurrency primitives. Shadow locks act as intermediaries that guide thread execution. For example, before acquiring a lock, an application thread must interact with its corresponding shadow lock. The shadow lock determines whether the thread can acquire the lock. If the thread cannot proceed, the shadow lock blocks it and allows other threads to execute, avoiding deadlocks and allowing controlled concurrency. This design enables Fray to control thread interleaving transparently while preserving the correctness of concurrency semantics. Each concurrency primitive is carefully modeled within the shadow lock framework to ensure soundness and completeness. More technical details can be found in our paper.
Moreover, this design is intended to be future-proof. By requiring only the instrumentation of shadow locks around concurrency primitives, it ensures compatibility with newer versions of JVM. This is feasible because the interfaces of concurrency primitives in the JVM are relatively stable and have remained unchanged for years.
Testing Fray
After building Fray, the next step was evaluation. Fortunately, many applications, such as Apache Lucene, already include concurrency tests. Such concurrency tests are regular JUnit tests that spawn multiple threads, do some work, then (usually) wait for those threads to finish, and then assert some property. Most of the time, these tests pass because they exercise only one interleaving. Worse yet, some tests only fail occasionally in the CI/CD environment, as described earlier, making these failures extremely difficult to debug. When we executed the same tests with Fray, we uncovered numerous bugs. Notably, Fray rediscovered previously reported bugs that had remained unfixed due to the lack of a reliable reproduction, including this blog’s focus: TestIDVersionPostingsFormat.testGlobalVersions
. Luckily, with Fray, we can deterministically replay them and provide developers with detailed information, enabling them to reliably reproduce and fix the issue.
Next Steps for Fray
We are thrilled to hear from developers at Elastic that Fray has been helpful in debugging concurrency bugs. We will continue to work on Fray to make it available to more developers.
Our short-term goals include enhancing Fray’s ability to deterministically replay the schedule, even in the presence of other non-deterministic operations such as a random-value generator or the use of object.hashcode
. We also aim to improve the usability of Fray, enabling developers to analyze and debug existing concurrency tests without any manual intervention. Most importantly, if you are facing challenges debugging or testing concurrency issues in your program, we’d love to hear from you. Please don’t hesitate to create an issue in the Fray Github repository.
Time to fix the danged thing
Thanks to Ao Li and the PASTA lab, we now have a reliably failing instance of this test! We can finally fix this thing. The key issue resided in how DocumentsWriterPerThreadPool
allowed for thread and resource reuse.
Here we can see each thread being created, referencing the initial delete queue at generation 0.
Then the queue advance will occur on flush, correctly seeing the previous 7 actions in the queue.
But, before all the threads can finish flushing, two are reused for an additional document:
These will then increment the seqNo
above the assumed maximum, which was calculated during the flush as 7. Note the additional numDocsInRAM
for segments _3
and _0
Thus causing Lucene to incorrectly account for the sequence of document actions during a flush and tripping this test failure.
Like all good bug fixes, the actual fix is about 10 lines of code. But took two engineers multiple days to actually figure out:
data:image/s3,"s3://crabby-images/b2889/b2889ed3a6d67025ab813ef337bfdd322ff6ce50" alt=""
Some lines of code take longer to write than others. And even require the help of some new friends.
Not all heroes wear capes
Yes, it's cliche – but it's true.
Concurrent program debugging is incredibly important. These tricky concurrency bugs take an inordinate amount of time to debug and work through. While new languages like Rust have built in mechanisms to help prevent race conditions like this, the majority of software in the world is already written, and written in something other than Rust. Java, even after all these years, is still one of the most used languages. Improving debugging on JVM based languages makes the software engineering world better. And given how some folks think that code will be written by Large Language Models, maybe our jobs as engineers will eventually just be debugging bad LLM code instead of just our own bad code. But, no matter the future of software engineering, concurrent program debugging will remain critical for maintaining and building software.
Thank you Ao Li and his colleagues from the PASTA Lab for making it that much better.
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.