Why I Prefer Tracing Garbage Collection Over ARC

Brom Bresenham

19 Apr 2022 • 7 min read

I started rewriting my Rogue language compiler from scratch. Rogue is evolved from a series of earlier languages, so this is approximately the sixth full iteration of that compiler, with each version programmed using the compiler from the previous iteration. Rogue v2 is being programmed in Rogue v1, naturally.

I wasn't sure if I would ever rewrite the Rogue compiler again. I'd kind of hoped not 'cause it's a lot of work - v1 has been working well and has become very robust over time, but on the other hand rewriting it is fun and the end result is better architected every time.

I have three motivations for wanting to rewrite Rogue.

I wanted to use my compiler development tool Froley to generate and maintain the scanner and parser components.
Rogue v1 and its predecessors cross-compile (or transpile) to C++ but I've always thought it would be ideal to target straight C instead.
Rewriting the compiler gives me a good opportunity to refine the language. Now that I've been using Rogue v1 for eight years I've got a few pain points I'm going to address and a few unused features I'm going to trim out.

Those motivations have been building for a while, but the actual catalyst was the release of the Playdate SDK. I'd like to mess around with Playdate and Rogue is almost the ideal language for doing so, but Playdate only (?) works with C, not the C++ that Rogue v1 generates. I know I might be able to make it work regardless, but it was a good excuse to pull the trigger on my Rogue v2 rewrite.

When I first started R2 (as I'm calling it during development) I wanted to consider alternatives to the Mark & Sweep tracing garbage collection that I've always used.

Mark & Sweep
Keep a master list of all allocated objects. When you run out of memory or decide to collect garbage: Mark: Starting at root objects such as globals, trace through the program's object graph and mark all referenced objects as "in use". Sweep: Iterate over the master list and free any objects that aren't marked in use. The amount of tracking overhead varies with implementation and can be quite minimal. Neither the developer nor the runtime framework needs to track objects during regular operations - in other words, no need to retain and release references as they're stored. To ensure that all references are accurately traced, the program (including all threads) must pause while garbage collection takes place. This is known as "stop the world" garbage collection. Consequently the more memory that's used, the longer the GC takes to finish.

Mark & Sweep

Keep a master list of all allocated objects.
When you run out of memory or decide to collect garbage:
- Mark: Starting at root objects such as globals, trace through the program's object graph and mark all referenced objects as "in use".
- Sweep: Iterate over the master list and free any objects that aren't marked in use.
The amount of tracking overhead varies with implementation and can be quite minimal. Neither the developer nor the runtime framework needs to track objects during regular operations - in other words, no need to retain and release references as they're stored.
To ensure that all references are accurately traced, the program (including all threads) must pause while garbage collection takes place. This is known as "stop the world" garbage collection. Consequently the more memory that's used, the longer the GC takes to finish.

As my first possible alternative I investigated Rust and its Borrow Checker system. Here's a summary from my brief look into it.

Borrow Checker

Borrow Checker
Rust adds novel rules and restrictions that allow it to track data ownership and references at compile time. In constrast, garbage-collecting strategies all track data ownership at runtime. In Rust, a standard assignment such as `b = a` transfers ownership of the data from `a` to `b`, and `a` can no longer be used to reference the data. In addition Rust allows either any number of read-only local (stack-based) pointers to a variable or a single write-only local pointer, thus allowing values to be passed as function arguments while avoiding thread concurrency issues. The mechanism that validates and enforces these rules is called the Borrow Checker.

Rust adds novel rules and restrictions that allow it to track data ownership and references at compile time. In constrast, garbage-collecting strategies all track data ownership at runtime.
In Rust, a standard assignment such as b = a transfers ownership of the data from a to b, and a can no longer be used to reference the data.
In addition Rust allows either any number of read-only local (stack-based) pointers to a variable or a single write-only local pointer, thus allowing values to be passed as function arguments while avoiding thread concurrency issues.
The mechanism that validates and enforces these rules is called the Borrow Checker.

I'm very impressed with Rust's unique approach as it shifts the burden of memory management entirely onto the compiler (and to some extent the developer as they adhere to the required conventions) instead of requiring any tracking or tracing at runtime. However, while this system works well with Rust's procedural approach I don't believe it's possible to extend it to an object-oriented paradigm. Especially not with Rogue that is so tightly interleaved with C. Rogue passes data to inline C, C functions call Rogue methods, etc., and it's just not feasible to make those interconnections visible to a Borrow Checker-style system.

Finally I turned my thoughts towards Automatic Reference Counting, or ARC.

Automatic Reference Counting (ARC)
ARC is essentially an automatic implementation of the "smart pointer" pattern that can be implemented in C++. Whenever an object is assigned to a reference variable, the object is "retained" by incrementing a reference count or refcount property of the object. When the variable goes out of scope or is assigned a reference to a different object, the original object's refcount is decremented. If the refcount goes to zero then the object is deleted right then. By refcounting objects as the program executes, we always know when an object can be freed. Code runs slightly slower because the world-stop overhead of GC tracing has been "amortized" into an incremental tracking overhead.

Automatic Reference Counting (ARC)

ARC is essentially an automatic implementation of the "smart pointer" pattern that can be implemented in C++.
Whenever an object is assigned to a reference variable, the object is "retained" by incrementing a reference count or refcount property of the object. When the variable goes out of scope or is assigned a reference to a different object, the original object's refcount is decremented. If the refcount goes to zero then the object is deleted right then.
By refcounting objects as the program executes, we always know when an object can be freed. Code runs slightly slower because the world-stop overhead of GC tracing has been "amortized" into an incremental tracking overhead.

Choosing ARC

I had the following things in mind when I initially decided to use ARC for R2.

Garbage Collection (implicitly meaning "traced" GC) has a bit of a bad reputation, especially with the occasional small GC freezes that used to (and maybe still do) happen on Android.
Because GC takes longer the more objects there are, I tend to be uptight about creating lots of e.g. small objects in my games if I can accomplish the same things with non-reference data lists and such. It would be nice to not have to worry about that.
Conversely ARC has more of a "professional" or "robust" halo around it, and you can create as many objects as you want - the execution speed will effectively remain constant in terms of memory management.
It would be interesting to tackle the challenge of writing an ARC mechanism for R2.

Second Thoughts

I started to have some doubts as I developed my ARC system. These were, in both chronological order and in order of being increasingly troubling, as follows.

Reference counting every variable didn't sound that bad - just affects the assignments, right? - until I started implementing the underlying mechanism. Then I realized there's just a great deal of overhead involved. Locals need to be ref-counted as well, so the straightforward approach is to to refcount every parameter and every this object context for every call. Returned references need to be retained beforehand, and consequently unused return values must be released by the caller.
Rogue/C interoperability becomes more complicated and fragile. C calls from Rogue must remember to retain their return values because Rogue will release them. Likewise C callers must remember to release values returned from Rogue.
When I started implementing R2's list methods I realized that the ARC overhead was going to be even worse than I had already realized. For example, it's not uncommon for me to copy some or most objects from a main list to a work list to send to a method, clear the original list, copy the updated work list back to the original list, and clear the work list. Say the list has 100 objects. In a tracing GC that's about a hundred individual copies, a memset, a memcpy, and another memset. But in ARC that's a hundred copies with a hundred retains, a hundred individual releases to clear the original list, a hundred more individual copies and retains, and a hundred more releases.

Premature Optimization

At this point I started thinking about going back to good ol' Mark & Sweep. After all, there wasn't really a problem I was trying to solve. Rogue's GC runs very fast, under 2 ms in the average case, if I recall, and my Rogue games run the same speed even if I force GC 60 times a second (once per frame).

And then I had an epiphany that sealed the deal for a return to GC.

Viva GC

I realized I was completely overlooking an important aspect of the GC-vs-ARC argument.

I had been comparing, say, occasional 2 ms GC calls with 0.5 ms of extra ARC tracking overhead per frame, thinking that 0.5 ms every frame was better than 2 ms during a frame once in a while.

What I was missing was that allocation doesn't need to happen all the time! If I write my code well, GC doesn't ever need to happen during core gameplay. In my games, GC only happens when the system reaches a designated allocation threshold or when it's manually triggered. If my game engine is well-engineered, makes use of recycled entity pools and such, and if I trigger GC's at opportune moments between levels and such, then there's little chance that intensive gameplay could be interrupted by a poorly timed GC.

So the real choice is "very fast code with fast, appropriately-timed allocations and GC's" versus "ARC but a bit slow all the time". I choose fast GC.

Appendix: Rogue's GC Tricks

Here are some details on how R2 implements its tracing GC.

Referenced Object Marking Mechanism

Each new object is added to a master list of all objects. Each object has a signed 32-bit refcount which isn't directly used by the Rogue runtime but can be incremented by native code to signal that the object is referenced in C.

An object is marked as in use by complementing (or flipping the bits of) the refcount. 0 becomes -1, 5 becomes -6, etc. So checking the in use status is as simple as checking refcount < 0 versus refcount >= 0. The refcount is complemented back to normal again at the end of GC.

GC In-between Calls to Rogue

Rogue's default GC mode is auto, which means GC can happen any time an object is about to be allocated. This requires extra tracking overhead of making sure every local variable is accessible to the collector.

Rogue has a faster GC mode called manual, where GC can only be invoked by C in-between calls to Rogue. This means that no local variables need be tracked, so the code runs faster and GC happens faster. This is ideal for games.

Separate Lists for Special Kinds of Objects

Any Rogue class can define an on_cleanup() method that is automatically called when then object is about to be freed. Instances of that class are said to "require cleanup". Rogue handles these objects by placing them in a separate master list; we know that any unreferenced objects in the objects_requiring_cleanup list need to have their on_cleanup() method called.