Life Insurance

How a Golang Upgrade Led to Breaking and Fixing a Service

Ethos Life | Jul 21, 2022
How a Golang Upgrade Led to Breaking and Fixing a Service

We use a plethora of different technologies here at Ethos to make life insurance easy and protect the next million families. One of those technologies is Go (or golang) which powers multiple backend services. A couple of months ago, we decided to upgrade the version of golang being used in one of these services and started experiencing an issue that one wouldn’t normally expect during such an upgrade.

Upgrading Golang Version caused a Service to Run Out-of-Memory

Upon upgrading the golang version, we started to see pods dying (and restarting) every couple of hours because they ran OoM (out of memory). As an initial brute-force solution, we tried increasing the amount of memory allocated to the service. This seemed to decrease the frequency of the pods dying but didn’t prevent them from eventually running OoM and dying.

Memory usage before resource increase
Memory usage after resource increase

On inspecting the memory usage, we found that the memory utilization was going up every hour, coinciding with the start of a long, memory-intensive hourly job. Memory usage didn’t recover after the job finished and instead increased in visible steps until the pod hit the limit and was then automatically killed. This indicated a memory leak, but no major changes had been made to the job recently and it had been running in production without any memory issues for over a year!

Diagnosing the Issue and Resolving it

I decided to take a deeper dive into this issue to try and figure out why, all of a sudden, the pods were running to their memory limits. As would be expected, I started by using pprof to collect and analyze the application’s memory allocation data. I collected data before, during, and after the job while using different versions of golang. It was clear from the memory profiles that memory usage isn’t returning to normal levels after the job finishes on the newer golang version. It does however return to the level before the job run when the older golang version is used.

Using pprof’s top command pointed out that the memory is being held by a method that’s responsible for downloading a large resource from an S3bucket. Digging further using the list command, I was able to find out that a buffer-object used to download the S3resource was the one clinging onto the memory. Some googling led me to discover that a few other people have had issues with using AWS SDK’s aws.WriteAtBuffer method to download large S3objects. So, I started looking into that but soon realized that we hadn’t touched the downloading code for months.

Applying “first-principles thinking” — manifesting “Intellectual Rigor”, one of our core values and the one that resonates the most with me as an engineer — I tried to reason with myself that since the issue appeared after a golangversion update and initial investigation shows a specific object is holding on to the memory, the issue has something to do with memory management. With this consideration, I changed the course of my investigation to focus on memory management & garbage collection — specifically on the garbage collection of the buffer object and any updates between versions that might have affected it.

Inspecting the code at different places referencing the buffer object, I stumbled upon an odd-looking code construct that was indirectly interacting with the buffer object (referencing some other field on the structcontaining the buffer object). A simplified simulation is shown below.

Simplified version of the odd-looking construct encountered in the code

A method creates and returns an unbuffered-channel to which we send a value from inside a goroutine, but the returned channel is not captured by the caller. After examining the git blame and asking around, I learned that a few months ago, a well-intentioned developer wanted to make the method behave asynchronously (by ignoring the channel) for the HTTP request handler but synchronously (by waiting for a value on the channel) for use in integration tests.

Using the same method asynchronously and synchronously

But, unknowingly, the developer introduced a goroutine leak in the handler flow (as is illustrated in this playground link).

Refactoring the method to not return a channel and running memory profiling for the updated code indicates that the refactor solves the memory issue.

Code refactored to resolve goroutine leak

So, this means problem solved, right? The memory usage returns to normal levels after job completion.

Hooray! 🎉

Going a little further!

But wait! This still doesn’t answer the question of why this was not an issue with the older golang version (the leaky goroutine code had been in production for several months).

Leaving the investigation here would solve the issue but would neither lead to substantial learning nor exemplify “Intellectual Rigor”!

So, I decided to spend some more time getting to the bottom of this. I started with running the job multiple times (using different golang versions and different versions of the code) gathering multiple memory profiles to dig into. But the memory profiles weren’t revealing any new information at this time. So I collected other profiles (CPUgoroutine, etc.) to inspect, but to no avail!

Going back to “first principles” — I conjectured that, in the simplest terms, the issue can be summed up as follows:

something changed between golang versions that resulted in the prevention of memory being garbage collected (which was being collected previously)

I started looking into golang’s garbage collection specifics. Based on the release notes, nothing seems to have changed the garbage collection drastically. So, maybe something changed “when” the garbage collector was run. Even though it was unreasonable even to think that the garbage collector was possibly not running for several minutes after the job was completed (and almost an hour in case of the production), just to be entirely sure I even tried running the old code with enforced garbage collection right before exiting from the job processing — but that didn’t help either (memory was still leaking).

It was clear that something was keeping a reference to the buffer object and preventing it from being garbage collected but I wasn’t sure what exactly was this?

So, I resorted to replicating the issue in a simpler setup and trying to reason about the issue with the simpler code! One thing to keep in mind here is that the GC cycle might not run in a short-lived program, so we’ll enforce garbage collection.

To accomplish this, I created a code snippet to simulate the scenario but with enforced garbage collection and tried running that with different versions of golang.

Simulation of the offending application code in a simpler code setup

This successfully replicated the goroutine leak but wasn’t causing the memory leak with the newer golang version — which meant this wasn’t simulating the exact scenario.

I was unable to make much progress and called it a day. But this issue was stuck in my mind even after work. Later that night, right before I was about to go to bed, it suddenly occurred to me that one construct that’s missing from my replication is defer — the original code writes to the channel in a deferred-closure and that closure also indirectly interacts with the buffer object (same as the goroutine — by referencing a different field in the struct holding the buffer object).

Updating the code snippet to conform with the usage of defer resulted in replication of the memory leak

Simulation of the offending application code along with use of defer

So, now I went through the release notes again, but this time to see if there were any updates made to how defer affects garbage collection. I couldn’t find the exact “defer affecting garbage collection” thing but I did discover that some optimizations were made to defer in v1.14.

To see if that could be directly affecting our use case, I tried running the original (non-refactored) code with v1.14 and that resulted in the memory leak too!

Thereafter, I started looking further into these optimizations made to defer. At this point, I’d like to point out that if you search for “defer changes in golang v1.14”, most of the first-page results you get are release notes (different people writing different articles just copying the official release notes).

Thankfully, it’s just most and not all. In the results, I found this podcastwhere the engineer behind the defer optimizations himself explains the changes made (optimizations are discussed at ~35:30).

Vishal Sood

Vishal Sood joined Ethos in March, 2021 as a Software Engineer. When Vishal isn’t working on empowering and developing the Agent ecosystem at Ethos, he enjoys keeping up to speed on all things MCU, cooking and learning new things. Interested in joining Vishal’s team? Learn more about our career opportunities here.