Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Fixing a kubelet memory leak in Kubernetes 1.36 (heyoncall.com)

65 points by compumike 20 hours ago | 13 comments

compumike 20 hours ago [-]

Author here! If you're running a Kubernetes cluster, I recommend you check `kubectl version` and see if you're running "Server Version: v1.36.[0,1,2]". If so, you may want to use the one-liner at the end of the article to check your "process_resident_memory_bytes" on each node, and consider restarting kubelet as a temporary workaround to tame the memory leak until v1.36.3 is released.

CamouflagedKiwi 5 hours ago [-]

Nice find.

Can't help but feel this is one of the subtle traps hidden beneath the advice that contexts aren't supposed to be stored. I know it's not always that easy, of course.

compumike 4 hours ago [-]

Thanks. I know there's a `go vet` tool that's run as part of Kubernetes CI, and one of its checks is:

  lostcancel: check cancel func returned by context.WithCancel is called

I'm not 100% sure why `go vet` didn't catch this issue, but storing the cancelFn in the struct is probably part of the reason. Any Go experts know if that's the case?

cyberax 4 hours ago [-]

The cancel function escapes the function body, so static analysis can't detect it. There's another lint for that (containedctx), but I think it's off in K8s.

This is a serious tripping point with Go. There's no way to express: "this is a root context that I _want_ to store and only use to create derived contexts". Goroutines are also a source of problems, you can't easily say "I'm passing the ownership of this context to a goroutine".

compumike 3 hours ago [-]

It does seem like a serious tripping point.

I took a quick look at "containedctx" and it seems like for this case, it would almost be backwards: it would flag the (not-memory-leaking) struct-stored "status.ctx", but wouldn't flag when there is a stored "status.cancelFn" only (which resulted in the memory leak).

Could Go implement a runtime leaked-context detector, like the data race detector? https://go.dev/doc/articles/race_detector

cyberax 3 hours ago [-]

This actually is possible now. Contexts are now garbage-collectible, even if cancel() is not called.

In this case, the cancel() function was preventing the collection. But I think it can be changed to hold a weak reference instead. The overhead is too large to run it normally, but it should be OK for something like the race detector.

keynha 3 hours ago [-]

[flagged]

rirze 5 hours ago [-]

Very cool. It's often daunting to contribute to such a well-established and recognizable project, but this is exactly how it should work.

jeffbee 2 hours ago [-]

Given that this happens on the main loop, one wonders how this escaped release quality checks. Is there another contributing cause that narrows the impact?

compumike 26 minutes ago [-]

I don’t see any other impact-narrowing criteria.

The default kubelet `syncFrequency` is 1 minute (per Pod). (There can be additional event-driven ones, but this is a floor.) Back of the envelope: 30 Pods * 1440 * 7 = 300k calls to startPodSync in a week.

I’d guess a lot of production clusters don’t run the latest release of Kubernetes, and 1.36 was only released in late April, less than two months before I reported the issue.

Given that the issue shows up as linear memory growth on the order of maybe 1 GiB/month per node, there really is a time-based discovery process here. (Could be artificially accelerated for a CI test, I imagine!) And any restarted nodes or clusters that upgraded point releases would restart that clock.

The only unusual contributing cause that made the issue more visible to me was trying to run everything on a tiny 2 GiB RAM node!

fsuts 5 hours ago [-]

Not all heroes wear capes! Well done

__turbobrew__ 4 hours ago [-]

A good reason to health check the kubelet process and restart it when the checks fail.

compumike 4 hours ago [-]

What kind of health checks? In my case, the kubelet process was staying alive and responsive to queries, I believe due to:

  # cat /proc/$(pgrep kubelet)/oom_score_adj
  -999
  
  (from OOMScoreAdjust=-999 in /etc/systemd/system/kubelet.service)

With this score, the Linux OOM killer wouldn't touch it, but any of my Pods were fair game.

nijave 3 hours ago [-]

At the metrics level, you can compare old vs new release. Have been bitten before by resource requirements dramatically change (regardless of whether it's a bug or functionality change)

Rendered at 22:07:12 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.