GKEでrollout restartする際に起きる503エラーを解消した

前日、GKEでdeploymentに対してrollout restartを実行すると、10s前後の503エラーが発生する事象に気づいた。呼び出し側でエラー処理が実装されているため大きな問題ではなかったが、オンプレのクラスタでは同じ事象が発生しないため、原因を調査した。

環境

GKE: 1.34.1-gke.1829001
Networking: Gateway API

再現

以下のスクリプトを実行させながら、kubectl rollout restart deployment/your-appを実行する。

START=$(date +%s.%N)
while true; do
  curl -s https://your-app.example.com/
  END=$(date +%s.%N)
  ELAPSED=$(echo "$END - $START" | bc)
  echo -e "\nElapsed time: ${ELAPSED}s"
  sleep 0.1
done

rollout restartをするタイミングで、10s前後、503系のエラーが起きることを確認できる。

Elapsed time: 21.341049000s
OK
Elapsed time: 21.530590000s
OK
Elapsed time: 21.730951000s
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused
Elapsed time: 21.938160000s
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused
Elapsed time: 22.135801000s
upstream connect error or disconnect/reset before headers. reset reason: connection timeout
Elapsed time: 26.838277000s
upstream connect error or disconnect/reset before headers. reset reason: connection timeout
Elapsed time: 31.532648000s
OK
Elapsed time: 31.740313000s

`503 upstream connect error or disconnect/reset before headers`

処理中のリクエストがあるにもかかわらず、backend serviceが削除されるとこのエラーが発生する。 preStop hookを追加することで解消できた。

        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 20"]

ただし、イメージに/bin/shとsleepが存在することが前提となるため、 distroless を使っている場合は、 baseイメージを変えるか、sleep用のendpointを実装するなどの工夫が必要になる。

I1113 11:00:23.813868 53249 warnings.go:110] “Warning: autopilot-workload-defaulter:Autopilot added tolerations matching: cloud.google.com/gke-spot” I1113 11:00:23.813935 53249 warnings.go:110] “Warning: autopilot-default-resources-mutator:The max supported TerminationGracePeriodSeconds is 25 seconds when using toleration of cloud.google.com/gke-spot=true:NoSchedule. Defaulting down from configured 30 seconds to 25 seconds.”

spot nodeの場合、TerminationGracePeriodSecondsはそもそも25sまでしか設定できないため、それ以上大きな値を設定しても効果がない。

余談だが、preStop Hookのeventはpodから直接確認できないため、kubectl eventコマンドを使う必要がある。これを知らずにデバッグに時間がかかってしまった。

`503 no healthy upstream`

上記エラーを解決したが、次にno healthy upstreamというエラーが出た。

ドキュメントを調べたところ、以下の記述があった。

This error message indicates that the health check prober cannot find healthy backend services

Podは起動しているものの、Load BalancerからBackend Serviceへのhealth checkが失敗している状態である。(k8sのreadiness probeとは別のもの)

Podのeventを確認すると、NEGがattachされていないぞというwarningがあった。

NEG is not attached to any BackendService with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

Ingress APIのドキュメントの known issues にも記載されているように、NEG controllerはhealth checkが存在しないのか、またはhealth checkが成功したのかを判別できないため、Podをreadyにマークしてしまう。これはIngress APIのドキュメントに記載されているが、Gateway APIでも同じissueが存在する。

このドキュメントに書いてある「複数のbackendsを用意しよう」「rolling update strategyを変えよう」といったResolutionsが全然回答になっていない。NEGに複数のbackendsを用意しても、結局NEGのattachが完了するまでに待たないといけない。

PodのReady状態を、NEGがattachされるまで遅延させる方法がないかを調べたところ、 minReadySeconds という機能を見つけた。minReadySeconds: 60を設定することで、503 no healthy upstreamエラーを見事に解消できた。

Another case, where minReadySeconds helps is when using LoadBalancer Services with cloud providers. Since minReadySeconds adds latency after a Pod is Ready, it provides buffer time to prevent killing pods in rotation before new pods show up

これまであまり使う機会がなかったが、まさにこのようなユースケースのために開発された機能である。

再現#

503 upstream connect error or disconnect/reset before headers#

503 no healthy upstream#

再現

`503 upstream connect error or disconnect/reset before headers`

`503 no healthy upstream`