Receive: Endless loop of retried replication with capnproto and distributors #8254

TheReal1604 · 2025-05-19T09:54:51Z

Thanos, Prometheus and Golang version used: 0.38.0

bitnami/thanos:0.38.0-debian-12-r3

Object Storage Provider:
Openstack-s3

What happened:
We use a thanos setup with 3-5 receivers and dedicated thanos-receive routing instances, which use capnproto as a replication protocol. The replication_factor is set to 3.

Currently we only have a static hashring configuration.

data:
  hashrings.json: |-
    [
      {
        "endpoints": [
          "thanos-receive-0.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-3.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-4.thanos-receive-headless.customer1.svc.cluster.local:19391"
        ]
      }
    ]

Unfortunately we can trigger something like an endless replication retry loop, if the receive instances restart in a chaotic way (this is triggered by k8s node-rollovers in our custom clusters).

If we see the errors, the distributor pods are logging the following errors very often:

Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: 2 errors: forwarding request to endpoint {thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded; forwarding request to endpoint {thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded

The receive pods restart at the same time, but the distributor can not recover from that error state. I ultimately have to kill the distributor pods to work correctly again. The normal receive-pods are working fine (dont need a restart or anything).

The interesting thing: We reverted back to grpc with protobuf and couldnt reproduce this issue, it seems to be something with the capnproto implementation.

What you expected to happen:
Recover successful from the above error if receive pods and router / distributor pods are restarted.

How to reproduce it (as minimally and precisely as possible):

Create a thanos setup (3-5 receive pods, 2 distributor / router pods)
Setup replication with replication factor 3 and use capnproto as the replication protocol
setup fixed hashring.json replication configmap
Send test data to the distributor pods
restart the distributor and receive pods in a chaotic way
-> Stuck in this error loop
Anything else we need to know:

We are using the thanos bitnami helmchart.

The text was updated successfully, but these errors were encountered:

dosubot · 2025-05-19T09:56:06Z

The issue you're experiencing with the endless replication retry loop when using Cap'n Proto in Thanos is likely related to how Thanos handles retries and error recovery in the Cap'n Proto replication protocol. When a disconnection error occurs, Thanos attempts to reconnect and retry the write operation up to a specified number of times. However, if the receive instances restart in a chaotic manner, it can lead to repeated disconnection errors and thus trigger the endless retry loop ^[1].

The fact that reverting to gRPC with Protobuf resolves the issue suggests that the Cap'n Proto implementation may not handle these chaotic restarts as gracefully as gRPC does. This could be due to differences in how each protocol manages connections and retries.

To mitigate this issue, you might consider the following approaches:

Increase the Reconnect Attempts: Adjust the number of reconnect attempts in the Cap'n Proto configuration to allow for more retries before giving up. This might help the system recover without manual intervention.
Stagger Restarts: If possible, avoid restarting all receive and distributor pods simultaneously. Staggering the restarts can reduce the likelihood of encountering simultaneous disconnection errors.
Monitor and Automate Recovery: Implement monitoring to detect when the system enters this error state and automate the restart of distributor pods to recover from the loop without manual intervention.

These steps might help improve the resilience of your setup when using Cap'n Proto.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

MichaHoffmann · 2025-05-22T07:38:36Z

Do you use a ingester/router split? if you do that then you cannot create a loop in the resulting bipartite directed graph! I think thats the recommended way to not end up with replication loops.

TheReal1604 · 2025-05-22T08:42:13Z

@MichaHoffmann Can you elaborate and clarify a bit further? Yes we use an ingester / router split. But I think you are writing about a logical replication loop (in the underlying architecture).

I mean, the distributor / router is kept in an endless error loop, trying to replicate the data which was ingested when the receive only nodes got restarted.

As we cannot replicate that issue with protobuf / grpc, I got the assumption that this might be a bug in the error handling loop in the capnproto implementation.

Thanks!

pvlltvk · 2025-05-25T09:55:40Z

@TheReal1604
Hi!
Did you start noticing this after upgrading to 0.38.0? I've been using the same scheme as you on 0.37.2 for a while and haven't noticed this behaviour (there were some chaotic restarts during that period)

GiedriusS · 2025-05-27T15:27:45Z

I see this too in our setup and it's some bug in the cpnp implementation. https://github.com/capnproto/go-capnp/releases there have been some fixes. I will update the Go module for the next RC and let's see if it still happens.

TheReal1604 · 2025-05-28T08:35:16Z

@TheReal1604 Hi! Did you start noticing this after upgrading to 0.38.0? I've been using the same scheme as you on 0.37.2 for a while and haven't noticed this behaviour (there were some chaotic restarts during that period)

We unfortunately experienced that also on version 0.37.2.

Good to see it confirmed and fixed @GiedriusS thanks! 😍

dosubot bot added bug component: receive labels May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Receive: Endless loop of retried replication with capnproto and distributors #8254

Receive: Endless loop of retried replication with capnproto and distributors #8254

TheReal1604 commented May 19, 2025

dosubot bot commented May 19, 2025

Uh oh!

MichaHoffmann commented May 22, 2025

Uh oh!

TheReal1604 commented May 22, 2025 •

edited

Loading

Uh oh!

pvlltvk commented May 25, 2025

Uh oh!

GiedriusS commented May 27, 2025

Uh oh!

TheReal1604 commented May 28, 2025

Uh oh!

Receive: Endless loop of retried replication with capnproto and distributors #8254

Receive: Endless loop of retried replication with capnproto and distributors #8254

Comments

TheReal1604 commented May 19, 2025

dosubot bot commented May 19, 2025

Uh oh!

MichaHoffmann commented May 22, 2025

Uh oh!

TheReal1604 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvlltvk commented May 25, 2025

Uh oh!

GiedriusS commented May 27, 2025

Uh oh!

TheReal1604 commented May 28, 2025

Uh oh!

TheReal1604 commented May 22, 2025 •

edited

Loading