Skip to content

Receive: Endless loop of retried replication with capnproto and distributors #8254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TheReal1604 opened this issue May 19, 2025 · 6 comments

Comments

@TheReal1604
Copy link

Thanos, Prometheus and Golang version used: 0.38.0

bitnami/thanos:0.38.0-debian-12-r3

Object Storage Provider:
Openstack-s3

What happened:
We use a thanos setup with 3-5 receivers and dedicated thanos-receive routing instances, which use capnproto as a replication protocol. The replication_factor is set to 3.

Currently we only have a static hashring configuration.

data:
  hashrings.json: |-
    [
      {
        "endpoints": [
          "thanos-receive-0.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-3.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-4.thanos-receive-headless.customer1.svc.cluster.local:19391"
        ]
      }
    ]

Unfortunately we can trigger something like an endless replication retry loop, if the receive instances restart in a chaotic way (this is triggered by k8s node-rollovers in our custom clusters).

If we see the errors, the distributor pods are logging the following errors very often:

Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: 2 errors: forwarding request to endpoint {thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded; forwarding request to endpoint {thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded

The receive pods restart at the same time, but the distributor can not recover from that error state. I ultimately have to kill the distributor pods to work correctly again. The normal receive-pods are working fine (dont need a restart or anything).

The interesting thing: We reverted back to grpc with protobuf and couldnt reproduce this issue, it seems to be something with the capnproto implementation.

What you expected to happen:
Recover successful from the above error if receive pods and router / distributor pods are restarted.

How to reproduce it (as minimally and precisely as possible):

  1. Create a thanos setup (3-5 receive pods, 2 distributor / router pods)
  2. Setup replication with replication factor 3 and use capnproto as the replication protocol
  3. setup fixed hashring.json replication configmap
  4. Send test data to the distributor pods
  5. restart the distributor and receive pods in a chaotic way
  6. -> Stuck in this error loop
    Anything else we need to know:
  • We are using the thanos bitnami helmchart.
Copy link

dosubot bot commented May 19, 2025

The issue you're experiencing with the endless replication retry loop when using Cap'n Proto in Thanos is likely related to how Thanos handles retries and error recovery in the Cap'n Proto replication protocol. When a disconnection error occurs, Thanos attempts to reconnect and retry the write operation up to a specified number of times. However, if the receive instances restart in a chaotic manner, it can lead to repeated disconnection errors and thus trigger the endless retry loop [1].

The fact that reverting to gRPC with Protobuf resolves the issue suggests that the Cap'n Proto implementation may not handle these chaotic restarts as gracefully as gRPC does. This could be due to differences in how each protocol manages connections and retries.

To mitigate this issue, you might consider the following approaches:

  1. Increase the Reconnect Attempts: Adjust the number of reconnect attempts in the Cap'n Proto configuration to allow for more retries before giving up. This might help the system recover without manual intervention.

  2. Stagger Restarts: If possible, avoid restarting all receive and distributor pods simultaneously. Staggering the restarts can reduce the likelihood of encountering simultaneous disconnection errors.

  3. Monitor and Automate Recovery: Implement monitoring to detect when the system enters this error state and automate the restart of distributor pods to recover from the loop without manual intervention.

These steps might help improve the resilience of your setup when using Cap'n Proto.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@MichaHoffmann
Copy link
Contributor

Do you use a ingester/router split? if you do that then you cannot create a loop in the resulting bipartite directed graph! I think thats the recommended way to not end up with replication loops.

@TheReal1604
Copy link
Author

TheReal1604 commented May 22, 2025

@MichaHoffmann Can you elaborate and clarify a bit further? Yes we use an ingester / router split. But I think you are writing about a logical replication loop (in the underlying architecture).

I mean, the distributor / router is kept in an endless error loop, trying to replicate the data which was ingested when the receive only nodes got restarted.

As we cannot replicate that issue with protobuf / grpc, I got the assumption that this might be a bug in the error handling loop in the capnproto implementation.

Thanks!

@pvlltvk
Copy link

pvlltvk commented May 25, 2025

@TheReal1604
Hi!
Did you start noticing this after upgrading to 0.38.0? I've been using the same scheme as you on 0.37.2 for a while and haven't noticed this behaviour (there were some chaotic restarts during that period)

@GiedriusS
Copy link
Member

I see this too in our setup and it's some bug in the cpnp implementation. https://github.com/capnproto/go-capnp/releases there have been some fixes. I will update the Go module for the next RC and let's see if it still happens.

@TheReal1604
Copy link
Author

@TheReal1604 Hi! Did you start noticing this after upgrading to 0.38.0? I've been using the same scheme as you on 0.37.2 for a while and haven't noticed this behaviour (there were some chaotic restarts during that period)

We unfortunately experienced that also on version 0.37.2.

Good to see it confirmed and fixed @GiedriusS thanks! 😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants