-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Receive: Endless loop of retried replication with capnproto and distributors #8254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The issue you're experiencing with the endless replication retry loop when using Cap'n Proto in Thanos is likely related to how Thanos handles retries and error recovery in the Cap'n Proto replication protocol. When a disconnection error occurs, Thanos attempts to reconnect and retry the write operation up to a specified number of times. However, if the receive instances restart in a chaotic manner, it can lead to repeated disconnection errors and thus trigger the endless retry loop [1]. The fact that reverting to gRPC with Protobuf resolves the issue suggests that the Cap'n Proto implementation may not handle these chaotic restarts as gracefully as gRPC does. This could be due to differences in how each protocol manages connections and retries. To mitigate this issue, you might consider the following approaches:
These steps might help improve the resilience of your setup when using Cap'n Proto. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Do you use a ingester/router split? if you do that then you cannot create a loop in the resulting bipartite directed graph! I think thats the recommended way to not end up with replication loops. |
@MichaHoffmann Can you elaborate and clarify a bit further? Yes we use an ingester / router split. But I think you are writing about a logical replication loop (in the underlying architecture). I mean, the distributor / router is kept in an endless error loop, trying to replicate the data which was ingested when the receive only nodes got restarted. As we cannot replicate that issue with protobuf / grpc, I got the assumption that this might be a bug in the error handling loop in the capnproto implementation. Thanks! |
@TheReal1604 |
I see this too in our setup and it's some bug in the cpnp implementation. https://github.com/capnproto/go-capnp/releases there have been some fixes. I will update the Go module for the next RC and let's see if it still happens. |
We unfortunately experienced that also on version 0.37.2. Good to see it confirmed and fixed @GiedriusS thanks! 😍 |
Thanos, Prometheus and Golang version used: 0.38.0
bitnami/thanos:0.38.0-debian-12-r3
Object Storage Provider:
Openstack-s3
What happened:
We use a thanos setup with 3-5 receivers and dedicated thanos-receive routing instances, which use capnproto as a replication protocol. The replication_factor is set to 3.
Currently we only have a static hashring configuration.
Unfortunately we can trigger something like an endless replication retry loop, if the receive instances restart in a chaotic way (this is triggered by k8s node-rollovers in our custom clusters).
If we see the errors, the distributor pods are logging the following errors very often:
The receive pods restart at the same time, but the distributor can not recover from that error state. I ultimately have to kill the distributor pods to work correctly again. The normal receive-pods are working fine (dont need a restart or anything).
The interesting thing: We reverted back to grpc with protobuf and couldnt reproduce this issue, it seems to be something with the capnproto implementation.
What you expected to happen:
Recover successful from the above error if receive pods and router / distributor pods are restarted.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
The text was updated successfully, but these errors were encountered: