Skip to content

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
teor2345 opened this issue Apr 9, 2025 · 1 comment · May be fixed by #6009
Open

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

teor2345 opened this issue Apr 9, 2025 · 1 comment · May be fixed by #6009

Comments

@teor2345
Copy link

teor2345 commented Apr 9, 2025

Summary

When an outbound kad substream times out (10s), it is removed from the substream list, and a new outbound substream can be opened.

But on the inbound side, there appears to be no timeout, so the node only drops new inbound substreams when they are over its substream limit. (If all other substreams are waiting for the first message, or in another state, those substreams can't be re-used. So the new substream gets dropped.)

This causes thousands of "substream limit exceeded" warnings on the inbound side. It can also slow down syncing a lot, in some cases making it impossible.

This bug is self-triggering, because the dropped inbound substreams also time out on the outbound side.

Edit: this is not a duplicate of #3236, the cause is different, and it only happens under specific load conditions.

Expected behavior

Inbound substreams time out after approximately 10 seconds.

Ideally the inbound timeout is slightly shorter, because the timeout starts on the outbound side immediately, but only starts on the inbound side after the network transmission delay. If there is a long network delay for earlier substreams, but a short network delay for later substreams, this warning can still happen occasionally.

Actual behavior

Inbound substreams which have been timed out on the outbound side seem to hang around for much longer than 10s. Maybe they are only removed when a read fails on them? Or some other error happens?

Relevant log output

2025-04-08T06:24:27.293722Z WARN Consensus: libp2p_kad::handler: New inbound substream to peer exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream. peer=PeerId("12D3KooWN6kFp2Ev181UGq3BUDfk1jfjaNu6sDTqxCZUBpmp8kRQ")

Possible Solution

On the sending side, outbound substreams only count towards the limit until they timeout:

StreamUpgradeError::Timeout => io::ErrorKind::TimedOut.into(),

if self.outbound_substreams.len() < MAX_NUM_STREAMS {

And the outbound timeout is 10 seconds:

Duration::from_secs(10),

  1. But on the receiving side, inbound substreams count towards the limit until they've received a message:
    if let Poll::Ready(Some(event)) = self.inbound_substreams.poll_next_unpin(cx) {

    } => match substream.poll_next_unpin(cx) {

    Poll::Ready(Some(Err(e))) => {

and can't be re-used if the sender times out on the first message:

InboundSubstreamState::WaitingMessage { first: false, .. }

There is no inbound timeout:
https://github.com/libp2p/rust-libp2p/blob/b56b47aa6510ab4af0ae797a7f036364d414ae3e/protocols/kad/src/handler.rs#L75C5-L75C23

Here is how other protocols implement matching inbound and outbound timeouts:

inbound_workers: futures_bounded::FuturesSet::new(

Version

Latest main back to at least 0.54.2

Would you like to work on fixing this bug?

Yes

@teor2345
Copy link
Author

Ping - we now have a confirmed fix for this bug downstream in Subspace, I'm happy to open a PR for it here as well:
autonomys#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant