kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

teor2345 · 2025-04-09T06:10:49Z

Summary

When an outbound kad substream times out (10s), it is removed from the substream list, and a new outbound substream can be opened.

But on the inbound side, there appears to be no timeout, so the node only drops new inbound substreams when they are over its substream limit. (If all other substreams are waiting for the first message, or in another state, those substreams can't be re-used. So the new substream gets dropped.)

This causes thousands of "substream limit exceeded" warnings on the inbound side. It can also slow down syncing a lot, in some cases making it impossible.

This bug is self-triggering, because the dropped inbound substreams also time out on the outbound side.

Edit: this is not a duplicate of #3236, the cause is different, and it only happens under specific load conditions.

Expected behavior

Inbound substreams time out after approximately 10 seconds.

Ideally the inbound timeout is slightly shorter, because the timeout starts on the outbound side immediately, but only starts on the inbound side after the network transmission delay. If there is a long network delay for earlier substreams, but a short network delay for later substreams, this warning can still happen occasionally.

Actual behavior

Inbound substreams which have been timed out on the outbound side seem to hang around for much longer than 10s. Maybe they are only removed when a read fails on them? Or some other error happens?

Relevant log output

2025-04-08T06:24:27.293722Z WARN Consensus: libp2p_kad::handler: New inbound substream to peer exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream. peer=PeerId("12D3KooWN6kFp2Ev181UGq3BUDfk1jfjaNu6sDTqxCZUBpmp8kRQ")

Possible Solution

On the sending side, outbound substreams only count towards the limit until they timeout:

rust-libp2p/protocols/kad/src/handler.rs

Line 614 in b56b47a

StreamUpgradeError::Timeout => io::ErrorKind::TimedOut.into(),

rust-libp2p/protocols/kad/src/handler.rs

Line 819 in b56b47a

if self.outbound_substreams.len() < MAX_NUM_STREAMS {

And the outbound timeout is 10 seconds:

rust-libp2p/protocols/kad/src/handler.rs

Line 476 in b56b47a

Duration::from_secs(10),

But on the receiving side, inbound substreams count towards the limit until they've received a message:

rust-libp2p/protocols/kad/src/handler.rs

Line 815 in b56b47a

if let Poll::Ready(Some(event)) = self.inbound_substreams.poll_next_unpin(cx) {

rust-libp2p/protocols/kad/src/handler.rs

Line 938 in b56b47a

} => match substream.poll_next_unpin(cx) {

rust-libp2p/protocols/kad/src/handler.rs

Line 1013 in b56b47a

Poll::Ready(Some(Err(e))) => {

and can't be re-used if the sender times out on the first message:

rust-libp2p/protocols/kad/src/handler.rs

Line 542 in b56b47a

InboundSubstreamState::WaitingMessage { first: false, .. }

rust-libp2p/protocols/kad/src/handler.rs

Line 573 in b56b47a

first: true,

There is no inbound timeout:
https://github.com/libp2p/rust-libp2p/blob/b56b47aa6510ab4af0ae797a7f036364d414ae3e/protocols/kad/src/handler.rs#L75C5-L75C23

Here is how other protocols implement matching inbound and outbound timeouts:

rust-libp2p/protocols/relay/src/behaviour/handler.rs

Line 384 in 1206fef

inbound_workers: futures_bounded::FuturesSet::new(

Version

Latest main back to at least 0.54.2

Would you like to work on fixing this bug?

Yes

The text was updated successfully, but these errors were encountered:

teor2345 · 2025-04-25T01:21:22Z

Ping - we now have a confirmed fix for this bug downstream in Subspace, I'm happy to open a PR for it here as well:
autonomys#2

This was referenced Apr 9, 2025

libp2p_kad::handler: New inbound substream to peer exceeds inbound substream limit autonomys/subspace#3450

Closed

Add a timeout to inbound kad substreams autonomys/rust-libp2p#2

Merged

This was referenced Apr 25, 2025

libp2p::kad exceeds substream limit due to outbound timeout, but no inbound timeout paritytech/polkadot-sdk#8333

Open

fix(kad): enforce a timeout for inbound substreams #6009

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

teor2345 commented Apr 9, 2025 •

edited

Loading

teor2345 commented Apr 25, 2025

Uh oh!

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

kad exceeds substream limit due to outbound timeout, but no inbound timeout #5981

Comments

teor2345 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expected behavior

Actual behavior

Relevant log output

Possible Solution

Version

Would you like to work on fixing this bug?

teor2345 commented Apr 25, 2025

Uh oh!

teor2345 commented Apr 9, 2025 •

edited

Loading