Kademlia full bucket node eviction logic #5970
Replies: 2 comments 1 reply
-
I found this after searching for "ping": rust-libp2p/protocols/kad/src/handler.rs Lines 883 to 887 in a898a04 The line was added in #3152. EDIT: |
Beta Was this translation helpful? Give feedback.
-
rust-libp2p/protocols/kad/src/kbucket/bucket.rs Lines 310 to 365 in 9f85d5c From glancing at the code, it seems that IMO seeing the buckets as a LRU cache may be misleading. It is important to prune peers from the bucket as soon as they are offline because holding unreachable peers in the routing table hurts the network. However, it isn't convenient to keep pinging peers too often. The best strategy is probably to periodically ping peers from the buckets, remove them when they don't answer. And for buckets that are not full, run Hence the replacement strategy isn't exactly LRU, but rather "keep peers as long as they are online". No peer should be replaced unless it is offline. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
I've been reviewing the Kademlia implementation in Rust libp2p, specifically how it handles the eviction of nodes from full buckets when a new pending node is introduced.
According to the Kademlia specification, existing nodes in a bucket should always be favored. When a new node attempts to join a full bucket, the least recently used (LRU) node should only be replaced if it is found to be unresponsive. This is typically done via a PING check. The original Kademlia paper also suggests an optimization where nodes are not pinged immediately but only when the application actually needs to send a message to the LRU node.
However, looking at the implementation here,
it seems that a pending node will almost always replace the LRU node within a full bucket. Instead of immediately pinging the LRU node to check for liveness, libp2p defers this check until a natural reconnection occurs. Meanwhile, the pending node is assigned a timeout (defaults to 60s). When this timeout expires, the pending node will replace the current LRU node—unless that LRU node is still connected, in which case the pending node is discarded.
So this means that, unless all nodes within a full bucket are always connected, a pending node will almost always replace the LRU node after the timeout?
I see two potential issues with this approach:
Does anyone have insights on the reasoning behind this design choice?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions