Insufficient Peers Error When Node Rejoins #5852

vinay10949 · 2025-02-06T19:06:42Z

When a node goes down and rejoins the network using Rust libp2p, it encounters an insufficient peers error. We would like to understand which parameters need to be adjusted to mitigate this issue.

Code Context:
The following code initializes the libp2p swarm:

pub fn new(key: Keypair, relay_behaviour: relay::client::Behaviour) -> Result<Self, Box<dyn std::error::Error>> {
    let peer_id = key.public().to_peer_id();
    let message_id_fn = |message: &gossipsub::Message| {
        let s = mishti_crypto::hash256(&message.data);
        gossipsub::MessageId::from(s)
    };

    let gossipsub_config = gossipsub::ConfigBuilder::default()
        .heartbeat_interval(Duration::from_secs(HEART_BEAT_INTERVAL))
        .validation_mode(gossipsub::ValidationMode::Strict)
        .duplicate_cache_time(Duration::from_secs(DUPLICATE_CACHE_TIME))
        .message_id_fn(message_id_fn)
        .max_messages_per_rpc(Some(MAX_MESSAGES_PER_RPC))
        .build()
        .map_err(|msg| io::Error::new(io::ErrorKind::Other, msg))?;

    let gossipsub = gossipsub::Behaviour::new(gossipsub::MessageAuthenticity::Signed(key.clone()), gossipsub_config)?;

    let mut kad_config = kad::Config::default();
    kad_config.set_query_timeout(Duration::from_secs(60));

    let store = kad::store::MemoryStore::new(peer_id);
    let kademlia = kad::Behaviour::with_config(peer_id, store, kad_config);

    Ok(Self {
        gossipsub,
        kademlia,
        relay_client: relay_behaviour,
        request_response_behaviour: cbor::Behaviour::new([(StreamProtocol::new("/String"), ProtocolSupport::Full)], Config::default()),
    })
}

const DUPLICATE_CACHE_TIME: u64 = 10;
const HEART_BEAT_INTERVAL: u64 = 5;
const MAX_MESSAGES_PER_RPC: usize = 10000;

The swarm is initialized as follows:

pub async fn init_swarm(keypair: Option<Keypair>, bootstrap_addresses: Option<Vec<(PeerId, Multiaddr)>>, port: String) -> Result<Swarm<MyBehaviour>, Box<dyn Error>> {
    let builder = if let Some(keypair) = keypair {
        SwarmBuilder::with_existing_identity(keypair)
    } else {
        SwarmBuilder::with_new_identity()
    };

    let mut swarm = builder
        .with_tokio()
        .with_tcp(tcp::Config::default().port_reuse(true), noise::Config::new, yamux::Config::default)?
        .with_quic()
        .with_relay_client(noise::Config::new, yamux::Config::default)?
        .with_behaviour(|keypair, relay_behaviour| {
            if bootstrap_addresses.is_none() {
                info!("Bootstrap Peer ID :{}", keypair.public().to_peer_id());
            }
            MyBehaviour::new(keypair.clone(), relay_behaviour).unwrap()
        })?
        .with_swarm_config(|c| c.with_idle_connection_timeout(Duration::from_secs(60)))
        .build();

    if let Some(ref bootstrap_addresses) = bootstrap_addresses {
        for (peer_id, multi_addr) in bootstrap_addresses {
            swarm.behaviour_mut().kademlia.add_address(peer_id, multi_addr.clone());
            swarm.behaviour_mut().kademlia.bootstrap()?;
        }
    }

    swarm.behaviour_mut().gossipsub.subscribe(&IdentTopic::new(NETWORK_TOPIC))?;
    let listen_address = format!("/ip4/0.0.0.0/udp/{}/quic-v1", port);
    swarm.listen_on(listen_address.parse()?)?;
    Ok(swarm)
}

Expected Behavior:
When a node rejoins the network, it should successfully reconnect to peers and resume normal operations.

Actual Behavior:
After rejoining, the node logs an insufficient peers error.

Questions:

Are there specific parameters in gossipsub, kademlia, or swarm that should be adjusted to handle node reconnection better?
Should additional bootstrap mechanisms be used when a node rejoins?
Would increasing query_timeout, heartbeat_interval, or duplicate_cache_time help in this scenario?

Any guidance on resolving this issue would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

dariusc93 · 2025-02-06T21:00:42Z

Hey! Could you provide some logs? I do know there is a issue with quic transport where when a node disconnects and reconnects with the same port before the connection actually times out that it would not reuse that connection. See #5097. If that is the case, the workaround would be to lower the timeout and keepalive low enough so the connection would timeout quickly when the peer disconnect in any manner. See #5097 (comment) for the parameters I use. As for the effects of using such a low duration is hard to gauge. Another workaround would be to try with TCP instead and see if that issue still happens.

vinay10949 · 2025-02-06T21:06:39Z

I tried this out, and it worked! Only lowering the timeout doesnt work

let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
config.max_idle_timeout = 300;
config.keep_alive_interval = Duration::from_millis(100);

let mut kad_config = kad::Config::default();
kad_config.set_query_timeout(Duration::from_secs(30));
kad_config.set_replication_factor(std::num::NonZero::new(4).unwrap());

if let Some(ref bootstrap_addresses) = bootstrap_addresses {
    for (peer_id, multi_addr) in bootstrap_addresses {
        swarm.behaviour_mut().kademlia.add_address(peer_id, multi_addr.clone());
        swarm.dial(multi_addr.clone())?;
    }
    swarm.behaviour_mut().kademlia.bootstrap()?;
}

vinay10949 closed this as completed Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Insufficient Peers Error When Node Rejoins #5852

Insufficient Peers Error When Node Rejoins #5852

vinay10949 commented Feb 6, 2025

dariusc93 commented Feb 6, 2025

Uh oh!

vinay10949 commented Feb 6, 2025 •

edited

Loading

Uh oh!

Insufficient Peers Error When Node Rejoins #5852

Insufficient Peers Error When Node Rejoins #5852

Comments

vinay10949 commented Feb 6, 2025

dariusc93 commented Feb 6, 2025

Uh oh!

vinay10949 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinay10949 commented Feb 6, 2025 •

edited

Loading