Skip to content

Insufficient Peers Error When Node Rejoins #5852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vinay10949 opened this issue Feb 6, 2025 · 2 comments
Closed

Insufficient Peers Error When Node Rejoins #5852

vinay10949 opened this issue Feb 6, 2025 · 2 comments

Comments

@vinay10949
Copy link

When a node goes down and rejoins the network using Rust libp2p, it encounters an insufficient peers error. We would like to understand which parameters need to be adjusted to mitigate this issue.

Code Context:
The following code initializes the libp2p swarm:

pub fn new(key: Keypair, relay_behaviour: relay::client::Behaviour) -> Result<Self, Box<dyn std::error::Error>> {
    let peer_id = key.public().to_peer_id();
    let message_id_fn = |message: &gossipsub::Message| {
        let s = mishti_crypto::hash256(&message.data);
        gossipsub::MessageId::from(s)
    };

    let gossipsub_config = gossipsub::ConfigBuilder::default()
        .heartbeat_interval(Duration::from_secs(HEART_BEAT_INTERVAL))
        .validation_mode(gossipsub::ValidationMode::Strict)
        .duplicate_cache_time(Duration::from_secs(DUPLICATE_CACHE_TIME))
        .message_id_fn(message_id_fn)
        .max_messages_per_rpc(Some(MAX_MESSAGES_PER_RPC))
        .build()
        .map_err(|msg| io::Error::new(io::ErrorKind::Other, msg))?;

    let gossipsub = gossipsub::Behaviour::new(gossipsub::MessageAuthenticity::Signed(key.clone()), gossipsub_config)?;

    let mut kad_config = kad::Config::default();
    kad_config.set_query_timeout(Duration::from_secs(60));

    let store = kad::store::MemoryStore::new(peer_id);
    let kademlia = kad::Behaviour::with_config(peer_id, store, kad_config);

    Ok(Self {
        gossipsub,
        kademlia,
        relay_client: relay_behaviour,
        request_response_behaviour: cbor::Behaviour::new([(StreamProtocol::new("/String"), ProtocolSupport::Full)], Config::default()),
    })
}

const DUPLICATE_CACHE_TIME: u64 = 10;
const HEART_BEAT_INTERVAL: u64 = 5;
const MAX_MESSAGES_PER_RPC: usize = 10000;

The swarm is initialized as follows:

pub async fn init_swarm(keypair: Option<Keypair>, bootstrap_addresses: Option<Vec<(PeerId, Multiaddr)>>, port: String) -> Result<Swarm<MyBehaviour>, Box<dyn Error>> {
    let builder = if let Some(keypair) = keypair {
        SwarmBuilder::with_existing_identity(keypair)
    } else {
        SwarmBuilder::with_new_identity()
    };

    let mut swarm = builder
        .with_tokio()
        .with_tcp(tcp::Config::default().port_reuse(true), noise::Config::new, yamux::Config::default)?
        .with_quic()
        .with_relay_client(noise::Config::new, yamux::Config::default)?
        .with_behaviour(|keypair, relay_behaviour| {
            if bootstrap_addresses.is_none() {
                info!("Bootstrap Peer ID :{}", keypair.public().to_peer_id());
            }
            MyBehaviour::new(keypair.clone(), relay_behaviour).unwrap()
        })?
        .with_swarm_config(|c| c.with_idle_connection_timeout(Duration::from_secs(60)))
        .build();

    if let Some(ref bootstrap_addresses) = bootstrap_addresses {
        for (peer_id, multi_addr) in bootstrap_addresses {
            swarm.behaviour_mut().kademlia.add_address(peer_id, multi_addr.clone());
            swarm.behaviour_mut().kademlia.bootstrap()?;
        }
    }

    swarm.behaviour_mut().gossipsub.subscribe(&IdentTopic::new(NETWORK_TOPIC))?;
    let listen_address = format!("/ip4/0.0.0.0/udp/{}/quic-v1", port);
    swarm.listen_on(listen_address.parse()?)?;
    Ok(swarm)
}

Expected Behavior:
When a node rejoins the network, it should successfully reconnect to peers and resume normal operations.

Actual Behavior:
After rejoining, the node logs an insufficient peers error.

Questions:

  1. Are there specific parameters in gossipsub, kademlia, or swarm that should be adjusted to handle node reconnection better?
  2. Should additional bootstrap mechanisms be used when a node rejoins?
  3. Would increasing query_timeout, heartbeat_interval, or duplicate_cache_time help in this scenario?

Any guidance on resolving this issue would be greatly appreciated!

@dariusc93
Copy link
Member

Hey! Could you provide some logs? I do know there is a issue with quic transport where when a node disconnects and reconnects with the same port before the connection actually times out that it would not reuse that connection. See #5097. If that is the case, the workaround would be to lower the timeout and keepalive low enough so the connection would timeout quickly when the peer disconnect in any manner. See #5097 (comment) for the parameters I use. As for the effects of using such a low duration is hard to gauge. Another workaround would be to try with TCP instead and see if that issue still happens.

@vinay10949
Copy link
Author

vinay10949 commented Feb 6, 2025

I tried this out, and it worked! Only lowering the timeout doesnt work

let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
config.max_idle_timeout = 300;
config.keep_alive_interval = Duration::from_millis(100);

let mut kad_config = kad::Config::default();
kad_config.set_query_timeout(Duration::from_secs(30));
kad_config.set_replication_factor(std::num::NonZero::new(4).unwrap());

if let Some(ref bootstrap_addresses) = bootstrap_addresses {
    for (peer_id, multi_addr) in bootstrap_addresses {
        swarm.behaviour_mut().kademlia.add_address(peer_id, multi_addr.clone());
        swarm.dial(multi_addr.clone())?;
    }
    swarm.behaviour_mut().kademlia.bootstrap()?;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants