Skip to content

Intermittent Gossip Data Propagation Issues in libp2p Network (Rust-libp2p 0.54) #6035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vinay10949 opened this issue May 25, 2025 · 4 comments

Comments

@vinay10949
Copy link

vinay10949 commented May 25, 2025

Summary

We're experiencing intermittent issues with gossip data propagation in our libp2p network using Rust-libp2p 0.54. The problem occurs on both local development machines and test servers, where nodes sometimes fail to receive gossiped messages despite appearing to be connected.

Symptoms:

  1. Nodes sometimes fail to receive gossiped data
  2. Bootstrap connections occasionally fail with HandshakeTimedOut errors
  3. Gossipsub mesh reports needing more peers even when nodes are connected
  4. Kademlia bootstrap queries complete but don't always result in stable connections

Configuration:

// Network setup
let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
config.max_idle_timeout = 10*1000; // 10 seconds
config.keep_alive_interval = Duration::from_secs(5);

// Gossipsub config
let gossipsub_config = gossipsub::ConfigBuilder::default()
    .heartbeat_interval(Duration::from_secs(HEARTBEAT_INTERVAL)) // 5 seconds
    .validation_mode(gossipsub::ValidationMode::Strict)
    .duplicate_cache_time(Duration::from_secs(DUPLICATE_CACHE_DURATION)) // 10 seconds
    .max_transmit_size(1_000_000)
    .message_id_fn(message_id_fn)
    .max_messages_per_rpc(Some(MAX_MESSAGES_PER_RPC)) // 100
    .mesh_n_low(4)
    .mesh_n_high(10)
    .mesh_n(8)
    .build()?;


//Swarm setup 

#[tracing::instrument(skip(keypair))]
pub async fn setup_swarm_network(
    keypair: Option<Keypair>,
    bootstrap_addresses: Option<Vec<(PeerId, Multiaddr)>>,
    port: String,
) -> Result<Swarm<SwarmBehaviour>, Box<dyn Error>> {
    // Set up the SwarmBuilder based on whether a keypair is provided or not.
    let builder = if let Some(keypair) = keypair.clone() {
        // Use the provided keypair for the swarm identity.
        SwarmBuilder::with_existing_identity(keypair)
    } else {
        // Generate a new identity if no keypair is provided.
        SwarmBuilder::with_new_identity()
    };
    let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
   // config.max_idle_timeout = 300;
   config.max_idle_timeout = 10*1000;
    //config.keep_alive_interval = Duration::from_millis(100);
    config.keep_alive_interval=Duration::from_secs(5);
    // Build the libp2p swarm with a specific transport (TCP and QUIC), and relay client.
    let mut swarm = builder
        .with_tokio() // Use Tokio for asynchronous execution.
        .with_quic_config(|_| config)
        .with_behaviour(|keypair| {
            // If no bootstrap addresses are provided, print the peer ID for informational purposes.
            if bootstrap_addresses.is_none() {
                info!("Bootstrap Peer ID :{}", keypair.public().to_peer_id());
            }
            // Initialize the custom MyBehaviour which includes Gossipsub and Kademlia behaviors.
            SwarmBehaviour::new(keypair.clone()).unwrap()
        })?
        .with_swarm_config(|c| {
            // Configure idle connection timeout.
            c.with_idle_connection_timeout(Duration::from_secs(60))
        })
        .build();

    // If bootstrap nodes are provided, add them to the Kademlia behavior.
    if let Some(ref bootstrap_addresses) = bootstrap_addresses {
        for (peer_id, multi_addr) in bootstrap_addresses {
            // Add each bootstrap node's address to the Kademlia DHT.
            swarm
                .behaviour_mut()
                .kademlia
                .add_address(peer_id, multi_addr.clone());
            swarm.dial(multi_addr.clone())?;
            // Trigger the Kademlia bootstrap process to find more peers.
         
        }
   swarm.behaviour_mut().kademlia.bootstrap()?;
    }

    // Subscribe to the primary Gossipsub topic for network-wide communication.
    swarm
        .behaviour_mut()
        .gossipsub
        .subscribe(&IdentTopic::new(NETWORK_TOPIC))?;

    // Define the address to listen on for incoming connections (QUIC over UDP).
    let listen_address = format!("/ip4/0.0.0.0/udp/{}/quic-v1", port);
    swarm.listen_on(listen_address.parse()?)?;

    // Return the initialized swarm.
    Ok(swarm)
}

Logs:
From Node 1 (working):

[TRACE] Sending message to peer 16Uiu2HAmR6ogo4eHfXuz28HNS2XJUGcB1R9Wf4UzHh7go18LQX3v
[TRACE] Sending message to peer 16Uiu2HAmGjjk8mDH5F1Y3FVW68tenMNWMNkTcZXceLMnXUEJDoSx
[TRACE] Sending message to peer 16Uiu2HAmMwshLKvkHnMsgJ5MPxcLeVkkSxRK8Rm6cFRaCCTkhhEd

From Node 2 (failing):

[ERROR] Failed to establish outgoing connection. Connection ID: ConnectionId(8), 
Peer ID: Some(PeerId("16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X")), 
Error: Transport([(/ip4/127.0.0.1/udp/7070/quic-v1/p2p/16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X, 
Other(Custom { kind: Other, error: Other(Right(HandshakeTimedOut)) }))]).

[DEBUG] HEARTBEAT: Mesh low. Topic contains: 0 needs: 4
[DEBUG] RANDOM PEERS: Got 0 peers

Expected behavior

  1. Stable connections between nodes
  2. Reliable gossip message propagation
  3. Healthy mesh network with sufficient peers

Actual behavior

  1. Intermittent connection failures
  2. Gossip messages sometimes not received
  3. Mesh peer count often below configured minimum

1.log

2.log

Relevant log output

Possible Solution

No response

Version

0.54

Would you like to work on fixing this bug?

Yes

@vinay10949
Copy link
Author

Sometimes I get this

peer_id: Some(PeerId("16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X"))
[2025-05-25T15:22:35.888Z] TRACE: NODE/31469 on abc.local: [RECV - EVENT] got frame ResetStream(ResetStream { id: StreamId(32), error_code: 0, final_offset: 24 }) (address=/ip4/127.0.0.1/udp/7070/quic-v1,id=0,line=2648,pn=380,space=Data,target=quinn_proto::connection)
    file: cargo/registry/src/index.crates.io-1949cf8c6b5b557f/quinn-proto-0.11.9/src/connection/mod.rs
    --

@jxs
Copy link
Member

jxs commented May 29, 2025

Hi, have you tried enabling TCP? if so do the symptoms persist?

@vinay10949
Copy link
Author

@jxs Yes currently we have disabled quic and enabled TCP. Also our network is very small . like 4 nodes

Also we spotted this

May 29 07:53:04 guardian-testnet-2 sh[25552]: {"v":0,"name":"DUCAT_NODE","msg":"[SWARM::POLL - EVENT] Request to peer in query failed with Io(Custom { kind: ConnectionRefused, error: \"protocol not supported\" })","level":20,"hostname":"guardian-testnet-2","pid":25552,"time":"2025-05-29T07:53:04.078781109Z","target":"libp2p_kad::behaviour","line":2358,"file":"/home/admin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-kad-0.46.2/src/behaviour.rs","peer":"16Uiu2HAm4QSE1Q6jvtbq4NZjmBdUb4k34KG17vS5bRRfv5ZYtZyE","query":"QueryId(0)"}

@rose2221
Copy link
Contributor

would like to work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants