Skip to content

[bug]: unable to read message from peer: EOF, disconnections and routing stops #8125

@KnockOnWoodNode

Description

@KnockOnWoodNode

Background

It will happen, after N time, that my node starts having channels deactivated to several peers, one after the other, to be reactivated in a few minutes time tops; at the same time, it will become unable to route HTLCs, as I see no more forwards being notified by bos bot, and one of my peers who tracks these things reports a spike in pending outgoing HTLCs from their node to mine, whenever this happens, that will slowly resolve themselves by failing.
Restarting lnd solves the issue, until next time this happens.

I couldn't make solid hypotheses about why this happens, but here's all the details that I can provide so you maybe have some ideas of your own.
I run sqlite backend, and increased timeout to 10m to avoid SQLITE_BUSY errors. I don't remember this error happening before, but I am 90% sure it started after I began connecting to more peers other than my direct channel ones, to get gossip updates faster from the network (this is before I knew about active sync peers and passive sync peers, before I was connecting to many peers which were all passive, I later on caught up and increased my active peers value, but all of this doesn't seem to have had any influence on the issue).
What I seemed to notice, other than seeing this problem arise after I increased the number of peers I connect to, is that the more peers I have, the sooner this happens. Using persistent connections or not doesn't appear to change anything.
I attached a log for one node which I picked among the ones my nodes detected disconnections to this last time. I had increased PEER loglevel to debug, and zgrep'd logs for its pubkey. I have since then restored info loglevel for everything.

I have disabled, for the time being, my script that connects to more peers, to be bale to report what happens in the upcoming days.

rocket.log

Your environment

  • version of lnd 0.16.4
  • which operating system (uname -a on *Nix) 5.10.0-26-amd64 Fix name typo in README #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
  • version of btcd, bitcoind, or other backend 24.1
  • any other relevant environment details sqlite backend, 12-core Xeon with 64GB of ECC RAM and 6-ssd zpool mirror pool

Steps to reproduce

Have sqlite backend (no idea if necessary), have an active routing node with 40something channels, connect to many peers (above 300 for faster mishap) with lncli connect <pubkey>@<address>:<port>

Expected behaviour

lnd continues operating normally, managing forwards like a champ

Actual behaviour

channels are disconnected at random, htlcs are not being processed

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions