Skip to content

net: UDPConn.WriteToUDPAddrPort sometimes blocks forever on darwin/arm64 #73919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brad-defined opened this issue May 29, 2025 · 2 comments
Open
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@brad-defined
Copy link

Go version

go version go1.24.1 darwin/arm64

Output of go env in your module/workspace:

AR='ar'
CC='clang'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='clang++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='arm64'
GOARM64='v8.0'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/Users/bradleyhiggins/Library/Caches/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/Users/bradleyhiggins/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/_4/sn138dfd3lggdfjlrflywzdc0000gn/T/go-build3757900667=/tmp/go-build -gno-record-gcc-switches -fno-common'
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMOD='/Users/bradleyhiggins/Work/gotmp/udptest/go.mod'
GOMODCACHE='/Users/bradleyhiggins/go/pkg/mod'
GONOPROXY='github.com/DefinedNet'
GONOSUMDB='github.com/DefinedNet'
GOOS='darwin'
GOPATH='/Users/bradleyhiggins/go'
GOPRIVATE='github.com/DefinedNet'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/bradleyhiggins/Library/Application Support/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.24.1'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

We've had a few recent reports of our UDP-heavy application (github.com/Nebula) locking up on Darwin hosts. When users hit the issue, they can reproduce it regularly. We've also had users who had the issue lose the ability to reproduce the issue. I've been unable to reproduce the issue directly myself.

What did you see happen?

When the process locks up, we see this goroutine in a stack trace:

goroutine 22 [IO wait, 5 minutes]:
internal/poll.runtime_pollWait(0x14a530bd0, 0x77)
	runtime/netpoll.go:351 +0xa0
internal/poll.(*pollDesc).wait(0x1400016d280?, 0x140004ca8c0?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x28
internal/poll.(*pollDesc).waitWrite(...)
	internal/poll/fd_poll_runtime.go:93
internal/poll.(*FD).WriteToInet6(0x1400016d280, {0x140004ca8c0, 0x122, 0x140}, 0x140003315d8)
	internal/poll/fd_unix.go:485 +0x1d8
net.(*netFD).writeToInet6(0x1400016d280, {0x140004ca8c0?, 0x6154a55d56fa2fab?, 0x456866b1?}, 0x0?)
	net/fd_posix.go:114 +0x28
net.(*UDPConn).writeToAddrPort(0x1400011c300, {0x140004ca8c0, 0x122, 0x140}, {{{0x14000331708?, 0x10262ee08?}, {0x1400000c648?}}, 0xea0?})
	net/udpsock_posix.go:159 +0xf8
net.(*UDPConn).WriteToUDPAddrPort(0x1400011c300, {0x140004ca8c0?, 0xc505ecc647b91bac?, 0x102de1448?}, {{{0x102e53201?, 0x102de3a60?}, {0x1400000c648?}}, 0x4?})
	net/udpsock.go:227 +0x44
github.com/slackhq/nebula/udp.(*GenericConn).WriteTo(0x102eaae60?, {0x140004ca8c0?, 0x102de1448?, 0x140001695e0?}, {{{0x102f4f980?, 0x102633d90?}, {0x1400000c648?}}, 0x19ac?})
	github.com/slackhq/[email protected]/udp/udp_generic.go:55 +0x28

Other goroutines attempting to write to the same UDP socket block forever on lock acquisition (as the lock is held by goroutine 22 above):

goroutine 33 [semacquire, 5 minutes]:
internal/poll.runtime_Semacquire(0x0?)
	runtime/sema.go:76 +0x2c
internal/poll.(*fdMutex).rwlock(0x1400016d280, 0xc0?)
	internal/poll/fd_mutex.go:154 +0xdc
internal/poll.(*FD).writeLock(...)
	internal/poll/fd_mutex.go:239
internal/poll.(*FD).WriteToInet6(0x1400016d280, {0x14000010000, 0x20, 0x2329}, 0x140000ee838)
	internal/poll/fd_unix.go:472 +0x4c
net.(*netFD).writeToInet6(0x1400016d280, {0x14000010000?, 0xffff000000000000?, 0xf2b636c6?}, 0x0?)
	net/fd_posix.go:114 +0x28
net.(*UDPConn).writeToAddrPort(0x1400011c300, {0x14000010000, 0x20, 0x2329}, {{{0x1400049e900?, 0x14000010000?}, {0x1400000c630?}}, 0x2329?})
	net/udpsock_posix.go:159 +0xf8
net.(*UDPConn).WriteToUDPAddrPort(0x1400011c300, {0x14000010000?, 0x140003a0080?, 0x0?}, {{{0x14000160900?, 0x103556c20?}, {0x1400000c630?}}, 0xf4d0?})
	net/udpsock.go:227 +0x44
github.com/slackhq/nebula/udp.(*GenericConn).WriteTo(0x140004ac3d0?, {0x14000010000?, 0x10?, 0x2329?}, {{{0x14000010000?, 0x10?}, {0x1400000c630?}}, 0x6c20?})
	github.com/slackhq/[email protected]/udp/udp_generic.go:55 +0x28

Some earlier issues that are possibly related are #61555 and comment #45211 (comment)

The comment linked above indicated that the socket write resulted in an EAGAIN return value. The goroutine is parked, and for some reason, the WRITE event from the kevent() call never fired.

I was suspicious that either the expected kevent wasn't firing, or there was a subtle race in the lock-free kevent handler.

To suss this out, I instrumented golang to include some extra debug information around the sendto() call and kevent() loop:
go version go1.25-devel_7f806c1052 Wed May 21 00:07:41 2025 -0700 darwin/arm64

bradleyhiggins@Haxx src % git diff
diff --git a/src/internal/poll/fd_poll_runtime.go b/src/internal/poll/fd_poll_runtime.go
index 2aef11243a..18db07a922 100644
--- a/src/internal/poll/fd_poll_runtime.go
+++ b/src/internal/poll/fd_poll_runtime.go
@@ -37,6 +37,7 @@ var serverInit sync.Once
 
 func (pd *pollDesc) init(fd *FD) error {
        serverInit.Do(runtime_pollServerInit)
+       println("BRAD func(pd *pollDesc) init(fd *FD) error", int(fd.Sysfd))
        ctx, errno := runtime_pollOpen(uintptr(fd.Sysfd))
        if errno != 0 {
                return errnoErr(syscall.Errno(errno))
diff --git a/src/runtime/netpoll_kqueue.go b/src/runtime/netpoll_kqueue.go
index db4dddc2fe..ec99a319b5 100644
--- a/src/runtime/netpoll_kqueue.go
+++ b/src/runtime/netpoll_kqueue.go
@@ -129,6 +129,7 @@ retry:
                ev := &events[i]
 
                if isWakeup(ev) {
+                       println("BRAD: ", i, "/", int(n), "(", ev.ident, ") isWakeup")
                        isBlocking := delay != 0
                        processWakeupEvent(kq, isBlocking)
                        if isBlocking {
@@ -142,6 +143,7 @@ retry:
                var mode int32
                switch ev.filter {
                case _EVFILT_READ:
+                       println("BRAD: ", i, "/", int(n), "(", ev.ident, ") _EVFILT_READ")
                        mode += 'r'
 
                        // On some systems when the read end of a pipe
@@ -157,6 +159,7 @@ retry:
                                mode += 'w'
                        }
                case _EVFILT_WRITE:
+                       println("BRAD: ", i, "/", int(n), "(", ev.ident, ") _EVFILT_WRITE")
                        mode += 'w'
                }
                if mode != 0 {
@@ -175,6 +178,7 @@ retry:
                                        continue
                                }
                        }
+                       println("BRAD: ", i, "/", int(n), "(", ev.ident, ") ev.flags==_EV_ERROR", ev.flags == _EV_ERROR)
                        pd.setEventErr(ev.flags == _EV_ERROR, tag)
                        delta += netpollready(&toRun, pd, mode)
                }
diff --git a/src/syscall/syscall_unix.go b/src/syscall/syscall_unix.go
index ecd5952975..936cf211ff 100644
--- a/src/syscall/syscall_unix.go
+++ b/src/syscall/syscall_unix.go
@@ -438,7 +438,31 @@ func sendtoInet6(fd int, p []byte, flags int, to *SockaddrInet6) (err error) {
        if err != nil {
                return err
        }
-       return sendto(fd, p, flags, ptr, n)
+
+       ret := sendto(fd, p, flags, ptr, n)
+       println(
+               "BRAD: sendto (", fd, ") returned", ret, "to addr",
+               to.Addr[0],
+               to.Addr[1],
+               to.Addr[2],
+               to.Addr[3],
+               to.Addr[4],
+               to.Addr[5],
+               to.Addr[6],
+               to.Addr[7],
+               to.Addr[8],
+               to.Addr[9],
+               to.Addr[10],
+               to.Addr[11],
+               to.Addr[12],
+               to.Addr[13],
+               to.Addr[14],
+               to.Addr[15])
+       if ret == EAGAIN {
+               println("BRAD: sendto returned EAGAIN")
+       }
+
+       return ret
 }
 
 func Sendto(fd int, p []byte, flags int, to Sockaddr) (err error) {

I asked a user who could reproduce the issue to run with our program built with this amended golang, and got this output (file descriptor 8 is the UDP socket):

INFO[0000] listening on [::]:42069                      
BRAD func(pd *pollDesc) init(fd *FD) error 8
BRAD:  0 / 1 ( 8 ) _EVFILT_WRITE
BRAD:  0 / 1 ( 8 ) ev.flags==_EV_ERROR false
<snip>
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
INFO[0010] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=144220064 localIndex=144220064 remoteIndex=0 udpAddrs="[[2600:<snip>]:58361 [2607:<snip>]:58361 [fd74:<snip>]:58361 [fd74:<snip>]:58361 [fdd7:<snip>]:58361 <ipv4snip>:58361 <ipv4snip>:58361 <ipv4snip>:58361]" vpnIp=<snip>
BRAD:  0 / 1 ( 8 ) _EVFILT_WRITE
BRAD:  0 / 1 ( 8 ) ev.flags==_EV_ERROR false
<snip>
ERRO[0011] Failed to send handshake message              error="write udp [::]:42069->[2600:<snip>]:58361: sendto: no route to host" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=144220064 localIndex=144220064 remoteIndex=0 udpAddr="[2600:<snip>]:58361" vpnIp=<snip>
BRAD: sendto ( 8 ) returned (0x10307efa0,0x102ec3fc8) to addr x x x x x x x x x x x x x x x x
ERRO[0011] Failed to send handshake message              error="write udp [::]:42069->[2607:<snip>]:58361: sendto: no route to host" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=144220064 localIndex=144220064 remoteIndex=0 udpAddr="[2607:<snip>]:58361" vpnIp=<snip>
BRAD: sendto ( 8 ) returned (0x0,0x0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto ( 8 ) returned (0x10307efa0,0x1035a51a0) to addr x x x x x x x x x x x x x x x x
BRAD: sendto returned EAGAIN
BRAD:  0 / 1 ( 8 ) _EVFILT_READ
BRAD:  0 / 1 ( 8 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 8 ) _EVFILT_READ
BRAD:  0 / 1 ( 8 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 8 ) _EVFILT_READ
BRAD:  0 / 1 ( 8 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 7 ) _EVFILT_READ
BRAD:  0 / 1 ( 7 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 7 ) _EVFILT_READ
BRAD:  0 / 1 ( 7 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 8 ) _EVFILT_READ
BRAD:  0 / 1 ( 8 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 3994991092 ) isWakeup
BRAD:  0 / 1 ( 7 ) _EVFILT_READ
BRAD:  0 / 1 ( 7 ) ev.flags==_EV_ERROR false
BRAD:  0 / 1 ( 3994991092 ) isWakeup
BRAD:  0 / 1 ( 3994991092 ) isWakeup
BRAD:  0 / 1 ( 3994991092 ) isWakeup
^CINFO[0021] Cleanly shutting down Nebula network   

I've snipped out some IP addresses from the user's trace, but I believe all relevant info is included. Interpreting this trace, it appears to me that the UDP socket starts out fine, and shows kevents happening as expected. There is a kevent WRITE event fired after the first batch of UDP packets are sent off. (This is also the last WRITE event I see in the log for file descriptor 8.) Later, this user sees an EAGAIN returned by sendto (BRAD: sendto returned EAGAIN). After that, the log shows some kevents occurring against file descriptor 8, but they're all _EVFILT_READ, never _EVFILT_WRITE.

With some further debugging steps, we found that (at least with this user who is seeing the issue), when we prevented the program from attempting any UDP writes to the ULA 7c::/7 range, the issue did not manifest.

In an earlier incarnation of a similar issue (linked above), the program Little Snitch was installed on the failing machines. Those machines failed with the same stack. In attempting to root cause this issue, I re-installed Little Snitch and successfully reproduced the lockup. I saw the same behavior as indicated in the instrumented logs above - a sendto call returns EAGAIN, but no WRITE kevent ever occurs to wake it up again.

In this case, the users encountering the issue do not have Little Snitch installed.

What did you expect to see?

I expect the UDP write to not block forever.

If it's guaranteed that kevent() fires a WRITE event for a file descriptor in every scenario in which sendto returns EAGAIN, then the behavior I'm seeing would appear to be a Darwin bug. I've seen this behavior twice and on different o/s versions - once with Little Snitch installed, and this new incarnation reported by a few users (apparently related to UDP writes to ULA IPv6 addresses.)

Even if it's the case that darwin is to blame, I would love to have the ability to work around this behavior. One thought is to make UDP writes non-blocking from a golang perspective. That is, the socket itself is nonblocking, but golang's net.(*UDPConn).WriteToUDPAddrPort call will block until sendto returns something other than EINTER and EAGAIN. If I had a way to call WriteToUDPAddrPort (or another udp send function) in a way such that at least the EAGAIN is immediately bubbled up the stack instead of blocking for a kevent that never happens, then I could work around this kevent behavior without locking up the process.

@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label May 29, 2025
@mknyszek mknyszek changed the title net/ udpconn : WriteToUDPAddrPort sometimes blocks forever on darwin/MacOS net/udpconn: WriteToUDPAddrPort sometimes blocks forever on darwin/arm64 May 29, 2025
@mknyszek
Copy link
Contributor

CC @neild

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 29, 2025
@mknyszek mknyszek added this to the Backlog milestone May 29, 2025
@seankhliao seankhliao changed the title net/udpconn: WriteToUDPAddrPort sometimes blocks forever on darwin/arm64 net: UDPConn.WriteToUDPAddrPort sometimes blocks forever on darwin/arm64 May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants