Skip to content

async sockets in linux (windows is fine) use 5-8x expected cpu vs synchronous sockets or comparables in other languages #72153

@andrewaggb

Description

@andrewaggb

I'm finding that any form of async tcp or unix socket in c#/.net 6 is using 5-8x the cpu usage compared to using synchronous sockets in c# or doing essentially the same thing in python3, nodejs, or netcat in listen mode. Async sockets in c# on windows perform fine. I've tried several different forms of async with sockets in c# and they all seem to behave the same.

In my test scenario I'm saturating a gigabit ethernet link via netcat because it's very easy to reproduce. My production scenario is a video surveillance application receiving many simultaneous hd video feeds.

I have experienced this issue in my production application with and without docker, on physical machines and in hyper-v virtual machines, on centos 7.9, rocky 8.6, and ubuntu 20.04.

My examples below will all be ubuntu 20.04 with dotnet 6 installed on the host.
Hyper-v virtual machine on windows 11 with 4 hyper-v cpus on a ryzen 5600g

I'm using cpu percentages from the top command and these are all processing the same gigabit traffic (no example here is processing more or less MB/s - they are all a full gigabit link verified with application counters and tools like iftop)

using a standard TCP Listener -> NetworkStream with sync reads and a ~1MB buffer
~6-8% cpu usage. I have no problem with this and as you'll see below this result is consistent with other tools and programming languages in my test environment.

  byte[] buffer = new byte[1048576];
  while ((nBytes **= stream.Read(**buffer, 0, buffer.Length)) > 0)
  {
      totalBytes += nBytes;
  }

using a standard TCP Listener -> NetworkStream with async reads and a ~1MB buffer. The only difference is await ReadAsync vs Read
~50-60% cpu usage - 5-8x as much cpu as the synchronous version.

  byte[] buffer = new byte[1048576];
  while((nBytes = **await stream.ReadAsync(**buffer, 0, buffer.Length)) > 0)
  {
      totalBytes += nBytes;
  }

I have also tried socket.BeginReceive with approximately the same 50-60% cpu usage
I have also tried socket.ReceiveAsync with SocketAsyncEventArgs with approximately the same 50-60% cpu usage

I have tried unix Sockets in c#/.net 6
With unix sockets the cpu usage for a synchronous read loop is around 6% of cpu (vs 6-8% for tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed.

With unix sockets the cpu usage for an asynchronous read loop is around 50-60% of cpu (~ the same as async tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed.

I ran some comparables

using netcat as a listener
nc -l 9989 >/dev/null
uses about 8% cpu (same as synchronous sockets in c#)

using nodejs and the socket.on data callback
uses about 10% cpu (slightly more than synchronous c# but 5-6x less than async c# and this is async and not re-using a receive buffer so it's less efficient)

using python3 with a synchronous tcp socket receiving into a buffer (very comparable to my c# synchronous example)
uses about 8% cpu (same as synchronous sockets in c#)

using golang with a goroutine per connection receiving into a buffer (comparable to my c# synchronous example)
uses about 14% cpu (approximately double synchronous sockets in c#)

My basic test methodology is
From another machine run netcat to generate a full gigabit/s of ethernet traffic
cat /dev/zero | nc 192.168.1.93 9989

On the destination machine 192.168.1.93 (the ubuntu 20.04 vm all the tests were run on)

High CPU usage
dotnet run --configuration Release PerfTestTCPListenerAsync

Low CPU usage
dotnet run --configuration Release PerfTestTCPListener
nc -l 9989 >/dev/null
nodejs jsserver.js
python3 pyserver.py

I've included the most basic c# example Program.cs using the exact same code with a Read vs ReadAsync, a very basic python example pyserver.py and a very basic nodejs example jsserver.js. All are configured to listen on port 9989 so the same netcat client command can be run against any of the 5 examples.

I've been monitoring the cpu usage with top and network traffic with iftop.

data.zip

Async version dotnet-counters monitor (I'm note sure if the cpu usage here is quoting all cores but top value shows 52% for this async socket process)
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 244,648
CPU Usage (%) 16
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 9
GC Fragmentation (%) 0.233
GC Heap Size (MB) 6
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 24
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 245,448
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 24
IL Bytes Jitted (B) 48,860
LOH Size (B) 1,048,656
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 0
Number of Assemblies Loaded 18
Number of Methods Jitted 392
POH (Pinned Object Heap) Size (B) 39,976
ThreadPool Completed Work Item Count (Count / 1 sec) 6,132
ThreadPool Queue Length 0
ThreadPool Thread Count 5
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 47

Synchronous version (top shows cpu usage as around 6%)
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 32,672
CPU Usage (%) 0
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 26
GC Fragmentation (%) 1.105
GC Heap Size (MB) 25
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 24
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 7,806,528
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 12,179,944
IL Bytes Jitted (B) 295,457
LOH Size (B) 287,848
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 4
Number of Assemblies Loaded 121
Number of Methods Jitted 3,737
POH (Pinned Object Heap) Size (B) 195,360
ThreadPool Completed Work Item Count (Count / 1 sec) 0
ThreadPool Queue Length 0
ThreadPool Thread Count 4
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 140

Linux Traces:
I've attached 20 second dotnet-trace results from both Synchronous and Asynchronous in Linux. It appears to me all (or most)of the extra time is spent in LowLevelLifoSemaphore but I'm not sure if that's expected.

Traces.zip

Versus Windows (0.1% cpu in async mode, 12700k)
image

TraceAsyncWindows.zip
Of course the windows trace looks very different as it uses IOCompletion

I ran some tests using a similarly spec'd windows 10 hyper-v VM and got pretty lousy numbers. Both Synchronous and Async used about 50% of the cpu (4 hyper-v cpus on a 5700g) and only got up to about 600mbps. While it doesn't invalidate the enormous synchronous vs asynchronous performance discrepancy I observed on linux it may invalidate the windows vs linux argument. These aren't heavily loaded hyper-visors so I'm surprised to be honest. Python3 got 35% cpu usage and 800mbps on the same windows 10 machine.

Using a 5700g with windows 11 (not in a hyper-vm) shows similar 0.3%ish cpu usage and full gigabit traffic for .net 6 async and slightly higher cpu usage for python3 (0.4ish) for full gigabit traffic.

Using an intel 12400 (non virtualized) rocky linux 8.5 I had 40% cpu usage in async and 1% for synchronous on 100mbit interface and 47% async vs 1% synchronous on a gigabit interface for the same machine.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions