Recall from Operating Systems a high-level overview of what fork(2) does:
- It returns twice, once in the original (parent) process, and once in a new (child) process.
- You can use the return value to distinguish in which copy your code is continuing.
- "Everything" gets copied from the parent to the child (these days, using CoW).
- A very typical use of
fork(2)is to then useexecve(2)or similar.
Among the things that get copied are memory maps (so that pointers remain valid), which includes globals and mutexes by virtue of them being stored in memory.
The file descriptor table is also copied, although with the explicit note that the fd copies (unlike memory) do not maintain distinct state (e.g. position: a seek in the parent post-fork will inadvertently seek in the child).
Only one thread (the one that calls fork(2)) will remain, all others "disappear."
Basically, because this is the way some old Unix vendors implemented this in the 70's. It got codified into POSIX and Linux (as well as OS X) obey that.
If you'd like to follow along, open a Terminal and run make in the root of
this repo. The docs for mentions like fork(2) can be loaded by running man 2 fork to make sure you get the right section, which is displayed in the
upper-left of the viewer. The Linux version of the man pages are also
available online.
Python libraries use multiprocessing to get CPU-bound workloads to parallelize -- they wouldn't use more than one core otherwise because of the GIL. That said, there are actually three backends that multiprocessing can use, best and most well-defined first:
spawnwhich uses theposix_spawn(2)syscall which does a fork+exec and carries basically no state from the parent. As an implementation detail, the command line is customizable, and file desciptors can be deliberately inherited, which is how Python communicates with a child over a pickle-based protocol.forkserverwhen started early enough first forks a boring enough child that global state (like using grpc in the parent) hasn't happened yet. When the parent needs another child, it talks to theforkserverprocess which creates a grandchild that inherits the boring state. Python still communicates with the grandchild over a pickle-based protocol. As mentioned above, using grpc in either the parent or the child is safe-ish, and this is a way you can do both.forkhas some general compatibility problems when combined with threads or internal state like buffers or locks, in any language, not just Python. The remainder of this section gives simple Python examples, the following one talks about Java, and the remainder is in straightforward C. This is simply threads and processes not getting along, except in very narrow cases (you immediately exec).
This code maybe works because there aren't other threads (that we know about):
$ python 0.py
About to fork, locked= False
got lock 84901
got lock 84902While this code has a 10% chance of deadlock (fairly precisely, when it's locked as we go into the fork):
$ python 0b.py
About to fork, locked= True
got lock 84962
<hang>
Additionally, if you run that on 3.12+ you get a warning that threads are involved and it can deadlock. That's fully accurate.
$ python3.12 0b.py
About to fork, locked= True
/Users/timhatch/code/fork-is-problematic/0b.py:18: DeprecationWarning: This process (pid=85040) is multi-threaded, use of fork() may lead to deadlocks in the child.
rv = os.fork()
got lock 85040
<hang>
With spawn, no state is really inherited in the child, so connection pooling doesn't cause problems (each child has its own connection pool, note different ephemeral ports):
$ python 0c.py
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51984), raddr=('172.217.12.110', 80)>
Read b'HTTP/' 127
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51984), raddr=('172.217.12.110', 80)>
Read b'HTTP/' 127
Fetching with <socket.socket fd=7, family=2, type=1, proto=0, laddr=('172.24.10.37', 51985), raddr=('172.217.12.110', 80)>
Fetching with <socket.socket fd=7, family=2, type=1, proto=0, laddr=('172.24.10.37', 51986), raddr=('172.217.12.110', 80)>
Fetching with <socket.socket fd=7, family=2, type=1, proto=0, laddr=('172.24.10.37', 51987), raddr=('172.217.12.110', 80)>
Read b'HTTP/' 127
Read b'HTTP/' 127
Read b'HTTP/' 127
But with fork, the already-open file descriptor is shared, and the responses get interleaved/merged on this pipeline-enabled, pooled connection:
$ python 0d.py
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51981), raddr=('172.217.12.110', 80)>
Read b'HTTP/' 127
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51981), raddr=('172.217.12.110', 80)>
Read b'HTTP/' 127
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51981), raddr=('172.217.12.110', 80)>
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51981), raddr=('172.217.12.110', 80)>
Fetching with <socket.socket fd=3, family=2, type=1, proto=0, laddr=('172.24.10.37', 51981), raddr=('172.217.12.110', 80)>
Read b'HTTP/' 127
Read b'HTTP/' 130
Read b'P/1.1' 124
For a prefix- or line-oriented protocol (like websockets when not masking, or plaintext HTTP 1.1), you might just receive a different valid response not intended for you. With more binary protocols (like anything over TLS) you are likely to encounter low-level "data corruption" type exceptions that are unpredictable.
Because they don't use fork, because they don't need to. The threading model is already concurrent, unlike Python's, and the runtime doesn't even expose how to fork without using JNI (which should give you pause about whether it's a good idea for your Java code).
I can't find a single page on the Internet that says "here's how you would" because it's so weird, and in the words of Paul Bakker is "so uncommon, I've never done it even once in my career."
I queried Google for "java jni fork" to try to find some references, and the summary does a pretty good job at capturing the scariness combined with "but, why" (incidentally, also using the word "problematic"):
Using fork() directly from JNI code can be problematic due to the way the JVM manages its internal state. However, if you absolutely need this functionality, here's a general approach:
Important Considerations:
- JVM State: Forking a process duplicates the entire memory space of the parent process, including the JVM's internal state. This can lead to unpredictable behavior and crashes.
- Synchronization: The child process will inherit ... from the parent process, which can lead to synchronization issues and deadlocks.
- Portability: Forking is not supported on all platforms (e.g., Windows).
...
Caution:
- This approach should be used with extreme caution. It is generally not recommended for production environments due to the potential for instability.
- Consider carefully whether you truly need to use fork() in this context.
Quite often, I/O is buffered. What this means is that your call to printf(3)
doesn't have to output immediately -- it can write to an internal memory
buffer, and only output when that is "full" or when the program is "exiting".
Look at the example 1.c which prints something and exits. This outputs the
same, regardless of whether it's being piped.
$ ./1
hi!
$ ./1 | cat
hi!The reason for this is that there are hooks that get called when using
exit(3) and among those is to flush buffers. If you use the syscall
_exit(2) as 1b.c does, then the output differs.
$ ./1b
hi!
$ ./1b | catNotably, signals also interrupt normal exit, as in 1c.c. The output is a
little more verbose because the shell knows it got killed, but the important
part is that hi! is not printed in the second case.
$ ./1c
hi!
zsh: terminated ./1c
$ ./1c | cat
zsh: terminated ./1c |
zsh: done catIn addition to there being a buffer, that buffer is necessarily in memory. Unless some hook is called, that (potentially incomplete) state is copied to the child in a way that might not be ~idempotent.
$ ./2
parent 58868
child 58868
parent 58868
child 58869You probably expected the parent line to only output once. However, that
not-yet-output data was copied to the child, and it didn't know any better than
to output it (in both the parent, and the child) when exiting.
The way around this is once again hooks; if you call flush(3) then the state
is "reset" and not carried to the child, and it works "correctly." I've used
syscalls (man section 2) here to show what libc is basically doing.
$ ./2b
parent 59022
child 59022
child 59023While libc does have a concept of atexit hooks, it does not have one for
atfork and we have to use pthread_atfork(3) for that. At least it's part
of POSIX. Note that Python atfork hooks are not called if a C library calls fork(2) directly,
just as atexit hooks don't get called when you bypass the library function
and use _exit(2)).
pthread_atfork(3) includes this gem:
In practice, [the task of setting up good state after a fork] is generally too difficult to be practicable.
A common workaround is to store the expected pid in a global variable, and
call getpid() periodically to recognize if you've forked. This is expensive,
and not actually a "fix" -- just "detection." You can't know whether the
parent will ever/timely flush its buffer to just reset in the child, for example.
OTEL uses both of these in the python implementation -- it assumes that spans queued pre-fork will be handled by the parent (not guaranteed, it might get killed or interrupted by a signal), but the child starts fresh.
Only the thread that calls fork(2) remains in the child, all others disappear
(with no hook). First, demonstrating that the child doesn't inherit them, we
should get 7 lines (not 11, if they were) from running this:
$ ./3
60144 0 t1
a
60144 0 t2
60144 1 t1
60144 1 t2
b 60145
b 0
However, a bunch of the time (nondeterminstically) this hangs. If you attach a
debugger, you'll see it's waiting for a lock -- it was locked in the parent and
because there aren't atfork hooks, that lock remains locked in the child.
Trust me that one of the threads was outputting (thus, held the lock) when the
fork happened if it hangs.
Thus, the only safe state for locks during a fork is either unlocked, or locked (by an atfork hook directly, as the Python c-api does for its own locks). You can't just free all locks when you fork, and neither can you realistically acquire "all" locks in a real application, fork, then release them even if you had a reliable hook to let you run that code.
If you use buffered I/O, the time spent under the lock is less (so the likelihood of deadlock is less), but it still outputs duplicate entries just as in problem 2.
$ ./3 | cat
a
60501 0 t2
60501 0 t1
b 0
a
60501 0 t2
60501 0 t1
60501 1 t2
60501 1 t1
b 60503To avoid deadlock, many libraries include workarounds that reinitialize locks in the child regardless of their prior state. As in the previous OTEL example, this trusts that the parent should be responsible for any state that happened before the fork. This leads to error-prone code, for example OTEL also checks the pid but that internally uses a lock that doesn't get reinitialized, which I think means this doesn't always work and can deadlock if that lock is held when double-forking and there's another thread in the child.
What I'm trying to get across is that these are hacks -- libraries certainly do them, because they benefit users -- but not proper solutions. Hang on, it gets worse.
This example is orthogonal to the lock issues -- file descriptors share state across forked processes. What this means is that reads can interleave.
You might get lucky -- here the two processes read this in some reasonable order, and if you can imagine this being a pipelined request-response, maybe the second one thought it was going to see 123 but saw the equally valid 456 and misassociated a response with a request. You might not even notice this in your application code, depending on what the library does with such things.
$ echo 123456 | ./4
123
456However, you might not be lucky. Each process still does a read and gets a byte in the correct order, but because the pointer is unexpectedly advanced, sees a confusing set of data.
$ echo 123456 | ./4
124
356If we were trying to parse a proto, or use encryption, this would likely result in some confusing low-level errors. This applies to files as well as sockets.
As long as you're reading a given fd only in the parent, or only in the child, the fork actually isn't that bad for this case. The problem arises when you do both.
I haven't written up a C example for this, but suffice it to say that many
network libraries store a set of open connections, and use select(2) on them.
This gets worse with connection pooling, because you might keep one around that
could get a (different) second request sent from two different children, with
different encryption state (resulting in confusing errors, rather than just
mismatched replies).
It totally is, although the language might not be scary enough to make this obvious:
fork(2) on OS X says:
The child process has its own copy of the parent's descriptors. These descriptors reference the same underlying objects, so that, for instance, file pointers in file objects are shared between the child and the parent, so that an lseek(2) on a descriptor in the child process can affect a subsequent read or write by the parent.
At the very end it also says
CAVEATS
There are limits to what you can do in the child process. To be totally safe you should restrict yourself to only executing async-signal safe operations until such time as one of the exec functions is called. All APIs, including global data symbols, in any framework or library should be assumed to be unsafe after a fork() unless explicitly documented to be safe or async-signal safe.
The term "async-signal safe" is kind of jargony, but almost nothing library is
safe, not even printf(3). See
signal-safety(7)
for the list. The approved use of fork is to exec, which is what clears the
process state to something known-good. The modern posix_spawn(2) is just
that without being a footgun.
The Python os.fork() docs also now state:
We chose to surface [multiple threads existing when you call os.fork()] as a warning, when detectable, to better inform developers of a design problem that the POSIX platform specifically notes as not supported. Even in code that appears to work, it has never been safe to mix threading with os.fork() on POSIX platforms. The CPython runtime itself has always made API calls that are not safe for use in the child process when threads existed in the parent (such as malloc and free).
See this discussion on fork being incompatible with threads for technical details of why we’re surfacing this longstanding platform compatibility problem to developers.
Parallelism is properly supported in Python when using threads (as long as
you're I/O-bound, not CPU-bound), or mp.get_context("spawn") (the default on
OS X and the only option on Windows), and in a pinch,
mp.get_context("forkserver") (but only on Linux -- this isn't the future).
Those represent two extremes of what gets shared -- in threads, you can
directly refer to objects that can't (easily) be recreated, while in spawn or
forkserver they need to all be pickleable. Not all objects are pickleable, in
particular inner functions because they are closures. There are generally ways
to rewrite these to use top-level functions along with explicit currying
(functools.partial) which can be pickled.
Since 3.13 (Oct 2024) there is a functional nogil interpreter available
upstream, which allows you to get full concurrency using interpreter threads
without arbitrary limits.
This is an ABI change, so native code needs to be recompiled (and in some
cases, remove their usage of fork), so this will take some time to become
viable. Scientific libs like numpy and scipy already support it, but as of
Oct 2024 the popular cryptography project fails to even build on it.
As long as one fork(2) or os.fork() or mp.set_start_method("fork") or
mp.get_context("fork") remains, we can still have this problem. That's why
the default on OS X changed to spawn (to make using fork at least opt-in),
and I personally hope that the default is changed on Linux as well.
Well-behaved projects like trailrunner already use spawn
because it provides consistent cross-platform behavior learned the hard way.