Subject: Network Stack Locking
This is a status email, don't sweat it
The high level view, for those less willing to wade through a greater
level of detail, is that we have a substantial work in progress with a
lot of our bases covered, and that we're looking for broader exposure
for the work. We've been merging smaller parts of the work (supporting
infrastructure, fine-grained locking for specific leaf dependencies),
and are starting to think about larger scale merging over the next
month or two. There are some known serious issues in the current work,
but we've also identified some areas that need attention outside of the
stack in order to make serious progress on merging. There are also some
important tasks that require owners moving forward, and a solicitation
for those areas. I don't attempt to capture everything, in particular
things like locking strategies in this e-mail. You will find patch URLs
and perforce references.
As many of you are aware, I've become the latest inheritor of the omnibus
"Network Stack Locking" task of SMPng. This work has a pretty long
history that I won't attempt to go into here, other than to observe that
the vast majority of work that will be discussed in this e-mail is the
product of significant contributions of others, including: Jonathan Lemon,
Jennifer Yang, Jeffrey Hsu, and Sam Leffler, and a large number of other
contributors (many of whom are named in recent status reports).
The goal of this e-mail is to provide a bit of high level information
about what is going on to increase awareness, solicit involvement in a
variety of areas, and throw around words like "merge schedule". Warning:
this is a work in progress, and you will find rough parts. This is being
worked on actively, but by bringing this up during the process, we can
improve the work. If you see things that scare you, that's a reasonable
response.
Now into the details:
Those following the last few status reports will know that recent work has
focused in the following areas:
- Introducing and refining data based locking for the top levels of the
network stack (sockets, socket buffers, et al).
- Refining and testing locking for lower pieces of the stack that already
have locking.
- Locking for UNIX domain sockets, FIFOs, etc.
- Iterating through pseudo-interfaces and network interfaces to identify
and correct locking problems.
- Allow Giant to be conditionally acquired across the entire stack using a
Giant Toggle Switch.
- Address interactions with tightly coupled support infrastructure for the
stack, including the MAC Framework, kqueue, sigio, select() general
signaling primitives, et al.
- Investigating and in many cases locking of less popular/less widely used
stack components that were previously unaddressed, such as IPv6,
netatalk, netipx, et al.
- Some local changes used to monitor and assert locks at a finer
granularity than in the main tree. Specifically, sampling of callouts
and timeouts to measure what we're grabbing Giant for, and in certain
branches, the addition of a great many assertions.
This work is occurring in a number of Perforce branches. The primary
branch that is actively worked on is "rwatson_netperf", which may be found
at the following patch:
//depot/users/rwatson/netperf/...
Additional work is taking place to explore socket locking issues in:
//depot/users/rwatson/net2/...
A number of other developers have branches off of these branches to
explore locking for particular subsystems. There are also some larger
unintegrated patch sets for data-based NFS locking, fixing the user space
build, etc. You can find a non-Perforce version at:
http://www.watson.org/~robert/freebsd/netperf/
This includes a basic change log and incrementally generated patches, work
sets, etc. Perforce is the preferred way to get to the work as it
provides easier access to my working notes, the ability to maintain local
changes, get the most recent version, etc. I try to drop patches fairly
regularly -- several times a week against HEAD, but due to travel to
BSDCan, I'm about two weeks behind. I hope to make substantial headway
this weekend in updating the patch set and integrating a number of recent
socket locking changes from various work branches.
This work is currently a work in progress, and has a number of known
issues, including some lock order reversal problems, known deficiencies
in socket locking coverage of socket variables, etc. However, it's been
being reviewed and worked on by an increasingly broad population of
FreeBSD developers, so I wanted to move to a more general patch posting
process and attempt to identify additional "hired hands" for areas that
require additional work. Here are current known tasks and current owners:
Task Developer
---- ---------
Sockets Robert Watson
Synthetic network interfaces Robert Watson
Netinet6 George Neville-Neil
Netatalk Robert Watson
Netipx Robert Watson
Interface Locking Max Laier, Luigi Rizzo,
Maurycy Pawlowski-Wieronski,
Brooks Davis
Routing Cleanup Luigi Rizzo
KQueue (subsystem lock) Brian Feldman
KQueue (data locking) John-Mark Gurney
NFS Server (subsystem lock) Robert Watson
NFS Server (data locking) Rick Macklem
SPPP Roman Kurakin
Userspace build Roman Kurakin
VFS/fifofs interactions Don Lewis
Performance measurement Pawel Jakub Dawidek
And of course, I can't neglect to mention the on-going work of Kris
Kennaway to test out these changes on high-load systems :-).
Some noted absences in the above, and areas where I'd like to see
additional people helping out are:
- Reviewing Netgraph modules for correct interactions with locking in the
remainder of the system. I've started pushing some locking into
ng_ksocket.c and ng_socket.c, and some of the basic infrastructure that
needed it, but each module will need to be reviewed for correct locking.
- ATM -- Harti? :-)
- Network device drivers -- some have locking, some have correct locking,
some have potential interactions with other pieces of the system (such
as the USB stack). Note that for a driver to work correctly with a
Giant-free system, it must be safe to invoke ifp->if_start() without
holding Giant, and for if_start() to be aware that it cannot
acquire Giant without generating a lock order issue. It's OK for
if_input() to be called with Giant, although undesirable generally.
Some drivers also have locking that is commented out by default due to
use of recursive locks, but I'm not sure this is necessarily sufficient
problem not to just turn on the locking.
- Complete coverage of synthetic/pseudo-interfaces. In particular,
careful addressing of if_gif and other "cross-layer" and protocol aware
pieces.
- mbuma -- Bosko's work looks good to me, we need to make sure all the
pieces work with each other. Getting down to one large memory allocator
would be great. I'm interested in exploring uniprocessor optimizations
here -- I notice that a lot of the locks getting acquired in profiling
are for memory allocation. Exploring using critical sections, per-cpu
variables/caching, and pinning both seem like reasonable approaches to
reduce synchronization costs here.
Note that there are some serious issues with the current locking changes:
- Socket locking is deficient in a number of ways -- primarily that there
are several important socket fields that are currently insufficiently or
inconsistently synchronized. I'm in the throes of correcting this, but
that requires a line-by-line review of all use of sockets, which will
take me at least another week or two to complete. I'm also addressing
some races between listen sockets and the sockets hung off of them
during the new connection setup and accept process. Currently there is
no defined lock order between multiple sockets, and if possible I'd like
to keep it that way.
- Based on the BSD/OS strategy, there are two mutexes on a socket: each
socket buffer has a mutex (send, receive), and then the basic socket
fields are locked using SOCK_LOCK(), which actually uses the receive
socket buffer mutex. This reduces the locking overhead while helping to
address ordering issues in the upward and downward paths. However,
there are also some issues of locking correctness and redundancy, and
I'm looking into these as part of an overall review of the strategy.
It's worth noting that the BSD/OS snapshot we have has substantially
incomplete and non-functional socket locking, so unlike some other
pieces of the network stack, it was not possible to use the strategy
whole-cloth. In the long term, the socket locking model may require
substantial revision.
- Per some recent discussions on -CURRENT, I've been exploring mitigating
locking costs through coalescing activities on multiple packets. I.e.,
effectively passing in queues of packet chains across API boundaries, as
well as creating local work queues. It's a bit early to commit to this
approach because the performance numbers have not confirmed the benefit,
but it's important to keep that possible approach in mind across all
other locking work, as it trades off work queue latency with
synchronization cost. My earlier experimentation occurred at the end of
2003, so I hope to revisit this now that more of the locking is in place
to offer us advantages in preemption and parallelism.
- They enable net.isr.enable by default, which provides inbound packet
parallelism through running to completion in the ithread. This has
other down sides, and while we should provide the option, I think we
should continue to support forcing use of the netisr. One of the
problems with the netisr approach is how to accomplish inbound
processing parallelism without sacrificing the currently strong ordering
properties, which could cause bad TCP behavior, etc. We should seriously
consider at least some aspects of Jeffrey Hsu's work on DragonFly
to explore providing for multiple netisr's bound to CPUs, then directing
traffic based on protocol aware hashing that permits us to maintain
sufficient ordering to meeting higher level protocol requirements while
avoiding the cost of maintaining full ordering. This isn't something we
have to do immediately, but exploiting parallelism requires both
effective synchronization and effective balancing of load.
In the short term, I'm less interested in the avoidance of
synchronization of data adopted in the DragonFly approach, since I'd
like to see that approach validated on a larger chunk of the stack
(i.e., across the more incestuous pieces of the network stack), and also
to see performance numbers that confirm the claims. The approach we're
currently taking is tried and true across a broad array of systems
(almost every commercial UNIX vendor, for example), and offers many
benefits (such as a very strong assertion model). However, as aspects
of the DFBSD approach are validated (or not, as the case may be), we
should consider adopting things as they make sense. The approaches
offer quite a bit of promise, but are also very experimental and will
require a lot of validation, needless to say.
- There are still some serious issues in the timely processing and
scheduling of device driver interrupts, and these affect performance in
a number of ways. They also change the degree of effective coalescing
of interrupts, making it harder to evaluate strategies to lower costs.
These issues aren't limited to the network stack work, but I wanted to
make sure it was on the list of concerns.
- There are issues relating to upcalls from the socket layer: while many
consumers of sockets simply sleep for wakeups on socket pointers,
so_upcall() permits the network stack to "upcall" into other components
of the system. I believe this was introduced initially for the NFS
server to allow initial processing of RPCs to occur in the netisr rather
than waiting on a context switch to the NFS server threads. However,
it's now also used for accept sockets, and I'm aware of outstanding
changes that modify the NFS client to use it as well. We need to
establish what locks will be held over the upcall, if any, and what
expectations are in place for implementers of upcall functions. At the
very least, they have to be MPSAFE, but there are also potential lock
order issues.
- Locking for KQueue is critical to success. Without locking down the
event infrastructure, we can't remove Giant from the many interesting
pieces of the network stack. KQueue is an example of a high level of
incestuousness between levels, and will require careful handling.
Brian's approach adopts a "single subsystem" for KQueue and as such
offers a low hanging fruit approach, but comes at a number of costs, not
least is parallelism loss and functional loss. John-Mark's approach
appears to offer a more granular locking approach offering higher
parallelism, but at the cost of complexity. I've not yet had the
opportunity to review either in any detail, but I know Brian has
integrated a work branch in Perforce that combines both the locking in
rwatson_netperf, and perform testing. There's obviously more work to go
on here, and it is required to get to "Giant-free operation".
For more complete changes and history, I would refer you to the last few
FreeBSD Status Reports on network stack locking. I would also encourage
you to contact me if you would like to claim some section of the stack for
work so I can coordinate activities. These patch sets have been pounded
heavily in a wide variety of environments, but there are several known
issues so I would recommend using them cautiously.
In terms of merging: I've been gradually merging a lot of the
infrastructure pieces as I went along. The next big chunks to consider
merging are:
- Socket locking. This needs to wait until I'm more happy with the
strategy.
- UNIX domain socket locking. This is probably an early candidate, but
because of potential interactions with socket locking changes, I've been
deferring the merge.
- NFS server locking. I had planned to merge the current subsystem lock
quickly, but then Rick turned up with fine-grained data based locking of
the NFS server, and NFSv4 server code when I asked him for review of the
subsystem lock, so I've been holding off.
- Additional general infrastructure, such as more psuedo-interface
locking, fifofs stuff, etc. I'll continue on the gradual incremental
merge path as I have been for the past few months.
It's obviously desirable to get things merged as soon as they are ready,
even with Giant remaining over the stack, so that we can get broad
exercising of the locking assertions in INVARIANTS and WITNESS. As such,
over the next month I anticipate an increasing number of merges, and
increasing usability of "debug.mpsafenet" in the main tree. Turning off
Giant will likely lead to problems for some time to come, but the sooner
we get exposure, the better life will be. We've done a lot of heavy
testing of common code paths, but working out the edge cases will take
some time. We're prepared to live in a world with a dual-mode stack for
some period, but that has to be an interim measure.
So I guess the upshot is "Stuff is going on, be aware, volunteer to
help!".
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org Senior Research Scientist, McAfee Research