Subject: Network Stack Locking

<summary>This is a status email, don't sweat it</summary>

<executivesummary>
  The high level view, for those less willing to wade through a greater
  level of detail, is that we have a substantial work in progress with a
  lot of our bases covered, and that we're looking for broader exposure
  for the work.  We've been merging smaller parts of the work (supporting
  infrastructure, fine-grained locking for specific leaf dependencies),
  and are starting to think about larger scale merging over the next
  month or two.  There are some known serious issues in the current work,
  but we've also identified some areas that need attention outside of the
  stack in order to make serious progress on merging.  There are also some
  important tasks that require owners moving forward, and a solicitation
  for those areas.  I don't attempt to capture everything, in particular
  things like locking strategies in this e-mail.  You will find patch URLs
  and perforce references.
</executivesummary>

As many of you are aware, I've become the latest inheritor of the omnibus
"Network Stack Locking" task of SMPng.  This work has a pretty long
history that I won't attempt to go into here, other than to observe that
the vast majority of work that will be discussed in this e-mail is the
product of significant contributions of others, including: Jonathan Lemon,
Jennifer Yang, Jeffrey Hsu, and Sam Leffler, and a large number of other
contributors (many of whom are named in recent status reports).

The goal of this e-mail is to provide a bit of high level information
about what is going on to increase awareness, solicit involvement in a
variety of areas, and throw around words like "merge schedule".  Warning: 
this is a work in progress, and you will find rough parts.  This is being
worked on actively, but by bringing this up during the process, we can
improve the work.  If you see things that scare you, that's a reasonable
response.

Now into the details:

Those following the last few status reports will know that recent work has
focused in the following areas: 

- Introducing and refining data based locking for the top levels of the
  network stack (sockets, socket buffers, et al).

- Refining and testing locking for lower pieces of the stack that already
  have locking.

- Locking for UNIX domain sockets, FIFOs, etc.

- Iterating through pseudo-interfaces and network interfaces to identify
  and correct locking problems.

- Allow Giant to be conditionally acquired across the entire stack using a
  Giant Toggle Switch.

- Address interactions with tightly coupled support infrastructure for the
  stack, including the MAC Framework, kqueue, sigio, select() general
  signaling primitives, et al.

- Investigating and in many cases locking of less popular/less widely used
  stack components that were previously unaddressed, such as IPv6,
  netatalk, netipx, et al.

- Some local changes used to monitor and assert locks at a finer
  granularity than in the main tree.  Specifically, sampling of callouts
  and timeouts to measure what we're grabbing Giant for, and in certain
  branches, the addition of a great many assertions.

This work is occurring in a number of Perforce branches.  The primary
branch that is actively worked on is "rwatson_netperf", which may be found
at the following patch:

  //depot/users/rwatson/netperf/...

Additional work is taking place to explore socket locking issues in:

  //depot/users/rwatson/net2/...

A number of other developers have branches off of these branches to
explore locking for particular subsystems.  There are also some larger
unintegrated patch sets for data-based NFS locking, fixing the user space
build, etc.  You can find a non-Perforce version at: 

  http://www.watson.org/~robert/freebsd/netperf/

This includes a basic change log and incrementally generated patches, work
sets, etc.  Perforce is the preferred way to get to the work as it
provides easier access to my working notes, the ability to maintain local
changes, get the most recent version, etc.  I try to drop patches fairly
regularly -- several times a week against HEAD, but due to travel to
BSDCan, I'm about two weeks behind.  I hope to make substantial headway
this weekend in updating the patch set and integrating a number of recent
socket locking changes from various work branches.

This work is currently a work in progress, and has a number of known
issues, including some lock order reversal problems, known deficiencies
in socket locking coverage of socket variables, etc.  However, it's been
being reviewed and worked on by an increasingly broad population of
FreeBSD developers, so I wanted to move to a more general patch posting
process and attempt to identify additional "hired hands" for areas that
require additional work.  Here are current known tasks and current owners: 

Task					Developer
----					---------
Sockets					Robert Watson
Synthetic network interfaces		Robert Watson
Netinet6				George Neville-Neil
Netatalk				Robert Watson
Netipx					Robert Watson
Interface Locking			Max Laier, Luigi Rizzo,
					Maurycy Pawlowski-Wieronski,
					Brooks Davis
Routing Cleanup				Luigi Rizzo
KQueue (subsystem lock)			Brian Feldman
KQueue (data locking)			John-Mark Gurney	
NFS Server (subsystem lock)		Robert Watson
NFS Server (data locking)		Rick Macklem
SPPP					Roman Kurakin
Userspace build				Roman Kurakin
VFS/fifofs interactions			Don Lewis
Performance measurement			Pawel Jakub Dawidek

And of course, I can't neglect to mention the on-going work of Kris
Kennaway to test out these changes on high-load systems :-).

Some noted absences in the above, and areas where I'd like to see
additional people helping out are:

- Reviewing Netgraph modules for correct interactions with locking in the
  remainder of the system.  I've started pushing some locking into
  ng_ksocket.c and ng_socket.c, and some of the basic infrastructure that
  needed it, but each module will need to be reviewed for correct locking. 

- ATM -- Harti? :-)

- Network device drivers -- some have locking, some have correct locking,
  some have potential interactions with other pieces of the system (such
  as the USB stack).  Note that for a driver to work correctly with a
  Giant-free system, it must be safe to invoke ifp->if_start() without
  holding Giant, and for if_start() to be aware that it cannot
  acquire Giant without generating a lock order issue.  It's OK for
  if_input() to be called with Giant, although undesirable generally.
  Some drivers also have locking that is commented out by default due to
  use of recursive locks, but I'm not sure this is necessarily sufficient
  problem not to just turn on the locking. 

- Complete coverage of synthetic/pseudo-interfaces.  In particular,
  careful addressing of if_gif and other "cross-layer" and protocol aware
  pieces.

- mbuma -- Bosko's work looks good to me, we need to make sure all the
  pieces work with each other.  Getting down to one large memory allocator
  would be great.  I'm interested in exploring uniprocessor optimizations
  here -- I notice that a lot of the locks getting acquired in profiling
  are for memory allocation.  Exploring using critical sections, per-cpu
  variables/caching, and pinning both seem like reasonable approaches to
  reduce synchronization costs here. 

Note that there are some serious issues with the current locking changes:

- Socket locking is deficient in a number of ways -- primarily that there
  are several important socket fields that are currently insufficiently or
  inconsistently synchronized.  I'm in the throes of correcting this, but
  that requires a line-by-line review of all use of sockets, which will
  take me at least another week or two to complete.  I'm also addressing
  some races between listen sockets and the sockets hung off of them
  during the new connection setup and accept process.  Currently there is
  no defined lock order between multiple sockets, and if possible I'd like
  to keep it that way. 

- Based on the BSD/OS strategy, there are two mutexes on a socket: each
  socket buffer has a mutex (send, receive), and then the basic socket
  fields are locked using SOCK_LOCK(), which actually uses the receive
  socket buffer mutex.  This reduces the locking overhead while helping to
  address ordering issues in the upward and downward paths.  However,
  there are also some issues of locking correctness and redundancy, and
  I'm looking into these as part of an overall review of the strategy.
  It's worth noting that the BSD/OS snapshot we have has substantially
  incomplete and non-functional socket locking, so unlike some other
  pieces of the network stack, it was not possible to use the strategy
  whole-cloth.  In the long term, the socket locking model may require
  substantial revision.

- Per some recent discussions on -CURRENT, I've been exploring mitigating
  locking costs through coalescing activities on multiple packets.  I.e.,
  effectively passing in queues of packet chains across API boundaries, as
  well as creating local work queues.  It's a bit early to commit to this
  approach because the performance numbers have not confirmed the benefit,
  but it's important to keep that possible approach in mind across all
  other locking work, as it trades off work queue latency with
  synchronization cost.  My earlier experimentation occurred at the end of
  2003, so I hope to revisit this now that more of the locking is in place
  to offer us advantages in preemption and parallelism.

- They enable net.isr.enable by default, which provides inbound packet
  parallelism through running to completion in the ithread.  This has
  other down sides, and while we should provide the option, I think we
  should continue to support forcing use of the netisr.  One of the
  problems with the netisr approach is how to accomplish inbound
  processing parallelism without sacrificing the currently strong ordering
  properties, which could cause bad TCP behavior, etc.  We should seriously
  consider at least some aspects of Jeffrey Hsu's work on DragonFly
  to explore providing for multiple netisr's bound to CPUs, then directing
  traffic based on protocol aware hashing that permits us to maintain
  sufficient ordering to meeting higher level protocol requirements while
  avoiding the cost of maintaining full ordering.  This isn't something we
  have to do immediately, but exploiting parallelism requires both
  effective synchronization and effective balancing of load.

  In the short term, I'm less interested in the avoidance of
  synchronization of data adopted in the DragonFly approach, since I'd
  like to see that approach validated on a larger chunk of the stack
  (i.e., across the more incestuous pieces of the network stack), and also
  to see performance numbers that confirm the claims.  The approach we're
  currently taking is tried and true across a broad array of systems
  (almost every commercial UNIX vendor, for example), and offers many
  benefits (such as a very strong assertion model).  However, as aspects
  of the DFBSD approach are validated (or not, as the case may be), we
  should consider adopting things as they make sense.  The approaches
  offer quite a bit of promise, but are also very experimental and will
  require a lot of validation, needless to say.

- There are still some serious issues in the timely processing and
  scheduling of device driver interrupts, and these affect performance in
  a number of ways.  They also change the degree of effective coalescing
  of interrupts, making it harder to evaluate strategies to lower costs.
  These issues aren't limited to the network stack work, but I wanted to
  make sure it was on the list of concerns.

- There are issues relating to upcalls from the socket layer: while many
  consumers of sockets simply sleep for wakeups on socket pointers,
  so_upcall() permits the network stack to "upcall" into other components
  of the system.  I believe this was introduced initially for the NFS
  server to allow initial processing of RPCs to occur in the netisr rather
  than waiting on a context switch to the NFS server threads.  However,
  it's now also used for accept sockets, and I'm aware of outstanding
  changes that modify the NFS client to use it as well.  We need to
  establish what locks will be held over the upcall, if any, and what
  expectations are in place for implementers of upcall functions.  At the
  very least, they have to be MPSAFE, but there are also potential lock
  order issues.

- Locking for KQueue is critical to success.  Without locking down the
  event infrastructure, we can't remove Giant from the many interesting
  pieces of the network stack.  KQueue is an example of a high level of
  incestuousness between levels, and will require careful handling.
  Brian's approach adopts a "single subsystem" for KQueue and as such
  offers a low hanging fruit approach, but comes at a number of costs, not
  least is parallelism loss and functional loss.  John-Mark's approach
  appears to offer a more granular locking approach offering higher
  parallelism, but at the cost of complexity.  I've not yet had the
  opportunity to review either in any detail, but I know Brian has
  integrated a work branch in Perforce that combines both the locking in
  rwatson_netperf, and perform testing.  There's obviously more work to go
  on here, and it is required to get to "Giant-free operation". 

For more complete changes and history, I would refer you to the last few
FreeBSD Status Reports on network stack locking.  I would also encourage
you to contact me if you would like to claim some section of the stack for
work so I can coordinate activities.  These patch sets have been pounded
heavily in a wide variety of environments, but there are several known
issues so I would recommend using them cautiously.

In terms of merging: I've been gradually merging a lot of the
infrastructure pieces as I went along.  The next big chunks to consider
merging are:

- Socket locking.  This needs to wait until I'm more happy with the
  strategy.

- UNIX domain socket locking.  This is probably an early candidate, but
  because of potential interactions with socket locking changes, I've been
  deferring the merge. 

- NFS server locking.  I had planned to merge the current subsystem lock
  quickly, but then Rick turned up with fine-grained data based locking of
  the NFS server, and NFSv4 server code when I asked him for review of the
  subsystem lock, so I've been holding off.

- Additional general infrastructure, such as more psuedo-interface
  locking, fifofs stuff, etc.  I'll continue on the gradual incremental
  merge path as I have been for the past few months.

It's obviously desirable to get things merged as soon as they are ready,
even with Giant remaining over the stack, so that we can get broad
exercising of the locking assertions in INVARIANTS and WITNESS.  As such,
over the next month I anticipate an increasing number of merges, and
increasing usability of "debug.mpsafenet" in the main tree.  Turning off
Giant will likely lead to problems for some time to come, but the sooner
we get exposure, the better life will be.  We've done a lot of heavy
testing of common code paths, but working out the edge cases will take
some time.  We're prepared to live in a world with a dual-mode stack for
some period, but that has to be an interim measure. 

So I guess the upshot is "Stuff is going on, be aware, volunteer to
help!".

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Senior Research Scientist, McAfee Research