diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 152bb318ce..3f3ab4d5f9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,19 +3,17 @@ Want to contribute? Great! First, read this page. ### Before you contribute Before we can use your code, you must sign the -[Google Individual Contributor License Agreement] -(https://cla.developers.google.com/about/google-individual) -(CLA), which you can do online. The CLA is necessary mainly because you own the -copyright to your changes, even after your contribution becomes part of our -codebase, so we need your permission to use and distribute your code. We also -need to be sure of various other things—for instance that you'll tell us if you -know that your code infringes on other people's patents. You don't have to sign -the CLA until after you've submitted your code for review and a member has -approved it, but you must do it before we can put your code into our codebase. -Before you start working on a larger contribution, you should get in touch with -us first through the issue tracker with your idea so that we can help out and -possibly guide you. Coordinating up front makes it much easier to avoid -frustration later on. +[Google Individual Contributor License Agreement][gcla] (CLA), which you can do +online. The CLA is necessary mainly because you own the copyright to your +changes, even after your contribution becomes part of our codebase, so we need +your permission to use and distribute your code. We also need to be sure of +various other things—for instance that you'll tell us if you know that your +code infringes on other people's patents. You don't have to sign the CLA until +after you've submitted your code for review and a member has approved it, but +you must do it before we can put your code into our codebase. Before you start +working on a larger contribution, you should get in touch with us first through +the issue tracker with your idea so that we can help out and possibly guide you. +Coordinating up front makes it much easier to avoid frustration later on. ### Coding Guidelines All code should conform to the [Go style guidelines][gostyle]. @@ -70,4 +68,5 @@ the one above, the [Software Grant and Corporate Contributor License Agreement] (https://cla.developers.google.com/about/google-corporate). +[gcla]: https://cla.developers.google.com/about/google-individual [gostyle]: https://github.com/golang/go/wiki/CodeReviewComments diff --git a/README.md b/README.md index b7e2c69021..65075de0e0 100644 --- a/README.md +++ b/README.md @@ -65,8 +65,6 @@ defense-in-depth. **gVisor** provides a third isolation mechanism, distinct from those mentioned above. -![gVisor](g3doc/Layers.png "gVisor") - gVisor intercepts application system calls and acts as the guest kernel, without the need for translation through virtualized hardware. gVisor may be thought of as either a merged guest kernel and VMM, or as seccomp on steroids. This @@ -75,6 +73,8 @@ on threads and memory mappings, not fixed guest physical resources) while also lowering the fixed costs of virtualization. However, this comes at the price of reduced application compatibility and higher per-system call overhead. +![gVisor](g3doc/Layers.png "gVisor") + On top of this, gVisor employs rule-based execution to provide defense-in-depth (details below). @@ -106,8 +106,6 @@ application to directly control the system calls it makes. ### File System Access -![Sentry](g3doc/Sentry-Gofer.png "Sentry and Gofer") - In order to provide defense-in-depth and limit the host system surface, the gVisor container runtime is normally split into two separate processes. First, the *Sentry* process includes the kernel and is responsible for executing user @@ -115,6 +113,8 @@ code and handling system calls. Second, file system operations that extend beyon the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a proxy, called a *Gofer*, via a 9P connection. +![Sentry](g3doc/Sentry-Gofer.png "Sentry and Gofer") + The Gofer acts as a file system proxy by opening host files on behalf of the application, and passing them to the Sentry process, which has no host file access itself. Furthermore, the Sentry runs in an empty user namespace, and the @@ -142,12 +142,13 @@ mapping functionality. Today, gVisor supports two platforms: executing host system calls. This platform can run anywhere that `ptrace` works (even VMs without nested virtualization). -* The **KVM** platform allows the Sentry to act as both guest OS and VMM, - switching back and forth between the two worlds seamlessly. The KVM platform - can run on bare-metal or on a VM with nested virtualization enabled. While - there is no virtualized hardware layer -- the sandbox retains a process model - -- gVisor leverages virtualization extensions available on modern processors - in order to improve isolation and performance of address space switches. +* The **KVM** platform (experimental) allows the Sentry to act as both guest OS + and VMM, switching back and forth between the two worlds seamlessly. The KVM + platform can run on bare-metal or on a VM with nested virtualization enabled. + While there is no virtualized hardware layer -- the sandbox retains a process + model -- gVisor leverages virtualization extensions available on modern + processors in order to improve isolation and performance of address space + switches. ### Performance @@ -167,6 +168,8 @@ and Docker. * [git][git] * [Bazel][bazel] +* [Python 2.7][python] (See [bug #8](https://github.com/google/gvisor/issues/8) + for Python 3 support updates) * [Docker version 17.09.0 or greater][docker] ### Getting the source @@ -228,7 +231,6 @@ Terminal support works too: docker run --runtime=runsc -it ubuntu /bin/bash ``` - ### Kubernetes Support (Experimental) gVisor can run sandboxed containers in a Kubernetes cluster with cri-o, although @@ -394,7 +396,7 @@ here when available. ## Community -Join the [gvisor-discuss mailing list][gvisor-discuss-list] to discuss all things +Join the [gvisor-users mailing list][gvisor-users-list] to discuss all things gVisor. Sensitive security-related questions and comments can be sent to the private @@ -412,11 +414,12 @@ See [Contributing.md](CONTRIBUTING.md). [docker]: https://www.docker.com [docker-storage-driver]: https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-storage-driver [git]: https://git-scm.com -[gvisor-discuss-list]: https://groups.google.com/forum/#!forum/gvisor-users +[gvisor-users-list]: https://groups.google.com/forum/#!forum/gvisor-users [gvisor-security-list]: https://groups.google.com/forum/#!forum/gvisor-security [kvm]: https://www.linux-kvm.org [netstack]: https://github.com/google/netstack [oci]: https://www.opencontainers.org +[python]: https://python.org [sandbox]: https://en.wikipedia.org/wiki/Sandbox_(computer_security) [seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt [selinux]: https://selinuxproject.org diff --git a/pkg/dhcp/client.go b/pkg/dhcp/client.go index 9a4fd7ae4c..37deb69fff 100644 --- a/pkg/dhcp/client.go +++ b/pkg/dhcp/client.go @@ -162,7 +162,7 @@ func (c *Client) Request(ctx context.Context, requestedAddr tcpip.Address) error // DHCPOFFER for { var addr tcpip.FullAddress - v, err := epin.Read(&addr) + v, _, err := epin.Read(&addr) if err == tcpip.ErrWouldBlock { select { case <-ch: @@ -216,7 +216,7 @@ func (c *Client) Request(ctx context.Context, requestedAddr tcpip.Address) error // DHCPACK for { var addr tcpip.FullAddress - v, err := epin.Read(&addr) + v, _, err := epin.Read(&addr) if err == tcpip.ErrWouldBlock { select { case <-ch: diff --git a/pkg/dhcp/dhcp_test.go b/pkg/dhcp/dhcp_test.go index d56b939972..ed884fcb63 100644 --- a/pkg/dhcp/dhcp_test.go +++ b/pkg/dhcp/dhcp_test.go @@ -36,7 +36,7 @@ func TestDHCP(t *testing.T) { } }() - s := stack.New([]string{ipv4.ProtocolName}, []string{udp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName}, []string{udp.ProtocolName}) const nicid tcpip.NICID = 1 if err := s.CreateNIC(nicid, id); err != nil { diff --git a/pkg/dhcp/server.go b/pkg/dhcp/server.go index d132d90b48..8816203a8f 100644 --- a/pkg/dhcp/server.go +++ b/pkg/dhcp/server.go @@ -104,7 +104,7 @@ func (s *Server) reader(ctx context.Context) { for { var addr tcpip.FullAddress - v, err := s.ep.Read(&addr) + v, _, err := s.ep.Read(&addr) if err == tcpip.ErrWouldBlock { select { case <-ch: diff --git a/pkg/sentry/control/proc.go b/pkg/sentry/control/proc.go index 7d06a1d04b..d77b30c907 100644 --- a/pkg/sentry/control/proc.go +++ b/pkg/sentry/control/proc.go @@ -72,9 +72,6 @@ type ExecArgs struct { // Capabilities is the list of capabilities to give to the process. Capabilities *auth.TaskCapabilities - // Detach indicates whether Exec should detach once the process starts. - Detach bool - // FilePayload determines the files to give to the new process. urpc.FilePayload } @@ -135,12 +132,6 @@ func (proc *Proc) Exec(args *ExecArgs, waitStatus *uint32) error { return err } - // If we're supposed to detach, don't wait for the process to exit. - if args.Detach { - *waitStatus = 0 - return nil - } - // Wait for completion. newTG.WaitExited() *waitStatus = newTG.ExitStatus().Status() diff --git a/pkg/sentry/fs/host/socket_test.go b/pkg/sentry/fs/host/socket_test.go index 80c46dcfa6..9b73c51739 100644 --- a/pkg/sentry/fs/host/socket_test.go +++ b/pkg/sentry/fs/host/socket_test.go @@ -142,7 +142,7 @@ func TestSocketSendMsgLen0(t *testing.T) { defer sfile.DecRef() s := sfile.FileOperations.(socket.Socket) - n, terr := s.SendMsg(nil, usermem.BytesIOSequence(nil), []byte{}, 0, unix.ControlMessages{}) + n, terr := s.SendMsg(nil, usermem.BytesIOSequence(nil), []byte{}, 0, socket.ControlMessages{}) if n != 0 { t.Fatalf("socket sendmsg() failed: %v wrote: %d", terr, n) } diff --git a/pkg/sentry/kernel/README.md b/pkg/sentry/kernel/README.md index 3306780d61..88760a9bb5 100644 --- a/pkg/sentry/kernel/README.md +++ b/pkg/sentry/kernel/README.md @@ -87,7 +87,7 @@ kept separate from the main "app" state to reduce the size of the latter. 4. `SyscallReinvoke`, which does not correspond to anything in Linux, and exists solely to serve the autosave feature. -![dot -Tsvg -Goverlap=false -orun_states.svg run_states.dot](g3doc/run_states.dot "Task control flow graph") +![dot -Tpng -Goverlap=false -orun_states.png run_states.dot](g3doc/run_states.png "Task control flow graph") States before which a stop may occur are represented as implementations of the `taskRunState` interface named `run(state)`, allowing them to be saved and diff --git a/pkg/sentry/kernel/g3doc/run_states.png b/pkg/sentry/kernel/g3doc/run_states.png new file mode 100644 index 0000000000..b63b60f020 Binary files /dev/null and b/pkg/sentry/kernel/g3doc/run_states.png differ diff --git a/pkg/sentry/kernel/kernel.go b/pkg/sentry/kernel/kernel.go index 0932965e00..25c8dd8859 100644 --- a/pkg/sentry/kernel/kernel.go +++ b/pkg/sentry/kernel/kernel.go @@ -887,6 +887,15 @@ func (k *Kernel) SetExitError(err error) { } } +// NowNanoseconds implements tcpip.Clock.NowNanoseconds. +func (k *Kernel) NowNanoseconds() int64 { + now, err := k.timekeeper.GetTime(sentrytime.Realtime) + if err != nil { + panic("Kernel.NowNanoseconds: " + err.Error()) + } + return now +} + // SupervisorContext returns a Context with maximum privileges in k. It should // only be used by goroutines outside the control of the emulated kernel // defined by e. diff --git a/pkg/sentry/platform/ring0/kernel_amd64.go b/pkg/sentry/platform/ring0/kernel_amd64.go index c82613a9c3..76ba65b3f4 100644 --- a/pkg/sentry/platform/ring0/kernel_amd64.go +++ b/pkg/sentry/platform/ring0/kernel_amd64.go @@ -149,7 +149,7 @@ func (c *CPU) CR4() uint64 { // //go:nosplit func (c *CPU) EFER() uint64 { - return _EFER_LME | _EFER_SCE | _EFER_NX + return _EFER_LME | _EFER_LMA | _EFER_SCE | _EFER_NX } // IsCanonical indicates whether addr is canonical per the amd64 spec. diff --git a/pkg/sentry/platform/ring0/x86.go b/pkg/sentry/platform/ring0/x86.go index e16f6c5990..74b1400667 100644 --- a/pkg/sentry/platform/ring0/x86.go +++ b/pkg/sentry/platform/ring0/x86.go @@ -46,6 +46,7 @@ const ( _EFER_SCE = 0x001 _EFER_LME = 0x100 + _EFER_LMA = 0x400 _EFER_NX = 0x800 _MSR_STAR = 0xc0000081 diff --git a/pkg/sentry/socket/BUILD b/pkg/sentry/socket/BUILD index 87e32df374..5500a676ee 100644 --- a/pkg/sentry/socket/BUILD +++ b/pkg/sentry/socket/BUILD @@ -32,6 +32,7 @@ go_library( "//pkg/sentry/usermem", "//pkg/state", "//pkg/syserr", + "//pkg/tcpip", "//pkg/tcpip/transport/unix", ], ) diff --git a/pkg/sentry/socket/control/control.go b/pkg/sentry/socket/control/control.go index cb34cbc85c..17ecdd11c4 100644 --- a/pkg/sentry/socket/control/control.go +++ b/pkg/sentry/socket/control/control.go @@ -208,6 +208,31 @@ func putCmsg(buf []byte, msgType uint32, align uint, data []int32) []byte { return alignSlice(buf, align) } +func putCmsgStruct(buf []byte, msgType uint32, align uint, data interface{}) []byte { + if cap(buf)-len(buf) < linux.SizeOfControlMessageHeader { + return buf + } + ob := buf + + buf = putUint64(buf, uint64(linux.SizeOfControlMessageHeader)) + buf = putUint32(buf, linux.SOL_SOCKET) + buf = putUint32(buf, msgType) + + hdrBuf := buf + + buf = binary.Marshal(buf, usermem.ByteOrder, data) + + // Check if we went over. + if cap(buf) != cap(ob) { + return hdrBuf + } + + // Fix up length. + putUint64(ob, uint64(len(buf)-len(ob))) + + return alignSlice(buf, align) +} + // Credentials implements SCMCredentials.Credentials. func (c *scmCredentials) Credentials(t *kernel.Task) (kernel.ThreadID, auth.UID, auth.GID) { // "When a process's user and group IDs are passed over a UNIX domain @@ -261,6 +286,16 @@ func alignSlice(buf []byte, align uint) []byte { return buf[:aligned] } +// PackTimestamp packs a SO_TIMESTAMP socket control message. +func PackTimestamp(t *kernel.Task, timestamp int64, buf []byte) []byte { + return putCmsgStruct( + buf, + linux.SO_TIMESTAMP, + t.Arch().Width(), + linux.NsecToTimeval(timestamp), + ) +} + // Parse parses a raw socket control message into portable objects. func Parse(t *kernel.Task, socketOrEndpoint interface{}, buf []byte) (unix.ControlMessages, error) { var ( diff --git a/pkg/sentry/socket/epsocket/BUILD b/pkg/sentry/socket/epsocket/BUILD index 0e463a92a5..8430886cbe 100644 --- a/pkg/sentry/socket/epsocket/BUILD +++ b/pkg/sentry/socket/epsocket/BUILD @@ -50,6 +50,7 @@ go_library( "//pkg/syserror", "//pkg/tcpip", "//pkg/tcpip/buffer", + "//pkg/tcpip/header", "//pkg/tcpip/network/ipv4", "//pkg/tcpip/network/ipv6", "//pkg/tcpip/stack", diff --git a/pkg/sentry/socket/epsocket/epsocket.go b/pkg/sentry/socket/epsocket/epsocket.go index 3fc3ea58ff..5701ecfac0 100644 --- a/pkg/sentry/socket/epsocket/epsocket.go +++ b/pkg/sentry/socket/epsocket/epsocket.go @@ -109,6 +109,7 @@ type SocketOperations struct { // readMu protects access to readView, control, and sender. readMu sync.Mutex `state:"nosave"` readView buffer.View + readCM tcpip.ControlMessages sender tcpip.FullAddress } @@ -210,12 +211,13 @@ func (s *SocketOperations) fetchReadView() *syserr.Error { s.readView = nil s.sender = tcpip.FullAddress{} - v, err := s.Endpoint.Read(&s.sender) + v, cms, err := s.Endpoint.Read(&s.sender) if err != nil { return syserr.TranslateNetstackError(err) } s.readView = v + s.readCM = cms return nil } @@ -230,7 +232,7 @@ func (s *SocketOperations) Read(ctx context.Context, _ *fs.File, dst usermem.IOS if dst.NumBytes() == 0 { return 0, nil } - n, _, _, err := s.nonBlockingRead(ctx, dst, false, false, false) + n, _, _, _, err := s.nonBlockingRead(ctx, dst, false, false, false) if err == syserr.ErrWouldBlock { return int64(n), syserror.ErrWouldBlock } @@ -552,6 +554,18 @@ func GetSockOpt(t *kernel.Task, s socket.Socket, ep commonEndpoint, family int, } return linux.NsecToTimeval(s.RecvTimeout()), nil + + case linux.SO_TIMESTAMP: + if outLen < sizeOfInt32 { + return nil, syserr.ErrInvalidArgument + } + + var v tcpip.TimestampOption + if err := ep.GetSockOpt(&v); err != nil { + return nil, syserr.TranslateNetstackError(err) + } + + return int32(v), nil } case syscall.SOL_TCP: @@ -659,6 +673,14 @@ func SetSockOpt(t *kernel.Task, s socket.Socket, ep commonEndpoint, level int, n binary.Unmarshal(optVal[:linux.SizeOfTimeval], usermem.ByteOrder, &v) s.SetRecvTimeout(v.ToNsecCapped()) return nil + + case linux.SO_TIMESTAMP: + if len(optVal) < sizeOfInt32 { + return syserr.ErrInvalidArgument + } + + v := usermem.ByteOrder.Uint32(optVal) + return syserr.TranslateNetstackError(ep.SetSockOpt(tcpip.TimestampOption(v))) } case syscall.SOL_TCP: @@ -823,7 +845,9 @@ func (s *SocketOperations) coalescingRead(ctx context.Context, dst usermem.IOSeq } // nonBlockingRead issues a non-blocking read. -func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSequence, peek, trunc, senderRequested bool) (int, interface{}, uint32, *syserr.Error) { +// +// TODO: Support timestamps for stream sockets. +func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSequence, peek, trunc, senderRequested bool) (int, interface{}, uint32, socket.ControlMessages, *syserr.Error) { isPacket := s.isPacketBased() // Fast path for regular reads from stream (e.g., TCP) endpoints. Note @@ -839,14 +863,14 @@ func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSe s.readMu.Lock() n, err := s.coalescingRead(ctx, dst, trunc) s.readMu.Unlock() - return n, nil, 0, err + return n, nil, 0, socket.ControlMessages{}, err } s.readMu.Lock() defer s.readMu.Unlock() if err := s.fetchReadView(); err != nil { - return 0, nil, 0, err + return 0, nil, 0, socket.ControlMessages{}, err } if !isPacket && peek && trunc { @@ -854,14 +878,14 @@ func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSe // amount that could be read. var rql tcpip.ReceiveQueueSizeOption if err := s.Endpoint.GetSockOpt(&rql); err != nil { - return 0, nil, 0, syserr.TranslateNetstackError(err) + return 0, nil, 0, socket.ControlMessages{}, syserr.TranslateNetstackError(err) } available := len(s.readView) + int(rql) bufLen := int(dst.NumBytes()) if available < bufLen { - return available, nil, 0, nil + return available, nil, 0, socket.ControlMessages{}, nil } - return bufLen, nil, 0, nil + return bufLen, nil, 0, socket.ControlMessages{}, nil } n, err := dst.CopyOut(ctx, s.readView) @@ -874,17 +898,18 @@ func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSe if peek { if l := len(s.readView); trunc && l > n { // isPacket must be true. - return l, addr, addrLen, syserr.FromError(err) + return l, addr, addrLen, socket.ControlMessages{IP: s.readCM}, syserr.FromError(err) } if isPacket || err != nil { - return int(n), addr, addrLen, syserr.FromError(err) + return int(n), addr, addrLen, socket.ControlMessages{IP: s.readCM}, syserr.FromError(err) } // We need to peek beyond the first message. dst = dst.DropFirst(n) num, err := dst.CopyOutFrom(ctx, safemem.FromVecReaderFunc{func(dsts [][]byte) (int64, error) { - n, err := s.Endpoint.Peek(dsts) + n, _, err := s.Endpoint.Peek(dsts) + // TODO: Handle peek timestamp. if err != nil { return int64(n), syserr.TranslateNetstackError(err).ToError() } @@ -895,7 +920,7 @@ func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSe // We got some data, so no need to return an error. err = nil } - return int(n), nil, 0, syserr.FromError(err) + return int(n), nil, 0, socket.ControlMessages{IP: s.readCM}, syserr.FromError(err) } var msgLen int @@ -908,15 +933,15 @@ func (s *SocketOperations) nonBlockingRead(ctx context.Context, dst usermem.IOSe } if trunc { - return msgLen, addr, addrLen, syserr.FromError(err) + return msgLen, addr, addrLen, socket.ControlMessages{IP: s.readCM}, syserr.FromError(err) } - return int(n), addr, addrLen, syserr.FromError(err) + return int(n), addr, addrLen, socket.ControlMessages{IP: s.readCM}, syserr.FromError(err) } // RecvMsg implements the linux syscall recvmsg(2) for sockets backed by // tcpip.Endpoint. -func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (n int, senderAddr interface{}, senderAddrLen uint32, controlMessages unix.ControlMessages, err *syserr.Error) { +func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (n int, senderAddr interface{}, senderAddrLen uint32, controlMessages socket.ControlMessages, err *syserr.Error) { trunc := flags&linux.MSG_TRUNC != 0 peek := flags&linux.MSG_PEEK != 0 @@ -924,7 +949,7 @@ func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags // Stream sockets ignore the sender address. senderRequested = false } - n, senderAddr, senderAddrLen, err = s.nonBlockingRead(t, dst, peek, trunc, senderRequested) + n, senderAddr, senderAddrLen, controlMessages, err = s.nonBlockingRead(t, dst, peek, trunc, senderRequested) if err != syserr.ErrWouldBlock || flags&linux.MSG_DONTWAIT != 0 { return } @@ -936,25 +961,25 @@ func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags defer s.EventUnregister(&e) for { - n, senderAddr, senderAddrLen, err = s.nonBlockingRead(t, dst, peek, trunc, senderRequested) + n, senderAddr, senderAddrLen, controlMessages, err = s.nonBlockingRead(t, dst, peek, trunc, senderRequested) if err != syserr.ErrWouldBlock { return } if err := t.BlockWithDeadline(ch, haveDeadline, deadline); err != nil { if err == syserror.ETIMEDOUT { - return 0, nil, 0, unix.ControlMessages{}, syserr.ErrTryAgain + return 0, nil, 0, socket.ControlMessages{}, syserr.ErrTryAgain } - return 0, nil, 0, unix.ControlMessages{}, syserr.FromError(err) + return 0, nil, 0, socket.ControlMessages{}, syserr.FromError(err) } } } // SendMsg implements the linux syscall sendmsg(2) for sockets backed by // tcpip.Endpoint. -func (s *SocketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (int, *syserr.Error) { - // Reject control messages. - if !controlMessages.Empty() { +func (s *SocketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages socket.ControlMessages) (int, *syserr.Error) { + // Reject Unix control messages. + if !controlMessages.Unix.Empty() { return 0, syserr.ErrInvalidArgument } diff --git a/pkg/sentry/socket/epsocket/provider.go b/pkg/sentry/socket/epsocket/provider.go index 5616435b3b..6c1e3b6b9d 100644 --- a/pkg/sentry/socket/epsocket/provider.go +++ b/pkg/sentry/socket/epsocket/provider.go @@ -23,6 +23,7 @@ import ( "gvisor.googlesource.com/gvisor/pkg/sentry/socket" "gvisor.googlesource.com/gvisor/pkg/syserr" "gvisor.googlesource.com/gvisor/pkg/tcpip" + "gvisor.googlesource.com/gvisor/pkg/tcpip/header" "gvisor.googlesource.com/gvisor/pkg/tcpip/network/ipv4" "gvisor.googlesource.com/gvisor/pkg/tcpip/network/ipv6" "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/tcp" @@ -37,8 +38,8 @@ type provider struct { netProto tcpip.NetworkProtocolNumber } -// GetTransportProtocol figures out transport protocol. Currently only TCP and -// UDP are supported. +// GetTransportProtocol figures out transport protocol. Currently only TCP, +// UDP, and ICMP are supported. func GetTransportProtocol(stype unix.SockType, protocol int) (tcpip.TransportProtocolNumber, *syserr.Error) { switch stype { case linux.SOCK_STREAM: @@ -48,14 +49,16 @@ func GetTransportProtocol(stype unix.SockType, protocol int) (tcpip.TransportPro return tcp.ProtocolNumber, nil case linux.SOCK_DGRAM: - if protocol != 0 && protocol != syscall.IPPROTO_UDP { - return 0, syserr.ErrInvalidArgument + switch protocol { + case 0, syscall.IPPROTO_UDP: + return udp.ProtocolNumber, nil + case syscall.IPPROTO_ICMP: + return header.ICMPv4ProtocolNumber, nil + case syscall.IPPROTO_ICMPV6: + return header.ICMPv6ProtocolNumber, nil } - return udp.ProtocolNumber, nil - - default: - return 0, syserr.ErrInvalidArgument } + return 0, syserr.ErrInvalidArgument } // Socket creates a new socket object for the AF_INET or AF_INET6 family. diff --git a/pkg/sentry/socket/hostinet/socket.go b/pkg/sentry/socket/hostinet/socket.go index defa3db2cd..02fad1c60c 100644 --- a/pkg/sentry/socket/hostinet/socket.go +++ b/pkg/sentry/socket/hostinet/socket.go @@ -57,6 +57,8 @@ type socketOperations struct { queue waiter.Queue } +var _ = socket.Socket(&socketOperations{}) + func newSocketFile(ctx context.Context, fd int, nonblock bool) (*fs.File, *syserr.Error) { s := &socketOperations{fd: fd} if err := fdnotifier.AddFD(int32(fd), &s.queue); err != nil { @@ -339,14 +341,14 @@ func (s *socketOperations) SetSockOpt(t *kernel.Task, level int, name int, opt [ } // RecvMsg implements socket.Socket.RecvMsg. -func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (int, interface{}, uint32, unix.ControlMessages, *syserr.Error) { +func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (int, interface{}, uint32, socket.ControlMessages, *syserr.Error) { // Whitelist flags. // // FIXME: We can't support MSG_ERRQUEUE because it uses ancillary // messages that netstack/tcpip/transport/unix doesn't understand. Kill the // Socket interface's dependence on netstack. if flags&^(syscall.MSG_DONTWAIT|syscall.MSG_PEEK|syscall.MSG_TRUNC) != 0 { - return 0, nil, 0, unix.ControlMessages{}, syserr.ErrInvalidArgument + return 0, nil, 0, socket.ControlMessages{}, syserr.ErrInvalidArgument } var senderAddr []byte @@ -411,11 +413,11 @@ func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags } } - return int(n), senderAddr, uint32(len(senderAddr)), unix.ControlMessages{}, syserr.FromError(err) + return int(n), senderAddr, uint32(len(senderAddr)), socket.ControlMessages{}, syserr.FromError(err) } // SendMsg implements socket.Socket.SendMsg. -func (s *socketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (int, *syserr.Error) { +func (s *socketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages socket.ControlMessages) (int, *syserr.Error) { // Whitelist flags. if flags&^(syscall.MSG_DONTWAIT|syscall.MSG_EOR|syscall.MSG_FASTOPEN|syscall.MSG_MORE|syscall.MSG_NOSIGNAL) != 0 { return 0, syserr.ErrInvalidArgument diff --git a/pkg/sentry/socket/netlink/socket.go b/pkg/sentry/socket/netlink/socket.go index 2d0e59cebf..0b8f528d02 100644 --- a/pkg/sentry/socket/netlink/socket.go +++ b/pkg/sentry/socket/netlink/socket.go @@ -305,7 +305,7 @@ func (s *Socket) GetPeerName(t *kernel.Task) (interface{}, uint32, *syserr.Error } // RecvMsg implements socket.Socket.RecvMsg. -func (s *Socket) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (int, interface{}, uint32, unix.ControlMessages, *syserr.Error) { +func (s *Socket) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (int, interface{}, uint32, socket.ControlMessages, *syserr.Error) { from := linux.SockAddrNetlink{ Family: linux.AF_NETLINK, PortID: 0, @@ -323,7 +323,7 @@ func (s *Socket) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, have if trunc { n = int64(r.MsgSize) } - return int(n), from, fromLen, unix.ControlMessages{}, syserr.FromError(err) + return int(n), from, fromLen, socket.ControlMessages{}, syserr.FromError(err) } // We'll have to block. Register for notification and keep trying to @@ -337,14 +337,14 @@ func (s *Socket) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, have if trunc { n = int64(r.MsgSize) } - return int(n), from, fromLen, unix.ControlMessages{}, syserr.FromError(err) + return int(n), from, fromLen, socket.ControlMessages{}, syserr.FromError(err) } if err := t.BlockWithDeadline(ch, haveDeadline, deadline); err != nil { if err == syserror.ETIMEDOUT { - return 0, nil, 0, unix.ControlMessages{}, syserr.ErrTryAgain + return 0, nil, 0, socket.ControlMessages{}, syserr.ErrTryAgain } - return 0, nil, 0, unix.ControlMessages{}, syserr.FromError(err) + return 0, nil, 0, socket.ControlMessages{}, syserr.FromError(err) } } } @@ -459,7 +459,7 @@ func (s *Socket) processMessages(ctx context.Context, buf []byte) *syserr.Error } // sendMsg is the core of message send, used for SendMsg and Write. -func (s *Socket) sendMsg(ctx context.Context, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (int, *syserr.Error) { +func (s *Socket) sendMsg(ctx context.Context, src usermem.IOSequence, to []byte, flags int, controlMessages socket.ControlMessages) (int, *syserr.Error) { dstPort := int32(0) if len(to) != 0 { @@ -506,12 +506,12 @@ func (s *Socket) sendMsg(ctx context.Context, src usermem.IOSequence, to []byte, } // SendMsg implements socket.Socket.SendMsg. -func (s *Socket) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (int, *syserr.Error) { +func (s *Socket) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages socket.ControlMessages) (int, *syserr.Error) { return s.sendMsg(t, src, to, flags, controlMessages) } // Write implements fs.FileOperations.Write. func (s *Socket) Write(ctx context.Context, _ *fs.File, src usermem.IOSequence, _ int64) (int64, error) { - n, err := s.sendMsg(ctx, src, nil, 0, unix.ControlMessages{}) + n, err := s.sendMsg(ctx, src, nil, 0, socket.ControlMessages{}) return int64(n), err.ToError() } diff --git a/pkg/sentry/socket/rpcinet/socket.go b/pkg/sentry/socket/rpcinet/socket.go index 574d99ba55..15047df01f 100644 --- a/pkg/sentry/socket/rpcinet/socket.go +++ b/pkg/sentry/socket/rpcinet/socket.go @@ -402,7 +402,7 @@ func rpcRecvMsg(t *kernel.Task, req *pb.SyscallRequest_Recvmsg) (*pb.RecvmsgResp } // RecvMsg implements socket.Socket.RecvMsg. -func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (int, interface{}, uint32, unix.ControlMessages, *syserr.Error) { +func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (int, interface{}, uint32, socket.ControlMessages, *syserr.Error) { req := &pb.SyscallRequest_Recvmsg{&pb.RecvmsgRequest{ Fd: s.fd, Length: uint32(dst.NumBytes()), @@ -414,10 +414,10 @@ func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags res, err := rpcRecvMsg(t, req) if err == nil { n, e := dst.CopyOut(t, res.Data) - return int(n), res.Address.GetAddress(), res.Address.GetLength(), unix.ControlMessages{}, syserr.FromError(e) + return int(n), res.Address.GetAddress(), res.Address.GetLength(), socket.ControlMessages{}, syserr.FromError(e) } if err != syserr.ErrWouldBlock || flags&linux.MSG_DONTWAIT != 0 { - return 0, nil, 0, unix.ControlMessages{}, err + return 0, nil, 0, socket.ControlMessages{}, err } // We'll have to block. Register for notifications and keep trying to @@ -430,17 +430,17 @@ func (s *socketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags res, err := rpcRecvMsg(t, req) if err == nil { n, e := dst.CopyOut(t, res.Data) - return int(n), res.Address.GetAddress(), res.Address.GetLength(), unix.ControlMessages{}, syserr.FromError(e) + return int(n), res.Address.GetAddress(), res.Address.GetLength(), socket.ControlMessages{}, syserr.FromError(e) } if err != syserr.ErrWouldBlock { - return 0, nil, 0, unix.ControlMessages{}, err + return 0, nil, 0, socket.ControlMessages{}, err } if err := t.BlockWithDeadline(ch, haveDeadline, deadline); err != nil { if err == syserror.ETIMEDOUT { - return 0, nil, 0, unix.ControlMessages{}, syserr.ErrTryAgain + return 0, nil, 0, socket.ControlMessages{}, syserr.ErrTryAgain } - return 0, nil, 0, unix.ControlMessages{}, syserr.FromError(err) + return 0, nil, 0, socket.ControlMessages{}, syserr.FromError(err) } } } @@ -459,14 +459,14 @@ func rpcSendMsg(t *kernel.Task, req *pb.SyscallRequest_Sendmsg) (uint32, *syserr } // SendMsg implements socket.Socket.SendMsg. -func (s *socketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (int, *syserr.Error) { +func (s *socketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages socket.ControlMessages) (int, *syserr.Error) { // Whitelist flags. if flags&^(syscall.MSG_DONTWAIT|syscall.MSG_EOR|syscall.MSG_FASTOPEN|syscall.MSG_MORE|syscall.MSG_NOSIGNAL) != 0 { return 0, syserr.ErrInvalidArgument } - // Reject control messages. - if !controlMessages.Empty() { + // Reject Unix control messages. + if !controlMessages.Unix.Empty() { return 0, syserr.ErrInvalidArgument } diff --git a/pkg/sentry/socket/socket.go b/pkg/sentry/socket/socket.go index be3026bfaa..bd4858a341 100644 --- a/pkg/sentry/socket/socket.go +++ b/pkg/sentry/socket/socket.go @@ -31,9 +31,17 @@ import ( ktime "gvisor.googlesource.com/gvisor/pkg/sentry/kernel/time" "gvisor.googlesource.com/gvisor/pkg/sentry/usermem" "gvisor.googlesource.com/gvisor/pkg/syserr" + "gvisor.googlesource.com/gvisor/pkg/tcpip" "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/unix" ) +// ControlMessages represents the union of unix control messages and tcpip +// control messages. +type ControlMessages struct { + Unix unix.ControlMessages + IP tcpip.ControlMessages +} + // Socket is the interface containing socket syscalls used by the syscall layer // to redirect them to the appropriate implementation. type Socket interface { @@ -78,11 +86,11 @@ type Socket interface { // // senderAddrLen is the address length to be returned to the application, // not necessarily the actual length of the address. - RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (n int, senderAddr interface{}, senderAddrLen uint32, controlMessages unix.ControlMessages, err *syserr.Error) + RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (n int, senderAddr interface{}, senderAddrLen uint32, controlMessages ControlMessages, err *syserr.Error) // SendMsg implements the sendmsg(2) linux syscall. SendMsg does not take // ownership of the ControlMessage on error. - SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (n int, err *syserr.Error) + SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages ControlMessages) (n int, err *syserr.Error) // SetRecvTimeout sets the timeout (in ns) for recv operations. Zero means // no timeout. diff --git a/pkg/sentry/socket/unix/unix.go b/pkg/sentry/socket/unix/unix.go index a4b4148519..f83156c8ef 100644 --- a/pkg/sentry/socket/unix/unix.go +++ b/pkg/sentry/socket/unix/unix.go @@ -358,10 +358,10 @@ func (s *SocketOperations) Write(ctx context.Context, _ *fs.File, src usermem.IO // SendMsg implements the linux syscall sendmsg(2) for unix sockets backed by // a unix.Endpoint. -func (s *SocketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages unix.ControlMessages) (int, *syserr.Error) { +func (s *SocketOperations) SendMsg(t *kernel.Task, src usermem.IOSequence, to []byte, flags int, controlMessages socket.ControlMessages) (int, *syserr.Error) { w := EndpointWriter{ Endpoint: s.ep, - Control: controlMessages, + Control: controlMessages.Unix, To: nil, } if len(to) > 0 { @@ -452,7 +452,7 @@ func (s *SocketOperations) Read(ctx context.Context, _ *fs.File, dst usermem.IOS // RecvMsg implements the linux syscall recvmsg(2) for sockets backed by // a unix.Endpoint. -func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (n int, senderAddr interface{}, senderAddrLen uint32, controlMessages unix.ControlMessages, err *syserr.Error) { +func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags int, haveDeadline bool, deadline ktime.Time, senderRequested bool, controlDataLen uint64) (n int, senderAddr interface{}, senderAddrLen uint32, controlMessages socket.ControlMessages, err *syserr.Error) { trunc := flags&linux.MSG_TRUNC != 0 peek := flags&linux.MSG_PEEK != 0 @@ -490,7 +490,7 @@ func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags if trunc { n = int64(r.MsgSize) } - return int(n), from, fromLen, r.Control, syserr.FromError(err) + return int(n), from, fromLen, socket.ControlMessages{Unix: r.Control}, syserr.FromError(err) } // We'll have to block. Register for notification and keep trying to @@ -509,14 +509,14 @@ func (s *SocketOperations) RecvMsg(t *kernel.Task, dst usermem.IOSequence, flags if trunc { n = int64(r.MsgSize) } - return int(n), from, fromLen, r.Control, syserr.FromError(err) + return int(n), from, fromLen, socket.ControlMessages{Unix: r.Control}, syserr.FromError(err) } if err := t.BlockWithDeadline(ch, haveDeadline, deadline); err != nil { if err == syserror.ETIMEDOUT { - return 0, nil, 0, unix.ControlMessages{}, syserr.ErrTryAgain + return 0, nil, 0, socket.ControlMessages{}, syserr.ErrTryAgain } - return 0, nil, 0, unix.ControlMessages{}, syserr.FromError(err) + return 0, nil, 0, socket.ControlMessages{}, syserr.FromError(err) } } } diff --git a/pkg/sentry/strace/socket.go b/pkg/sentry/strace/socket.go index 48c072e96e..1a2e8573e8 100644 --- a/pkg/sentry/strace/socket.go +++ b/pkg/sentry/strace/socket.go @@ -440,6 +440,7 @@ var SocketProtocol = map[int32]abi.ValueSet{ var controlMessageType = map[int32]string{ linux.SCM_RIGHTS: "SCM_RIGHTS", linux.SCM_CREDENTIALS: "SCM_CREDENTIALS", + linux.SO_TIMESTAMP: "SO_TIMESTAMP", } func cmsghdr(t *kernel.Task, addr usermem.Addr, length uint64, maxBytes uint64) string { @@ -477,7 +478,7 @@ func cmsghdr(t *kernel.Task, addr usermem.Addr, length uint64, maxBytes uint64) typ = fmt.Sprint(h.Type) } - if h.Length > uint64(len(buf)-i) { + if h.Length > uint64(len(buf)-i+linux.SizeOfControlMessageHeader) { strs = append(strs, fmt.Sprintf( "{level=%s, type=%s, length=%d, content extends beyond buffer}", level, @@ -546,6 +547,32 @@ func cmsghdr(t *kernel.Task, addr usermem.Addr, length uint64, maxBytes uint64) i += control.AlignUp(length, width) + case linux.SO_TIMESTAMP: + if length < linux.SizeOfTimeval { + strs = append(strs, fmt.Sprintf( + "{level=%s, type=%s, length=%d, content too short}", + level, + typ, + h.Length, + )) + i += control.AlignUp(length, width) + break + } + + var tv linux.Timeval + binary.Unmarshal(buf[i:i+linux.SizeOfTimeval], usermem.ByteOrder, &tv) + + strs = append(strs, fmt.Sprintf( + "{level=%s, type=%s, length=%d, Sec: %d, Usec: %d}", + level, + typ, + h.Length, + tv.Sec, + tv.Usec, + )) + + i += control.AlignUp(length, width) + default: panic("unreachable") } diff --git a/pkg/sentry/syscalls/linux/sys_socket.go b/pkg/sentry/syscalls/linux/sys_socket.go index 3797c0a5dc..d6d5dba8a6 100644 --- a/pkg/sentry/syscalls/linux/sys_socket.go +++ b/pkg/sentry/syscalls/linux/sys_socket.go @@ -610,7 +610,14 @@ func RecvMsg(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.Sysca flags |= linux.MSG_DONTWAIT } - n, err := recvSingleMsg(t, s, msgPtr, flags, false, ktime.Time{}) + var haveDeadline bool + var deadline ktime.Time + if dl := s.RecvTimeout(); dl != 0 { + deadline = t.Kernel().MonotonicClock().Now().Add(time.Duration(dl) * time.Nanosecond) + haveDeadline = true + } + + n, err := recvSingleMsg(t, s, msgPtr, flags, haveDeadline, deadline) return n, nil, err } @@ -724,10 +731,11 @@ func recvSingleMsg(t *kernel.Task, s socket.Socket, msgPtr usermem.Addr, flags i // Fast path when no control message nor name buffers are provided. if msg.ControlLen == 0 && msg.NameLen == 0 { - n, _, _, _, err := s.RecvMsg(t, dst, int(flags), haveDeadline, deadline, false, 0) + n, _, _, cms, err := s.RecvMsg(t, dst, int(flags), haveDeadline, deadline, false, 0) if err != nil { return 0, syserror.ConvertIntr(err.ToError(), kernel.ERESTARTSYS) } + cms.Unix.Release() return uintptr(n), nil } @@ -738,17 +746,21 @@ func recvSingleMsg(t *kernel.Task, s socket.Socket, msgPtr usermem.Addr, flags i if e != nil { return 0, syserror.ConvertIntr(e.ToError(), kernel.ERESTARTSYS) } - defer cms.Release() + defer cms.Unix.Release() controlData := make([]byte, 0, msg.ControlLen) if cr, ok := s.(unix.Credentialer); ok && cr.Passcred() { - creds, _ := cms.Credentials.(control.SCMCredentials) + creds, _ := cms.Unix.Credentials.(control.SCMCredentials) controlData = control.PackCredentials(t, creds, controlData) } - if cms.Rights != nil { - controlData = control.PackRights(t, cms.Rights.(control.SCMRights), flags&linux.MSG_CMSG_CLOEXEC != 0, controlData) + if cms.IP.HasTimestamp { + controlData = control.PackTimestamp(t, cms.IP.Timestamp, controlData) + } + + if cms.Unix.Rights != nil { + controlData = control.PackRights(t, cms.Unix.Rights.(control.SCMRights), flags&linux.MSG_CMSG_CLOEXEC != 0, controlData) } // Copy the address to the caller. @@ -779,7 +791,7 @@ func recvFrom(t *kernel.Task, fd kdefs.FD, bufPtr usermem.Addr, bufLen uint64, f } // Reject flags that we don't handle yet. - if flags & ^(linux.MSG_DONTWAIT|linux.MSG_NOSIGNAL|linux.MSG_PEEK|linux.MSG_TRUNC) != 0 { + if flags & ^(linux.MSG_DONTWAIT|linux.MSG_NOSIGNAL|linux.MSG_PEEK|linux.MSG_TRUNC|linux.MSG_CONFIRM) != 0 { return 0, syscall.EINVAL } @@ -816,7 +828,7 @@ func recvFrom(t *kernel.Task, fd kdefs.FD, bufPtr usermem.Addr, bufLen uint64, f } n, sender, senderLen, cm, e := s.RecvMsg(t, dst, int(flags), haveDeadline, deadline, nameLenPtr != 0, 0) - cm.Release() + cm.Unix.Release() if e != nil { return 0, syserror.ConvertIntr(e.ToError(), kernel.ERESTARTSYS) } @@ -990,7 +1002,7 @@ func sendSingleMsg(t *kernel.Task, s socket.Socket, file *fs.File, msgPtr userme } // Call the syscall implementation. - n, e := s.SendMsg(t, src, to, int(flags), controlMessages) + n, e := s.SendMsg(t, src, to, int(flags), socket.ControlMessages{Unix: controlMessages}) err = handleIOError(t, n != 0, e.ToError(), kernel.ERESTARTSYS, "sendmsg", file) if err != nil { controlMessages.Release() @@ -1041,7 +1053,7 @@ func sendTo(t *kernel.Task, fd kdefs.FD, bufPtr usermem.Addr, bufLen uint64, fla } // Call the syscall implementation. - n, e := s.SendMsg(t, src, to, int(flags), control.New(t, s, nil)) + n, e := s.SendMsg(t, src, to, int(flags), socket.ControlMessages{Unix: control.New(t, s, nil)}) return uintptr(n), handleIOError(t, n != 0, e.ToError(), kernel.ERESTARTSYS, "sendto", file) } diff --git a/pkg/tcpip/adapters/gonet/gonet.go b/pkg/tcpip/adapters/gonet/gonet.go index 96a2d670d6..5aa6b1aa2d 100644 --- a/pkg/tcpip/adapters/gonet/gonet.go +++ b/pkg/tcpip/adapters/gonet/gonet.go @@ -268,7 +268,7 @@ type opErrorer interface { // commonRead implements the common logic between net.Conn.Read and // net.PacketConn.ReadFrom. func commonRead(ep tcpip.Endpoint, wq *waiter.Queue, deadline <-chan struct{}, addr *tcpip.FullAddress, errorer opErrorer) ([]byte, error) { - read, err := ep.Read(addr) + read, _, err := ep.Read(addr) if err == tcpip.ErrWouldBlock { // Create wait queue entry that notifies a channel. @@ -276,7 +276,7 @@ func commonRead(ep tcpip.Endpoint, wq *waiter.Queue, deadline <-chan struct{}, a wq.EventRegister(&waitEntry, waiter.EventIn) defer wq.EventUnregister(&waitEntry) for { - read, err = ep.Read(addr) + read, _, err = ep.Read(addr) if err != tcpip.ErrWouldBlock { break } diff --git a/pkg/tcpip/adapters/gonet/gonet_test.go b/pkg/tcpip/adapters/gonet/gonet_test.go index 2f86469ebb..e3d0c6c84f 100644 --- a/pkg/tcpip/adapters/gonet/gonet_test.go +++ b/pkg/tcpip/adapters/gonet/gonet_test.go @@ -47,7 +47,7 @@ func TestTimeouts(t *testing.T) { func newLoopbackStack() (*stack.Stack, *tcpip.Error) { // Create the stack and add a NIC. - s := stack.New([]string{ipv4.ProtocolName, ipv6.ProtocolName}, []string{tcp.ProtocolName, udp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName, ipv6.ProtocolName}, []string{tcp.ProtocolName, udp.ProtocolName}) if err := s.CreateNIC(NICID, loopback.New()); err != nil { return nil, err diff --git a/pkg/tcpip/link/sniffer/sniffer.go b/pkg/tcpip/link/sniffer/sniffer.go index da6969e940..72d9a0f1cf 100644 --- a/pkg/tcpip/link/sniffer/sniffer.go +++ b/pkg/tcpip/link/sniffer/sniffer.go @@ -86,8 +86,9 @@ func writePCAPHeader(w io.Writer, maxLen uint32) error { // NewWithFile creates a new sniffer link-layer endpoint. It wraps around // another endpoint and logs packets and they traverse the endpoint. // -// Packets can be logged to file in the pcap format in addition to the standard -// human-readable logs. +// Packets can be logged to file in the pcap format. A sniffer created +// with this function will not emit packets using the standard log +// package. // // snapLen is the maximum amount of a packet to be saved. Packets with a length // less than or equal too snapLen will be saved in their entirety. Longer @@ -107,7 +108,7 @@ func NewWithFile(lower tcpip.LinkEndpointID, file *os.File, snapLen uint32) (tcp // called by the link-layer endpoint being wrapped when a packet arrives, and // logs the packet before forwarding to the actual dispatcher. func (e *endpoint) DeliverNetworkPacket(linkEP stack.LinkEndpoint, remoteLinkAddr tcpip.LinkAddress, protocol tcpip.NetworkProtocolNumber, vv *buffer.VectorisedView) { - if atomic.LoadUint32(&LogPackets) == 1 { + if atomic.LoadUint32(&LogPackets) == 1 && e.file == nil { LogPacket("recv", protocol, vv.First(), nil) } if e.file != nil && atomic.LoadUint32(&LogPacketsToFile) == 1 { @@ -168,7 +169,7 @@ func (e *endpoint) LinkAddress() tcpip.LinkAddress { // higher-level protocols to write packets; it just logs the packet and forwards // the request to the lower endpoint. func (e *endpoint) WritePacket(r *stack.Route, hdr *buffer.Prependable, payload buffer.View, protocol tcpip.NetworkProtocolNumber) *tcpip.Error { - if atomic.LoadUint32(&LogPackets) == 1 { + if atomic.LoadUint32(&LogPackets) == 1 && e.file == nil { LogPacket("send", protocol, hdr.UsedBytes(), payload) } if e.file != nil && atomic.LoadUint32(&LogPacketsToFile) == 1 { diff --git a/pkg/tcpip/network/arp/BUILD b/pkg/tcpip/network/arp/BUILD index e6d0899a9c..58d174965d 100644 --- a/pkg/tcpip/network/arp/BUILD +++ b/pkg/tcpip/network/arp/BUILD @@ -30,5 +30,6 @@ go_test( "//pkg/tcpip/link/sniffer", "//pkg/tcpip/network/ipv4", "//pkg/tcpip/stack", + "//pkg/tcpip/transport/ping", ], ) diff --git a/pkg/tcpip/network/arp/arp_test.go b/pkg/tcpip/network/arp/arp_test.go index 91ffdce4b6..6d61ff1d71 100644 --- a/pkg/tcpip/network/arp/arp_test.go +++ b/pkg/tcpip/network/arp/arp_test.go @@ -16,6 +16,7 @@ import ( "gvisor.googlesource.com/gvisor/pkg/tcpip/network/arp" "gvisor.googlesource.com/gvisor/pkg/tcpip/network/ipv4" "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" + "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/ping" ) const ( @@ -32,7 +33,7 @@ type testContext struct { } func newTestContext(t *testing.T) *testContext { - s := stack.New([]string{ipv4.ProtocolName, arp.ProtocolName}, []string{ipv4.PingProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName, arp.ProtocolName}, []string{ping.ProtocolName4}) const defaultMTU = 65536 id, linkEP := channel.New(256, defaultMTU, stackLinkAddr) diff --git a/pkg/tcpip/network/ipv4/BUILD b/pkg/tcpip/network/ipv4/BUILD index 9df113df19..02d55355c8 100644 --- a/pkg/tcpip/network/ipv4/BUILD +++ b/pkg/tcpip/network/ipv4/BUILD @@ -1,6 +1,6 @@ package(licenses = ["notice"]) # BSD -load("@io_bazel_rules_go//go:def.bzl", "go_library", "go_test") +load("@io_bazel_rules_go//go:def.bzl", "go_library") go_library( name = "ipv4", @@ -19,20 +19,5 @@ go_library( "//pkg/tcpip/network/fragmentation", "//pkg/tcpip/network/hash", "//pkg/tcpip/stack", - "//pkg/waiter", - ], -) - -go_test( - name = "ipv4_test", - size = "small", - srcs = ["icmp_test.go"], - deps = [ - ":ipv4", - "//pkg/tcpip", - "//pkg/tcpip/buffer", - "//pkg/tcpip/link/channel", - "//pkg/tcpip/link/sniffer", - "//pkg/tcpip/stack", ], ) diff --git a/pkg/tcpip/network/ipv4/icmp.go b/pkg/tcpip/network/ipv4/icmp.go index ffd7613504..3c382fdc2f 100644 --- a/pkg/tcpip/network/ipv4/icmp.go +++ b/pkg/tcpip/network/ipv4/icmp.go @@ -5,26 +5,14 @@ package ipv4 import ( - "context" "encoding/binary" - "time" "gvisor.googlesource.com/gvisor/pkg/tcpip" "gvisor.googlesource.com/gvisor/pkg/tcpip/buffer" "gvisor.googlesource.com/gvisor/pkg/tcpip/header" "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" - "gvisor.googlesource.com/gvisor/pkg/waiter" ) -// PingProtocolName is a pseudo transport protocol used to handle ping replies. -// Use it when constructing a stack that intends to use ipv4.Ping. -const PingProtocolName = "icmpv4ping" - -// pingProtocolNumber is a fake transport protocol used to -// deliver incoming ICMP echo replies. The ICMP identifier -// number is used as a port number for multiplexing. -const pingProtocolNumber tcpip.TransportProtocolNumber = 256 + 11 - // handleControl handles the case when an ICMP packet contains the headers of // the original packet that caused the ICMP one to be sent. This information is // used to find out which transport endpoint must be notified about the ICMP @@ -78,7 +66,10 @@ func (e *endpoint) handleICMP(r *stack.Route, vv *buffer.VectorisedView) { } case header.ICMPv4EchoReply: - e.dispatcher.DeliverTransportPacket(r, pingProtocolNumber, vv) + if len(v) < header.ICMPv4EchoMinimumSize { + return + } + e.dispatcher.DeliverTransportPacket(r, header.ICMPv4ProtocolNumber, vv) case header.ICMPv4DstUnreachable: if len(v) < header.ICMPv4DstUnreachableMinimumSize { @@ -104,179 +95,20 @@ type echoRequest struct { func (e *endpoint) echoReplier() { for req := range e.echoRequests { - sendICMPv4(&req.r, header.ICMPv4EchoReply, 0, req.v) + sendPing4(&req.r, 0, req.v) req.r.Release() } } -func sendICMPv4(r *stack.Route, typ header.ICMPv4Type, code byte, data buffer.View) *tcpip.Error { - hdr := buffer.NewPrependable(header.ICMPv4MinimumSize + int(r.MaxHeaderLength())) +func sendPing4(r *stack.Route, code byte, data buffer.View) *tcpip.Error { + hdr := buffer.NewPrependable(header.ICMPv4EchoMinimumSize + int(r.MaxHeaderLength())) - icmpv4 := header.ICMPv4(hdr.Prepend(header.ICMPv4MinimumSize)) - icmpv4.SetType(typ) + icmpv4 := header.ICMPv4(hdr.Prepend(header.ICMPv4EchoMinimumSize)) + icmpv4.SetType(header.ICMPv4EchoReply) icmpv4.SetCode(code) + copy(icmpv4[header.ICMPv4MinimumSize:], data) + data = data[header.ICMPv4EchoMinimumSize-header.ICMPv4MinimumSize:] icmpv4.SetChecksum(^header.Checksum(icmpv4, header.Checksum(data, 0))) return r.WritePacket(&hdr, data, header.ICMPv4ProtocolNumber) } - -// A Pinger can send echo requests to an address. -type Pinger struct { - Stack *stack.Stack - NICID tcpip.NICID - Addr tcpip.Address - LocalAddr tcpip.Address // optional - Wait time.Duration // if zero, defaults to 1 second - Count uint16 // if zero, defaults to MaxUint16 -} - -// Ping sends echo requests to an ICMPv4 endpoint. -// Responses are streamed to the channel ch. -func (p *Pinger) Ping(ctx context.Context, ch chan<- PingReply) *tcpip.Error { - count := p.Count - if count == 0 { - count = 1<<16 - 1 - } - wait := p.Wait - if wait == 0 { - wait = 1 * time.Second - } - - r, err := p.Stack.FindRoute(p.NICID, p.LocalAddr, p.Addr, ProtocolNumber) - if err != nil { - return err - } - - netProtos := []tcpip.NetworkProtocolNumber{ProtocolNumber} - ep := &pingEndpoint{ - stack: p.Stack, - pktCh: make(chan buffer.View, 1), - } - id := stack.TransportEndpointID{ - LocalAddress: r.LocalAddress, - RemoteAddress: p.Addr, - } - - _, err = p.Stack.PickEphemeralPort(func(port uint16) (bool, *tcpip.Error) { - id.LocalPort = port - err := p.Stack.RegisterTransportEndpoint(p.NICID, netProtos, pingProtocolNumber, id, ep) - switch err { - case nil: - return true, nil - case tcpip.ErrPortInUse: - return false, nil - default: - return false, err - } - }) - if err != nil { - return err - } - defer p.Stack.UnregisterTransportEndpoint(p.NICID, netProtos, pingProtocolNumber, id) - - v := buffer.NewView(4) - binary.BigEndian.PutUint16(v[0:], id.LocalPort) - - start := time.Now() - - done := make(chan struct{}) - go func(count int) { - loop: - for ; count > 0; count-- { - select { - case v := <-ep.pktCh: - seq := binary.BigEndian.Uint16(v[header.ICMPv4MinimumSize+2:]) - ch <- PingReply{ - Duration: time.Since(start) - time.Duration(seq)*wait, - SeqNumber: seq, - } - case <-ctx.Done(): - break loop - } - } - close(done) - }(int(count)) - defer func() { <-done }() - - t := time.NewTicker(wait) - defer t.Stop() - for seq := uint16(0); seq < count; seq++ { - select { - case <-t.C: - case <-ctx.Done(): - return nil - } - binary.BigEndian.PutUint16(v[2:], seq) - sent := time.Now() - if err := sendICMPv4(&r, header.ICMPv4Echo, 0, v); err != nil { - ch <- PingReply{ - Error: err, - Duration: time.Since(sent), - SeqNumber: seq, - } - } - } - return nil -} - -// PingReply summarizes an ICMP echo reply. -type PingReply struct { - Error *tcpip.Error // reports any errors sending a ping request - Duration time.Duration - SeqNumber uint16 -} - -type pingProtocol struct{} - -func (*pingProtocol) NewEndpoint(stack *stack.Stack, netProto tcpip.NetworkProtocolNumber, waiterQueue *waiter.Queue) (tcpip.Endpoint, *tcpip.Error) { - return nil, tcpip.ErrNotSupported // endpoints are created directly -} - -func (*pingProtocol) Number() tcpip.TransportProtocolNumber { return pingProtocolNumber } - -func (*pingProtocol) MinimumPacketSize() int { return header.ICMPv4EchoMinimumSize } - -func (*pingProtocol) ParsePorts(v buffer.View) (src, dst uint16, err *tcpip.Error) { - ident := binary.BigEndian.Uint16(v[4:]) - return 0, ident, nil -} - -func (*pingProtocol) HandleUnknownDestinationPacket(*stack.Route, stack.TransportEndpointID, *buffer.VectorisedView) bool { - return true -} - -// SetOption implements TransportProtocol.SetOption. -func (p *pingProtocol) SetOption(option interface{}) *tcpip.Error { - return tcpip.ErrUnknownProtocolOption -} - -// Option implements TransportProtocol.Option. -func (p *pingProtocol) Option(option interface{}) *tcpip.Error { - return tcpip.ErrUnknownProtocolOption -} - -func init() { - stack.RegisterTransportProtocolFactory(PingProtocolName, func() stack.TransportProtocol { - return &pingProtocol{} - }) -} - -type pingEndpoint struct { - stack *stack.Stack - pktCh chan buffer.View -} - -func (e *pingEndpoint) Close() { - close(e.pktCh) -} - -func (e *pingEndpoint) HandlePacket(r *stack.Route, id stack.TransportEndpointID, vv *buffer.VectorisedView) { - select { - case e.pktCh <- vv.ToView(): - default: - } -} - -// HandleControlPacket implements stack.TransportEndpoint.HandleControlPacket. -func (e *pingEndpoint) HandleControlPacket(id stack.TransportEndpointID, typ stack.ControlType, extra uint32, vv *buffer.VectorisedView) { -} diff --git a/pkg/tcpip/network/ipv4/icmp_test.go b/pkg/tcpip/network/ipv4/icmp_test.go deleted file mode 100644 index 378fba74b1..0000000000 --- a/pkg/tcpip/network/ipv4/icmp_test.go +++ /dev/null @@ -1,124 +0,0 @@ -// Copyright 2016 The Netstack Authors. All rights reserved. -// Use of this source code is governed by a BSD-style -// license that can be found in the LICENSE file. - -package ipv4_test - -import ( - "context" - "testing" - "time" - - "gvisor.googlesource.com/gvisor/pkg/tcpip" - "gvisor.googlesource.com/gvisor/pkg/tcpip/buffer" - "gvisor.googlesource.com/gvisor/pkg/tcpip/link/channel" - "gvisor.googlesource.com/gvisor/pkg/tcpip/link/sniffer" - "gvisor.googlesource.com/gvisor/pkg/tcpip/network/ipv4" - "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" -) - -const stackAddr = "\x0a\x00\x00\x01" - -type testContext struct { - t *testing.T - linkEP *channel.Endpoint - s *stack.Stack -} - -func newTestContext(t *testing.T) *testContext { - s := stack.New([]string{ipv4.ProtocolName}, []string{ipv4.PingProtocolName}) - - const defaultMTU = 65536 - id, linkEP := channel.New(256, defaultMTU, "") - if testing.Verbose() { - id = sniffer.New(id) - } - if err := s.CreateNIC(1, id); err != nil { - t.Fatalf("CreateNIC failed: %v", err) - } - - if err := s.AddAddress(1, ipv4.ProtocolNumber, stackAddr); err != nil { - t.Fatalf("AddAddress failed: %v", err) - } - - s.SetRouteTable([]tcpip.Route{{ - Destination: "\x00\x00\x00\x00", - Mask: "\x00\x00\x00\x00", - Gateway: "", - NIC: 1, - }}) - - return &testContext{ - t: t, - s: s, - linkEP: linkEP, - } -} - -func (c *testContext) cleanup() { - close(c.linkEP.C) -} - -func (c *testContext) loopback() { - go func() { - for pkt := range c.linkEP.C { - v := make(buffer.View, len(pkt.Header)+len(pkt.Payload)) - copy(v, pkt.Header) - copy(v[len(pkt.Header):], pkt.Payload) - vv := v.ToVectorisedView([1]buffer.View{}) - c.linkEP.Inject(pkt.Proto, &vv) - } - }() -} - -func TestEcho(t *testing.T) { - c := newTestContext(t) - defer c.cleanup() - c.loopback() - - ch := make(chan ipv4.PingReply, 1) - p := ipv4.Pinger{ - Stack: c.s, - NICID: 1, - Addr: stackAddr, - Wait: 10 * time.Millisecond, - Count: 1, // one ping only - } - if err := p.Ping(context.Background(), ch); err != nil { - t.Fatalf("icmp.Ping failed: %v", err) - } - - ping := <-ch - if ping.Error != nil { - t.Errorf("bad ping response: %v", ping.Error) - } -} - -func TestEchoSequence(t *testing.T) { - c := newTestContext(t) - defer c.cleanup() - c.loopback() - - const numPings = 3 - ch := make(chan ipv4.PingReply, numPings) - p := ipv4.Pinger{ - Stack: c.s, - NICID: 1, - Addr: stackAddr, - Wait: 10 * time.Millisecond, - Count: numPings, - } - if err := p.Ping(context.Background(), ch); err != nil { - t.Fatalf("icmp.Ping failed: %v", err) - } - - for i := uint16(0); i < numPings; i++ { - ping := <-ch - if ping.Error != nil { - t.Errorf("i=%d bad ping response: %v", i, ping.Error) - } - if ping.SeqNumber != i { - t.Errorf("SeqNumber=%d, want %d", ping.SeqNumber, i) - } - } -} diff --git a/pkg/tcpip/sample/tun_tcp_connect/main.go b/pkg/tcpip/sample/tun_tcp_connect/main.go index 332929c850..ef5c7ec607 100644 --- a/pkg/tcpip/sample/tun_tcp_connect/main.go +++ b/pkg/tcpip/sample/tun_tcp_connect/main.go @@ -113,7 +113,7 @@ func main() { // Create the stack with ipv4 and tcp protocols, then add a tun-based // NIC and ipv4 address. - s := stack.New([]string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) mtu, err := rawfile.GetMTU(tunName) if err != nil { @@ -183,7 +183,7 @@ func main() { // connection from its side. wq.EventRegister(&waitEntry, waiter.EventIn) for { - v, err := ep.Read(nil) + v, _, err := ep.Read(nil) if err != nil { if err == tcpip.ErrClosedForReceive { break diff --git a/pkg/tcpip/sample/tun_tcp_echo/main.go b/pkg/tcpip/sample/tun_tcp_echo/main.go index 10cd701af1..8c166f643a 100644 --- a/pkg/tcpip/sample/tun_tcp_echo/main.go +++ b/pkg/tcpip/sample/tun_tcp_echo/main.go @@ -42,7 +42,7 @@ func echo(wq *waiter.Queue, ep tcpip.Endpoint) { defer wq.EventUnregister(&waitEntry) for { - v, err := ep.Read(nil) + v, _, err := ep.Read(nil) if err != nil { if err == tcpip.ErrWouldBlock { <-notifyCh @@ -99,7 +99,7 @@ func main() { // Create the stack with ip and tcp protocols, then add a tun-based // NIC and address. - s := stack.New([]string{ipv4.ProtocolName, ipv6.ProtocolName, arp.ProtocolName}, []string{tcp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName, ipv6.ProtocolName, arp.ProtocolName}, []string{tcp.ProtocolName}) mtu, err := rawfile.GetMTU(tunName) if err != nil { diff --git a/pkg/tcpip/stack/stack.go b/pkg/tcpip/stack/stack.go index 558ecdb720..b480bf8124 100644 --- a/pkg/tcpip/stack/stack.go +++ b/pkg/tcpip/stack/stack.go @@ -270,6 +270,9 @@ type Stack struct { // If not nil, then any new endpoints will have this probe function // invoked everytime they receive a TCP segment. tcpProbeFunc TCPProbeFunc + + // clock is used to generate user-visible times. + clock tcpip.Clock } // New allocates a new networking stack with only the requested networking and @@ -279,7 +282,7 @@ type Stack struct { // SetNetworkProtocolOption/SetTransportProtocolOption methods provided by the // stack. Please refer to individual protocol implementations as to what options // are supported. -func New(network []string, transport []string) *Stack { +func New(clock tcpip.Clock, network []string, transport []string) *Stack { s := &Stack{ transportProtocols: make(map[tcpip.TransportProtocolNumber]*transportProtocolState), networkProtocols: make(map[tcpip.NetworkProtocolNumber]NetworkProtocol), @@ -287,6 +290,7 @@ func New(network []string, transport []string) *Stack { nics: make(map[tcpip.NICID]*NIC), linkAddrCache: newLinkAddrCache(ageLimit, resolutionTimeout, resolutionAttempts), PortManager: ports.NewPortManager(), + clock: clock, } // Add specified network protocols. @@ -388,6 +392,11 @@ func (s *Stack) SetTransportProtocolHandler(p tcpip.TransportProtocolNumber, h f } } +// NowNanoseconds implements tcpip.Clock.NowNanoseconds. +func (s *Stack) NowNanoseconds() int64 { + return s.clock.NowNanoseconds() +} + // Stats returns a snapshot of the current stats. // // NOTE: The underlying stats are updated using atomic instructions as a result @@ -474,6 +483,12 @@ func (s *Stack) CreateDisabledNIC(id tcpip.NICID, linkEP tcpip.LinkEndpointID) * return s.createNIC(id, "", linkEP, false) } +// CreateDisabledNamedNIC is a combination of CreateNamedNIC and +// CreateDisabledNIC. +func (s *Stack) CreateDisabledNamedNIC(id tcpip.NICID, name string, linkEP tcpip.LinkEndpointID) *tcpip.Error { + return s.createNIC(id, name, linkEP, false) +} + // EnableNIC enables the given NIC so that the link-layer endpoint can start // delivering packets to it. func (s *Stack) EnableNIC(id tcpip.NICID) *tcpip.Error { diff --git a/pkg/tcpip/stack/stack_test.go b/pkg/tcpip/stack/stack_test.go index b416065d71..ea7dccdc2d 100644 --- a/pkg/tcpip/stack/stack_test.go +++ b/pkg/tcpip/stack/stack_test.go @@ -176,7 +176,7 @@ func TestNetworkReceive(t *testing.T) { // Create a stack with the fake network protocol, one nic, and two // addresses attached to it: 1 & 2. id, linkEP := channel.New(10, defaultMTU, "") - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) if err := s.CreateNIC(1, id); err != nil { t.Fatalf("CreateNIC failed: %v", err) } @@ -270,7 +270,7 @@ func TestNetworkSend(t *testing.T) { // address: 1. The route table sends all packets through the only // existing nic. id, linkEP := channel.New(10, defaultMTU, "") - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) if err := s.CreateNIC(1, id); err != nil { t.Fatalf("NewNIC failed: %v", err) } @@ -292,7 +292,7 @@ func TestNetworkSendMultiRoute(t *testing.T) { // Create a stack with the fake network protocol, two nics, and two // addresses per nic, the first nic has odd address, the second one has // even addresses. - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id1, linkEP1 := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id1); err != nil { @@ -371,7 +371,7 @@ func TestRoutes(t *testing.T) { // Create a stack with the fake network protocol, two nics, and two // addresses per nic, the first nic has odd address, the second one has // even addresses. - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id1, _ := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id1); err != nil { @@ -435,7 +435,7 @@ func TestRoutes(t *testing.T) { } func TestAddressRemoval(t *testing.T) { - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id, linkEP := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id); err != nil { @@ -479,7 +479,7 @@ func TestAddressRemoval(t *testing.T) { } func TestDelayedRemovalDueToRoute(t *testing.T) { - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id, linkEP := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id); err != nil { @@ -547,7 +547,7 @@ func TestDelayedRemovalDueToRoute(t *testing.T) { } func TestPromiscuousMode(t *testing.T) { - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id, linkEP := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id); err != nil { @@ -607,7 +607,7 @@ func TestAddressSpoofing(t *testing.T) { srcAddr := tcpip.Address("\x01") dstAddr := tcpip.Address("\x02") - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id, _ := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id); err != nil { @@ -648,7 +648,7 @@ func TestAddressSpoofing(t *testing.T) { // Set the subnet, then check that packet is delivered. func TestSubnetAcceptsMatchingPacket(t *testing.T) { - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id, linkEP := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id); err != nil { @@ -682,7 +682,7 @@ func TestSubnetAcceptsMatchingPacket(t *testing.T) { // Set destination outside the subnet, then check it doesn't get delivered. func TestSubnetRejectsNonmatchingPacket(t *testing.T) { - s := stack.New([]string{"fakeNet"}, nil) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, nil) id, linkEP := channel.New(10, defaultMTU, "") if err := s.CreateNIC(1, id); err != nil { @@ -714,7 +714,7 @@ func TestSubnetRejectsNonmatchingPacket(t *testing.T) { } func TestNetworkOptions(t *testing.T) { - s := stack.New([]string{"fakeNet"}, []string{}) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, []string{}) // Try an unsupported network protocol. if err := s.SetNetworkProtocolOption(tcpip.NetworkProtocolNumber(99999), fakeNetGoodOption(false)); err != tcpip.ErrUnknownProtocol { diff --git a/pkg/tcpip/stack/transport_test.go b/pkg/tcpip/stack/transport_test.go index 7e072e96ef..b870ab375d 100644 --- a/pkg/tcpip/stack/transport_test.go +++ b/pkg/tcpip/stack/transport_test.go @@ -46,8 +46,8 @@ func (*fakeTransportEndpoint) Readiness(mask waiter.EventMask) waiter.EventMask return mask } -func (*fakeTransportEndpoint) Read(*tcpip.FullAddress) (buffer.View, *tcpip.Error) { - return buffer.View{}, nil +func (*fakeTransportEndpoint) Read(*tcpip.FullAddress) (buffer.View, tcpip.ControlMessages, *tcpip.Error) { + return buffer.View{}, tcpip.ControlMessages{}, nil } func (f *fakeTransportEndpoint) Write(p tcpip.Payload, opts tcpip.WriteOptions) (uintptr, *tcpip.Error) { @@ -67,8 +67,8 @@ func (f *fakeTransportEndpoint) Write(p tcpip.Payload, opts tcpip.WriteOptions) return uintptr(len(v)), nil } -func (f *fakeTransportEndpoint) Peek([][]byte) (uintptr, *tcpip.Error) { - return 0, nil +func (f *fakeTransportEndpoint) Peek([][]byte) (uintptr, tcpip.ControlMessages, *tcpip.Error) { + return 0, tcpip.ControlMessages{}, nil } // SetSockOpt sets a socket option. Currently not supported. @@ -210,7 +210,7 @@ func (f *fakeTransportProtocol) Option(option interface{}) *tcpip.Error { func TestTransportReceive(t *testing.T) { id, linkEP := channel.New(10, defaultMTU, "") - s := stack.New([]string{"fakeNet"}, []string{"fakeTrans"}) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, []string{"fakeTrans"}) if err := s.CreateNIC(1, id); err != nil { t.Fatalf("CreateNIC failed: %v", err) } @@ -270,7 +270,7 @@ func TestTransportReceive(t *testing.T) { func TestTransportControlReceive(t *testing.T) { id, linkEP := channel.New(10, defaultMTU, "") - s := stack.New([]string{"fakeNet"}, []string{"fakeTrans"}) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, []string{"fakeTrans"}) if err := s.CreateNIC(1, id); err != nil { t.Fatalf("CreateNIC failed: %v", err) } @@ -336,7 +336,7 @@ func TestTransportControlReceive(t *testing.T) { func TestTransportSend(t *testing.T) { id, _ := channel.New(10, defaultMTU, "") - s := stack.New([]string{"fakeNet"}, []string{"fakeTrans"}) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, []string{"fakeTrans"}) if err := s.CreateNIC(1, id); err != nil { t.Fatalf("CreateNIC failed: %v", err) } @@ -373,7 +373,7 @@ func TestTransportSend(t *testing.T) { } func TestTransportOptions(t *testing.T) { - s := stack.New([]string{"fakeNet"}, []string{"fakeTrans"}) + s := stack.New(&tcpip.StdClock{}, []string{"fakeNet"}, []string{"fakeTrans"}) // Try an unsupported transport protocol. if err := s.SetTransportProtocolOption(tcpip.TransportProtocolNumber(99999), fakeTransportGoodOption(false)); err != tcpip.ErrUnknownProtocol { diff --git a/pkg/tcpip/tcpip.go b/pkg/tcpip/tcpip.go index f3a94f353a..f9df1d9899 100644 --- a/pkg/tcpip/tcpip.go +++ b/pkg/tcpip/tcpip.go @@ -23,6 +23,7 @@ import ( "fmt" "strconv" "strings" + "time" "gvisor.googlesource.com/gvisor/pkg/tcpip/buffer" "gvisor.googlesource.com/gvisor/pkg/waiter" @@ -80,6 +81,24 @@ var ( errSubnetAddressMasked = errors.New("subnet address has bits set outside the mask") ) +// A Clock provides the current time. +// +// Times returned by a Clock should always be used for application-visible +// time, but never for netstack internal timekeeping. +type Clock interface { + // NowNanoseconds returns the current real time as a number of + // nanoseconds since some epoch. + NowNanoseconds() int64 +} + +// StdClock implements Clock with the time package. +type StdClock struct{} + +// NowNanoseconds implements Clock.NowNanoseconds. +func (*StdClock) NowNanoseconds() int64 { + return time.Now().UnixNano() +} + // Address is a byte slice cast as a string that represents the address of a // network node. Or, in the case of unix endpoints, it may represent a path. type Address string @@ -210,6 +229,16 @@ func (s SlicePayload) Size() int { return len(s) } +// A ControlMessages contains socket control messages for IP sockets. +type ControlMessages struct { + // HasTimestamp indicates whether Timestamp is valid/set. + HasTimestamp bool + + // Timestamp is the time (in ns) that the last packed used to create + // the read data was received. + Timestamp int64 +} + // Endpoint is the interface implemented by transport protocols (e.g., tcp, udp) // that exposes functionality like read, write, connect, etc. to users of the // networking stack. @@ -219,9 +248,13 @@ type Endpoint interface { Close() // Read reads data from the endpoint and optionally returns the sender. - // This method does not block if there is no data pending. - // It will also either return an error or data, never both. - Read(*FullAddress) (buffer.View, *Error) + // + // This method does not block if there is no data pending. It will also + // either return an error or data, never both. + // + // A timestamp (in ns) is optionally returned. A zero value indicates + // that no timestamp was available. + Read(*FullAddress) (buffer.View, ControlMessages, *Error) // Write writes data to the endpoint's peer. This method does not block if // the data cannot be written. @@ -238,7 +271,10 @@ type Endpoint interface { // Peek reads data without consuming it from the endpoint. // // This method does not block if there is no data pending. - Peek([][]byte) (uintptr, *Error) + // + // A timestamp (in ns) is optionally returned. A zero value indicates + // that no timestamp was available. + Peek([][]byte) (uintptr, ControlMessages, *Error) // Connect connects the endpoint to its peer. Specifying a NIC is // optional. @@ -347,6 +383,10 @@ type ReuseAddressOption int // Only supported on Unix sockets. type PasscredOption int +// TimestampOption is used by SetSockOpt/GetSockOpt to specify whether +// SO_TIMESTAMP socket control messages are enabled. +type TimestampOption int + // TCPInfoOption is used by GetSockOpt to expose TCP statistics. // // TODO: Add and populate stat fields. diff --git a/pkg/tcpip/transport/ping/BUILD b/pkg/tcpip/transport/ping/BUILD new file mode 100644 index 0000000000..a39a887b6b --- /dev/null +++ b/pkg/tcpip/transport/ping/BUILD @@ -0,0 +1,50 @@ +package(licenses = ["notice"]) # BSD + +load("@io_bazel_rules_go//go:def.bzl", "go_library") +load("//tools/go_generics:defs.bzl", "go_template_instance") +load("//tools/go_stateify:defs.bzl", "go_stateify") + +go_stateify( + name = "ping_state", + srcs = [ + "endpoint.go", + "endpoint_state.go", + "ping_packet_list.go", + ], + out = "ping_state.go", + imports = ["gvisor.googlesource.com/gvisor/pkg/tcpip/buffer"], + package = "ping", +) + +go_template_instance( + name = "ping_packet_list", + out = "ping_packet_list.go", + package = "ping", + prefix = "pingPacket", + template = "//pkg/ilist:generic_list", + types = { + "Linker": "*pingPacket", + }, +) + +go_library( + name = "ping", + srcs = [ + "endpoint.go", + "endpoint_state.go", + "ping_packet_list.go", + "ping_state.go", + "protocol.go", + ], + importpath = "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/ping", + visibility = ["//visibility:public"], + deps = [ + "//pkg/sleep", + "//pkg/state", + "//pkg/tcpip", + "//pkg/tcpip/buffer", + "//pkg/tcpip/header", + "//pkg/tcpip/stack", + "//pkg/waiter", + ], +) diff --git a/pkg/tcpip/transport/ping/endpoint.go b/pkg/tcpip/transport/ping/endpoint.go new file mode 100644 index 0000000000..609e7d9479 --- /dev/null +++ b/pkg/tcpip/transport/ping/endpoint.go @@ -0,0 +1,665 @@ +// Copyright 2016 The Netstack Authors. All rights reserved. +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file. + +package ping + +import ( + "encoding/binary" + "sync" + + "gvisor.googlesource.com/gvisor/pkg/sleep" + "gvisor.googlesource.com/gvisor/pkg/tcpip" + "gvisor.googlesource.com/gvisor/pkg/tcpip/buffer" + "gvisor.googlesource.com/gvisor/pkg/tcpip/header" + "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" + "gvisor.googlesource.com/gvisor/pkg/waiter" +) + +type pingPacket struct { + pingPacketEntry + senderAddress tcpip.FullAddress + data buffer.VectorisedView `state:".(buffer.VectorisedView)"` + timestamp int64 + hasTimestamp bool + // views is used as buffer for data when its length is large + // enough to store a VectorisedView. + views [8]buffer.View `state:"nosave"` +} + +type endpointState int + +const ( + stateInitial endpointState = iota + stateBound + stateConnected + stateClosed +) + +// endpoint represents a ping endpoint. This struct serves as the interface +// between users of the endpoint and the protocol implementation; it is legal to +// have concurrent goroutines make calls into the endpoint, they are properly +// synchronized. +type endpoint struct { + // The following fields are initialized at creation time and do not + // change throughout the lifetime of the endpoint. + stack *stack.Stack `state:"manual"` + netProto tcpip.NetworkProtocolNumber + waiterQueue *waiter.Queue + + // The following fields are used to manage the receive queue, and are + // protected by rcvMu. + rcvMu sync.Mutex `state:"nosave"` + rcvReady bool + rcvList pingPacketList + rcvBufSizeMax int + rcvBufSize int + rcvClosed bool + rcvTimestamp bool + + // The following fields are protected by the mu mutex. + mu sync.RWMutex `state:"nosave"` + sndBufSize int + id stack.TransportEndpointID + state endpointState + bindNICID tcpip.NICID + bindAddr tcpip.Address + regNICID tcpip.NICID + route stack.Route `state:"manual"` +} + +func newEndpoint(stack *stack.Stack, netProto tcpip.NetworkProtocolNumber, waiterQueue *waiter.Queue) *endpoint { + return &endpoint{ + stack: stack, + netProto: netProto, + waiterQueue: waiterQueue, + rcvBufSizeMax: 32 * 1024, + sndBufSize: 32 * 1024, + } +} + +// Close puts the endpoint in a closed state and frees all resources +// associated with it. +func (e *endpoint) Close() { + e.mu.Lock() + defer e.mu.Unlock() + + switch e.state { + case stateBound, stateConnected: + e.stack.UnregisterTransportEndpoint(e.regNICID, []tcpip.NetworkProtocolNumber{e.netProto}, ProtocolNumber4, e.id) + } + + // Close the receive list and drain it. + e.rcvMu.Lock() + e.rcvClosed = true + e.rcvBufSize = 0 + for !e.rcvList.Empty() { + p := e.rcvList.Front() + e.rcvList.Remove(p) + } + e.rcvMu.Unlock() + + e.route.Release() + + // Update the state. + e.state = stateClosed +} + +// Read reads data from the endpoint. This method does not block if +// there is no data pending. +func (e *endpoint) Read(addr *tcpip.FullAddress) (buffer.View, tcpip.ControlMessages, *tcpip.Error) { + e.rcvMu.Lock() + + if e.rcvList.Empty() { + err := tcpip.ErrWouldBlock + if e.rcvClosed { + err = tcpip.ErrClosedForReceive + } + e.rcvMu.Unlock() + return buffer.View{}, tcpip.ControlMessages{}, err + } + + p := e.rcvList.Front() + e.rcvList.Remove(p) + e.rcvBufSize -= p.data.Size() + ts := e.rcvTimestamp + + e.rcvMu.Unlock() + + if addr != nil { + *addr = p.senderAddress + } + + if ts && !p.hasTimestamp { + // Linux uses the current time. + p.timestamp = e.stack.NowNanoseconds() + } + + return p.data.ToView(), tcpip.ControlMessages{HasTimestamp: ts, Timestamp: p.timestamp}, nil +} + +// prepareForWrite prepares the endpoint for sending data. In particular, it +// binds it if it's still in the initial state. To do so, it must first +// reacquire the mutex in exclusive mode. +// +// Returns true for retry if preparation should be retried. +func (e *endpoint) prepareForWrite(to *tcpip.FullAddress) (retry bool, err *tcpip.Error) { + switch e.state { + case stateInitial: + case stateConnected: + return false, nil + + case stateBound: + if to == nil { + return false, tcpip.ErrDestinationRequired + } + return false, nil + default: + return false, tcpip.ErrInvalidEndpointState + } + + e.mu.RUnlock() + defer e.mu.RLock() + + e.mu.Lock() + defer e.mu.Unlock() + + // The state changed when we released the shared locked and re-acquired + // it in exclusive mode. Try again. + if e.state != stateInitial { + return true, nil + } + + // The state is still 'initial', so try to bind the endpoint. + if err := e.bindLocked(tcpip.FullAddress{}, nil); err != nil { + return false, err + } + + return true, nil +} + +// Write writes data to the endpoint's peer. This method does not block +// if the data cannot be written. +func (e *endpoint) Write(p tcpip.Payload, opts tcpip.WriteOptions) (uintptr, *tcpip.Error) { + // MSG_MORE is unimplemented. (This also means that MSG_EOR is a no-op.) + if opts.More { + return 0, tcpip.ErrInvalidOptionValue + } + + to := opts.To + + e.mu.RLock() + defer e.mu.RUnlock() + + // Prepare for write. + for { + retry, err := e.prepareForWrite(to) + if err != nil { + return 0, err + } + + if !retry { + break + } + } + + var route *stack.Route + if to == nil { + route = &e.route + + if route.IsResolutionRequired() { + // Promote lock to exclusive if using a shared route, given that it may + // need to change in Route.Resolve() call below. + e.mu.RUnlock() + defer e.mu.RLock() + + e.mu.Lock() + defer e.mu.Unlock() + + // Recheck state after lock was re-acquired. + if e.state != stateConnected { + return 0, tcpip.ErrInvalidEndpointState + } + } + } else { + // Reject destination address if it goes through a different + // NIC than the endpoint was bound to. + nicid := to.NIC + if e.bindNICID != 0 { + if nicid != 0 && nicid != e.bindNICID { + return 0, tcpip.ErrNoRoute + } + + nicid = e.bindNICID + } + + toCopy := *to + to = &toCopy + netProto, err := e.checkV4Mapped(to, true) + if err != nil { + return 0, err + } + + // Find the enpoint. + r, err := e.stack.FindRoute(nicid, e.bindAddr, to.Addr, netProto) + if err != nil { + return 0, err + } + defer r.Release() + + route = &r + } + + if route.IsResolutionRequired() { + waker := &sleep.Waker{} + if err := route.Resolve(waker); err != nil { + if err == tcpip.ErrWouldBlock { + // Link address needs to be resolved. Resolution was triggered the + // background. Better luck next time. + // + // TODO: queue up the request and send after link address + // is resolved. + route.RemoveWaker(waker) + return 0, tcpip.ErrNoLinkAddress + } + return 0, err + } + } + + v, err := p.Get(p.Size()) + if err != nil { + return 0, err + } + + switch e.netProto { + case header.IPv4ProtocolNumber: + err = sendPing4(route, e.id.LocalPort, v) + + case header.IPv6ProtocolNumber: + // TODO: Support IPv6. + } + + return uintptr(len(v)), err +} + +// Peek only returns data from a single datagram, so do nothing here. +func (e *endpoint) Peek([][]byte) (uintptr, tcpip.ControlMessages, *tcpip.Error) { + return 0, tcpip.ControlMessages{}, nil +} + +// SetSockOpt sets a socket option. Currently not supported. +func (e *endpoint) SetSockOpt(opt interface{}) *tcpip.Error { + switch v := opt.(type) { + case tcpip.TimestampOption: + e.rcvMu.Lock() + e.rcvTimestamp = v != 0 + e.rcvMu.Unlock() + } + return nil +} + +// GetSockOpt implements tcpip.Endpoint.GetSockOpt. +func (e *endpoint) GetSockOpt(opt interface{}) *tcpip.Error { + switch o := opt.(type) { + case tcpip.ErrorOption: + return nil + + case *tcpip.SendBufferSizeOption: + e.mu.Lock() + *o = tcpip.SendBufferSizeOption(e.sndBufSize) + e.mu.Unlock() + return nil + + case *tcpip.ReceiveBufferSizeOption: + e.rcvMu.Lock() + *o = tcpip.ReceiveBufferSizeOption(e.rcvBufSizeMax) + e.rcvMu.Unlock() + return nil + + case *tcpip.ReceiveQueueSizeOption: + e.rcvMu.Lock() + if e.rcvList.Empty() { + *o = 0 + } else { + p := e.rcvList.Front() + *o = tcpip.ReceiveQueueSizeOption(p.data.Size()) + } + e.rcvMu.Unlock() + return nil + + case *tcpip.TimestampOption: + e.rcvMu.Lock() + *o = 0 + if e.rcvTimestamp { + *o = 1 + } + e.rcvMu.Unlock() + } + + return tcpip.ErrUnknownProtocolOption +} + +func sendPing4(r *stack.Route, ident uint16, data buffer.View) *tcpip.Error { + if len(data) < header.ICMPv4EchoMinimumSize { + return tcpip.ErrInvalidEndpointState + } + + // Set the ident. Sequence number is provided by the user. + binary.BigEndian.PutUint16(data[header.ICMPv4MinimumSize:], ident) + + hdr := buffer.NewPrependable(header.ICMPv4EchoMinimumSize + int(r.MaxHeaderLength())) + + icmpv4 := header.ICMPv4(hdr.Prepend(header.ICMPv4EchoMinimumSize)) + copy(icmpv4, data) + data = data[header.ICMPv4EchoMinimumSize:] + + // Linux performs these basic checks. + if icmpv4.Type() != header.ICMPv4Echo || icmpv4.Code() != 0 { + return tcpip.ErrInvalidEndpointState + } + + icmpv4.SetChecksum(0) + icmpv4.SetChecksum(^header.Checksum(icmpv4, header.Checksum(data, 0))) + + return r.WritePacket(&hdr, data, header.ICMPv4ProtocolNumber) +} + +func (e *endpoint) checkV4Mapped(addr *tcpip.FullAddress, allowMismatch bool) (tcpip.NetworkProtocolNumber, *tcpip.Error) { + netProto := e.netProto + if header.IsV4MappedAddress(addr.Addr) { + return 0, tcpip.ErrNoRoute + } + + // Fail if we're bound to an address length different from the one we're + // checking. + if l := len(e.id.LocalAddress); !allowMismatch && l != 0 && l != len(addr.Addr) { + return 0, tcpip.ErrInvalidEndpointState + } + + return netProto, nil +} + +// Connect connects the endpoint to its peer. Specifying a NIC is optional. +func (e *endpoint) Connect(addr tcpip.FullAddress) *tcpip.Error { + e.mu.Lock() + defer e.mu.Unlock() + + nicid := addr.NIC + localPort := uint16(0) + switch e.state { + case stateBound, stateConnected: + localPort = e.id.LocalPort + if e.bindNICID == 0 { + break + } + + if nicid != 0 && nicid != e.bindNICID { + return tcpip.ErrInvalidEndpointState + } + + nicid = e.bindNICID + default: + return tcpip.ErrInvalidEndpointState + } + + netProto, err := e.checkV4Mapped(&addr, false) + if err != nil { + return err + } + + // Find a route to the desired destination. + r, err := e.stack.FindRoute(nicid, e.bindAddr, addr.Addr, netProto) + if err != nil { + return err + } + defer r.Release() + + id := stack.TransportEndpointID{ + LocalAddress: r.LocalAddress, + LocalPort: localPort, + RemoteAddress: r.RemoteAddress, + } + + // Even if we're connected, this endpoint can still be used to send + // packets on a different network protocol, so we register both even if + // v6only is set to false and this is an ipv6 endpoint. + netProtos := []tcpip.NetworkProtocolNumber{netProto} + + id, err = e.registerWithStack(nicid, netProtos, id) + if err != nil { + return err + } + + e.id = id + e.route = r.Clone() + e.regNICID = nicid + + e.state = stateConnected + + e.rcvMu.Lock() + e.rcvReady = true + e.rcvMu.Unlock() + + return nil +} + +// ConnectEndpoint is not supported. +func (*endpoint) ConnectEndpoint(tcpip.Endpoint) *tcpip.Error { + return tcpip.ErrInvalidEndpointState +} + +// Shutdown closes the read and/or write end of the endpoint connection +// to its peer. +func (e *endpoint) Shutdown(flags tcpip.ShutdownFlags) *tcpip.Error { + e.mu.RLock() + defer e.mu.RUnlock() + + if e.state != stateConnected { + return tcpip.ErrNotConnected + } + + if flags&tcpip.ShutdownRead != 0 { + e.rcvMu.Lock() + wasClosed := e.rcvClosed + e.rcvClosed = true + e.rcvMu.Unlock() + + if !wasClosed { + e.waiterQueue.Notify(waiter.EventIn) + } + } + + return nil +} + +// Listen is not supported by UDP, it just fails. +func (*endpoint) Listen(int) *tcpip.Error { + return tcpip.ErrNotSupported +} + +// Accept is not supported by UDP, it just fails. +func (*endpoint) Accept() (tcpip.Endpoint, *waiter.Queue, *tcpip.Error) { + return nil, nil, tcpip.ErrNotSupported +} + +func (e *endpoint) registerWithStack(nicid tcpip.NICID, netProtos []tcpip.NetworkProtocolNumber, id stack.TransportEndpointID) (stack.TransportEndpointID, *tcpip.Error) { + if id.LocalPort != 0 { + // The endpoint already has a local port, just attempt to + // register it. + err := e.stack.RegisterTransportEndpoint(nicid, netProtos, ProtocolNumber4, id, e) + return id, err + } + + // We need to find a port for the endpoint. + _, err := e.stack.PickEphemeralPort(func(p uint16) (bool, *tcpip.Error) { + id.LocalPort = p + err := e.stack.RegisterTransportEndpoint(nicid, netProtos, ProtocolNumber4, id, e) + switch err { + case nil: + return true, nil + case tcpip.ErrPortInUse: + return false, nil + default: + return false, err + } + }) + + return id, err +} + +func (e *endpoint) bindLocked(addr tcpip.FullAddress, commit func() *tcpip.Error) *tcpip.Error { + // Don't allow binding once endpoint is not in the initial state + // anymore. + if e.state != stateInitial { + return tcpip.ErrInvalidEndpointState + } + + netProto, err := e.checkV4Mapped(&addr, false) + if err != nil { + return err + } + + // Expand netProtos to include v4 and v6 if the caller is binding to a + // wildcard (empty) address, and this is an IPv6 endpoint with v6only + // set to false. + netProtos := []tcpip.NetworkProtocolNumber{netProto} + + if len(addr.Addr) != 0 { + // A local address was specified, verify that it's valid. + if e.stack.CheckLocalAddress(addr.NIC, netProto, addr.Addr) == 0 { + return tcpip.ErrBadLocalAddress + } + } + + id := stack.TransportEndpointID{ + LocalPort: addr.Port, + LocalAddress: addr.Addr, + } + id, err = e.registerWithStack(addr.NIC, netProtos, id) + if err != nil { + return err + } + if commit != nil { + if err := commit(); err != nil { + // Unregister, the commit failed. + e.stack.UnregisterTransportEndpoint(addr.NIC, netProtos, ProtocolNumber4, id) + return err + } + } + + e.id = id + e.regNICID = addr.NIC + + // Mark endpoint as bound. + e.state = stateBound + + e.rcvMu.Lock() + e.rcvReady = true + e.rcvMu.Unlock() + + return nil +} + +// Bind binds the endpoint to a specific local address and port. +// Specifying a NIC is optional. +func (e *endpoint) Bind(addr tcpip.FullAddress, commit func() *tcpip.Error) *tcpip.Error { + e.mu.Lock() + defer e.mu.Unlock() + + err := e.bindLocked(addr, commit) + if err != nil { + return err + } + + e.bindNICID = addr.NIC + e.bindAddr = addr.Addr + + return nil +} + +// GetLocalAddress returns the address to which the endpoint is bound. +func (e *endpoint) GetLocalAddress() (tcpip.FullAddress, *tcpip.Error) { + e.mu.RLock() + defer e.mu.RUnlock() + + return tcpip.FullAddress{ + NIC: e.regNICID, + Addr: e.id.LocalAddress, + Port: e.id.LocalPort, + }, nil +} + +// GetRemoteAddress returns the address to which the endpoint is connected. +func (e *endpoint) GetRemoteAddress() (tcpip.FullAddress, *tcpip.Error) { + e.mu.RLock() + defer e.mu.RUnlock() + + if e.state != stateConnected { + return tcpip.FullAddress{}, tcpip.ErrNotConnected + } + + return tcpip.FullAddress{ + NIC: e.regNICID, + Addr: e.id.RemoteAddress, + Port: e.id.RemotePort, + }, nil +} + +// Readiness returns the current readiness of the endpoint. For example, if +// waiter.EventIn is set, the endpoint is immediately readable. +func (e *endpoint) Readiness(mask waiter.EventMask) waiter.EventMask { + // The endpoint is always writable. + result := waiter.EventOut & mask + + // Determine if the endpoint is readable if requested. + if (mask & waiter.EventIn) != 0 { + e.rcvMu.Lock() + if !e.rcvList.Empty() || e.rcvClosed { + result |= waiter.EventIn + } + e.rcvMu.Unlock() + } + + return result +} + +// HandlePacket is called by the stack when new packets arrive to this transport +// endpoint. +func (e *endpoint) HandlePacket(r *stack.Route, id stack.TransportEndpointID, vv *buffer.VectorisedView) { + e.rcvMu.Lock() + + // Drop the packet if our buffer is currently full. + if !e.rcvReady || e.rcvClosed || e.rcvBufSize >= e.rcvBufSizeMax { + e.rcvMu.Unlock() + return + } + + wasEmpty := e.rcvBufSize == 0 + + // Push new packet into receive list and increment the buffer size. + pkt := &pingPacket{ + senderAddress: tcpip.FullAddress{ + NIC: r.NICID(), + Addr: id.RemoteAddress, + }, + } + pkt.data = vv.Clone(pkt.views[:]) + e.rcvList.PushBack(pkt) + e.rcvBufSize += vv.Size() + + if e.rcvTimestamp { + pkt.timestamp = e.stack.NowNanoseconds() + pkt.hasTimestamp = true + } + + e.rcvMu.Unlock() + + // Notify any waiters that there's data to be read now. + if wasEmpty { + e.waiterQueue.Notify(waiter.EventIn) + } +} + +// HandleControlPacket implements stack.TransportEndpoint.HandleControlPacket. +func (e *endpoint) HandleControlPacket(id stack.TransportEndpointID, typ stack.ControlType, extra uint32, vv *buffer.VectorisedView) { +} diff --git a/pkg/tcpip/transport/ping/endpoint_state.go b/pkg/tcpip/transport/ping/endpoint_state.go new file mode 100644 index 0000000000..e1664f0499 --- /dev/null +++ b/pkg/tcpip/transport/ping/endpoint_state.go @@ -0,0 +1,61 @@ +// Copyright 2016 The Netstack Authors. All rights reserved. +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file. + +package ping + +import ( + "gvisor.googlesource.com/gvisor/pkg/tcpip" + "gvisor.googlesource.com/gvisor/pkg/tcpip/buffer" + "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" +) + +// saveData saves pingPacket.data field. +func (p *pingPacket) saveData() buffer.VectorisedView { + // We cannot save p.data directly as p.data.views may alias to p.views, + // which is not allowed by state framework (in-struct pointer). + return p.data.Clone(nil) +} + +// loadData loads pingPacket.data field. +func (p *pingPacket) loadData(data buffer.VectorisedView) { + // NOTE: We cannot do the p.data = data.Clone(p.views[:]) optimization + // here because data.views is not guaranteed to be loaded by now. Plus, + // data.views will be allocated anyway so there really is little point + // of utilizing p.views for data.views. + p.data = data +} + +// beforeSave is invoked by stateify. +func (e *endpoint) beforeSave() { + // Stop incoming packets from being handled (and mutate endpoint state). + e.rcvMu.Lock() +} + +// afterLoad is invoked by stateify. +func (e *endpoint) afterLoad() { + e.stack = stack.StackFromEnv + + if e.state != stateBound && e.state != stateConnected { + return + } + + var err *tcpip.Error + if e.state == stateConnected { + e.route, err = e.stack.FindRoute(e.regNICID, e.bindAddr, e.id.RemoteAddress, e.netProto) + if err != nil { + panic(*err) + } + + e.id.LocalAddress = e.route.LocalAddress + } else if len(e.id.LocalAddress) != 0 { // stateBound + if e.stack.CheckLocalAddress(e.regNICID, e.netProto, e.id.LocalAddress) == 0 { + panic(tcpip.ErrBadLocalAddress) + } + } + + e.id, err = e.registerWithStack(e.regNICID, []tcpip.NetworkProtocolNumber{e.netProto}, e.id) + if err != nil { + panic(*err) + } +} diff --git a/pkg/tcpip/transport/ping/protocol.go b/pkg/tcpip/transport/ping/protocol.go new file mode 100644 index 0000000000..1459b4d60c --- /dev/null +++ b/pkg/tcpip/transport/ping/protocol.go @@ -0,0 +1,106 @@ +// Copyright 2016 The Netstack Authors. All rights reserved. +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file. + +// Package ping contains the implementation of the ICMP and IPv6-ICMP transport +// protocols for use in ping. To use it in the networking stack, this package +// must be added to the project, and +// activated on the stack by passing ping.ProtocolName (or "ping") and/or +// ping.ProtocolName6 (or "ping6") as one of the transport protocols when +// calling stack.New(). Then endpoints can be created by passing +// ping.ProtocolNumber or ping.ProtocolNumber6 as the transport protocol number +// when calling Stack.NewEndpoint(). +package ping + +import ( + "encoding/binary" + "fmt" + + "gvisor.googlesource.com/gvisor/pkg/tcpip" + "gvisor.googlesource.com/gvisor/pkg/tcpip/buffer" + "gvisor.googlesource.com/gvisor/pkg/tcpip/header" + "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" + "gvisor.googlesource.com/gvisor/pkg/waiter" +) + +const ( + // ProtocolName4 is the string representation of the ping protocol name. + ProtocolName4 = "ping4" + + // ProtocolNumber4 is the ICMP protocol number. + ProtocolNumber4 = header.ICMPv4ProtocolNumber + + // ProtocolName6 is the string representation of the ping protocol name. + ProtocolName6 = "ping6" + + // ProtocolNumber6 is the IPv6-ICMP protocol number. + ProtocolNumber6 = header.ICMPv6ProtocolNumber +) + +type protocol struct { + number tcpip.TransportProtocolNumber +} + +// Number returns the ICMP protocol number. +func (p *protocol) Number() tcpip.TransportProtocolNumber { + return p.number +} + +func (p *protocol) netProto() tcpip.NetworkProtocolNumber { + switch p.number { + case ProtocolNumber4: + return header.IPv4ProtocolNumber + case ProtocolNumber6: + return header.IPv6ProtocolNumber + } + panic(fmt.Sprint("unknown protocol number: ", p.number)) +} + +// NewEndpoint creates a new ping endpoint. +func (p *protocol) NewEndpoint(stack *stack.Stack, netProto tcpip.NetworkProtocolNumber, waiterQueue *waiter.Queue) (tcpip.Endpoint, *tcpip.Error) { + if netProto != p.netProto() { + return nil, tcpip.ErrUnknownProtocol + } + return newEndpoint(stack, netProto, waiterQueue), nil +} + +// MinimumPacketSize returns the minimum valid ping packet size. +func (p *protocol) MinimumPacketSize() int { + switch p.number { + case ProtocolNumber4: + return header.ICMPv4EchoMinimumSize + case ProtocolNumber6: + return header.ICMPv6EchoMinimumSize + } + panic(fmt.Sprint("unknown protocol number: ", p.number)) +} + +// ParsePorts returns the source and destination ports stored in the given udp +// packet. +func (*protocol) ParsePorts(v buffer.View) (src, dst uint16, err *tcpip.Error) { + return 0, binary.BigEndian.Uint16(v[header.ICMPv4MinimumSize:]), nil +} + +// HandleUnknownDestinationPacket handles packets targeted at this protocol but +// that don't match any existing endpoint. +func (p *protocol) HandleUnknownDestinationPacket(*stack.Route, stack.TransportEndpointID, *buffer.VectorisedView) bool { + return true +} + +// SetOption implements TransportProtocol.SetOption. +func (p *protocol) SetOption(option interface{}) *tcpip.Error { + return tcpip.ErrUnknownProtocolOption +} + +// Option implements TransportProtocol.Option. +func (p *protocol) Option(option interface{}) *tcpip.Error { + return tcpip.ErrUnknownProtocolOption +} + +func init() { + stack.RegisterTransportProtocolFactory(ProtocolName4, func() stack.TransportProtocol { + return &protocol{ProtocolNumber4} + }) + + // TODO: Support IPv6. +} diff --git a/pkg/tcpip/transport/tcp/endpoint.go b/pkg/tcpip/transport/tcp/endpoint.go index 5d62589d88..d84171b0c4 100644 --- a/pkg/tcpip/transport/tcp/endpoint.go +++ b/pkg/tcpip/transport/tcp/endpoint.go @@ -374,7 +374,7 @@ func (e *endpoint) cleanup() { } // Read reads data from the endpoint. -func (e *endpoint) Read(*tcpip.FullAddress) (buffer.View, *tcpip.Error) { +func (e *endpoint) Read(*tcpip.FullAddress) (buffer.View, tcpip.ControlMessages, *tcpip.Error) { e.mu.RLock() // The endpoint can be read if it's connected, or if it's already closed // but has some pending unread data. Also note that a RST being received @@ -383,9 +383,9 @@ func (e *endpoint) Read(*tcpip.FullAddress) (buffer.View, *tcpip.Error) { if s := e.state; s != stateConnected && s != stateClosed && e.rcvBufUsed == 0 { e.mu.RUnlock() if s == stateError { - return buffer.View{}, e.hardError + return buffer.View{}, tcpip.ControlMessages{}, e.hardError } - return buffer.View{}, tcpip.ErrInvalidEndpointState + return buffer.View{}, tcpip.ControlMessages{}, tcpip.ErrInvalidEndpointState } e.rcvListMu.Lock() @@ -394,7 +394,7 @@ func (e *endpoint) Read(*tcpip.FullAddress) (buffer.View, *tcpip.Error) { e.mu.RUnlock() - return v, err + return v, tcpip.ControlMessages{}, err } func (e *endpoint) readLocked() (buffer.View, *tcpip.Error) { @@ -498,7 +498,7 @@ func (e *endpoint) Write(p tcpip.Payload, opts tcpip.WriteOptions) (uintptr, *tc // Peek reads data without consuming it from the endpoint. // // This method does not block if there is no data pending. -func (e *endpoint) Peek(vec [][]byte) (uintptr, *tcpip.Error) { +func (e *endpoint) Peek(vec [][]byte) (uintptr, tcpip.ControlMessages, *tcpip.Error) { e.mu.RLock() defer e.mu.RUnlock() @@ -506,9 +506,9 @@ func (e *endpoint) Peek(vec [][]byte) (uintptr, *tcpip.Error) { // but has some pending unread data. if s := e.state; s != stateConnected && s != stateClosed { if s == stateError { - return 0, e.hardError + return 0, tcpip.ControlMessages{}, e.hardError } - return 0, tcpip.ErrInvalidEndpointState + return 0, tcpip.ControlMessages{}, tcpip.ErrInvalidEndpointState } e.rcvListMu.Lock() @@ -516,9 +516,9 @@ func (e *endpoint) Peek(vec [][]byte) (uintptr, *tcpip.Error) { if e.rcvBufUsed == 0 { if e.rcvClosed || e.state != stateConnected { - return 0, tcpip.ErrClosedForReceive + return 0, tcpip.ControlMessages{}, tcpip.ErrClosedForReceive } - return 0, tcpip.ErrWouldBlock + return 0, tcpip.ControlMessages{}, tcpip.ErrWouldBlock } // Make a copy of vec so we can modify the slide headers. @@ -534,7 +534,7 @@ func (e *endpoint) Peek(vec [][]byte) (uintptr, *tcpip.Error) { for len(v) > 0 { if len(vec) == 0 { - return num, nil + return num, tcpip.ControlMessages{}, nil } if len(vec[0]) == 0 { vec = vec[1:] @@ -549,7 +549,7 @@ func (e *endpoint) Peek(vec [][]byte) (uintptr, *tcpip.Error) { } } - return num, nil + return num, tcpip.ControlMessages{}, nil } // zeroReceiveWindow checks if the receive window to be announced now would be diff --git a/pkg/tcpip/transport/tcp/tcp_test.go b/pkg/tcpip/transport/tcp/tcp_test.go index 118d861ba9..3c21a1ec32 100644 --- a/pkg/tcpip/transport/tcp/tcp_test.go +++ b/pkg/tcpip/transport/tcp/tcp_test.go @@ -147,7 +147,7 @@ func TestSimpleReceive(t *testing.T) { c.WQ.EventRegister(&we, waiter.EventIn) defer c.WQ.EventUnregister(&we) - if _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -169,7 +169,7 @@ func TestSimpleReceive(t *testing.T) { } // Receive data. - v, err := c.EP.Read(nil) + v, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } @@ -199,7 +199,7 @@ func TestOutOfOrderReceive(t *testing.T) { c.WQ.EventRegister(&we, waiter.EventIn) defer c.WQ.EventUnregister(&we) - if _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -226,7 +226,7 @@ func TestOutOfOrderReceive(t *testing.T) { // Wait 200ms and check that no data has been received. time.Sleep(200 * time.Millisecond) - if _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -243,7 +243,7 @@ func TestOutOfOrderReceive(t *testing.T) { // Receive data. read := make([]byte, 0, 6) for len(read) < len(data) { - v, err := c.EP.Read(nil) + v, _, err := c.EP.Read(nil) if err != nil { if err == tcpip.ErrWouldBlock { // Wait for receive to be notified. @@ -284,7 +284,7 @@ func TestOutOfOrderFlood(t *testing.T) { opt := tcpip.ReceiveBufferSizeOption(10) c.CreateConnected(789, 30000, &opt) - if _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -361,7 +361,7 @@ func TestRstOnCloseWithUnreadData(t *testing.T) { c.WQ.EventRegister(&we, waiter.EventIn) defer c.WQ.EventUnregister(&we) - if _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -414,7 +414,7 @@ func TestFullWindowReceive(t *testing.T) { c.WQ.EventRegister(&we, waiter.EventIn) defer c.WQ.EventUnregister(&we) - _, err := c.EP.Read(nil) + _, _, err := c.EP.Read(nil) if err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -449,7 +449,7 @@ func TestFullWindowReceive(t *testing.T) { ) // Receive data and check it. - v, err := c.EP.Read(nil) + v, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } @@ -487,7 +487,7 @@ func TestNoWindowShrinking(t *testing.T) { c.WQ.EventRegister(&we, waiter.EventIn) defer c.WQ.EventUnregister(&we) - _, err := c.EP.Read(nil) + _, _, err := c.EP.Read(nil) if err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -551,7 +551,7 @@ func TestNoWindowShrinking(t *testing.T) { // Receive data and check it. read := make([]byte, 0, 10) for len(read) < len(data) { - v, err := c.EP.Read(nil) + v, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } @@ -954,7 +954,7 @@ func TestZeroScaledWindowReceive(t *testing.T) { } // Read some data. An ack should be sent in response to that. - v, err := c.EP.Read(nil) + v, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } @@ -1337,7 +1337,7 @@ func TestReceiveOnResetConnection(t *testing.T) { loop: for { - switch _, err := c.EP.Read(nil); err { + switch _, _, err := c.EP.Read(nil); err { case nil: t.Fatalf("Unexpected success.") case tcpip.ErrWouldBlock: @@ -2293,7 +2293,7 @@ func TestReadAfterClosedState(t *testing.T) { c.WQ.EventRegister(&we, waiter.EventIn) defer c.WQ.EventUnregister(&we) - if _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrWouldBlock { t.Fatalf("Unexpected error from Read: %v", err) } @@ -2345,7 +2345,7 @@ func TestReadAfterClosedState(t *testing.T) { // Check that peek works. peekBuf := make([]byte, 10) - n, err := c.EP.Peek([][]byte{peekBuf}) + n, _, err := c.EP.Peek([][]byte{peekBuf}) if err != nil { t.Fatalf("Unexpected error from Peek: %v", err) } @@ -2356,7 +2356,7 @@ func TestReadAfterClosedState(t *testing.T) { } // Receive data. - v, err := c.EP.Read(nil) + v, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } @@ -2367,11 +2367,11 @@ func TestReadAfterClosedState(t *testing.T) { // Now that we drained the queue, check that functions fail with the // right error code. - if _, err := c.EP.Read(nil); err != tcpip.ErrClosedForReceive { + if _, _, err := c.EP.Read(nil); err != tcpip.ErrClosedForReceive { t.Fatalf("Unexpected return from Read: got %v, want %v", err, tcpip.ErrClosedForReceive) } - if _, err := c.EP.Peek([][]byte{peekBuf}); err != tcpip.ErrClosedForReceive { + if _, _, err := c.EP.Peek([][]byte{peekBuf}); err != tcpip.ErrClosedForReceive { t.Fatalf("Unexpected return from Peek: got %v, want %v", err, tcpip.ErrClosedForReceive) } } @@ -2479,7 +2479,7 @@ func checkSendBufferSize(t *testing.T, ep tcpip.Endpoint, v int) { } func TestDefaultBufferSizes(t *testing.T) { - s := stack.New([]string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) // Check the default values. ep, err := s.NewEndpoint(tcp.ProtocolNumber, ipv4.ProtocolNumber, &waiter.Queue{}) @@ -2525,7 +2525,7 @@ func TestDefaultBufferSizes(t *testing.T) { } func TestMinMaxBufferSizes(t *testing.T) { - s := stack.New([]string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) // Check the default values. ep, err := s.NewEndpoint(tcp.ProtocolNumber, ipv4.ProtocolNumber, &waiter.Queue{}) @@ -2575,7 +2575,7 @@ func TestSelfConnect(t *testing.T) { // it checks that if an endpoint binds to say 127.0.0.1:1000 then // connects to 127.0.0.1:1000, then it will be connected to itself, and // is able to send and receive data through the same endpoint. - s := stack.New([]string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName}, []string{tcp.ProtocolName}) id := loopback.New() if testing.Verbose() { @@ -2637,13 +2637,13 @@ func TestSelfConnect(t *testing.T) { // Read back what was written. wq.EventUnregister(&waitEntry) wq.EventRegister(&waitEntry, waiter.EventIn) - rd, err := ep.Read(nil) + rd, _, err := ep.Read(nil) if err != nil { if err != tcpip.ErrWouldBlock { t.Fatalf("Read failed: %v", err) } <-notifyCh - rd, err = ep.Read(nil) + rd, _, err = ep.Read(nil) if err != nil { t.Fatalf("Read failed: %v", err) } diff --git a/pkg/tcpip/transport/tcp/tcp_timestamp_test.go b/pkg/tcpip/transport/tcp/tcp_timestamp_test.go index d12081bb7b..335262e434 100644 --- a/pkg/tcpip/transport/tcp/tcp_timestamp_test.go +++ b/pkg/tcpip/transport/tcp/tcp_timestamp_test.go @@ -95,7 +95,7 @@ func TestTimeStampEnabledConnect(t *testing.T) { // There should be 5 views to read and each of them should // contain the same data. for i := 0; i < 5; i++ { - got, err := c.EP.Read(nil) + got, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } @@ -296,7 +296,7 @@ func TestSegmentDropWhenTimestampMissing(t *testing.T) { } // Issue a read and we should data. - got, err := c.EP.Read(nil) + got, _, err := c.EP.Read(nil) if err != nil { t.Fatalf("Unexpected error from Read: %v", err) } diff --git a/pkg/tcpip/transport/tcp/testing/context/context.go b/pkg/tcpip/transport/tcp/testing/context/context.go index 6a402d150e..eb928553fb 100644 --- a/pkg/tcpip/transport/tcp/testing/context/context.go +++ b/pkg/tcpip/transport/tcp/testing/context/context.go @@ -129,7 +129,7 @@ type Context struct { // New allocates and initializes a test context containing a new // stack and a link-layer endpoint. func New(t *testing.T, mtu uint32) *Context { - s := stack.New([]string{ipv4.ProtocolName, ipv6.ProtocolName}, []string{tcp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName, ipv6.ProtocolName}, []string{tcp.ProtocolName}) // Allow minimum send/receive buffer sizes to be 1 during tests. if err := s.SetTransportProtocolOption(tcp.ProtocolNumber, tcp.SendBufferSizeOption{1, tcp.DefaultBufferSize, tcp.DefaultBufferSize * 10}); err != nil { diff --git a/pkg/tcpip/transport/udp/endpoint.go b/pkg/tcpip/transport/udp/endpoint.go index 80fa88c4c5..f86fc6d5af 100644 --- a/pkg/tcpip/transport/udp/endpoint.go +++ b/pkg/tcpip/transport/udp/endpoint.go @@ -19,6 +19,8 @@ type udpPacket struct { udpPacketEntry senderAddress tcpip.FullAddress data buffer.VectorisedView `state:".(buffer.VectorisedView)"` + timestamp int64 + hasTimestamp bool // views is used as buffer for data when its length is large // enough to store a VectorisedView. views [8]buffer.View `state:"nosave"` @@ -52,6 +54,7 @@ type endpoint struct { rcvBufSizeMax int `state:".(int)"` rcvBufSize int rcvClosed bool + rcvTimestamp bool // The following fields are protected by the mu mutex. mu sync.RWMutex `state:"nosave"` @@ -134,7 +137,7 @@ func (e *endpoint) Close() { // Read reads data from the endpoint. This method does not block if // there is no data pending. -func (e *endpoint) Read(addr *tcpip.FullAddress) (buffer.View, *tcpip.Error) { +func (e *endpoint) Read(addr *tcpip.FullAddress) (buffer.View, tcpip.ControlMessages, *tcpip.Error) { e.rcvMu.Lock() if e.rcvList.Empty() { @@ -143,12 +146,13 @@ func (e *endpoint) Read(addr *tcpip.FullAddress) (buffer.View, *tcpip.Error) { err = tcpip.ErrClosedForReceive } e.rcvMu.Unlock() - return buffer.View{}, err + return buffer.View{}, tcpip.ControlMessages{}, err } p := e.rcvList.Front() e.rcvList.Remove(p) e.rcvBufSize -= p.data.Size() + ts := e.rcvTimestamp e.rcvMu.Unlock() @@ -156,7 +160,12 @@ func (e *endpoint) Read(addr *tcpip.FullAddress) (buffer.View, *tcpip.Error) { *addr = p.senderAddress } - return p.data.ToView(), nil + if ts && !p.hasTimestamp { + // Linux uses the current time. + p.timestamp = e.stack.NowNanoseconds() + } + + return p.data.ToView(), tcpip.ControlMessages{HasTimestamp: ts, Timestamp: p.timestamp}, nil } // prepareForWrite prepares the endpoint for sending data. In particular, it @@ -299,8 +308,8 @@ func (e *endpoint) Write(p tcpip.Payload, opts tcpip.WriteOptions) (uintptr, *tc } // Peek only returns data from a single datagram, so do nothing here. -func (e *endpoint) Peek([][]byte) (uintptr, *tcpip.Error) { - return 0, nil +func (e *endpoint) Peek([][]byte) (uintptr, tcpip.ControlMessages, *tcpip.Error) { + return 0, tcpip.ControlMessages{}, nil } // SetSockOpt sets a socket option. Currently not supported. @@ -322,6 +331,11 @@ func (e *endpoint) SetSockOpt(opt interface{}) *tcpip.Error { } e.v6only = v != 0 + + case tcpip.TimestampOption: + e.rcvMu.Lock() + e.rcvTimestamp = v != 0 + e.rcvMu.Unlock() } return nil } @@ -370,6 +384,14 @@ func (e *endpoint) GetSockOpt(opt interface{}) *tcpip.Error { } e.rcvMu.Unlock() return nil + + case *tcpip.TimestampOption: + e.rcvMu.Lock() + *o = 0 + if e.rcvTimestamp { + *o = 1 + } + e.rcvMu.Unlock() } return tcpip.ErrUnknownProtocolOption @@ -733,6 +755,11 @@ func (e *endpoint) HandlePacket(r *stack.Route, id stack.TransportEndpointID, vv e.rcvList.PushBack(pkt) e.rcvBufSize += vv.Size() + if e.rcvTimestamp { + pkt.timestamp = e.stack.NowNanoseconds() + pkt.hasTimestamp = true + } + e.rcvMu.Unlock() // Notify any waiters that there's data to be read now. diff --git a/pkg/tcpip/transport/udp/endpoint_state.go b/pkg/tcpip/transport/udp/endpoint_state.go index 41b98424a8..e20d59ca30 100644 --- a/pkg/tcpip/transport/udp/endpoint_state.go +++ b/pkg/tcpip/transport/udp/endpoint_state.go @@ -13,7 +13,7 @@ import ( // saveData saves udpPacket.data field. func (u *udpPacket) saveData() buffer.VectorisedView { - // We canoot save u.data directly as u.data.views may alias to u.views, + // We cannot save u.data directly as u.data.views may alias to u.views, // which is not allowed by state framework (in-struct pointer). return u.data.Clone(nil) } diff --git a/pkg/tcpip/transport/udp/udp_test.go b/pkg/tcpip/transport/udp/udp_test.go index 65c5679529..1eb9ecb800 100644 --- a/pkg/tcpip/transport/udp/udp_test.go +++ b/pkg/tcpip/transport/udp/udp_test.go @@ -56,7 +56,7 @@ type headers struct { } func newDualTestContext(t *testing.T, mtu uint32) *testContext { - s := stack.New([]string{ipv4.ProtocolName, ipv6.ProtocolName}, []string{udp.ProtocolName}) + s := stack.New(&tcpip.StdClock{}, []string{ipv4.ProtocolName, ipv6.ProtocolName}, []string{udp.ProtocolName}) id, linkEP := channel.New(256, mtu, "") if testing.Verbose() { @@ -260,12 +260,12 @@ func testV4Read(c *testContext) { defer c.wq.EventUnregister(&we) var addr tcpip.FullAddress - v, err := c.ep.Read(&addr) + v, _, err := c.ep.Read(&addr) if err == tcpip.ErrWouldBlock { // Wait for data to become available. select { case <-ch: - v, err = c.ep.Read(&addr) + v, _, err = c.ep.Read(&addr) if err != nil { c.t.Fatalf("Read failed: %v", err) } @@ -355,12 +355,12 @@ func TestV6ReadOnV6(t *testing.T) { defer c.wq.EventUnregister(&we) var addr tcpip.FullAddress - v, err := c.ep.Read(&addr) + v, _, err := c.ep.Read(&addr) if err == tcpip.ErrWouldBlock { // Wait for data to become available. select { case <-ch: - v, err = c.ep.Read(&addr) + v, _, err = c.ep.Read(&addr) if err != nil { c.t.Fatalf("Read failed: %v", err) } diff --git a/runsc/boot/BUILD b/runsc/boot/BUILD index 88736cfa49..16522c668f 100644 --- a/runsc/boot/BUILD +++ b/runsc/boot/BUILD @@ -64,6 +64,7 @@ go_library( "//pkg/tcpip/network/ipv4", "//pkg/tcpip/network/ipv6", "//pkg/tcpip/stack", + "//pkg/tcpip/transport/ping", "//pkg/tcpip/transport/tcp", "//pkg/tcpip/transport/udp", "//pkg/urpc", diff --git a/runsc/boot/loader.go b/runsc/boot/loader.go index a470cb054b..af577f5714 100644 --- a/runsc/boot/loader.go +++ b/runsc/boot/loader.go @@ -37,11 +37,13 @@ import ( slinux "gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux" "gvisor.googlesource.com/gvisor/pkg/sentry/time" "gvisor.googlesource.com/gvisor/pkg/sentry/watchdog" + "gvisor.googlesource.com/gvisor/pkg/tcpip" "gvisor.googlesource.com/gvisor/pkg/tcpip/link/sniffer" "gvisor.googlesource.com/gvisor/pkg/tcpip/network/arp" "gvisor.googlesource.com/gvisor/pkg/tcpip/network/ipv4" "gvisor.googlesource.com/gvisor/pkg/tcpip/network/ipv6" "gvisor.googlesource.com/gvisor/pkg/tcpip/stack" + "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/ping" "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/tcp" "gvisor.googlesource.com/gvisor/pkg/tcpip/transport/udp" "gvisor.googlesource.com/gvisor/runsc/boot/filter" @@ -177,7 +179,7 @@ func New(spec *specs.Spec, conf *Config, controllerFD int, ioFDs []int, console // this point. Netns is configured before Run() is called. Netstack is // configured using a control uRPC message. Host network is configured inside // Run(). - networkStack := newEmptyNetworkStack(conf) + networkStack := newEmptyNetworkStack(conf, k) // Initiate the Kernel object, which is required by the Context passed // to createVFS in order to mount (among other things) procfs. @@ -337,7 +339,7 @@ func (l *Loader) WaitExit() kernel.ExitStatus { return l.k.GlobalInit().ExitStatus() } -func newEmptyNetworkStack(conf *Config) inet.Stack { +func newEmptyNetworkStack(conf *Config, clock tcpip.Clock) inet.Stack { switch conf.Network { case NetworkHost: return hostinet.NewStack() @@ -345,8 +347,8 @@ func newEmptyNetworkStack(conf *Config) inet.Stack { case NetworkNone, NetworkSandbox: // NetworkNone sets up loopback using netstack. netProtos := []string{ipv4.ProtocolName, ipv6.ProtocolName, arp.ProtocolName} - protoNames := []string{tcp.ProtocolName, udp.ProtocolName} - return &epsocket.Stack{stack.New(netProtos, protoNames)} + protoNames := []string{tcp.ProtocolName, udp.ProtocolName, ping.ProtocolName4} + return &epsocket.Stack{stack.New(clock, netProtos, protoNames)} default: panic(fmt.Sprintf("invalid network configuration: %v", conf.Network)) diff --git a/runsc/cmd/exec.go b/runsc/cmd/exec.go index 8379f552d9..576031b5b5 100644 --- a/runsc/cmd/exec.go +++ b/runsc/cmd/exec.go @@ -99,7 +99,6 @@ func (ex *Exec) Execute(_ context.Context, f *flag.FlagSet, args ...interface{}) if err != nil { Fatalf("error parsing process spec: %v", err) } - e.Detach = ex.detach conf := args[0].(*boot.Config) waitStatus := args[1].(*syscall.WaitStatus) @@ -123,7 +122,7 @@ func (ex *Exec) Execute(_ context.Context, f *flag.FlagSet, args ...interface{}) // executed. If detach was specified, starts a child in non-detach mode, // write the child's PID to the pid file. So when the container returns, the // child process will also return and signal containerd. - if e.Detach { + if ex.detach { binPath, err := specutils.BinPath() if err != nil { Fatalf("error getting bin path: %v", err) diff --git a/runsc/sandbox/sandbox.go b/runsc/sandbox/sandbox.go index b2fa1d58ef..64810b4ea0 100644 --- a/runsc/sandbox/sandbox.go +++ b/runsc/sandbox/sandbox.go @@ -535,8 +535,6 @@ func (s *Sandbox) createSandboxProcess(conf *boot.Config, binPath string, common nss = append(nss, userns) setUIDGIDMappings(cmd, s.Spec) } else { - // TODO: Retrict capabilities since it's using current user - // namespace, i.e. root. log.Infof("Sandbox will be started in the current user namespace") } // When running in the caller's defined user namespace, apply the same diff --git a/runsc/sandbox/sandbox_test.go b/runsc/sandbox/sandbox_test.go index 6c71cac303..6e3125b7b8 100644 --- a/runsc/sandbox/sandbox_test.go +++ b/runsc/sandbox/sandbox_test.go @@ -365,7 +365,6 @@ func TestExec(t *testing.T) { Envv: []string{"PATH=" + os.Getenv("PATH")}, WorkingDirectory: "/", KUID: uid, - Detach: false, } // Verify that "sleep 100" and "sleep 5" are running after exec. @@ -472,7 +471,6 @@ func TestCapabilities(t *testing.T) { KUID: uid, KGID: gid, Capabilities: &auth.TaskCapabilities{}, - Detach: true, } // "exe" should fail because we don't have the necessary permissions. @@ -484,14 +482,10 @@ func TestCapabilities(t *testing.T) { execArgs.Capabilities = &auth.TaskCapabilities{ EffectiveCaps: auth.CapabilitySetOf(linux.CAP_DAC_OVERRIDE), } - // First, start running exec. + // "exe" should not fail this time. if _, err := s.Execute(&execArgs); err != nil { t.Fatalf("sandbox failed to exec %v: %v", execArgs, err) } - - if err := waitForProcessList(s, expectedPL); err != nil { - t.Error(err) - } } // Test that an tty FD is sent over the console socket if one is provided.