Skip to content

Latest commit

 

History

History
422 lines (324 loc) · 20.7 KB

DESIGN.md

File metadata and controls

422 lines (324 loc) · 20.7 KB

Stenographer/Stenotype Design

Introduction

This document is meant to give an overview of the design of stenographer and stenotype at a medium/high level. For low-level stuff, look at the code :). The architecture described in this document has changed relatively little over the course of the project, and we doubt it will change much in the future.

High-Level Design

Stenographer consists of a stenographer server, which serves user requests and manages disk, and which runs a stenotype child process. stenotype sniffs packet data and writes it to disk, communicating with stenographer simply by un-hiding files when they're read for consumption. The user scripts stenocurl and stenoread provide simple wrappers around curl, which allow analysts to request packet data from the stenographer server simply and easily.

Detailed Design

Stenographer is actually a few separate processes.

Stenographer

Stenographer is a long-running server, the binary that you start up if you want to "run stenographer" on your system. It manages the stenotype binary as a child process, watches disk usage and cleans up old files, and serves data to analysts based on their queries.

Running Stenotype

First off, stenographer is in charge of making sure that stenotype (discussed momentarily) starts and keeps running. It starts stenotype as a subprocess, watching for failures and restarting as necessary. It also watches stenotype's output (the files it creates) and may kill/restart stenotype itself if it feels it is misbehaving or not generating files fast enough.

Managing Disk(s)

Stenographer watches the disks that stenotype uses and tries to keep them tidy and usable. This includes deleting old files when disk space decreases below a threshold, and deleting old temporary files that stenotype creates, if stenotype crashes before it can clean up after itself.

Stenographer handles disk management in two ways. First, it runs checks whenever it starts up a new stenotype instance to make sure files from an old, possibly crashed instance are no longer around and causing issues. Secondly, it periodically checks disk state for out-of-disk issues (currently every 15 seconds). During that periodic check, it also looks for new files stenotype may have generated that it can use to serve analyst requests (described momentarily).

Serving Data

Stenographer is also in charge of serving any analyst requests for packet data. It watches the data generated by stenotype, and when analysts request packets it looks up their requests in the generated data and returns them.

Stenographer provides data to analysts over TLS. Queries are POST'd to the /query HTTP handler, and responses are streamed back as PCAP files (MIME type application/octet-stream).

Currently, stenographer only binds to localhost, so it doesn't accept remote user requests.

Access Control

Access to the server is controlled with client certificates. On install, a script, stenokeys.sh, is run to generate a CA certificate and use it to create/sign a client and server certificate. The client and server authenticate each other on every request using the CA certificate as a source of truth. POSIX permissions are used locally to control access to the certs... the stenographer user which runs steno has read access to the server key (steno:root -r--------). The stenographer group as read access to the client key (root:steno ----r-----). Key usage extensions specify that the server key must be used as a TLS server, and the client key must be used as a TLS client.

Due to the file permissions mentioned above, giving steno access to a local user simply requires adding that user to the local stenographer group, thus giving them access to client_key.pem.

Once keys are created on install, they're currently NEVER REVOKED. Thus, if someone gets access to a client cert, they'll have access to the server ad infinitum. Should you have problems with a key being released, the current best way to handle this is by deleting all data in the /etc/stenographer/certs directory and rerunning stenokeys.sh to generate an entirely new set of keys rooted to a new CA.

stenokeys.sh will not modify keys/certs that already exist in /etc/stenographer/certs. Thus, if you have more complex topologies, you can overwrite these values and they'll happily be used by Stenographer. If, for example, you already have a CA in your organization, you can copy its cert into the ca_cert.pem file, then create {client,server}_{key,cert}.pem files rooted in that CA and copy them in. This also allows folks to use a single CA cert over multiple stenographer instances, allowing a single client cert to access multiple servers over the network.

Stenotype

Stenotype's sole purpose is to read packet data off the wire, index it, and write it to disk. It uses a multi-threaded architecture, while trying to limit context switching by having most processing on a single core stay within a single thread.

Packet Sniffing/Writing

Stenotype tries to be as performant as possible by allowing the kernel to do the vast majority of the work. It uses AF_PACKET, which asks the kernel to place packets into blocks in a shared memory region, then notify stenotype when blocks are available. After indexing the packets in each block, it passes the block directly back to the kernel as an O_DIRECT asynchronous write operation.

Besides indexing, then, stenotype's main job is to wait for the kernel to put packets in a memory region, then immediately ask the kernel to take that region back and write it. An important benefit of this design is that packets are never copied out of the kernel's shared memory space. The kernel writes them from the NIC to shared memory, then the kernel uses that same shared memory for O_DIRECT writes to disk. The packets transit the bus twice and are never copied from RAM to RAM.

Packet File Format

As detailed above, the "file format" used by stenotype is actually to directly dump data as it's presented by AF_PACKET. Thus, data is written as blocks, with each block containing a small header followed by a linked list of packets. Blocks are large (1M), and are dumped regularly (every 10s), so there's a good chance that for slow networks we use far more disk than we need. However, as network speed increases past 1M/minute/thread, this format becomes quite efficient. There will always be overhead, however.

Stenotype guarantees that a packet file will not exceed 4GB, by rotating files if they reach that size. It also rotates files older than 1 minute. Files are named for the microsecond timestamp they were created at. While a file is being written, it will be hidden (.1422693160230282). When rotating, the file will be renamed to no longer be hidden (.1422693160230282 -> 1422693160230282). This rename only occurs after all data has been successfully flushed to disk, so external processes which see this rename happen (like stenographer) can immediately start to use the newly renamed file.

Packet Load Balancing

Stenotype takes advantage of AF_PACKET's excellent load-balancing options to split up the work of processing packets across many CPUs. It uses AF_PACKET's PACKET_FANOUT to create a separate memory region for N different threads, then request that the kernel split up incoming packets across these regions. One stentoype packet reading/writing thread is created for each of these regions. Within that single thread, block processing (reading in a block, indexing it, starting an async write, reading the next block, etc...) happens serially.

Indexing

After getting a block of packets from the kernel but before passing them back to be written out, stenotype reads through each packet and creates a small number of indexes in memory. These indexes are very simple, mapping a packet attribute to a file seek offset. Attributes we use include ports (src and dst), protocols (udp/tcp/etc) and IPs (v4 and v6). Indexes are dumped to disk when file rotation happens, with a corresponding index file created for each packet file, of the same name but in a different directory. Given the example above, when the .1422693160230282 -> 1422693160230282 file rotation happens, an index also named .1422693160230282 will be created and written, then renamed to 1422693160230282 when the index has been fully flushed to disk. Once both the packets directory and index directory have a 1422693160230282 file, stenographer can read both in and use the index to lookup packets.

Index File Format

Indexes are leveldb SSTables, a simple, compressed file format that stores key-value pairs sorted by key and provides simple, efficient mechanisms to query individual keys or key ranges. Among other things, leveldb tables give us great compression capabilities, keeping our indexes small while still providing fast reads.

We store each attribute (port number, protocol number, IP, etc) and its associated packet positions in the blockfile using the format:

Key: [type (1 byte)][value (? bytes)] Value: [position 0 (4 bytes)][position 1 (4 bytes)] ...

The type specifies the type of attribute being indexed (1 == protocol, 2 == port, 4 == IPv4, 6 == IPv6). The value is 1 byte for protocol, 2 for ports, 4 and 16 respectively for IPv4 and IPv6 addresses. Each position is a seek offset into a packet file (which are guaranteed to not exceed 4GB) and are always exactly 4 bytes long. All values (ports, protocols, positions) are big endian. Looking up packets involves reading key for a specific attribute to get all positions for that value, then seeking into the packet files to find the packets in question and returning them. For example, to find all packets with port 80, you'd read in the positions for key:

[\x02 (type=port) \x00\x50 (value=80)]

Index Writing

The main stenotype packet sniffing thread tries to very quickly read in packet blocks, index them, then pass them back to the kernel. It does all disk operations asynchronously, in order to keep its CPU busy with indexing, by far the most time-intensive part of the whole operation. It would be extremely detrimental to performance to have this thread block on each file rotation to convert in-memory indexes to on-disk indexes and write out index files. Because of this, index writing is relegated to a separate thread. For each reading/writing thread, a index-writing thread is created, and a thread-safe producer-consumer queue created to link them up. When the reader/writer wants to rotate a file, it simply passes a pointer to its in-memory index over the queue, then creates a new empty index and starts populating it with packet data for its new file.

The index-writing thread sits in an endless loop, watching the queue for new indexes. When it gets a new index, it creates a leveldb table, iterates through the index to populate that table, and flushes that table to disk. Since index writing takes (in our experience) far less time/energy than packet writing, the index-writing thread does all of its operations serially, blocking while the index is flushed to disk, then moving that index into its usable (non-hidden) location.

Stenoread/Stenocurl

As detailed above in Stenographer's "Access Control" section, we require TLS handshakes in order to verify that clients are indeed allowed access to packet data. To aid in this, the simple shell script stenocurl wraps the curl utility, adding the various flags necessary to use the correct client certificate and verify against the correct server certificate. stenoread is a simple addition to stenocurl, which takes in a query string, passes the query to stenocurl as a POST request, then passes the resulting PCAP file through tcpdump in order to allow for additional filtering, writing to disk, printing in a human-readable format, etc.

How Queries Work

An analyst that wants to query stenographer calls the stenoread script, passing in a query string (see README.md for the query language format). This string is then POST'd (via stenocurl, using TLS certs/keys) to stenographer. Stenographer parses the query into a Query object, which allows it to decide:

  • which index files it should read
  • which keys it should read from each index file
  • how it should combine packet file positions it gets from each key

To illustrate, for the query string

(port 1 or ip proto 2) and after 3h ago

Stenographer would translate:

  • after 3h ago -> only read index files with microsecond names greater than (now() - 3h)
  • within these files, compute the union (because of the or) of position sets from
    • key \x02\x00\x01 (port == 1)
    • key \x01\x02 (protocol == 2)

Once it has computed a set of packet positions for each index file, it then seeks in the corresponding packet files, reads the packets out, and merges them into a single PCAP file which it serves back to the analyst.

This PCAP file comes back via stenocurl as a stream to STDOUT, where stenoread passes it through tcpdump. With no additional options, tcpdump just prints the packet data out in a nice format. With various options, tcpdump could do further filtering (by TCP flags, etc), write its input to disk (-w out.pcap), or do all the other things tcpdump is so good at.

gRPC

Stenographer has gRPC support that enables secure, remote interactions with the program. Given the sensitive nature of packet data and the requirements of many users to manage a fleet of servers running Stenographer, the gRPC channel only supports encryption with client authentication and expects the administrator to use certificates that are managed separately from those generated by stenokeys.sh (for easily generating certificates, take a look at Square's certstrap utility). The protobuf that defines Stenographer's gRPC service can be found in protobuf/steno.proto.

gRPC support is optional and can be enabled by adding an Rpc dictionary of settings to steno.conf. An example configuration is shown below:

, "Rpc": { "CaCert": "/path/to/rpc/ca/cert"
         , "ServerKey": "/path/to/rpc/key"
         , "ServerCert": "/path/to/rpc/cert"
         , "ServerPort": 8443
         , "ServerPcapPath": "/path/to/rpc/pcap/directory"
         , "ServerPcapMaxSize": 1000000000
         , "ClientPcapChunkSize": 1000
         , "ClientPcapMaxSize": 5000000
  }

RetrievePcap

This call allows clients to remotely retrieve PCAP via stenoread. To retrieve PCAP, clients send the service a unique identifier, the size of PCAP file chunks to stream in return, the maximum size of the PCAP file to return, and the stenoread query used to parse packet data. In response, clients receive streams of messages containing the unique identifier and PCAP file chunks (which need to be reassembled client-side). Below is a minimalist example (shown in Python) of how a client can request PCAP and save it to local disk:

with grpc.secure_channel(server, creds) as channel:
    stub = steno_pb2_grpc.StenographerStub(channel)
    pb = steno_pb2.PcapRequest()
    pb.uid = str(uuid.uuid4())
    pb.chunk_size = 1000
    pb.max_size = 500000
    pb.query = 'after 5m ago and tcp'
    pcap_file = os.path.join('.', '{}.pcap'.format(uid))
    with open(pcap_file, 'wb') as fout:
        for response in stub.RetrievePcap(pb):
            fout.write(response.pcap)

RetrievePcap requires the gRPC server to be configured with the following fields (in addition to any fields that require the server to startup):

  • ServerPcapPath: local path to the directory where stenoread PCAP is temporarily stored
  • ServerPcapMaxSize: upper limit on how much PCAP a client is allowed to receive (used to restrict clients from receiving excessively large PCAPs)
  • ClientPcapChunkSize: size of the PCAP chunks to stream to the client (used if the client has not specified a size in the request)
  • ClientPcapMaxSize: upper limit on how much PCAP a client will receive (used if the client has not specified a size in the request)

Defense In Depth

Stenotype

We're pretty scared of stenotype, because:

  1. We're processing untrusted data: packet
  2. We've got very strong permissions: the ability to read packets
  3. It's written in a memory-unsafe language: C++
  4. We're not perfect.

Because of this, we've tried to use security best practices to minimize the risk of running these binaries with the following methods:

  • Runing as an unprivileged user stenographer
    • We setcap the stenotype binary to just have the ability to read raw packets.
    • If you DON'T want to use setcap, we also offer the ability to drop privileges with setuid/setgid after starting stenotype... you can start it as root, then drop privs to an untrusted user (that user must still be able to open/write files in the index/packet directories).
  • seccomp sandboxing: stenotype sandboxes itself after opening up sockets for packet reading. This sandbox isn't particularly granular, but it should stop us from doing anything too crazy if the stenotype binary is compromized.
  • Fuzzing: We've extracted the most concerning bit of code (the indexing code that processes packet data) and fuzzed it as best we can, using the excellent AFL fuzzer. If you'd like to run your own fuzzing, install AFL, then run make fuzz in the stenotype/ subdirectory, and watch your CPUs become forced-air heaters.
  • We're considering AppArmor, and may add some configs to use it for locking down stenotype as well.

Stenographer

We're slightly less concerned about stenographer, since it doesn't actually process packet information. It also has a smaller attack surface, especially when bound to localhost. Our major attack vector in stenographer is queries coming in over TLS. However, TLS certificate handling is all done with the Go standard library (which we trust prett well ;), so our code only ever touches queries that come from a user in the stenographer group. Since we run it as user stenographer, if someone in the stenographer group does achieve a shell, they'll be able to... read packets. The big concern here is that they'll be able to read more packets than allowed by default (let's say that we've passed in a BPF filter to stenotype, for example). Our primary defenses, then, are:

  • Running as an unprivileged user stenographer
  • Using Go's standard library TLS to reject requests not coming from relatively trusted users
  • Using Go, which is much more memory-safe (runtime array bounds checks, etc)
  • We're considering AppArmor here, too, and will update this doc if we come up with good configs.

Design Limitations

Some of Stenographer's design decisions make it perform poorly in certain environments or give it strange performance characteristics. This section aims to point these out in advance, so folks have a better understanding of some of the idiosyncracies they may see when deploying Stenographer.

Slow Links, Large Files

Stenographer is optimized for fast links, and some of those optimizations give it strange behavior on slow links. The first of these is file size. You may notice that on a network link that's REALLY slow, you'll still see 6MB files created every minute. This is because currently, Stenographer will:

  • Store packets in 1MB blocks
  • Flush one block every 10 seconds

Of course, if your link generates over 1MB every 10 seconds, this doesn't matter to you at all. If it does, though, you're going to waste disk space. We're considering flushing one block a minute or every thirty seconds.

Packets Don't Show Up Immediately

With stenotype writing files and stenographer reading them, a packet won't show up in a request's response until it's on disk, its index is on disk, and stenographer has noticed both of these things occurring. This means that packets are generally 1-2 minutes behind real-time, since

  • Packets are stored by the kernel for up to 10 seconds before being written to disk
  • Packet files flush every minute
  • Index files created/flushed starting when packet files are written
  • stenographer looks for new files on disk every 15 seconds

Altogether, this means that there's a maximum 100-120 second delay between stenotype seeing a packet and stenographer being able to serve that packet based on analyst requests.

Note that for fast links, this time is reduced slightly, since:

  • Stenotype flushes a block whenever it gets 1MB of packets, reducing the initial 10-second wait for the kernel.
  • stenotype flushes at 1 minute OR at 4GB, whichever comes first, so if you get over 4GB/min, you'll flush files/indexes faster than once a minute.