GopherCap

GopherCap is an open source project maintained by Stamus Networks for accurate, modular and scalable PCAP manipulation. The official repository can be found at https://github.com/StamusNetworks/gophercap

Gophercap is a lightweight tool for working with PCAP files. First and foremost, it can map and replay large asynchronous datasets that were written with concurrent writers. It was developed by Stamus Networks for solving once such complex replay problem where regular tcpreplay features were not sufficient.

Gophercap does not aim to be a generic traffic replay tool like tcpreplay. The latter is much more mature and widely adopted, and therefore also a better option in simple replay scenarios where user simply needs to generate live traffic from a single sequential PCAP file.

By comparison, gopherCap was written from research and data engineering perspective, relying on strong concurrency primitives and pleasant IO interaction in Go programming language to handle large PCAP sets that span to hundreds of files and several terabytes disk usage. Furthermore, it is written with modern CLI frameworks that encapsulate functionality into smaller subcommands and allow user to configure it by passing a YAML file, using command line arguments, etc.

Following sections will explain currently implemented subcommands in more detail and will provide some simple usage examples.

exampleConfig

As name implies, this subcommand generates a YAML configuration file skeleton. Configuration file path must be defined via global --config flag.

gopherCap --config /tmp/gopher.yml exampleConfig

User should observe following log message if all goes well.

INFO[0000] Writing config to /tmp/gopher.yml

User can then observe default configuration parameters and modify according to needs. Gopher uses a single configuration dictionary for all subcommands, with those parameters organized by respective command. Parameters used by multiple (but not necessarily all) subcommands are located in global section.

cat /tmp/gopher.yml
global:
  dump:
    json: /tmp/dump.json
  file:
    regexp: ""
map:
  dir:
    src: ""
  file:
    suffix: pcap
    workers: 4
replay:
  disable_wait: false
  loop:
    count: 1
    infinite: false
  out:
    bpf: ""
    interface: eth0
  time:
    from: ""
    modifier: "1"
    scale:
      duration: 1h0m0s
      enabled: false
    to: ""
tarball:
  dryrun: false
  in:
    file: ""
  out:
    dir: ""
    gzip: false

Note that CLI flags override values in configuration dictionary. Meaning user can define sensible defaults with --config but still override specific options in some replay cases. For example, user could always use different --file-regexp values for different replay runs without reconfiguring.

User is not limited to YAML configuration files. Other structured formats like TOML or JSON are also supported. That format is automatically detected from file suffix.

gopherCap --config /tmp/gopher.json
gopherCap --config /tmp/gopher.toml

Finally, exampleConfig uses no flags other than global --config. Not all configuration options may be explained in following section. Each subcommand supports --help flag to list updated information of individual flags.

map

Gophercap handles large asynchronous PCAP sets by first doing a full pass over entire data. This is mostly needed to map out first and last timestamps of each PCAP file to deduce global start and stop of entire set. This information will allow replay command to decide how long individual PCAP readers should wait before reading, thus solving the async problem. As it already needs to parse entire PCAP file, it also collects other information during that pass. For example, maximum packet size for MTU configuration, number of bytes, number of packets, duration, etc.

Note that capinfos collects similar information and is likely a better solution for inspecting small individual PCAPs. However, Gophercap has many advantages:

  • It’s was much faster according to our tests, likely due to less operations it does, and is thus better for handling large files;

  • Gzip compression detection. Meaning, it does basic file magic check for each PCAP and can thus handle .pcap and .pcap.gz files with no interaction needed from user. This saves a lot of disk space when dealing with multi-terabyte datasets, albeit at cost of mapping speed as Gophercap still needs to decompress each file. However, this happens on the fly over raw byte stream in memory, and thus uncompressed data never touches a disk;

  • Output is structured JSON with global interval and duration derived from all files in set;

  • Gopher will recursively search all PCAP files from root directory defined in map.dir.src that match map.file.suffix and global.file.regexp;

  • Worker pool, map.file.workers ensures that only N PCAP files are read at the same time. Thus allowing users to balance CPU core utilization against IO limitations;

Mapping only needs to be done once, unless new files are added to PCAP set!

Assuming that user has a folder structure in /home/snuser/malware-samples, following command would find and map all files with pcap suffix that match Hancitor and maldoc naming pattern while ensuring that maximum 2 files are parsed concurrently.

gopherCap map \
  --dir-src /home/snuser/malware-samples \
  --file-suffix pcap \
  --dump-json /tmp/malware-samples.json \
  --file-regexp "maldocs.+hancitor" \
  --file-workers 2

This CLI command is equal to following configuration invocation.

gopherCap --config map.yml map
global:
  dump.json: /tmp/malware-samples.json
  file.regexp: 'maldocs.+hancitor'
map:
  dir.src: /home/snuser/malware-samples
  file:
    suffix: pcap
    workers: 2

Then observe the metadata.

cat /tmp/malware-samples.json | jq .
{
  "beginning": "2020-07-13T21:47:09.005658Z",
  "end": "2021-01-30T16:11:29.98529Z",
  "files": [
    {
      "path": "/home/snuser/malware-samples/maldocs/hancitor/2020/July/e57d44fd470e7364a235353ded942f0f.pcap",
      "root": "/home/snuser/malware-samples",
      "err": null,
      "snaplen": 262144,
      "packets": 303,
      "size": 86230,
      "max_packet_size": 1033,
      "beginning": "2020-07-13T21:47:09.005658Z",
      "end": "2020-07-13T22:02:10.247467Z",
      "pps": 0.33620277818247557,
      "duration": 901241809000,
      "duration_human": "15m1.241809s",
      "delay": 0,
      "delay_human": "0s"
    },
    {
      "path": "/home/snuser/malware-samples/maldocs/hancitor/2021/January/e688ebdab6916fc89610c89ccb94ce16.pcap",
      "root": "/home/snuser/malware-samples",
      "err": null,
      "snaplen": 262144,
      "packets": 358,
      "size": 25376,
      "max_packet_size": 418,
      "beginning": "2021-01-30T16:06:30.926979Z",
      "end": "2021-01-30T16:11:29.98529Z",
      "pps": 1.1970909579570252,
      "duration": 299058311000,
      "duration_human": "4m59.058311s",
      "delay": 17345961921321000,
      "delay_human": "4818h19m21.921321s"
    }
  ]
}

Note that delay value is very large for second PCAP file. That’s because the example PCAP set does not have async problem at all. Rather, it’s simply a collection of different malware samples, and not a product of same capture process where PCAPs rotated in different times. Gopher can still replay this set without waiting 6 months for second reader to start, but that needs a special flag. More on that in next section.

Replay

Suppose we have a multi-terabyte packet capture from a red-blue exercise that was created with Suricata PCAP writer in multi mode. In other words, separate PCAP file per worker, each rotating at different times whenever max file size is reached, sessions properly balanced between workers. Assuming that user has already mapped this set as instructed in previous section, and that mapping is located in /home/snuser/exercise/gophercap.json, then we can use following configuration to replay all discovered PCAP files at the same rate as they were originally written. Furthermore, the configuration limits replay to specific file pattern, the writer will not write any packet seen from noisy or sensitive segment 10.0.10.0/24, and the replay will loop infinitely.

gopherCap --config gopher.yml replay
global:
  dump.json: /home/snuser/exercise/gophercap.json
  file.regexp: 'meerkat-20012\d+-\d+\.pcap'

replay:
  disable_wait: false
  loop.infinite: true
  out:
    bpf: "not net 10.0.10.0/24"
    interface: dummy0

Like map, replay is agnostic to Gzip compression. Compressed files dynamically opened with gzip reader while uncompressed files are read as-is. No user interaction needed.

Unlike tcpreplay, gophercap does not currently support defining specific rates, like --pps or --bps. Instead, it supports time scaling. In other words, user can define --time-modifier to speed up or slow down the replay whereas packets still preserve temporal properties between them. Furthermore, combination of --time-scale-enabled and --time-scale-duration would dynamically calculate appropriate modifier to achieve desired result. For example, consider following configuration that is based on previous example:

global:
  dump.json: /home/snuser/exercise/gophercap.json
  file.regexp: 'meerkat-20012\d+-\d+\.pcap'

replay:
  disable_wait: false
  loop.infinite: true
  out:
    bpf: "not net 10.0.10.0/24"
    interface: dummy0

  time:
    from: ""
    modifier: 1
    scale:
      duration: 15m
      enabled: true
    to: ""

This will ensure that each each replay iteration is completed in approximately 15 minutes. Packet rate will be calculated dynamically to reach this goal and minor drift cannot be avoided due to calculations between each packet. Note that replay.time.scale.enabled will always override whatever value user defines via replay.time.modifier key or --time-modifier flag. User can also use replay.time.from and replay.time.to to only replay files from specific period, for example daytime. However, this feature currently does not scan individual packets and simply relies on PCAP file beginning and end values. Thus, a PCAP file is ignored even if defined period begins inside that file and gophercap will simply start from next file from that.

But, can user replay multiple PCAPs that were not written by same capture process? For example, user might want to use multiple PCAPs written at different times to generate simulated real-time traffic. Consider example JSON mapping in map section - waiting 6 months for second file replay to start would be very bad. This can be achieved with --wait-disable flag or replay.disable_wait option. When set to true, all files will start replaying at the same time and will be forced into the same interval.

replay.disable_wait: true

Combined with timescaling, this feature is quite useful for traffic generation. Malware PCAP mapping example could easily be replayed with following configuration, which can be useful when writing or validating Suricata signatures, testing post-processing tools, etc. User can for example mix known malware C2 beacon PCAPs with normal traffic to set up lab and training environments.

global:
  dump.json: /tmp/malware-samples.json
  file.regexp: 'maldocs.+hancitor'

replay:
  disable_wait: true
  loop.infinite: true
  out.interface: dummy0

  time:
    scale:
      duration: 15m
      enabled: true

tarExtract

Gophercap can replay individual gzip-compressed PCAP files, but not when those files are in Tar archives. Likely never will, as Tar format is sequential and thus would break concurrency features. However, consider following example scenario -

  • 4 terabyte hard drive holds a 1 terabyte gzipped Tar archive of PCAP files;

  • uncompressed, those files sum 4 terabytes disk usage;

  • only 1/4 of those files are relevant for replay, everything else is noise;

  • compressed, that 1 terabyte PCAP set only requires 200 gigabytes;

  • disk available only has 300 gigabytes free space;

This problem motivated creation of this subcommand. It will scan a tar.gz file and only extract files that match user-defined regular expression pattern. Those files can be written directly to gzip compressed files. Thus, total disk requirement when solving the problem is only 200 gigabytes. No interim storage or temporary files needed. Following example would extract files that match specific date pattern from /mnt/big.tar.gz to separate gzipped output files in /mnt/small.

global:
  dump.json: /tmp/malware-samples.json
  file.regexp: 'meerkat-20012\d+-\d+\.pcap'
tarball:
  dryrun: false
  in:
    file: /mnt/big.tar.gz
  out:
    dir: /mnt/small
    gzip: true

Note that while developed for extracting PCAPs, nothing is really stopping user from using this in other contexts as well.