Skip to content

brimdata/zed-sample-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sample Data

To help you get started quickly with super, this repository contains small sample sets of Zeek data. There are six different log formats available, all representing events based on the same network traffic:

Directory Format
zeek-default/ Zeek default output format
zeek-json/ [ JSON as output by the Zeek package for JSON Streaming Logs
bsup/ Super Binary, output with super's default LZ4-compressed format
bsup-uncompressed/ Super Binary, output with super's option -bsup.compress=false to disable compression
jsup/ Super JSON, a text output format that has the look and feel of JSON

This sample data is used frequently for a simple SuperDB performance test and to check for unexpected changes in the SuperDB output formats.

Downloading

Because prior changes to the Super Binary and Super JSON output formats have added some bulk to the revision history, you'll typically want to save time by just downloading the latest revision:

# git clone --depth=1 https://github.com/brimdata/zed-sample-data.git

Origin/License

This sample data set was generated from a subset of the packet capture archives (formerly at https://archive.wrccdc.org/pcaps, though the site has been down of late) that are distributed by the WRCCDC.

This sample data is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, as it is built upon the WRCCDC PCAP data that is distributed under the same license.

Acknowledgement

We would like to express our thanks to the WRCCDC for generously making their packet capture archives available to the public and for commercial use. The terabytes of "real world" data has been invaluable to us in testing the foundations of super at scale.

Creation

The data set was made from the several PCAP files in the 2018 set. Zeek v6.2.0 was used in its default configuration with the only change being the addition/enabling of the JSON Streaming Logs package. The packet captures were then processed via the command-lines:

# mergecap -w wrccdc.pcap wrccdc.2018-03-24.10*.pcap
# zeek -r wrccdc.pcap local "JSONStreaming::enable_log_rotation=F"

This produced the logs in Zeek default and JSON formats. As Super Binary and Super JSON are not output by Zeek, these logs were created by sending each Zeek default log through super, e.g.:

# mkdir -p bsup && \
for file in zeek-default/*
do
  super -f bsup "$file" \
      | gzip -n > bsup/"$(basename "$file" | sed 's/\.log\.gz//')".bsup.gz
done

# mkdir -p bsup-uncompressed && \
for file in zeek-default/*
do
  super -f bsup -bsup.compress=false "$file" \
      | gzip -n > bsup-uncompressed/"$(basename "$file" | sed 's/\.log\.gz//')".bsup.gz
done

# mkdir -p jsup && \
for file in zeek-default/*
do
  super -f jsup "$file" \
      | gzip -n > jsup/"$(basename "$file" | sed 's/\.log\.gz//')".jsup.gz
done

Testing

Since the sample Super Binary and Super JSON logs are generated by super, regenerating these outputs is a useful super test. Assuming super is in your $PATH, a script is provided to regenerate the hash for each Super Binary and Super JSON log and compare it to a last known "good" hash stored in the md5sums/ directory.

Example output highlighting a format change has been flagged:

# scripts/check_md5sums.sh bsup
capture_loss:62949d22a0a557342d28ee5ee4b64d50
...
x509:10333d3d004c718b04cbedb8ee195cca

diff'ing current "super -f bsup" output hashes vs. committed hashes:
7c7
< ftp:c84824c8114df4db745399ff875b0d92
---
> ftp:2d8d90df3c4b84eb9e281a3f10767aa5

  ======> diffs detected! Check for a super bug or intentional Super Binary format change.
          Current hashes are in /var/folders/yn/jbkxxkpd4vg142pc3_bd_krc0000gn/T/tmp.9X7Gab9I