Skip to content

Commit

Permalink
Merge pull request #24 from turbot/tp_updates
Browse files Browse the repository at this point in the history
Tp updates
  • Loading branch information
judell authored Jan 23, 2025
2 parents 18820ee + 9357cdb commit 183f328
Show file tree
Hide file tree
Showing 21 changed files with 71 additions and 98 deletions.
15 changes: 8 additions & 7 deletions docs/learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,14 @@ connection "aws" "admin" {

Tailpipe can use the default AWS credentials from your credential file and/or environment variables; if you can run `aws ls s3`, for example, then you should be able to collect CloudTrail logs. The AWS plugin [documentation](https://hub.tailpipe.io/plugins/turbot/aws) describes other access patterns.

You will also need to define a [partition](/docs/manage/partition) which refers to a plugin-defined table (*aws_cloudtrail_log*) that describes the data found in each line of a Cloudtrail log, and a [source](/docs/manage/source) that governs how Tailpipe acquires the data that populates the partition. Tailpipe knows the structure of a bucket that contains Cloudtrail logs so you only need to specify the bucket name:
You will also need to define a [partition](/docs/manage/partition) which refers to a plugin-defined table (*aws_cloudtrail_log*) that describes the data found in each line of a Cloudtrail log, and a [source](/docs/manage/source) that governs how Tailpipe acquires the data that populates the partition.

```
partition "aws_cloudtrail_log" "prod" {
source "aws_s3_bucket" {
connection = connection.aws.admin
bucket = "aws-cloudtrail-logs-6054...81-fe67"
connection = connection.aws.admin
bucket = "aws-cloudtrail-logs-6054...81-fe67"
file_layout = "AWSLogs/%{NUMBER:account_id}/%{DATA}.json.gz"
}
}
```
Expand All @@ -66,10 +67,10 @@ tar xvf flaws_cloudtrail_logs.tar
To source the log data from the `.gz` file extracted from the tar file, your `aws.tpc` file won't include a `connection` block. Its `partition` block will follow this format:

```
partition "aws_cloudtrail_log" "prod" {
source "file_system" {
partition "aws_cloudtrail_log" "flaws" {
source "file" {
paths = ["~/flaws"]
file_layout = ["%{DATA}.json.gz"]
file_layout = "%{DATA}.json.gz"
}
}
```
Expand All @@ -79,7 +80,7 @@ partition "aws_cloudtrail_log" "prod" {
Now let's collect the logs:

```bash
tailpipe collect aws_cloudtrail_log.prod
tailpipe collect aws_cloudtrail_log
```

This command will:
Expand Down
8 changes: 3 additions & 5 deletions docs/manage/collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,23 @@ title: Collection

# Collection

The [tailpipe collect](/docs/reference/cli/collect) command runs a [plugin](/docs/manage/plugin) that reads from a [source](/docs/manage/source) and writes to the [hive](/docs/manage/hive). Every time you run `tailpipe collect`, Tailpipe refreshes its views over all collected parquet files. Those views are the tables you query with `tailpipe query` (or directly with DuckDB).
The [tailpipe collect](/docs/reference/cli/collect) command runs a [plugin](/docs/manage/plugin) that reads from a [source](/docs/manage/source) and writes to the [hive](/docs/manage/hive). Every time you run `tailpipe collect`, Tailpipe refreshes its views over all collected Parquet files. Those views are the tables you query with `tailpipe query` (or directly with DuckDB).

The collection process always writes to a local **workspace**, and does so on a per-partition basis. While you may specify multiple partitions on the command line, `partition` is the unit of collection. A partition day is the atomic unit of work; the partition collection succeeds or fails for all sources for a given day, and if it fails, rolls everything back for that day.

When a partition is collected, each source resumes from the last time it was collected. Source data is ingested, standardized, then written to parquet files in the **standard hive structure**.
When a partition is collected, each source resumes from the last time it was collected. Source data is ingested, standardized, then written to Parquet files in the **standard hive structure**.

### Initial collection

Often, the source data to be ingested is large, and the first ingestion would take quite a long time. To improve the first-run experience for collection Tailpipe collects by days in reverse chronological order. In other words, it starts with the current day and moves backward. The default is NOT be to collect all data on the initial collection, there is a default 7-day lookback window. You can override that on the command line, e.g.:

```
tailpipe collect aws_cloudtrail_log.test --from T-180d
tailpipe collect aws_cloudtrail_log.test --from 2024-01-01 --to 2024-03-31
tailpipe collect aws_cloudtrail_log.test --from 2024-01-01
```

- Subsequent collection runs occur chronologically resuming from the last collection by default, so there are no time gaps while the data is being collected.

- A user may specify a specify time range using `--from` and `--to` but generally this is discouraged as it potentially leaves gaps in the data.

- The data is available for querying even while partition collection is still occurring.

### Tables, partitions, and indexes
Expand Down
13 changes: 2 additions & 11 deletions docs/manage/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,17 @@ title: Hive

Tailpipe uses [hive partitioning](https://duckdb.org/docs/data/partitioning/hive_partitioning.html) to leverage automatic [filter pushdown](https://duckdb.org/docs/data/partitioning/hive_partitioning.html#filter-pushdown) and Tailpipe is opinionated on the layout:

- The data is written to parquet files in the workspace directory, with a prescribed directory and filename structure. Other than **index** the layout is dictated by the Tailpipe core.
- The data is written to Parquet files in the workspace directory, with a prescribed directory and filename structure. Other than **index** the layout is dictated by the Tailpipe core.

- The *plugin* may choose the **index** value, but it is not *user*-definable

- The metadata is also written to parquet files the workspace directory, with a prescribed directory and filename structure.

>[!NOTE]
> what's in the metadata? do users care?
The standard partitioning/hive structure enables efficient queries that only need to read subsets of the hive filtered by index or date.

Tailpipe [schemas](#schemas) also depend on this structure, as the filterable fields are aligned to the hive partitions.

>[!NOTE]
> what about schema is relevant to the user for lw7, vs later?


### Index: Custom Partition Key

Each plugin chooses what the **index** is for a given table. Because the data is laid out into partitions, performance is optimized when the partition appears in a `where` or `join` clause, or is used on a **schema** definition. The index provides a way to segment the data to optimize lookup performance in a way that is *optimal for the specific plugin*. For example, AWS tables index on account id, Azure tables on subscription, and GCP on project id.
Each plugin chooses what the **index** is for a given table. Because the data is laid out into partitions, performance is optimized when the partition appears in a `where` or `join` clause. The index provides a way to segment the data to optimize lookup performance in a way that is *optimal for the specific plugin*. For example, AWS tables index on account id, Azure tables on subscription, and GCP on project id.

```bash
tp_table=aws_cloudtrail_log
Expand Down
4 changes: 2 additions & 2 deletions docs/manage/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ title: Manage Tailpipe

# Manage Tailpipe

Tailpipe is simple to install, it's distributed as a single binary file that you can [download and run](/downloads). You use Tailpipe to collect logs, then query them.
Tailpipe is simple to install, it's distributed as a single binary file that you can [download and run](/downloads). Use Tailpipe to collect logs, then query them.

When you collect logs from a source, e.g. an S3 bucket containing Cloudtrail logs, you invoke a plugin that uses a connection to read the source and build a database table. The unit of collection is the partition: a table may comprise one one or more partitions. The collection process writes to a hierarchy of folders and files called the hive, over which Tailpipe builds the database table that you query. The workspace, by default `~/.tailpipe/data/default`, defines the location of the hive.
When you collect logs from a source, e.g. an S3 bucket containing Cloudtrail logs, you invoke a plugin that uses a connection to read the source and build a database table. The unit of collection is the partition: a table may comprise one one or more partitions. The collection process writes to a hierarchy of folders and files called the [hive](/docs/manage/hive.md), over which Tailpipe builds the database table that you query. The workspace, by default `~/.tailpipe/data/default`, defines the location of the hive.



Expand Down
2 changes: 1 addition & 1 deletion docs/manage/partition.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Partitions are defined in HCL and are required for collection.

The partition has two labels:

1. The table name. The table name is meaningful and must match a table name for an installed plugin. The table name implies the shape of resulting parquet file, and also makes assumptions about the source data format. For example, the `aws_cloudtrail_log` table is defined in the AWS plugin. The shape of that table (the structure of the data in the destination parquet file, which corresponds to table columns) is defined in the AWS plugin.
1. The table name. The table name is meaningful and must match a table name for an installed plugin. The table name implies the shape of resulting Parquet file, and also makes assumptions about the source data format. For example, the `aws_cloudtrail_log` table is defined in the AWS plugin. The shape of that table (the structure of the data in the destination Parquet file, which corresponds to table columns) is defined in the AWS plugin.

2. A partition name. The partition name must be unique for all partitions in a given table (though different tables may use the same partition names).

Expand Down
31 changes: 14 additions & 17 deletions docs/manage/source.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: Sources

# Sources

A partition acquires data from one or more **sources**. Usually a source will connect to a resource via a **connection** which specifies the credentials and account scope. The source is typically more specific than the connection though. For example, a *connection* that provides the ability to interact with an AWS account may support several *sources* that use that connection but provide logs from different services and locations, for example two AWS S3 sources from two different S3 buckets.
A partition acquires data from one or more **sources**. Often a source will connect to a resource via a [connection](/docs/reference/config-files/connection) which specifies the credentials and account scope. The source is typically more specific than the connection though. For example, a connection that provides the ability to interact with an AWS account may support several sources that use that connection but provide logs from different services and locations, for example two AWS S3 sources from two different S3 buckets.

A plugin's source mechanism is responsible for:

Expand All @@ -24,28 +24,25 @@ Sources are defined as sub-blocks in a *partition*.

```hcl
partition "aws_cloudtrail_log" "test" {
source "aws_s3" {
connection = connection.aws.logs
bucket = "my-logs-bucket"
prefix = "optional/path"
source "aws_s3_bucket" {
connection = connection.aws.logs
bucket = "my-logs-bucket"
file_layout = "{%DATA}.json.gz"
}
source "file_system" {
source "file" {
path = "/path/to/files"
extensions = "[.gz]"
}
source "aws_sqs" {
connection = connection.aws.logs
queue_url = "https://my-queue"
}
source "aws_cloudwatch" {
connection = connection.aws.logs
log_group_arn = "arn:..."
file_layout = "{%DATA}.json.gz"
}
}
```

Standard source types, like `aws_s3_bucket` and `file_system`, have an HCL shape that is consistent across all the partitions, and a standard set of arguments. However the *interpretation* of the arguments, and the *behavior* of the source, is plugin-dependent. For instance, the `aws_s3_source` for the `aws_cloudtrail_log` table makes Cloudtrail-specific assumptions about the key prefix structure and file names.
Standard source types, like `aws_s3_bucket` and `file`, have an HCL shape that is consistent across all the partitions, and a standard set of arguments.

>[!NOTE]
> is this still true, given file_layout?
However the *interpretation* of the arguments, and the *behavior* of the source, is plugin-dependent. For instance, the `aws_s3_source` for the `aws_cloudtrail_log` table makes Cloudtrail-specific assumptions about the key prefix structure and file names.



2 changes: 1 addition & 1 deletion docs/manage/table.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ title: Tables
>[!NOTES]
> Not much more here than in the glossary. But maybe we expand on schema definitions here. Unsure how much of that concept surfaces for lw7 vs later.
Tables are implemented as DuckDB views over the parquet files. Tailpipe creates tables (that is, creates views in the `tailpipe.db` database) based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules.
Tables are implemented as DuckDB views over the Parquet files. Tailpipe creates tables (that is, creates views in the `tailpipe.db` database) based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules.

When Tailpipe starts, it finds all the tables in the workspace according to the [hive directory layout](/docs/manage/hive). For each schema, it adds a view for the table. The view definitions will include qualifiers that implement the filter rules that are defined in the [schema definition](#schemas).
2 changes: 1 addition & 1 deletion docs/manage/workspace.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ A Tailpipe workspace is a profile that defines the environment in which Tailpipe

Each workspace comprises:

- A single local workspace directory for Tailpipe data (parquet) and metadata files
- A single local workspace directory for Tailpipe data (Parquet) and metadata files

- Optionally, context-specific settings and options

Expand Down
8 changes: 1 addition & 7 deletions docs/pipes-ecosystem/flowpipe.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,4 @@ title: Flowpipe

[Flowpipe](https://flowpipe.io/) is an automation and workflow engine designed for DevOps tasks. It allows you to define and execute complex workflows using code, making it ideal for automating cloud infrastructure management across platforms like AWS, Azure, and GCP.

Flowpipe enables you to take action on Tailpipe detections!

- Detect and correct misconfigurations leading to cost savings opportunities with Thrifty mods for [AWS](https://hub.flowpipe.io/mods/turbot/aws_thrifty), [Azure](https://hub.flowpipe.io/mods/turbot/azure_thrifty), or [GCP](https://hub.flowpipe.io/mods/turbot/gcp_thrifty).

- Automate your resource tagging standards with Tags mods for [AWS](https://hub.flowpipe.io/mods/turbot/aws_tags) or [Azure](https://hub.flowpipe.io/mods/turbot/azure_tags).

- [Build your own mods](https://flowpipe.io/docs/build) with simple HCL to create custom pipelines tailored to your specific needs.
Use it to take action on Tailpipe detections!
4 changes: 3 additions & 1 deletion docs/pipes-ecosystem/powerpipe.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ sidebar_label: Powerpipe

# Powerpipe

[Powerpipe](https://powerpipe.io/) is an open-source platform that enables you to visualize logs and cloud infrastructure using dashboards, log detections, and compliance controls. Powerpipe works seamlessly with Tailpipe; use it to evaluate your logs against a [library of benchmarks](#TBD) keyed to the MITRE ATT&CK framework.
[Powerpipe](https://powerpipe.io/) is an open-source platform that enables you to visualize logs and cloud infrastructure using dashboards, log detections, and compliance controls. Use it with [Tailpipe mods](https://hub.powerpipe.io?engines=tailpipe) that visualize logs on dashboards and benchmark them using the MITRE ATT&CK framework.

<!--
## Visualize your logs with Powerpipe Dashboards
Expand All @@ -21,3 +22,4 @@ sidebar_label: Powerpipe
![](/images/docs/pipes-ecosystem/benchmark_dashboard_view.png)
-->
2 changes: 1 addition & 1 deletion docs/query/batch-query.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ We can now run the query by passing the file name to `tailpipe query`
tailpipe query cloudtrail_event.sql
```

You can even run multiple sql files by passing a glob or a space separated list of file names to the command:
You can even run multiple sql files by passing a glob or a space-separated list of file names to the command:
```bash
tailpipe query *.sql
```
Expand Down
2 changes: 1 addition & 1 deletion docs/query/query-shell.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ The query shell supports standard emacs-style key bindings:

## Exploring Tables & Columns

Tailpipe **tables** provide an interface for querying log data using standard SQL. Tailpipe tables do not actually *store* data, they query the DuckDB views created over parquet files collected by `tailpipe collect`. The details are hidden from you though - *you just query them like any other table!*
Tailpipe **tables** provide an interface for querying log data using standard SQL. Tailpipe tables do not actually *store* data, they query the DuckDB views created over Parquet files collected by `tailpipe collect`. The details are hidden from you though - *you just query them like any other table!*

### Tables

Expand Down
3 changes: 1 addition & 2 deletions docs/reference/cli/collect.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,8 @@ Run a **collection**.

| Flag | Description
|-|-
| `--compact` | Compact the parquet files after collection (default true)
| `--compact` | Compact the Parquet files after collection (default true)
| `--from string` | Collect days newer than a relative or absolute date.
| `--to string` | Collect days older than than a relative or absolute date.
| `--output output` | Output format; one of: json, table (default text)
| `--help` | Help for connect

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/cli/compact.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: tailpipe compact
---

Compact multiple parquet files per day to one per day.
Compact multiple Parquet files per day to one per day.

# tailpipe compact

Expand Down
12 changes: 8 additions & 4 deletions docs/reference/cli/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,15 @@ title: Tailpipe CLI

| Command | Description
|-|-
| [tailpipe help](/docs/reference/cli/help) | Help about any command
| [tailpipe collect](/docs/reference/cli/collect) | Collect from log sources
| [tailpipe collect](/docs/reference/cli/collect) | Run a collection
| [tailpipe compact](/docs/reference/cli/compact) | Compact multiple parquet files per day to one per day
| [tailpipe connect](/docs/reference/cli/connect) | Return a connection string for a database
| [tailpipe help](/docs/reference/cli/help) | Help about any command
| [tailpipe partition](/docs/reference/cli/partition) | List, show, and delete Tailpipe partitions
| [tailpipe plugin](/docs/reference/cli/plugin) | Tailpipe plugin management
| [tailpipe query](/docs/reference/cli/query) | Query log sources
| [tailpipe query](/docs/reference/cli/query) | Execute a query against the workspace database
| [tailpipe source](/docs/reference/cli/source) | List and show Tailpipe sources
| [tailpipe table](/docs/reference/cli/table) | List and show Tailpipe tables



Expand All @@ -31,7 +35,7 @@ title: Tailpipe CLI
<tr>
<td nowrap="true"> `--config-path` </td>
<td>
Sets the search path for <a href = "/docs/reference/config-files">configuration files</a>. This argument accepts a colon-separated list of directories. All configuration files (`*.fpc`) will be loaded from each path, with decreasing precedence. The default is `.:$TAILPIPE_INSTALL_DIR/config` (`.:~/.tailpipe/config`). This allows you to manage your <a href="/docs/reference/config-files/workspace"> workspaces </a> and <a href="/docs/reference/config-files/connection">connections</a> centrally in the `~/.tailpipe/config` directory, but override them in the working directory / mod location if desired.
Sets the search path for <a href = "/docs/reference/config-files">configuration files</a>. This argument accepts a colon-separated list of directories. All configuration files (`*.tpc`) will be loaded from each path, with decreasing precedence. The default is `.:$TAILPIPE_INSTALL_DIR/config` (`.:~/.tailpipe/config`). This allows you to manage your <a href="/docs/reference/config-files/workspace"> workspaces </a> and <a href="/docs/reference/config-files/connection">connections</a> centrally in the `~/.tailpipe/config` directory, but override them in the working directory / mod location if desired.
</td>
</tr>

Expand Down
Loading

0 comments on commit 183f328

Please sign in to comment.