Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update-learn #3

Merged
merged 2 commits into from
Nov 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
302 changes: 100 additions & 202 deletions docs/learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,237 +7,135 @@ slug: /

# Learn Tailpipe

Tailpipe allows you to create "pipelines as code" to define workflows and other tasks that run in a sequence.
Tailpipe is a high-performance data collection and querying tool that makes it easy to collect, store, and analyze log data. With Tailpipe, you can:

## Creating your first pipeline
- Collect logs from various sources and store them efficiently in parquet files
- Query your data using familiar SQL syntax through DuckDB
- Share collected data with your team using remote object storage
- Create filtered views of your data using schemas
- Join log data with other data sources for enriched analysis

Getting started is easy! If you haven't already done so, [download and install Tailpipe](/downloads).
## Install the NGINX Plugin

Tailpipe pipelines and triggers are packaged into [mods](/docs/build), and Tailpipe requires a mod to run. Let's create a new directory for our mod, and then run `tailpipe mod init` to initialize it:
This tutorial uses the NGINX plugin to demonstrate collecting and analyzing web server access logs. First, [download and install Tailpipe](/downloads), and then install the plugin:

```bash
mkdir learn_tailpipe
cd learn_tailpipe
tailpipe mod init
tailpipe plugin install nginx
```

The `tailpipe mod init` command creates a file named `mod.fp` in the directory. This file contains a `mod` definition for our new mod:
## Configure Data Collection

```hcl
mod "local" {
title = "learn_tailpipe"
}
```

You can customize the [mod definition](/docs/tailpipe-hcl/mod) if you like, but the default is sufficient for our purposes.

Let's create our first pipeline.

Tailpipe mods are written in HCL. When Tailpipe runs, it will load the mod from the working directory and will read all files with the `.fp` extension from the directory and its subdirectories recursively. Create a file named `learn.fp` and add the following code:
Tailpipe uses HCL configuration files to define what data to collect. Create a file named `nginx.tpc` with the following content:

```hcl
pipeline "learn_tailpipe" {
step "http" "get_ipv4" {
url = "https://api.ipify.org?format=json"
}

output "ip_address" {
value = step.http.get_ipv4.response_body.ip
}
partition "nginx_access_log" "web_servers" {
plugin = "nginx"
source "nginx_access_log_file" {
log_path = "/var/log/nginx/access.log"
}
}
```

A Tailpipe [pipeline](/docs/tailpipe-hcl/step/pipeline) is a sequence of steps to do work. This snippet creates a pipeline called `learn_tailpipe` that has a single [http step](/docs/tailpipe-hcl/step/http), and a single [output](/docs/tailpipe-hcl/step/pipeline#outputs).

Let's run it!

```bash
tailpipe pipeline run learn_tailpipe
```

![](/images/docs/learn/get-ipv4.png)
This configuration tells Tailpipe to collect NGINX access logs from the specified log file. The configuration defines:
- A partition named "web_servers" for the "nginx_access_log" table
- The source type "nginx_access_log_file" which reads NGINX formatted logs
- The path to the log file to collect from

Tailpipe runs the pipeline and prints its outputs once it is complete.
## Collect Data

When troubleshooting, it's often useful to view more information about the currently executing steps. You can use the `--verbose` flag to show this detailed information.
Now let's collect the logs:

```bash
tailpipe pipeline run learn_tailpipe --verbose
tailpipe collect nginx_access_log.web_servers
```

![](/images/docs/learn/get-ipv4-verbose.png)

## Using mods

Tailpipe's modular design allows you to build pipelines from other pipelines. Let's install the `reallyfreegeoip` mod:

```bash
tailpipe mod install github.com/turbot/tailpipe-mod-reallyfreegeoip
This command will:
1. Read the NGINX access logs from the specified file
2. Parse and standardize the log entries
3. Store the data in parquet files organized by date
4. Update the local database with table definitions

## Query Your Data

Tailpipe provides an interactive SQL shell for analyzing your collected data. Let's look at some examples of what you can do.

### Analyze Traffic by Server

This query shows a summary of traffic for each server for a specific date:

```sql
SELECT
tp_index as server,
count(*) as requests,
count(distinct remote_addr) as unique_ips,
round(avg(bytes_sent)) as avg_bytes,
count(CASE WHEN status = 200 THEN 1 END) as success_count,
count(CASE WHEN status >= 500 THEN 1 END) as error_count,
round(avg(CASE WHEN method = 'GET' THEN bytes_sent END)) as avg_get_bytes
FROM nginx_access_log
WHERE tp_date = '2024-11-01'
GROUP BY tp_index
ORDER BY requests DESC;
```

```bash
Installed 1 mod:

local
└── github.com/turbot/[email protected]
```

The mod is installed into the `.tailpipe/mods` subdirectory, and a dependency is added to your `mod.fp`.

Now that the mod is installed, you should see its pipelines:

```bash
tailpipe pipeline list
```

```bash
MOD NAME DESCRIPTION
local learn_tailpipe
reallyfreegeoip reallyfreegeoip.pipeline.get_ip_geolocation Get geolocation data for an IPv4 or IPv6 address.
```

You can run pipelines from the dependency mod on the command line:

```bash
tailpipe pipeline run reallyfreegeoip.pipeline.get_ip_geolocation --arg ip_address=35.236.238.30
```

![](/images/docs/learn/reallyfreegeoip.png)

## Composing with pipelines

While running the dependency pipelines directly in the CLI is useful, the real power is the ability to compose pipelines from other pipelines. Let's add a [pipeline step](/docs/tailpipe-hcl/step/pipeline) to take our IP address and look up our geo-location information.

```hcl
pipeline "learn_tailpipe" {
step "http" "get_ipv4" {
url = "https://api.ipify.org?format=json"
}

step "pipeline" "get_geo" {
pipeline = reallyfreegeoip.pipeline.get_ip_geolocation
args = {
ip_address = step.http.get_ipv4.response_body.ip
}
}

output "ip_address" {
value = step.http.get_ipv4.response_body.ip
}

output "latitude" {
value = step.pipeline.get_geo.output.geolocation.latitude
}

output "longitude" {
value = step.pipeline.get_geo.output.geolocation.longitude
}
}
┌──────────────────────────────────────────────────────────────────────────────────────┐
│ server requests unique_ips avg_bytes success_c… error_cou… avg_get_b… │
│──────────────────────────────────────────────────────────────────────────────────── │
│ web-01.ex… 349 346 7036 267 7 7158 │
│ web-02.ex… 327 327 6792 246 11 6815 │
│ web-03.ex… 324 322 7001 254 8 6855 │
└──────────────────────────────────────────────────────────────────────────────────────┘
```

Notice that we used the IP address from the first step (`step.http.get_ipv4.response_body.ip`) as an argument to the second step. Tailpipe automatically detects this dependency and runs the steps in the correct order!

Let's add a couple more steps to get the weather forecast for our location.

```hcl
pipeline "learn_tailpipe" {
step "http" "get_ipv4" {
url = "https://api.ipify.org?format=json"
}

step "pipeline" "get_geo" {
pipeline = reallyfreegeoip.pipeline.get_ip_geolocation

args = {
ip_address = step.http.get_ipv4.response_body.ip
}
}

step "http" "get_weather" {
url = join("", [
"https://api.open-meteo.com/v1/forecast",
"?latitude=${step.pipeline.get_geo.output.geolocation.latitude}",
"&longitude=${step.pipeline.get_geo.output.geolocation.longitude}",
"&current=temperature",
"&forecast_days=1",
"&daily=temperature_2m_min,temperature_2m_max,precipitation_probability_mean",
"&temperature_unit=${step.pipeline.get_geo.output.geolocation.country_code == "US" ? "fahrenheit" : "celsius"}"
])
}

step "transform" "friendly_forecast" {
value = join("", [
"It is currently ",
step.http.get_weather.response_body.current.temperature,
step.http.get_weather.response_body.current_units.temperature,
", with a high of ",
step.http.get_weather.response_body.daily.temperature_2m_max[0],
step.http.get_weather.response_body.daily_units.temperature_2m_max,
" and a low of ",
step.http.get_weather.response_body.daily.temperature_2m_min[0],
step.http.get_weather.response_body.daily_units.temperature_2m_min,
". There is a ",
step.http.get_weather.response_body.daily.precipitation_probability_mean[0],
step.http.get_weather.response_body.daily_units.precipitation_probability_mean,
" chance of precipitation."
])
}

output "ip_address" {
value = step.http.get_ipv4.response_body.ip
}

output "latitude" {
value = step.pipeline.get_geo.output.geolocation.latitude
}

output "longitude" {
value = step.pipeline.get_geo.output.geolocation.longitude
}

output "forecast" {
value = step.transform.friendly_forecast.value
}
}
This shows us:
- Number of requests per server
- Count of unique IP addresses
- Average response size
- Success and error counts
- Average size of GET requests

### Join with External Data

One of Tailpipe's powerful features is the ability to join log data with other tables. Here's an example joining with an IP information table to get more context about the traffic:

```sql
SELECT
n.remote_addr as ip,
i.description,
count(*) as requests,
count(distinct n.server_name) as servers_accessed,
round(avg(n.bytes_sent)) as avg_bytes,
string_agg(distinct n.method, ', ') as methods_used,
count(CASE WHEN n.status >= 400 THEN 1 END) as errors
FROM nginx_access_log n
LEFT JOIN ip_info i ON n.remote_addr = i.ip_address
WHERE i.description IS NOT NULL
GROUP BY n.remote_addr, i.description
ORDER BY requests DESC;
```

![](/images/docs/learn/weather-report.png)



## Send a message

Now we have a pipeline that can get the local forecast - let's send it somewhere! The [message step](/docs/tailpipe-hcl/step/message) provides a mechanism for sending messages via multiple communication channels, such as Slack and Email.

Add this step to the `learn_tailpipe` pipeline.

```hcl
step "message" "send_forecast" {
notifier = notifier.default
subject = "Todays Forecast"
text = step.transform.friendly_forecast.value
}
```

And run the pipeline again.

```bash
tailpipe pipeline run learn_tailpipe
┌──────────────────────────────────────────────────────────────────────────────────────┐
│ ip descripti… requests servers_a… avg_bytes methods_u… errors │
│──────────────────────────────────────────────────────────────────────────────────── │
│ 203.0.113… Test Netw… 1 1 1860 GET 0 │
└──────────────────────────────────────────────────────────────────────────────────────┘
```

You should see the message printed to the console when you run the pipeline.

Console messages and inputs are useful, but Tailpipe can also route these input requests, approvals and notifications to external systems like Slack, MS Teams, and Email!

Tailpipe [Integrations](/docs/reference/config-files/integration) allow you to interface with external systems. [Notifiers](/docs/reference/config-files/notifier) allow you to route [message](/docs/tailpipe-hcl/step/message) and [input](/docs/build/input) steps to one or more integrations. Integrations are only loaded in [server-mode](/docs/run/server).
This enriched query shows:
- IP addresses and their descriptions
- How many servers each IP accessed
- Average response sizes
- HTTP methods used
- Error counts


Tailpipe server creates a default [`http` integration](/docs/reference/config-files/integration/http) as well as a [default notifier](/docs/reference/config-files/notifier#default-notifier) that routes to it, but you can send it via [Email](/docs/reference/config-files/integration/email), [Slack](/docs/reference/config-files/integration/slack) or [Microsoft Teams](/docs/reference/config-files/integration/msteams) without modifying the pipeline code. Just create the appropriate [integrations](/docs/reference/config-files/integration), add them to the [default notifier](/docs/reference/config-files/notifier#default-notifier), and run the pipeline again from a server instance!

```bash
tailpipe server &
tailpipe pipeline run learn_tailpipe --host local
```
## What's Next?

![](/images/docs/learn/slack-weather-report.png)
We've demonstrated basic log collection and analysis with Tailpipe. Here's what to explore next:

- [Discover more plugins on the Hub →](https://hub.steampipe.io/plugins)
- [Learn about data compaction and optimization →](https://tailpipe.io/docs/managing/compaction)
- [Share data with your team using remotes →](https://tailpipe.io/docs/sharing/remotes)
- [Create schemas for filtered views →](https://tailpipe.io/docs/schemas)
- [Join #tailpipe on Slack →](https://turbot.com/community/join)
Loading