Skip to content

Commit

Permalink
docs: Add how to parse guide (influxdata#14947)
Browse files Browse the repository at this point in the history
  • Loading branch information
powersj authored Mar 14, 2024
1 parent 6fa5a9b commit d0f505c
Showing 1 changed file with 243 additions and 0 deletions.
243 changes: 243 additions & 0 deletions docs/PARSING_DATA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# Parsing Data

Telegraf has the ability to take data in a variety of formats. Telegraf requires
configuration from the user in order to correctly parse, store, and send the
original data. Telegraf does not take the raw data and maintain it internally.

Telegraf uses an internal metric representation consisting of the metric name,
tags, fields and a timestamp, very similar to [line protocol][]. This
means that data needs to be broken up into a metric name, tags, fields, and a
timestamp. While none of these options are required, they are available to
the user and might be necessary to ensure the data is represented correctly.

[line protocol]: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/

## Parsers

The first step is to determine which parser to use. Look at the list of
[parsers][] and find one that will work with the user's data. This is generally
straightforward as the data-type will only have one parser that is actually
applicable to the data.

[parsers]: https://github.com/influxdata/telegraf/tree/master/plugins/parsers

### JSON parsers

There is an exception when it comes to JSON data. Instead of a single parser,
there are three different parsers capable of reading JSON data:

* `json`: This parser is great for flat JSON data. If the JSON is more complex
and for example, has other objects or nested arrays, then do not use this and
look at the other two options.
* `json_v2`: The v2 parser was created out of the need to parse JSON objects. It
can take on more advanced cases, at the cost of additional configuration.
* `xpath_json`: The xpath parser is the most capable of the three options. While
the xpath name may imply XML data, it can parse a variety of data types using
XPath expressions.

## Tags and fields

The next step is to look at the data and determine how the data needs to be
split up between tags and fields. Tags are generally strings or values that a
user will want to search on. While fields are the raw data values, numeric
types, etc. Generally, data is considered to be a field unless otherwise
specified as a tag.

## Timestamp

To parse a timestamp, at the very least the users needs to specify which field
has the timestamp and what the format of the timestamp is. The format can either
be a predefined Unix timestamp or parsed using a custom format based on Go
reference time.

For Unix timestamps Telegraf understands the following settings:

| Timestamp | Timestamp Format |
|-----------------------|------------------|
| `1709572232` | `unix` |
| `1709572232123` | `unix_ms` |
| `1709572232123456` | `unix_us` |
| `1709572232123456789` | `unix_ns` |

There are some named formats available as well:

| Timestamp | Named Format |
|---------------------------------------|---------------|
| `Mon Jan _2 15:04:05 2006` | `ANSIC` |
| `Mon Jan _2 15:04:05 MST 2006` | `UnixDate` |
| `Mon Jan 02 15:04:05 -0700 2006` | `RubyDate` |
| `02 Jan 06 15:04 MST` | `RFC822` |
| `02 Jan 06 15:04 -0700` | `RFC822Z` |
| `Monday, 02-Jan-06 15:04:05 MST` | `RFC850` |
| `Mon, 02 Jan 2006 15:04:05 MST` | `RFC1123` |
| `Mon, 02 Jan 2006 15:04:05 -0700` | `RFC1123Z` |
| `2006-01-02T15:04:05Z07:00` | `RFC3339` |
| `2006-01-02T15:04:05.999999999Z07:00` | `RFC3339Nano` |
| `Jan _2 15:04:05` | `Stamp` |
| `Jan _2 15:04:05.000` | `StampMilli` |
| `Jan _2 15:04:05.000000` | `StampMicro` |
| `Jan _2 15:04:05.000000000` | `StampNano` |

If the timestamp does not conform to any of the above, then the user can specify
a custom timestamp format, in which the user must provide the timestamp in
[Go reference time][] notation. Here are a few example timestamps and their Go
reference time equivalent:

| Timestamp | Go reference time |
|-------------------------------|-------------------------------|
| `2024-03-04T17:10:32` | `2006-01-02T15:04:05` |
| `04 Mar 24 10:10 -0700` | `02 Jan 06 15:04 -0700` |
| `2024-03-04T10:10:32Z07:00` | `2006-01-02T15:04:05Z07:00` |
| `2024-03-04 17:10:32.123+00` | `2006-01-02 15:04:05.999+00` |
| `2024-03-04T10:10:32.123456Z` | `2006-01-02T15:04:05.000000Z` |
| `2024-03-04T10:10:32.123456Z` | `2006-01-02T15:04:05.999999999Z` |

Note for fractional second values, the user can use either a `9` or `0`. Using a
`0` forces a certain length, but using `9`s do not.

Please note, that timezone abbreviations are ambiguous! For example `MST`, can
stand for either Mountain Standard Time (UTC-07) or Malaysia Standard Time
(UTC+08). As such, avoid abbreviated timezones if possible.

Unix timestamps use UTC, there is no concept of a timezone for a Unix timestamp.

[Go reference time]: https://pkg.go.dev/time#pkg-constants

## Examples

Below are a few basic examples to get users started.

### CSV

Given the following data:

```csv
node,temp,humidity,alarm,time
node1,32.3,23,false,2023-03-06T16:52:23Z
node2,22.6,44,false,2023-03-06T16:52:23Z
node3,17.9,56,true,2023-03-06T16:52:23Z
```

Here is corresponding parser configuration and result:

```toml
[[inputs.file]]
files = ["test.csv"]
data_format = "csv"

csv_header_row_count = 1
csv_column_names = ["node","temp","humidity","alarm","time"]
csv_tag_columns = ["node"]
csv_timestamp_column = "time"
csv_timestamp_format = "2006-01-02T15:04:05Z"
```

```text
file,node=node1 temp=32.3,humidity=23i,alarm=false 1678121543000000000
file,node=node2 temp=22.6,humidity=44i,alarm=false 1678121543000000000
file,node=node3 temp=17.9,humidity=56i,alarm=true 1678121543000000000
```

### JSON flat data

Given the following data:

```json
{ "node": "node", "temp": 32.3, "humidity": 23, "alarm": false, "time": "1709572232123456789"}
```

Here is corresponding parser configuration:

```toml
[[inputs.file]]
files = ["test.json"]
precision = "1ns"
data_format = "json"

tag_keys = ["node"]
json_time_key = "time"
json_time_format = "unix_ns"

```

```text
file,node=node temp=32.3,humidity=23 1709572232123456789
```

### JSON Objects

Given the following data:

```json
{
"metrics": [
{ "node": "node1", "temp": 32.3, "humidity": 23, "alarm": "false", "time": "1678121543"},
{ "node": "node2", "temp": 22.6, "humidity": 44, "alarm": "false", "time": "1678121543"},
{ "node": "node3", "temp": 17.9, "humidity": 56, "alarm": "true", "time": "1678121543"}
]
}
```

Here is corresponding parser configuration:

```toml
[[inputs.file]]
files = ["test.json"]
data_format = "json_v2"

[[inputs.file.json_v2]]
[[inputs.file.json_v2.object]]
path = "metrics"
timestamp_key = "time"
timestamp_format = "unix"
[[inputs.file.json_v2.object.tag]]
path = "#.node"
[[inputs.file.json_v2.object.field]]
path = "#.temp"
type = "float"
[[inputs.file.json_v2.object.field]]
path = "#.humidity"
type = "int"
[[inputs.file.json_v2.object.field]]
path = "#.alarm"
type = "bool"
```

```text
file,node=node1 temp=32.3,humidity=23i,alarm=false 1678121543000000000
file,node=node2 temp=22.6,humidity=44i,alarm=false 1678121543000000000
file,node=node3 temp=17.9,humidity=56i,alarm=true 1678121543000000000
```

### JSON Line Protocol

Given the following data:

```json
{
"fields": {"temp": 32.3, "humidity": 23, "alarm": false},
"name": "measurement",
"tags": {"node": "node1"},
"time": "2024-03-04T10:10:32.123456Z"
}
```

Here is corresponding parser configuration:

```toml
[[inputs.file]]
files = ["test.json"]
precision = "1us"
data_format = "xpath_json"

[[inputs.file.xpath]]
metric_name = "/name"
field_selection = "fields/*"
tag_selection = "tags/*"
timestamp = "/time"
timestamp_format = "2006-01-02T15:04:05.999999999Z"
```

```text
measurement,node=node1 alarm="false",humidity="23",temp="32.3" 1709547032123456000
```

0 comments on commit d0f505c

Please sign in to comment.