Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log raw events to a separate log file #4549

Merged
merged 32 commits into from
Jun 11, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Apr 9, 2024

What does this PR do?

This commit introduces a new logger core, used when collecting logs from sub process, that can be configured through logging.event_data and is used to log any message that contains the whole event or could contain any sensitive data. This is accomplished by adding log.type: event to the log entry. The logger core is responsible for filtering the log entries and directing them to the correct files.

Why is it important?

Some Beats outputs will log raw event data on certain types of errors, events can contain sensitive information that should not be present in the log files. This PR address this problem by logging event data to a separate log file.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

The diagnostics command

The diagnostics collect command will by default collect all events log files, there is a flag to opt out of this behaviour that can both: be passed via CLI or Fleet Action.

The logs command

The logs command can also read the events log files, it creates two streams for reading logs, one for the events logs and another for the normal logs, they share the same settings, but the line count is independent, i.e, if you set -n 10, each stream will read 10 lines.

When reading a number of lines, not in follow mode, both streams will get mixed up and the entries might not be completely ordered by time.

When in follow mode and as new lines are added, the streams will be correctly ordered.

How to test this PR locally

  1. Package the Elastic-Agent
  2. Replace the Filebeat binary by the binary built from Log raw events to a separate log file beats#38767
  3. Create /tmp/flog.log with a few lines, the data is not important
  4. Start the Elastic-Agent with the following configuration (adjust if needed)
outputs:
  default:
    type: elasticsearch
    hosts:
      - https://localhost:9200
    username: elastic
    password: changeme
    preset: balanced
    ssl.verification_mode: none

inputs:
  - type: filestream
    id: your-input-id
    streams:
      - id: your-filestream-stream-id
        data_stream:
          dataset: generic
        paths:
          - /tmp/flog.log

# Disable monitoring so there are less Beats running and less logs being generated.
agent.monitoring:
  enabled: false
  logs: false
  metrics: false
  pprof.enabled: false
  use_output: default
  http: # Needed if you already have an Elastic-Agent running on your machine
    enabled: false
    port: 7002 

agent.grpc: # Needed if you already have an Elastic-Agent running on your machine
  address: localhost
  port: 7001

# This just reduces the amount of logs.
agent.logging.metrics.enabled: false

To create ingest failures the easiest way is to close the write index from the datastream, to do that go to Kibana -> Dev Tools

To get the backing index for a datastream:

GET /_data_stream/logs-generic-default

This will return something like:

{
  "data_streams": [
    {
      "name": "logs-generic-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-logs-generic-default-2024.01.22-000001",
          "index_uuid": "0pq-XIYfSjuUQhTxlJKJjQ",
          "prefer_ilm": true,
          "ilm_policy": "logs",
          "managed_by": "Index Lifecycle Management"
        }
      ]
    }
  ]
}

Take note of the index_name .ds-logs-generic-default-2024.01.22-000001.
Close this index:

POST .ds-logs-generic-default-2024.01.22-000001/_close
  1. Add more data to the file /tmp/flog.log
  2. In the folder you're running the Elastic-Agent, look for a log file in data/elastic-agent-<hash>/logs/events the file name is something like elastic-agent-events-data-20240125.ndjson. You should see a log entry like this one:
{
  "log.level": "warn",
  "@timestamp": "2024-01-25T14:48:51.115+0100",
  "message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2024, time.January, 25, 14, 48, 46, 614819591, time.Local), Meta:{\"input_id\":\"your-input-id\",\"raw_index\":\"logs-generic-default\",\"stream_id\":\"your-filestream-stream-id\"}, Fields:{\"agent\":{\"ephemeral_id\":\"a06806a9-f18d-4ffa-bee1-debcc15f7cf5\",\"id\":\"0ff4eb46-71e1-4c49-a921-3b984b303c0f\",\"name\":\"millennium-falcon\",\"type\":\"filebeat\",\"version\":\"8.13.0\"},\"data_stream\":{\"dataset\":\"generic\",\"namespace\":\"default\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"0ff4eb46-71e1-4c49-a921-3b984b303c0f\",\"snapshot\":false,\"version\":\"8.13.0\"},\"event\":{\"dataset\":\"generic\"},\"host\":{\"architecture\":\"x86_64\",\"containerized\":false,\"hostname\":\"millennium-falcon\",\"id\":\"851f339d77174301b29e417ecb2ec6a8\",\"ip\":[\"42.42.42.42\",,\"ec8a:fc90:d347:6316:116e:8a27:f731:08ff\"],\"mac\":[\"95-A2-37-0D-71-73\",],\"name\":\"millennium-falcon\",\"os\":{\"build\":\"rolling\",\"family\":\"arch\",\"kernel\":\"6.7.0-arch3-1\",\"name\":\"Arch Linux\",\"platform\":\"arch\",\"type\":\"linux\",\"version\":\"\"}},\"input\":{\"type\":\"filestream\"},\"log\":{\"file\":{\"device_id\":\"34\",\"inode\":\"172876\",\"path\":\"/tmp/flog.log\"},\"offset\":1061765},\"message\":\"154.68.172.7 - ritchie3302 [25/Jan/2024:14:10:52 +0100] \\\"HEAD /supply-chains/metrics/platforms HTTP/1.1\\\" 502 13383\"}, Private:(*input_logfile.updateOp)(0xc000fc6d20), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"index_closed_exception\",\"reason\":\"closed\",\"index_uuid\":\"0pq-XIYfSjuUQhTxlJKJjQ\",\"index\":\".ds-logs-generic-default-2024.01.22-000001\"}, dropping event!",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-default",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-default"
  },
  "log.origin": {
    "file.line": 461,
    "file.name": "elasticsearch/client.go",
    "function": "github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).bulkCollectPublishFails"
  },
  "log.type": "event",
  "ecs.version": "1.6.0",
  "log.logger": "elasticsearch"
}

Note the "log.type": "event" and that this log entry is not present in other log files or the logs that go to stdout/stderr.

Related issues

## Use cases
## Screenshots
## Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

Copy link
Contributor

mergify bot commented Apr 9, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b event-logger-for-process upstream/event-logger-for-process
git merge upstream/main
git push upstream event-logger-for-process

Copy link
Contributor

mergify bot commented Apr 9, 2024

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip label Apr 9, 2024
@belimawr belimawr added skipped-test This change disables one or multiple tests Team:Elastic-Agent Label for the Agent team labels Apr 9, 2024
@belimawr belimawr changed the title Use event logger from logp Log raw events to a separate log file Apr 12, 2024
Copy link
Contributor

mergify bot commented Apr 15, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b event-logger-for-process upstream/event-logger-for-process
git merge upstream/main
git push upstream event-logger-for-process

Copy link
Contributor

mergify bot commented Apr 23, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b event-logger-for-process upstream/event-logger-for-process
git merge upstream/main
git push upstream event-logger-for-process

@belimawr belimawr force-pushed the event-logger-for-process branch 2 times, most recently from 4141bab to 5a51e88 Compare April 25, 2024 13:04
@belimawr belimawr marked this pull request as ready for review April 26, 2024 20:13
@belimawr belimawr requested review from a team as code owners April 26, 2024 20:13
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pierrehilbert
Copy link
Contributor

@bturquet could we please have someone to review here?

@pierrehilbert pierrehilbert requested review from blakerouse and removed request for michel-laterman April 29, 2024 18:00
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 29, 2024
belimawr and others added 7 commits June 5, 2024 14:59
ZipLogsWithPath needs to exclude the folder 'events', so we cannot
suffix it with '/'. Comments explaining the situation are also added.

zipLogsAndAssertFiles now calls t.Helper().
Stop calling `paths.SetTop()` in the diagnostics test. This
refactoring makes arguments and fields instead of modifying the global
state starting from the test in
`internal/pkg/diagnostics/diagnostics_test.go` up to the cobra
command.
@cmacknz
Copy link
Member

cmacknz commented Jun 5, 2024

They have their own, hardcoded log rotation strategy, currently it's set to 5Mb, 2 files: https://github.com/elastic/elastic-agent-libs/blob/d9dd0814b969adb584cc7b52e87cbe6c0eee605a/logp/config.go#L94-L114

👍 we just need to document this then

@belimawr
Copy link
Contributor Author

belimawr commented Jun 5, 2024

@cmacknz, I checked and setting agent.logging.event_data.to_stderr: true in the elastic-agent.yml works as expected.

Passing -E agent.logging.event_data.to_stderr=true does not work because the Elastic-Agent does not support the -E as Beats 🤯

I tested on a Fleet-Managed (enrolled, not installed) and it did not respect the setting on elastic-agent.yml. I'll create an issue to make it configurable for Fleet-Managed agents.

@belimawr
Copy link
Contributor Author

belimawr commented Jun 5, 2024

I got a bit puzzled about why agent.logging.event_data.to_stderr: true is working, and I actually ended up making it fully configurable by adding as part of the main agent config struct 🤦‍♂️.

This raises the question: Do I leave it like this (fully configurable like in Beats) and document it or do I remove the configuration ability and just introduce a way to send events to stderr for use on Kubernets/Docker/Container environments?

Anyway, none of the logging, aside from the level, is configurable by Fleet-Managed agents

@cmacknz
Copy link
Member

cmacknz commented Jun 6, 2024

This raises the question: Do I leave it like this (fully configurable like in Beats) and document it or do I remove the configuration ability and just introduce a way to send events to stderr for use on Kubernets/Docker/Container environments

Having a configuration option in our config file is the starting point, you should test that the Fleet overrides API allows setting it and that the agent respects the setting. It can be exposed in the UI as a follow up.

To make it easy to use on k8s or in containers in general, allowing the setting to be set via an environment variable would be nice. This will save users from having to deal with mounting a config map to override the default configuration file in the container.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. Just an issue with using the errCh.

internal/pkg/agent/cmd/logs.go Outdated Show resolved Hide resolved
Only send an error or `nil` when reading logs in the logs command.
@belimawr
Copy link
Contributor Author

belimawr commented Jun 6, 2024

I created an issue to track the container/Fleet-Managed case: #4874

@belimawr belimawr enabled auto-merge (squash) June 7, 2024 14:45
@belimawr
Copy link
Contributor Author

belimawr commented Jun 7, 2024

I've enabled auto merge.

@belimawr
Copy link
Contributor Author

belimawr commented Jun 7, 2024

Docs PR: elastic/ingest-docs#1053

@jlind23
Copy link
Contributor

jlind23 commented Jun 10, 2024

@tetianakravchenko @gizas @MichaelKatsoulis as you are codeowners could you please review this?

@jlind23
Copy link
Contributor

jlind23 commented Jun 10, 2024

@pierrehilbert @belimawr if not approved by o11y in the next couple days we should most probably bypass this protection.
Cc @cmacknz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip skipped-test This change disables one or multiple tests Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants