Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-chunk metrics #12758

Open
jancionear opened this issue Jan 20, 2025 · 2 comments
Open

Per-chunk metrics #12758

jancionear opened this issue Jan 20, 2025 · 2 comments
Assignees
Labels
A-observability Area: observability C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@jancionear
Copy link
Contributor

jancionear commented Jan 20, 2025

nearcore has a lot of metrics which are scraped by prometheus and presented on Grafana dashboards. These metrics provide some insight into what's going on in the binary, but they are not very precise. All metrics are currently aggregated over 1-minute periods, which is around 60 blocks. Aggregating over this many blocks provides high level information, but it doesn't give much insight into individual chunks.

It would be great to have detailed metrics for every chunk - how many transactions were processed, how much gas was burned at each stage of chunk application, how many receipts were forwarded, which limits were hit, etc, etc. I would love to be able to use a neard command/debug-ui page to view detailed chunk application metrics about any chosen chunk.

Advantages:

  • Easier to analyse runtime performance and receipt flow Currently I often find myself adding random logs throughout the runtime to understand which receipts are being processed and how, but this is pretty annoying. It would be much better to view generic chunk application metrics and get information from there. There are a lot of performance and scalability improvements on the roadmap, per-chunk metrics would be very helpful for that.
  • Easier to debug issues With aggregated metrics it's often hard to say what happened at the level of individual block or chunk. These metrics could provide additional insight into issues that happen for a block or two. (For example we could see that witness size was huge for one chunk on epoch boundary)
  • Availability in tests Prometheus metrics are not available in integration tests, which makes it hard to debug them. Per-chunk metrics could be stored in the local database and printed out during tests, providing more information about what's going on in the test.
  • Custom aggregators Grafana works fine, but I personally don't like the PromQL langauge. It's hard to make it do what I want, and I'm never really sure if I'm really aggregating things the way I wanted to. With per-chunk metrics it would be easy to write custom aggregation code in Rust and present that in some way.
@jancionear jancionear added A-observability Area: observability C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jan 20, 2025
@jancionear jancionear self-assigned this Jan 20, 2025
@walnut-the-cat
Copy link
Contributor

Could you add high level engineering cost?

@jancionear
Copy link
Contributor Author

Could you add high level engineering cost?

I'd say about a week of work to get the basic functionality going. Things like integration with debug-ui or custom aggregators can be added later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability Area: observability C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

No branches or pull requests

2 participants