Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

Open
ELLIOTTCABLE opened this issue Oct 10, 2023 · 5 comments · May be fixed by #190
Open

Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

ELLIOTTCABLE opened this issue Oct 10, 2023 · 5 comments · May be fixed by #190

Comments

@ELLIOTTCABLE
Copy link

We'd like to balance writes from the SigNoz ingestor to multiple Clickhouse shards.

Our preferred approach is to use the HTTP Clickhouse-protocol behind NginX (see #188); but another alternative would be to explicitly balance in the clickhousetracesexporter across multiple Clickhouse-TCP connections.

@ankitnayan
Copy link
Contributor

ankitnayan commented Oct 12, 2023

I didn't get the dataflow architecture. Does it look like nginx -> otel-collectors -> Clickhouse?

You should not need to loadbalance ingestion to clickhouse shards. The distributed clickhouse cluster does that using a shard key. You should just need to declare multiple shards in the clickhouse config for cluster and for k8s is even simpler.

I don't see the need to load balance via nginx. Are you trying to create multi-region cluster of clickhouse and ingest data to the shards of the region of the applications? Or are you having multiple clickhouse clusters?

Please share more details on the use case and the need along with data flow to help us understand more. Also, share the infra where you are running SigNoz.

@makeavish
Copy link
Member

@srikanthccv Do you have context on this?

@ELLIOTTCABLE
Copy link
Author

ELLIOTTCABLE commented Oct 13, 2023

Hi! Happy to try and answer questions as I understand them; although I don't have the entire picture by myself …

I want the usual things — mostly resilience against failure of a single node and/or maintenance downtime; avoiding write amplification, especially in cascading failure-states (since this is telemetry data, I'll admit to a slight fear of backpressure causing the "problem solving tool" to turn into a "problem exacerbating tool" once everything's on fire); query perf (don't want some %age of reads hitting a node that's overloaded with all of the entire cluster's writes) … also, we're not really sure exactly how much data we're going to want to be ingesting yet, but we're sure it will be a A Lot; so there might be throughput limitations on a single node.

To be fair, we're entirely new to storing telemetry in Clickhouse, this is still an experimental foray; our CH expertise lies in an entirely different dataset, with different constraints. Perhaps this is silly premature optimization; but it's also something worrying-enough to us (and with which we have enough experience in another domain) to be a bit careful.

Arch questions: currently one cluster of ~80 nodes dedicated to telemetry work, all in a single DC; I doubt we'll stand up another.

Data-flow: For the moment, due to #156, we're playing with 5 non-Signoz otel-collector nodes sitting in front of a single signoz-otel-collector.

  1. boxes, in whatever region, with roles we want producing telemetry have an OTLP-HTTP port opened on localhost and proxied-and-balanced amongst the 5 otel-collectors running in the telemetry datacenter;
  2. those are (jankily and incompletely, right now) theoretically going to run the actual pipelines, aggregating and filtering;
  3. which themselves front (again, temporarily, over OTLP-HTTP) the single signoz-otel-collector,
  4. … and that, I most recently had writing to the local TCP port of the clickhouse-shard it's sharing a box with. (Hopefully, post-tracesexporter: Allow multiple datasources in configuration, and round-robin load-balance amongst them #190, writing to ~5 write-shards.)

So, all very messy, and somewhat temporary; we're still feeling this technology out right now. 😓 Hope that answers your questions!

@ELLIOTTCABLE
Copy link
Author

Just following up on this; let me know y'all's thoughts. (=

@srikanthccv
Copy link
Member

@ELLIOTTCABLE just to make sure I follow you, you have a big cluster with tens of nodes and balance the ingestion load on the nodes, as in don't overload one node to do the work of receiving and distributing the data to other shards/servers. I am highlighting that to avoid any confusion of data balancing that happens with the Distributed table.

I would prefer not to have the balancing logic in the exporter, mainly because

  1. it's not the primary goal of the exporter, it should just receive the telemetry data, encode and send it to the destination.
  2. introducing this leads to more maintenance burden
  3. there is PR for the round-robin and eventually, there will be a request for supporting different techniques and any effort in that direction will be poor and buggy implementation
  4. it's not complicated to support HTTP

Adding HTTP protocol support should be fairly straightforward and I think that is the best choice for supporting the usecase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants