Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

ELLIOTTCABLE · 2023-10-10T15:39:27Z

We'd like to balance writes from the SigNoz ingestor to multiple Clickhouse shards.

Our preferred approach is to use the HTTP Clickhouse-protocol behind NginX (see #188); but another alternative would be to explicitly balance in the clickhousetracesexporter across multiple Clickhouse-TCP connections.

The text was updated successfully, but these errors were encountered:

ankitnayan · 2023-10-12T06:50:54Z

I didn't get the dataflow architecture. Does it look like nginx -> otel-collectors -> Clickhouse?

You should not need to loadbalance ingestion to clickhouse shards. The distributed clickhouse cluster does that using a shard key. You should just need to declare multiple shards in the clickhouse config for cluster and for k8s is even simpler.

I don't see the need to load balance via nginx. Are you trying to create multi-region cluster of clickhouse and ingest data to the shards of the region of the applications? Or are you having multiple clickhouse clusters?

Please share more details on the use case and the need along with data flow to help us understand more. Also, share the infra where you are running SigNoz.

makeavish · 2023-10-13T02:04:07Z

@srikanthccv Do you have context on this?

ELLIOTTCABLE · 2023-10-13T04:35:16Z

Hi! Happy to try and answer questions as I understand them; although I don't have the entire picture by myself …

I want the usual things — mostly resilience against failure of a single node and/or maintenance downtime; avoiding write amplification, especially in cascading failure-states (since this is telemetry data, I'll admit to a slight fear of backpressure causing the "problem solving tool" to turn into a "problem exacerbating tool" once everything's on fire); query perf (don't want some %age of reads hitting a node that's overloaded with all of the entire cluster's writes) … also, we're not really sure exactly how much data we're going to want to be ingesting yet, but we're sure it will be a A Lot; so there might be throughput limitations on a single node.

To be fair, we're entirely new to storing telemetry in Clickhouse, this is still an experimental foray; our CH expertise lies in an entirely different dataset, with different constraints. Perhaps this is silly premature optimization; but it's also something worrying-enough to us (and with which we have enough experience in another domain) to be a bit careful.

Arch questions: currently one cluster of ~80 nodes dedicated to telemetry work, all in a single DC; I doubt we'll stand up another.

Data-flow: For the moment, due to #156, we're playing with 5 non-Signoz otel-collector nodes sitting in front of a single signoz-otel-collector.

boxes, in whatever region, with roles we want producing telemetry have an OTLP-HTTP port opened on localhost and proxied-and-balanced amongst the 5 otel-collectors running in the telemetry datacenter;
those are (jankily and incompletely, right now) theoretically going to run the actual pipelines, aggregating and filtering;
which themselves front (again, temporarily, over OTLP-HTTP) the single signoz-otel-collector,
… and that, I most recently had writing to the local TCP port of the clickhouse-shard it's sharing a box with. (Hopefully, post-tracesexporter: Allow multiple datasources in configuration, and round-robin load-balance amongst them #190, writing to ~5 write-shards.)

So, all very messy, and somewhat temporary; we're still feeling this technology out right now. 😓 Hope that answers your questions!

ELLIOTTCABLE · 2023-12-13T00:13:32Z

Just following up on this; let me know y'all's thoughts. (=

srikanthccv · 2023-12-17T07:14:26Z

@ELLIOTTCABLE just to make sure I follow you, you have a big cluster with tens of nodes and balance the ingestion load on the nodes, as in don't overload one node to do the work of receiving and distributing the data to other shards/servers. I am highlighting that to avoid any confusion of data balancing that happens with the Distributed table.

I would prefer not to have the balancing logic in the exporter, mainly because

it's not the primary goal of the exporter, it should just receive the telemetry data, encode and send it to the destination.
introducing this leads to more maintenance burden
there is PR for the round-robin and eventually, there will be a request for supporting different techniques and any effort in that direction will be poor and buggy implementation
it's not complicated to support HTTP

Adding HTTP protocol support should be fairly straightforward and I think that is the best choice for supporting the usecase.

ELLIOTTCABLE linked a pull request Oct 10, 2023 that will close this issue

tracesexporter: Allow multiple datasources in configuration, and round-robin load-balance amongst them #190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

ELLIOTTCABLE commented Oct 10, 2023

ankitnayan commented Oct 12, 2023 •

edited

Loading

makeavish commented Oct 13, 2023

ELLIOTTCABLE commented Oct 13, 2023 •

edited

Loading

ELLIOTTCABLE commented Dec 13, 2023

srikanthccv commented Dec 17, 2023

Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

Support balancing amongst multiple Clickhouse shards (multiple DSNs in config) #189

Comments

ELLIOTTCABLE commented Oct 10, 2023

ankitnayan commented Oct 12, 2023 • edited Loading

makeavish commented Oct 13, 2023

ELLIOTTCABLE commented Oct 13, 2023 • edited Loading

ELLIOTTCABLE commented Dec 13, 2023

srikanthccv commented Dec 17, 2023

ankitnayan commented Oct 12, 2023 •

edited

Loading

ELLIOTTCABLE commented Oct 13, 2023 •

edited

Loading