diff --git a/_data/members.yml b/_data/members.yml index 0a4691c7..8e021cd1 100644 --- a/_data/members.yml +++ b/_data/members.yml @@ -925,6 +925,12 @@ waldemar.panas: github: wpanas role: Former editor +jakub.demianowski: + name: Jakub Demianowski + bio: Kuba works at Allegro crafting recommendation systems, leveraging microservices, big data, and MLOps. Outside of his technical explorations, he is an avid wall climbing enthusiast. + linkedin: jakub-demianowski + github: jdemianowski + jakub.dropia: name: Jakub Dropia bio: > diff --git a/_posts/2024-06-20-cost-optimization-data-pipeline-gcp.md b/_posts/2024-06-20-cost-optimization-data-pipeline-gcp.md new file mode 100644 index 00000000..4d2648ed --- /dev/null +++ b/_posts/2024-06-20-cost-optimization-data-pipeline-gcp.md @@ -0,0 +1,359 @@ +--- +layout: post +title: "A Mission to Cost-Effectiveness: Reducing the cost of a single Google Cloud Dataflow Pipeline by Over 60%" +author: jakub.demianowski +tags: [ tech, big data ] +--- +In this article we’ll present methods for efficiently optimizing physical resources and fine-tuning the configuration of a Google Cloud Platform (GCP) +Dataflow pipeline in order to achieve cost reductions. +Optimization will be presented as a real-life scenario, which will be performed in stages. + +Before we start, it’s time to introduce several avenues through which the cost of Big Data pipelines can be significantly reduced. +These include: + +1. Careful optimization of consumed physical resources, like choosing VM types with optimal CPU to memory ratio and a cost-effective CPU type. +2. Enhancing the configuration of the data processing engine to maximize its efficiency. +3. Optimizing input and output datasets. Not all data may need processing or perhaps, altering their structure could reduce the processing time. +4. Refining storage strategies for input and output datasets. This is particularly beneficial if reading or writing speeds are suboptimal and demand improvements. +5. Streamlining of our pipeline code and utilizing built-in optimization functionalities (for example broadcast joins and repartitioning in Apache Spark). + +Throughout this article we will focus solely on optimizing consumed physical resources (1st point) and enhancing configuration of the data processing engine (2nd point). + +## About the data pipeline being optimized + +The data pipeline which will serve us as an example throughout this article is written in Apache Beam using Python SDK. +The pipeline runs on Google Cloud Dataflow processing engine. +The goal of the pipeline is to join a couple of tables (most of them are a terabyte+ in size), apply some transformations and produce a unified output table. + +Overall processing cost of the full dataset is around $350 per day. +It results in roughly $10,500 per month, and $127,000 per year. + +## Approach to cost optimization + +At the beginning of the cost optimization let's draft a couple of hypotheses: + +- Physical resources are underutilized. +- Physical resources have not the best price-to-performance ratio. +- Configuration of the Dataflow job is suboptimal and could be optimized. + +My goal will be to check those hypotheses. + +During the testing phase I'll use a 3% subsample of input datasets. As a result I will be running tests with input size at ~100 GB level. +Thus, I'll limit the cost of tests and significantly reduce their time. Final tests will be made on the full dataset, not on a limited subsample. + +In order to save time and resources I've made some speculative choices regarding what I should test during optimization. +In addition, I’ve decided not to test all the possible combinations of machine families, disk types and configuration options to save time. +I will try to stick with the most promising choices and omit testing unpromising configurations. + +## Hypothesis testing: physical resources are underutilized + +In our initial configuration we used the following type of worker machines: + +- Machine type: [n2-standard-4](https://cloud.google.com/compute/docs/general-purpose-machines#n2_series) (4 vCPU, 16 GB of memory) +- Disk size: 100 GB +- Disk type: HDD +- Max. worker nodes: 500 +- Autoscaling algorithm: throughput based + +I made a decision to focus on CPU and memory utilization first. + +### CPU utilization +I checked if CPU utilization was on an acceptable level, and it was. + +The following diagram from Dataflow UI presents CPU utilization on All Workers in terms of the CPU utilization +for all cores on a single App Engine flexible instance. +So it gives us an idea of how the CPU is utilized on each virtual machine. + + + +We could also take a look at the same data presented in terms of statistical metrics. + + + +From the given graph I could see that mean utilization of the CPU was around 85%, which is a good score. +The result is affected by two shuffle stages, when we need to send data around the cluster (usually network is a small bottleneck here). +CPU tends to be idle while shuffling data using Dataflow [Shuffle Service](https://cloud.google.com/dataflow/docs/shuffle-for-batch). + +So CPU resources are not underutilized. We use almost all of what we pay for. + +### Memory utilization + +In the end I checked memory usage. I saw that we did not use all the memory which we were paying for. +Let’s take a look at the following two graphs. +The first one shows maximal memory utilization among all the workers. + + +The second one shows memory utilization statistics among all the worker nodes. + + +The first one presents average memory usage on a worker node, the second one presents overall memory usage among the whole cluster. +We clearly see that we only use around 50% of the memory. Bingo, we pay for memory that we do not use. + +### Improving memory utilization + +Usually there are two ways of improving memory utilization: + +- Change CPU to memory ratio on worker nodes. +- Decrease the amount of worker nodes. + +I’ve decided to change the CPU to memory ratio rather than decrease the number of worker nodes. +I did not want to compromise on scalability and time needed to perform a job. + +Test on a 3% subsample of input data has given the following cost of data processing: + +- n2-standard-4: $9.48 +- n2-highcpu-8: $8.52 (~ 10% less than original price) +- n2d-highcpu-8: $8.57 (~ 10% less than original price) + +We saved 10% on adjusting CPU and memory ratio. +It results in around $12,700 of estimated savings per year (10% of $127,000 annual cost). + + + + + + + + + + +
HypothesisSavings[^1]
[1] Physical resources are underutilized$12,700
+ +## Hypothesis testing: physical resources has not the best price-to-performance ratio + +I assumed that the current virtual machine type (n2-standard-4) has not the best price-to-performance ratio. +To check performance of different virtual machine types I used [CoreMark scores provided by Google Cloud itself](https://cloud.google.com/compute/docs/benchmarks-linux). + +Based on CoreMark scores and official Google Cloud VM pricing, I prepared a table which would help me choose the VM type with the best price-to-performance ratio. +The most important column is “price per 1 mln points” — how much I need to pay on average to score 1 mln points. +I used [official VM instance prices from Google Cloud site](https://cloud.google.com/compute/vm-instance-pricing) from region europe-west1. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Virtual Machine TypePoints in ScoreMark[^2]Price per hour[^3]Price per 1 mln points
n2-standard-466 833 pts$0.21$3.20
n2-standard-8138 657 pts$0.43$3.08
n2d-standard-8164 539 pts$0.37$2.26
e2-standard-8103 808 pts$0.29$2.84
t2d-standard-8237 411 pts$0.37$1.57
+ +As we see, another hypothesis proved to be true. We’re not using the virtual machine type with the best price-to-performance ratio - T2D. +We’re using N2 machine type. + +Unfortunately, at the time of writing, T2D machines do not provide CPU to memory ratios other than 3 GB per 1 vCPU. +It’s still better than 4 GB per 1 vCPU, but far from 1 or 2 GB per 1 vCPU. +We will check if T2D virtual machine type with 4 GB of memory per 1 CPU is cheaper than its counterparts. + +### Moving to a virtual machine type with better price-to-performance ratio + +I performed several tests on a small scale (3% subsample of input data) with T2D machine types. Let’s take a look at them. + +- n2-standard-4 + HDD: $9.48 +- n2-highcpu-8 + HDD: $8.52 (~ 10% less than original price) +- n2d-highcpu-8 + HDD: $8.57 (~ 10% less than original price) +- t2d-standard-8 + HDD: $6.65 (~ 32% less than original price) + +This way we decreased the estimated processing cost from $127,000 by $40,640 per year to $86,360 (by 32%)[^1]. +Unfortunately, we also introduced some possible underutilized resources (memory) by changing CPU to memory ratio. + + + + + + + + + + + + + + +
HypothesisSavings[^1]
[1] Physical resources are underutilized$12,700
[2] Moving to a more cost-effective VM type$27,940
+ +Total: $40,640 of estimated savings[^1] + +### Coming back to optimization of virtual machine storage type + +As I found the most suitable virtual machine type, I was able to focus on choosing between SSD and HDD disk types. +As we all know, HDDs are much slower than SSDs, especially in terms of random read/write. +For processes where we do not heavily use storage I/O operations there’s no need to move to more expensive SSDs. + +I decided to check if we should use cheaper and slower HDDs or more expensive and faster SSDs. +I run the pipeline (3% of input data size) with HDD and SSD disks. +Here are the results for different VM families: + +- n2-standard-4 + HDD: $9.48 +- n2-highcpu-8 + HDD: $8.52 (~ 10% less than original price) +- n2d-highcpu-8 + HDD: $8.57 (~ 10% less than original price) +- t2d-standard-8 + HDD: $6.65 (~ 32% less than original price) +- t2d-standard-8 + SSD: $5.64 (~ 41% less than original price) + +This way we decreased the estimated processing cost from $127,000 by $52,070 per year to $74,930 (by 41%)[^1]. + + + + + + + + + + + + + + + + + + +
HypothesisSavings[^1]
[1] Physical resources are underutilized$12,700
[2] Moving to a more cost-effective VM type$27,940
[3] Changing VM disk type to SSD$11,430
+ +Total: $52,070 of estimated savings[^1] + +## Hypothesis testing: configuration of the Dataflow job is not optimal + +Dataflow, in comparison to Apache Spark, leaves us with almost no configuration options to be changed. +It’s good because in Dataflow you get very decent out-of-the-box settings. +The single option which I wanted to tune was if we should use Shuffle Service. +Shuffle Service is a serverless tool that facilitates data shuffling around the cluster, thus relieving worker nodes from this task. +Also, node preemption is not so painful because Shuffle Service stores data in external storage independent of worker nodes. +But it comes at a price. + +Cost breakdown of processing 3% input dataset using virtual machine t2d-standard-8 with SSD disk is presented below[^3]: + +- Cost of CPU: $2.47 +- Cost of memory: $0.70 +- Cost of SSD disk: $0.16 +- Cost of shuffle service: $2.32 +- Overall cost: $5.64 + +Thus, we see that the cost of the shuffle service plays an important role - it’s more than 40% of the overall cost. Let’s do an experiment and turn Shuffle Service off. + +- n2-standard-4 + HDD: $9.48 (original configuration) +- t2d-standard-8 + SSD: $5.64 (~ 41% less than original configuration) +- t2d-standard-8 + SSD + no Shuffle Service: $3.95 (~ 58% less than original configuration) + +By turning off Shuffle Service we achieved a much lower cost. +As a bonus, our memory utilization increased to almost 100%, because we use worker nodes to perform a shuffle. +So we eliminated an underutilized T2D issue connected with a CPU to memory ratio. +Node preemption is not a problem since we’re not utilizing preemptible VMs. + +I must also add that turning off external shuffle service may not always result in lower cost. +It depends on many factors, and you should test it on your own data pipeline. +Also, you need to take into consideration that the job will usually require more resources (CPU, memory) once you turn off external shuffle service. + +This way we decreased the estimated processing cost from $127,000 by $73,660 per year to $53,340 (by 58%)[^1]. +So it’s now less than half of the initial cost[^1]. + + + + + + + + + + + + + + + + + + + + + + +
HypothesisSavings[^1]
[1] Physical resources are underutilized$12,700
[2] Moving to a more cost-effective VM type$27,940
[3] Changing VM disk type to SSD$11,430
[4] Turning off Shuffle Service$21,590
+ +Total: $73,660 of estimated savings[^1] + +Note: Why did we not use Dataflow [FlexRS](https://cloud.google.com/dataflow/docs/guides/flexrs) which could lower the processing price by combining preemptible and regular VMs? + +We did not test it due to how scheduling in FlexRS works. +When you schedule a Dataflow FlexRS job you do not know the exact start time, +the only one promise from FlexRS is that the job will start within 6 hours ([documentation notes from Google Cloud website on that](https://cloud.google.com/dataflow/docs/guides/flexrs)). +Our data pipeline must start at specified time and having a 6 hour delay is not acceptable. + +## Final test on a full dataset + +My last task was to test findings from subsampled input dataset (3%) tests on the full dataset (without subsampling). +Here are the costs of processing a full dataset for a single day: + + + + + + + + + + + + + + +
ConfigurationProcessing cost for one day on a full dataset
n2-standard-4 + HDD$350.02
t2d-standard-8 + SSD + shuffle service turned off$134.14 (~ 62% less than original price)
+ +As we see, the predicted gain from subsampling was achieved, and savings are even 3 pp higher than estimated. +For reference: we estimated, based on runs with 3% of input size, that we will achieve a roughly 58% cost reduction. + +- Initial annual cost: $127,000 +- Estimated annual cost after optimization: $48,260 +- Total estimated annual savings: $78,740 + +Presented figures are only estimates based on a single run and extrapolated to the whole year. +To know the exact savings we will need to run the processing pipeline over a year, which hasn’t been done yet. + +## Summary + +We achieved excellent outcome without even touching the processing code. +Speculative approach provided good results. +There may still be some space for optimization, but within the timeframe I was given, I treat these results as first-rate and do not find any more reasons to further +optimize the environment and configuration of the Dataflow job. + +Also, specified strategies do not have to lead to cost optimizations in other pipelines. +As every data pipeline is different, some changes which brought cost reduction in this example may result in increased processing cost in different data pipelines. +What is most important in this article: how to approach cost optimization of a data pipeline, not which type of resources to choose. + +[^1]: Presented figures are only estimates based on a single run (with only 3% of input data) and extrapolated to the whole year with the assumption that processing the whole dataset will result in the same relative savings as processing 3% of source data. +[^2]: CoreMark results from [CoreMark scores provided by Google Cloud itself](https://cloud.google.com/compute/docs/benchmarks-linux), retrieved on 04/05/2024. +[^3]: Official prices taken from Google Cloud site, VM instance pricing in region europe-west1, retrieved on 04/05/2024. diff --git a/authors/jakub.demianowski/index.md b/authors/jakub.demianowski/index.md new file mode 100644 index 00000000..5b20f629 --- /dev/null +++ b/authors/jakub.demianowski/index.md @@ -0,0 +1,4 @@ +--- +layout: author +author: jakub.demianowski +--- diff --git a/css/blog.scss b/css/blog.scss index 6cddd7bf..876c37de 100644 --- a/css/blog.scss +++ b/css/blog.scss @@ -356,6 +356,10 @@ figcaption { margin: auto; } +.post-content img.image-with-frame{ + box-shadow: 0 0 4px 0 #D7DBD6; +} + .post-content blockquote { border-left: 4px solid #e8e8e8; padding-left: 20px; diff --git a/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/01_cpu_utilization_all_workers.png b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/01_cpu_utilization_all_workers.png new file mode 100644 index 00000000..a970fd76 Binary files /dev/null and b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/01_cpu_utilization_all_workers.png differ diff --git a/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/02_cpu_utilization_stats.png b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/02_cpu_utilization_stats.png new file mode 100644 index 00000000..3e3a074d Binary files /dev/null and b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/02_cpu_utilization_stats.png differ diff --git a/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/03_memory_utilization_max_usage.png b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/03_memory_utilization_max_usage.png new file mode 100644 index 00000000..9cdeb9de Binary files /dev/null and b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/03_memory_utilization_max_usage.png differ diff --git a/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/04_memory_utilization_summary.png b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/04_memory_utilization_summary.png new file mode 100644 index 00000000..41ba2259 Binary files /dev/null and b/img/articles/2024-06-20-cost-optimization-data-pipeline-gcp/04_memory_utilization_summary.png differ diff --git a/img/authors/jakub.demianowski.jpg b/img/authors/jakub.demianowski.jpg new file mode 100644 index 00000000..c7cf98f7 Binary files /dev/null and b/img/authors/jakub.demianowski.jpg differ