Merge pull request #41 from allegro-internal/how-to-save-money-on-mai…

…ntaining-a-large-database Post: From 3TB to 100GB: A Cost-Saving Journey in Database Maintenance
allegro · Jul 10, 2023 · 6a1debf · 6a1debf
2 parents da792c9 + c46ad1c
commit 6a1debf
Show file tree

Hide file tree

Showing 12 changed files with 311 additions and 5 deletions.
diff --git a/_data/members.yml b/_data/members.yml
@@ -831,22 +831,22 @@ lukasz.solniczek:
     name: Łukasz Solniczek
     bio: iOS Mobile Software Engineer. Works on MBox - the Server-driven UI tool.
     linkedin: lukaszsolniczek
-    
+
 aleksandra.osowska-kurczab:
     name: Aleksandra Osowska-Kurczab
-    bio: Senior Research Engineer in the Machine Learning Research team, working on large-scale recommendation systems. She's pursuing her PhD in Computer Science at Warsaw University of Technology on deep learning in medical image analysis. AI enthusiast and geek. 
+    bio: Senior Research Engineer in the Machine Learning Research team, working on large-scale recommendation systems. She's pursuing her PhD in Computer Science at Warsaw University of Technology on deep learning in medical image analysis. AI enthusiast and geek.
     linkedin: aleksandra-osowska-kurczab
-    
+
 jacek.szczerbinski:
     name: Jacek Szczerbiński
     bio: Jacek Szczerbiński obtained his PhD in Chemistry from ETH Zurich. He then fell in love with ML and became a Research Engineer at Allegro. Currently he is studying robustness of text classifiers against mislabeled training data. His superpower is explaining ML to non-technical people.
     linkedin: szcz
-    
+
 klaudia.nazarko:
     name: Klaudia Nazarko
     bio: Research Engineer in the Machine Learning Research team working on large-scale recommendation systems.
     linkedin: klaudianazarko
-    
+
 kalina.kobus:
     name: Kalina Kobus
     bio: Senior Research Engineer in the Machine Learning Research team, working on NLP in the e-commerce domain. Kalina Kobus obtained her PhD in Machine Learning from Poznan University of Technology. Her research interests include large-scale text classification, deep metric learning and the robustness of classifiers against mislabeled training data.
@@ -862,3 +862,9 @@ lukasz.rokita:
     name: Łukasz Rokita
     bio: Software Engineer working on Cloud and Big Data. Always on the lookout for simplicity and patterns to translate into architecture. Dedicated reader of all things software engineering and JVM enthusiast. Avid skier and table tennis player.
     linkedin: łukasz-rokita-85779413a
+
+mateusz.stolecki:
+    name: Mateusz Stolecki
+    bio: Senior Software Engineer at Allegro Pay. .NET developer, software and hardware enthusiast. Especially interested in performance optimization. Overclocker and PC builder in spare time.
+    linkedin: mateusz-stolecki-534340116
+    twitter: wertonez
diff --git a/_posts/2023-07-10-save-money-on-large-database.md b/_posts/2023-07-10-save-money-on-large-database.md
@@ -0,0 +1,296 @@
+---
+layout: post
+title: "From 3TB to 100GB: A Cost-Saving Journey in Database Maintenance"
+author: [mateusz.stolecki]
+tags: [tech,azure,sql,saving,cloud]
+---
+
+In the era of ubiquitous cloud services and an increasingly growing PaaS and serverless-oriented approach, performance
+and resources seem to be becoming less and less important.
+After all, we can scale horizontally and vertically at any time, without worrying about potential performance challenges
+that the business may introduce.
+
+However, there is also another side to the coin – rising costs. While it can be argued that in many situations it is simply
+cheaper to add another instance of the service than to engage a developer who will work tirelessly to diagnose
+and optimize performance problems, the problem will persist and intensify as the business and its requirements grow.
+
+A similar situation arises with databases. We often store huge amounts of data for auditing or historical purposes.
+While the cost of maintaining such databases is negligible at a small scale,
+over time it can become a notable burden on our budget.
+
+I wanted to talk about such a case and how we managed to reduce the cost of maintaining a database nearly 30-fold.
+
+## The problem
+As the amount of data grows, the need for scaling arises. In the case of **Azure** services, scaling also has its [limitations](https://learn.microsoft.com/en-us/azure/azure-sql/database/purchasing-models?view=azuresql).
+It is not always possible to infinitely increase the available disk space without scaling other resources (CPU, RAM, I/O).
+In our case, this limit became apparent when we exceeded 1TB of data. Our database was based on the vCore model,
+where we used **4 vCores**.
+
+Unfortunately, this number of vCores limited the available disk space to **1TB**. Due to the increase in the number of users
+and the demand for disk space, we needed more resources. We continued to scale up, adding not only more disk resources
+but also computational resources (I will mention that at this point we reached a scale of **3TB** of data, which requires
+at least **12 vCores**). At some point, the cost of maintaining the database amounted to several thousand euros.
+This prompted us to look for solutions.
+
+Comparing the cost of storing substantial amounts of data within **Azure SQL** and **Storage Account**
+(especially blobs in the **archive** tier), we concluded that we could achieve significant cost reduction
+by archiving old/unused data and placing it in a cost-optimized container.
+
+### Monthly cost of storing 3TB of data
+
+<table>
+  <tr>
+    <th>Azure SQL 12vCore 3TB</th>
+    <th>Storage Account Archive tier</th>
+  </tr>
+  <tr>
+    <td>$2,876.18</td>
+    <td>$31.12</td>
+  </tr>
+</table>
+
+## Analysis
+After some investigation, It turned out that significant part of data could be safely archived,
+which would certainly provide
+potential savings and eliminate the problem of an overgrown database. Most of this data was actually historical.
+
+We implemented a solution that allows for much more scalable data archiving
+by asynchronously loading data into the warehouse.
+However, data from before the implementation of aforementioned solution were still generating considerable storage costs.
+
+The idea seemed simple both in concept and execution. However, we immediately encountered several problems.
+Exporting such massive amounts of data is a time-consuming process and puts a heavy load on the database
+causing responsiveness issues.
+
+Dealing with a production system, we could not reduce the reliability and availability of services.
+In addition, the export functionality offered by the Azure portal is limited to databases up to **200GB** in size,
+which meant that we had to look for another solution.
+
+## Action plan
+### Concept
+As it turned out, there are ways to export even huge databases. After some investigation,
+we found the **SQL Package** tool.
+It provides **export** option and is great for solving aforementioned problem. It is able to produce a `bacpac`
+file that contains highly compressed content of the database.
+The tool also allows you to restore data at any time using the **import** operation,
+if there is ever a need to review it, for example for audit purposes.
+
+The next step is to copy the file to the container in the Storage Account using the **AzCopy** tool and ensure
+that it is stored in the **ARCHIVE** tier, what will massively reduce the costs of maintaining it.
+
+The final stage is to delete unnecessary data from the database, then **SHRINK** it, what will reduce database resources.
+### Script and tools
+To export and archive the database, we used two tools provided by Microsoft: [SQL Package](https://learn.microsoft.com/en-us/sql/tools/sqlpackage/sqlpackage?view=sql-server-ver16)
+and [AzCopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10).
+
+After analyzing their documentation, we prepared the appropriate procedure taking
+into account performance and operation duration.
+### Infrastructure
+Due to the fact that the export and upload process to the Storage Account container with this amount of data may take
+a long time, we decided to set up a temporary **VM** with the accelerated networking option, which served us
+to execute all required scripts. It should be mentioned that the need to set up a dedicated virtual machine also arises
+from the fact that it must be located in an internal network, where it is also possible to connect to the machine that
+handles the database. Thanks to meeting this condition,
+it was possible to successfully connect to the database and perform the export operation.
+
+The virtual machine turned out to be moderately priced, as all performed operations were not computationally demanding
+(both CPU and RAM usage were low), what allowed us to use a very resource-efficient machine. The only notable extension
+of its functionality is **accelerated networking**, as it must work with data transfer over the network
+and we needed good performance.
+
+## Testing
+### Optimization
+Before we proceeded with the implementation in the production environment, we conducted a series of
+tests using test environments. They mainly involved running all the steps of the process using
+data packages of approximately **50GB** and **200GB** in size.
+We spent the majority of time testing and optimizing the use of the SQL Package tool.
+
+Our goal was to shorten the export time and obtain an optimal size for the resulting file,
+so it would not generate excessive costs due to the need to store it. We tested several scenarios
+(mostly by manipulating the **compression level** parameter).
+
+Compression in **FAST** mode showed an average of 10-20% faster export time than **MAXIMUM**, with the resulting file
+size varying within <10%.
+
+### Performance testing
+We also tested the load on the databases in each environment.
+**Data IO** and **CPU** load were tested using the test environment relying on DTU-based infrastructure utilising **100 DTU**
+units.
+
+Data IO
+![Data IO](/img/articles/2023-07-10-save-money-on-large-database/perf-test-dev-iops.png)
+
+CPU
+![CPU](/img/articles/2023-07-10-save-money-on-large-database/perf-test-dev-cpu.png)
+
+Notice, that the export operation primarily consumes IO resources.
+### Data Import
+Due to the possible need to reuse archived data, we had to make sure that the data we imported was suitable for re-import.
+
+Initially, we attempted to import the data using the **SQL Server Management Studio** tool provided by Microsoft.
+Unfortunately, this attempt failed due to errors related to file reading during the import operation.
+We made an additional attempt to import the archive using the SQL Package tool, which, in addition to the export option,
+also provides import options.
+
+Command
+
+```
+sqlpackage /Action:Import `
+        /tsn:$ServerName `
+        /tdn:$DatabaseName `
+        /tu:$SqlAdminName `
+        /tp:$SqlAdminPassword `
+        /tec:true `
+        /ttsc:false `
+        /d:true `
+        /sf:$SourceFile `
+        /p:CommandTimeout=999 `
+        /p:LongRunningCommandTimeout=0 `
+        /p:DatabaseLockTimeout=-1 `
+```
+
+solved the problem.
+
+## Deployment
+### Exporting the database using SQL Package tool
+The following script was executed, successfully extracting data from the database and creating the appropriate `bacpac` file.
+As a result, we received a compressed file of around 100GB.
+It is worth pointing out that data in the database occupied about 3TB, so compression was very efficient.
+The whole process took several hours.
+
+```
+sqlpackage
+    /Action:Export `
+    /ssn:$ServerName `
+    /sdn:$DatabaseName `
+    /su:$SqlAdminName `
+    /sp:$SqlAdminPassword `
+    /sec:true `
+    /stsc:false `
+    /tf:$TargetFile `
+    /p:CompressionOption=Fast `
+    /p:CommandTimeout=999 `
+    /p:LongRunningCommandTimeout=0 `
+    /p:DatabaseLockTimeout=-1 `
+    /p:TempDirectoryForTableData=$TempDirectory `
+    /d:true `
+    /df:$SqlLogs `
+```
+
+Many parameters of this operation were evaluated during trials on test environments.
+The particularly important ones are:
+- **CommandTimeout, LongRunningCommandTimeout, DatabaseLockTimeout** - This set of
+parameters ensures that the connection
+is maintained throughout the entire duration of the export operation (assuming that it will be long-running).
+- **CompressionOption** - The degree of data compression in the output file.
+Two variants were tested:
+**FAST** and **MAXIMUM**.
+**FAST** allowed us to shorten the export time by about 2 hours while showing only slightly lower
+data compression (in our case, the difference was around 10%).
+
+```powershell
+/p:TableData="dbo.TestTable"
+```
+
+The parameter allows us to limit the data export only to the tables selected by us, what significantly shortens
+the overall operation time. It is also worth mentioning that it is possible to set the parameter multiple times.
+
+Since the export was launched at night, the procedure had no negative impact on users. The impact of the
+export operation on the database load (Data I/O percentage) is presented in the graph below. It can be observed that
+the resource load increased during this operation.
+
+![Data IO](/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-export-iops.png)
+
+### Copying the archived database using AzCopy
+The following script was executed to copy the exported file to the Storage Account:
+
+```
+.\azcopy `
+    copy `
+    $TargetFile `
+"https://$StorageAccountName.blob.core.windows.net/$StorageContainerName/$StorageBlobName$SAS" `
+    --recursive `
+    --overwrite=true `
+    --blob-type=BlockBlob `
+    --put-md5 `
+    --log-level=info `
+    --block-blob-tier=archive `
+```
+
+The process went quickly. Copying the 100GB file took only a few minutes, thanks to the high network throughput.
+It is worth noting that the archive tier is set immediately.
+
+### Conducting a SHRINK operation
+The SHRINK operation is, unfortunately, required to downscale the Azure SQL database. It took several hours to complete.
+**WAIT_AT_LOW_PRIORITY** was used to reduce the impact of this rather resource-intensive operation
+on the database users.
+
+``` sql
+DBCC SHRINKDATABASE ([DB_NAME]) WITH WAIT_AT_LOW_PRIORITY
+```
+
+The performance chart (Data IO) during the above operation looked as follows:
+
+We observed a unusual increase in Data IO operations during the SHRINK operation.
+### Performance analysis and index rebuild
+This step appeared quite unexpectedly in our procedure. After performing the SHRINK operation and successfully
+lowering the parameters of the machine responsible for the database, we began to observe
+the impact of our operations on performance.
+
+To our concern, we observed a noticeable performance regression.
+Endpoints that use the database on which we performed **SHRINK** operation showed abnormally increased response times.
+
+![RPS](/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-rps-before-index.png)
+
+The database load chart also did not look encouraging, with frequent peaks during query execution.
+
+![IOPS](/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-iops-before-index.png)
+
+Attempts to scale the machine did not bring spectacular results and only increased costs (considering that our goal was
+to lower them, it was not an optimal solution).
+
+As it turned out, the culprit was extraordinarily high index fragmentation. The result of the SHRINK operation was an increase
+in the mentioned fragmentation to almost >90% for practically all existing indexes.
+This forced us to consider rebuilding all of them.
+
+Even Microsoft recommends rebuilding indexes in their documentation [here](https://learn.microsoft.com/en-us/sql/relational-databases/databases/shrink-a-database?view=sql-server-ver16):
+
+> Data that is moved to shrink a file can be scattered to any available location in the file.
+> This causes index fragmentation and can slow the performance of queries that search a range of the index.
+> To eliminate the fragmentation, consider rebuilding the indexes on the file after shrinking.
+
+We decided to proceed with the above-mentioned index rebuild process. Here, we also applied possible optimizations
+to avoid negative consequences related to the availability of our services. The **ONLINE** option is particularly noteworthy,
+as it ensures that existing indexes and tables will not be blocked, what is an important issue in the case
+of continuous operation of our services.
+
+``` sql
+ALTER INDEX ALL ON dbo.TableName REBUILD WITH
+(FILLFACTOR = 80, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = ON, ONLINE = ON);
+```
+
+It should also be noted that this can be a time-consuming operation, but as a result of its execution,
+the indexes returned to the required consistency level, reaching a level of fragmentation close to 0%.
+The response time and resource consumption charts of the database also returned to the values closer to the initial ones.
+
+![RPS](/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-rps-after-rebuild.png)
+
+![IOPS](/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-iops-after-index.png)
+
+## Conclusion
+After performing all of the described actions, we achieved a reduction
+in the size of the database from over 3TB to slightly below 100GB.
+By lowering the required disk space, we could also significantly reduce the computational resources of the database,
+generating further serious savings.
+
+Before performing all the operations,
+the monthly cost of maintaining the database was close to €3000.
+By switching from a database based on a 12 vCore and 3TB model to a Standard DTU with 100 units and 150GB
+we managed to cut our monthly spendings to mere €125.
+After all, our effort paid off.
+
+![Cost reduction](/img/articles/2023-07-10-save-money-on-large-database/montly-cost-reduction.png)
+
+The above example demonstrates how to greatly reduce infrastructure costs. Of course,
+the described procedure will apply to specific cases and data characteristics.
+However, if you have a similar problem, I think it is worth considering this approach.
diff --git a/authors/mateusz.stolecki/index.md b/authors/mateusz.stolecki/index.md
@@ -0,0 +1,4 @@
+---
+layout: author
+author: mateusz.stolecki
+---
diff --git a/img/articles/2023-07-10-save-money-on-large-database/montly-cost-reduction.png b/img/articles/2023-07-10-save-money-on-large-database/montly-cost-reduction.png
diff --git a/img/articles/2023-07-10-save-money-on-large-database/perf-test-dev-cpu.png b/img/articles/2023-07-10-save-money-on-large-database/perf-test-dev-cpu.png
diff --git a/img/articles/2023-07-10-save-money-on-large-database/perf-test-dev-iops.png b/img/articles/2023-07-10-save-money-on-large-database/perf-test-dev-iops.png
diff --git a/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-export-iops.png b/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-export-iops.png
diff --git a/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-iops-after-index.png b/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-iops-after-index.png
diff --git a/...articles/2023-07-10-save-money-on-large-database/perf-xyz-iops-before-index.png b/...articles/2023-07-10-save-money-on-large-database/perf-xyz-iops-before-index.png
diff --git a/...articles/2023-07-10-save-money-on-large-database/perf-xyz-rps-after-rebuild.png b/...articles/2023-07-10-save-money-on-large-database/perf-xyz-rps-after-rebuild.png
diff --git a/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-rps-before-index.png b/img/articles/2023-07-10-save-money-on-large-database/perf-xyz-rps-before-index.png
diff --git a/img/authors/mateusz.stolecki.jpg b/img/authors/mateusz.stolecki.jpg