-
Notifications
You must be signed in to change notification settings - Fork 208
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce the chunk deduplication process and algorithm
Signed-off-by: Zhao Yuan <[email protected]>
- Loading branch information
1 parent
6e07a4a
commit 7fc7008
Showing
1 changed file
with
116 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# Probntroduction | ||
In container images, there are often a large number of duplicate files or content, and these duplicate parts occupy a large amount of storage space, especially in high-density deployment scenarios. As the number of Nydus images grows, it will bring many problems such as low storage space utilization and excessive consumption of bandwidth resources. To do this, an effective deduplication mechanism (deduplication) needs to be designed to solve this problem. | ||
|
||
Unlike traditional OCI, which distributes images at a layer-granular level, the smallest unit of a Nydus image is a chunk, so the deduplication algorithm needs to be deduplicated in chunk units. At the same time, we want to deduplicate multiple aspects of the Nydus image, including between Nydus images and between different versions of the same Nydus image. No matter which deduplication method is essentially to deduplicate the repeated chunks in the image, only one duplicate chunk is retained, and the reference to the chunk is used instead of other duplicate chunks to reduce the storage space occupation, so as to maximize the data transmission and storage capabilities of Nydus and improve the access speed and efficiency of the image. | ||
# General idea | ||
The deduplication algorithm first needs to select the duplicate chunk in the image according to the image information such as the number of occurrences of chunk, chunk size, chunk image to which the chunk belongs and the corresponding version, and generate chunkdict, chunkdict records the unique identifier or fingerprint of chunk, only need to store chunkdict, other images can refer to chunk in chunkdict by reference. | ||
|
||
The deduplication algorithm is divided into two parts, the first part is the DBSCAN clustering algorithm, which deduplicates different images; The second part is the exponential smoothing algorithm, which deduplicates different versions within the image. | ||
|
||
**The general process is as follows:** | ||
1. Store the image information to the local database, | ||
2. Extract the image information and call the DBSCAN clustering algorithm to deduplicate different images. | ||
3. Deduplicate the dictionary content in 2, and call the exponential smoothing algorithm for each image separately for image version deduplication. | ||
4. Get the deduplication dictionary generated by running the two algorithms and drop the disk. | ||
# Algorithm detailed process | ||
## Overall Input | ||
|
||
```shell | ||
nydusify chunkdict generate --sources \ | ||
localhost:5000:redis:nydus_7.0.1, \ | ||
localhost:5000:redis:nydus_7.0.2,\ | ||
localhost:5000:redis:nydus_7.0.3 \ | ||
``` | ||
*** | ||
`nydusify chunkdict generate` calls two commands `nydus-image chunkdict save` and `nydus-image chunkdict generate` to store image information into the database and generate a list of chunks to be deduplicated | ||
|
||
Download multiple Nydus images in advance and put them into the repository as datasets, such as selecting 10 consecutive versions of redis and alpine as the image dataset, and execute the command `nydus-image chunkdict save` to store the information of the chunk and blob in the chunk and blob table of the database. | ||
|
||
```shell | ||
# Deposit multiple images into the database | ||
nydus-image chunkdict save --bootstrap \ | ||
./output/localhost:5000:redis:nydus_7.0.1/nydus_bootstrap, \ | ||
./output/localhost:5000:redis:nydus_7.0.2/nydus_bootstrap, \ | ||
./output/localhost:5000:redis:nydus_7.0.3/nydus_bootstrap \ | ||
``` | ||
Execute the command `nydus-image chunkdict generate` to access the database and call the deduplication algorithm to generate the chunk list | ||
```shell | ||
# Call the deduplication algorithm to generate chunk list | ||
nydus-image chunkdict generate --database \ | ||
sqlite:///path/imageservice/contrib/nydusify/chunkdict.db | ||
``` | ||
|
||
*** | ||
### Deduplication algorithm | ||
#### Algorithm 1 Deduplication between different images (DBSCAN clustering algorithm) | ||
*** | ||
**Basic principle:** DBSCAN is a density-based clustering algorithm, which mainly investigates the connectivity between samples through sample density, samples of the same category, they are closely connected, in other words, there must be samples of the same category not far around any sample of the category. Therefore, it can group a group of objects with high density and close distance, can find clusters of arbitrary shapes, and does not need to specify the number of clusters in advance, which is suitable for high-density deployment scenarios. | ||
|
||
**Input:** Read the chunk information in the database and store it in the chunk list. Chunk information includes:image_name, version, chunk_blob_id, chunk_digest, chunk_compressed_size, and so on. | ||
|
||
**Output:** The chunk dictionary corresponding to each image cluster | ||
|
||
**Basic steps:** | ||
**1.** Select a part of the version as the training set and the rest as the test set according to a certain proportion of all images. | ||
|
||
**2.** Divide all chunks in the training set into a new list according to the image_name, and each list corresponds to an image and all chunk sets in the image. | ||
|
||
**3.** These images are done using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm | ||
Clustering. | ||
|
||
*** | ||
3.1 Initialize the core point collection $Omega$ as an empty set,and set the clustering algorithm radius $gamma = 0.5$, and the sample number threshold $MinPts = 10$ | ||
|
||
3.2 Loop through each image and its corresponding chunk list,and calculate its distance from other images according to the following formula. | ||
$$ distance (x,y)= \frac{\lvert C(R_x) \cup C(R_y) \rvert - \lvert C(R_x) \cap C(R_y) \rvert}{\lvert C(R_x) \cup C(R_y) \rvert }$$ | ||
where $C(R_x)$ represents the unique chunk set of all training set images in the image. Calculate the number of images based on $distance(x,y) \leq \gamma$,If there are M y, such that $distance(x,y) \leq \gamma$, where $M \geq MinPts$, then add the imagex to the core point set, and image y is called the image in the neighborhood of the core image x; | ||
|
||
3.3 Initialize the number of cluster classes k=0, and then iterate the core point warehouse collection in turn, and add all the neighboring warehouses in the core point warehouse to the queue, if a warehouse in the neighborhood is also a core warehouse, all warehouses in its neighborhood join the queue, classify the warehouses in the above queue into a cluster class, and continue to traverse the core warehouse collection until all core warehouses are traversed. | ||
|
||
3.4 Calculate the frequency of chunks that appear in each class image. Add the chunk that appears in the image above $90%$ in the training set to the dictionary corresponding to the cluster class to generate a set of < cluster classes, and the dictionary > pairs. | ||
*** | ||
**4.** Adjust the neighborhood radius size and repeat step 3 to obtain multiple deduplication dictionaries. | ||
|
||
**5.** Use the test set to evaluate multiple deduplication dictionaries in 4, and select the chunk dictionary corresponding to the test set with the smallest storage space. | ||
|
||
**6.** Remove the chunk in the chunk dictionary selected in 5 for all images (training set and test set), and then repeat the operation 1-5 to generate the chunk dictionary until the maximum number of cycles is reached 7, or the discrete image ratio is greater than 80% of the total number of images. | ||
|
||
The principle of DBSCAN algorithm how to divide the cluster is shown in the diagram: | ||
![在这里插入图片描述](https://img-blog.csdnimg.cn/5fba149720a34620873a5a2cb304d668.png#pic_center) | ||
In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable. | ||
|
||
**Remark:** This section of the picture and the associated DBSCAN algorithm description are referenced from : [https://en.wikipedia.org/wiki/DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) | ||
#### Algorithm 2 Deduplication between different versions of the image (exponential smoothing algorithm) | ||
*** | ||
**Basic principle:** Exponential smoothing algorithm is a method for time series data prediction and smoothing, the basic principle is to weighted average the data, give higher weight to the more recent repeated chunks, and constantly update the smoothing value, so the newer chunk has a greater impact on future forecasts, and the impact of older data will gradually weaken. | ||
|
||
**Input:** The training set and test set after deduplication in algorithm 1. | ||
|
||
**Output:** The chunk dictionary corresponding to each image. | ||
|
||
**Basic steps:** | ||
**1.** Divide all chunks in the training set into a new list according to the image_name, and each list corresponds to an image and all chunk sets in the image. | ||
|
||
**2.** The different versions inside each image are sorted chronologically, and each chunk is scored according to the Exponential Smoothing formula. | ||
$$S_0 =0 ,S_t = \alpha Y_{t-1} +(1- \alpha)S_{t-1} $$ | ||
where, $\alpha=0.5$ , $Y_{t-1}$ indicates whether the chunk appeared in the previous image, 1 if it did, otherwise 0. | ||
|
||
**3.** Count the score for each chunk and select all chunks with a score greater than $THs$ as the chunk dictionary. Deduplicate the image version in the test set and calculate the storage space it occupies. | ||
|
||
**4.** Modify the value of $THs$ from 0.8 to 0.5 in steps of 0.05 and repeat steps 2 and 3 to generate multiple chunk dictionaries. | ||
|
||
**5.** Choose a chunk dictionary that minimizes the test set's storage space. | ||
*** | ||
### Exponential smoothing algorithm test table | ||
|
||
| image_name | version number | total_size | train_size | test_size | test_size after dedulicating | chunkdict_size | dedulicating rate | threshold | | ||
|------------|----------------|------------|------------|-----------|------------------------------|----------------|-------------------|-----------| | ||
| redis | 10 | 382.03 | 266.7 | 115.33 | 31.56 | 42.33 | 72.63% | 0.8-0.5 | | ||
| python | 10 | 3509.91 | 2095.37 | 1414.54 | 123.33 | 588.61 | 91.28% | 0.8-0.5 | | ||
| ubuntu | 10 | 317.33 | 222.11 | 95.22 | 12.27 | 39.61 | 87.11% | 0.8-0.5 | | ||
| nginx | 10 | 396.86 | 284.4 | 112.46 | 50.54 | 83.54 | 55.06% | 0.8-0.5 | | ||
| postgres | 10 | 1360.31 | 956.42 | 403.89 | 381.54 | 19.66 | 5.53% | 0.8-0.5 | | ||
| alpine | 10 | 27.23 | 19.04 | 8.19 | 5.62 | 4.7 | 31.29% | 0.8-0.5 | | ||
| node | 10 | 3698.44 | 2598.59 | 1099.85 | 429.39 | 649.42 | 60.96% | 0.8-0.5 | | ||
| httpd | 10 | 561.99 | 385.79 | 176.2 | 85.7 | 54.15 | 51.36% | 0.8-0.5 | | ||
*** |