-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid performance peaks and drops for s3:ListObjectsV2 on nasxl #21
Comments
Test with AWS EC2 t2.medium instance using EBS storage type gp2 did not produce the same behaviour as described. |
@tristanessquare: Please try to reproducde this behavior on the provided test system (full stack). |
@tristanessquare: Can you identify the root cause (at the Minio level) for this? |
Not yet, we are working on it. I just wanted to update this issue to reflect our current state of work. |
We have also looked into the nas mode and found the same behaviour - a sudden peak depending on the number of objects and more and more peaks in the middle. Comparing nas-legacy and nas we found that the duration of a list-objects call has nearly doubled from nas to nas-legacy, but the pattern nonethless is the same for nas, nas-legacy and nasxl. Our suspicion was that the root cause is some kind of glusterfs limit which is reached when reading all of the metafiles in the over-bucket. To verify this suspicion we tried to access the files directly on the filesystem with find, grep and cat. But the access times scaled linearly and did not show the expected peaks. As we could not explain the root cause of the peaks at certain numbers of objects by looking at the profiles from the previous post, we have used the madmin tool to create tracing-profiles. These profiles have been taken while calling mc ls 3 times on each of the two nasxl buckets (over and under). The mc ls call to over took about 15 seconds and the call to under took about 5 seconds. icas-nasxl-under-profiling.zip |
We have tried to identify the causes for the slow down unsuccessfully with the CPU Profiler and the Trace profiler has given us hardly any insights. The file access times in the over case are somewhat slower, but there are not enough to explain the difference between the runs. We have found an article about the shortcomings of the builtin profilers, probably we could try to use the proposed alternative to try to pinpoint the problem? |
I found a 16k constant. Although this is about the blockSize of the metacacheWriter. But still: This is something that could be adjusted... Another problem could be that the output writer is initiated with writer concurrency of 2 (also a magic number in the code) instead of GOMAXPROCS. This has been changed by Klaus Post on November 2020. |
We have repeated the profiling with fgprof. Still we could not find the root cause for the drop: profile-nasxl-over-fgprof.zip We have also tried to directly access the fuse mount and list/read all files. But we could not reproduce the performance drop with this.
|
Today we started to manually track down the root cause for the peak by adding more logging. The attached file shows all of our runs in chronological order. We found out that the peak is caused by os.OpenFile(). This call is much slower for the |
If you take a look at the implementation of The So there are two possibilities why the
|
@rafikc30 Can you pls. throw some GlusterFS wisdom at us!? When doing a |
@gergan Can you pls. share the exact command, so that we can reproduce it!? Thanks. |
Hi, it is slightly modified version of the measure-ls.sh script. It could be found on the server under /home/l3support/client: #!/bin/bash
# 1. Make sure that you've created an empty test bucket (e.g. ls-test) and an mc alias (e.g. nasxl)
# 2. Call the script like this: ./measure-ls.sh nasxl/ls-test 1 200 50 /mount/to/gluster/ls-test
# This will create a folder structure with one level (folders named 1 - 200) containing 50 objects in each folder.
# The script will generate two .csv files for PUT object and ListObjects latencies (in secs)
trap "exit" INT
TIMEFORMAT=%R
bucket=$1
min_folders=$2
max_folders=$3
files_per_folder=$4
bucket_mount=$5
outfile_prefix=`(echo $bucket | tr '/' '.')`
mc="mc"
touch file
object_count=0
for i in $(seq $min_folders $max_folders); do
for j in $(seq 1 $files_per_folder); do
echo `(time $mc cp file $bucket/$i/file$j --quiet --insecure >>/dev/null) 2>&1` >> $outfile_prefix.PUT.$min_folders.$max_folders.$files_per_folder.csv
done
((object_count=object_count+files_per_folder))
echo `(echo $object_count && time find $bucket_mount -type f -name 'xl.meta' -exec cat {} > /dev/null \;) 2>&1` | tr ' ' ';' >> $outfile_prefix.LIST.$min_folders.$max_folders.$files_per_folder.csv
done |
@aweisser Gluster open call is very unlikely to return ENITR on this occasion. While technically the open file system call can return EINTR when the blocked open call is interrupted by a signal handler. |
@rafikc30, @gergan , @tristanessquare: Follow these steps: Step 1: Create a package main
import (
"fmt"
"os"
"syscall"
)
// Pass a file name as first argument
func main() {
name := os.Args[1]
eintr_cnt, _, _ := openFileNologSimplified(name, os.O_RDONLY, 0)
fmt.Println(eintr_cnt)
}
// Simplified version of the Unix implementation of os.openFileNolog.
// It mainly lacks support for sticky bit and platforms where "close on exit" is not supported.
func openFileNologSimplified(name string, flag int, perm uint32) (int, int, error) {
var r int
c := 0
for {
var e error
r, e = syscall.Open(name, flag|syscall.O_CLOEXEC, perm)
if e == nil {
break
}
// We have to check EINTR here, per issues 11180 and 39237.
if e == syscall.EINTR {
c++
continue
}
return c, r, e
}
return c, r, nil
} Step 2: Build it with Step 3: Create a #!/bin/bash
# 1. Make sure that you've created an empty test bucket (e.g. ls-test) and an mc alias (e.g. nasxl)
# 2. Call the script like this: ./measure-openFileNolog.sh nasxl/ls-test 1 200 50 /mount/to/gluster/ls-test
# This will create a folder structure with one level (folders named 1 - 200) containing 50 objects in each folder.
# The script will generate two .csv files for PUT object and ListObjects latencies (in secs)
trap "exit" INT
TIMEFORMAT=%R
bucket=$1
min_folders=$2
max_folders=$3
files_per_folder=$4
bucket_mount=$5
outfile_prefix=`(echo $bucket | tr '/' '.')`
mc="mc"
touch file
object_count=0
for i in $(seq $min_folders $max_folders); do
for j in $(seq 1 $files_per_folder); do
echo `(time $mc cp file $bucket/$i/file$j --quiet --insecure >>/dev/null) 2>&1` >> $outfile_prefix.PUT.$min_folders.$max_folders.$files_per_folder.csv
done
((object_count=object_count+files_per_folder))
echo `(echo $object_count && time find $bucket_mount -type f -name 'xl.meta' -exec ./openFileNolog {} \; | awk '{s+=$1} END {print s}' | tr '\n' ' ') 2>&1` | tr ' ' ';' >> $outfile_prefix.OPEN.$min_folders.$max_folders.$files_per_folder.csv
done Step 4: Create an empty bucket on a nasxl target Step 5: Create 400 folders with 50 0-Byte files each and measure the Step 6: Wait for the .csv output to be created. tail -f nasxl.test20000.OPEN.1.400.50.csv
...
6150;0;47.509
6200;0;42.505
6250;0;43.542
6300;0;41.995 The columns are:
|
I'm currently running the new test routine on one of my test VMs and the
However, the |
Here's a visualisation of my first test run. Other than suspected the pattern is not that clear. I'll have to rerun the test to verify. In the first test run I also had some problems with my test system. So maybe this result is not representative. |
I've modified the code of package main
import (
"fmt"
"io/fs"
"os"
"path/filepath"
"syscall"
"time"
)
var totalOpenTime time.Duration = time.Duration(0)
var totalFiles int = 0
// Pass a file name as first argument
func main() {
start := time.Now()
name := os.Args[1]
filepath.WalkDir(name, visit)
totalTime := time.Since(start)
fmt.Printf("%d;%s;%s\n", totalFiles, totalOpenTime, totalTime)
}
func visit(path string, d fs.DirEntry, err error) error {
isDir, _ := isDirectory(path)
if !isDir {
duration := openFile(path)
totalOpenTime += duration
totalFiles++
}
return nil
}
// isDirectory determines if a file represented
// by `path` is a directory or not
func isDirectory(path string) (bool, error) {
fileInfo, err := os.Stat(path)
if err != nil {
return false, err
}
return fileInfo.IsDir(), err
}
func openFile(name string) time.Duration {
duration, r, _ := openFileNologSimplified(name, os.O_RDONLY, 0)
// Close is done here, to ensure that we measure ONLY the
// syscall.Open() duration in openFileNologSimplified().
// We should ensure that filehandles are closed to
// 1. prevent memory leaks and
// 2. because GlusterFS has a fixed upper limit of how many
// fds can be open.
defer syscall.Close(r)
return duration
}
// Simplified version of the Unix implementation of os.openFileNolog.
// It mainly lacks support for sticky bit and platforms where "close on exit" is not supported.
func openFileNologSimplified(name string, flag int, perm uint32) (time.Duration, int, error) {
start := time.Now()
var r int
c := 0
for {
var e error
r, e = syscall.Open(name, flag|syscall.O_CLOEXEC, perm)
if e == nil {
break
}
// We have to check EINTR here, per issues 11180 and 39237.
if e == syscall.EINTR {
c++
continue
}
return time.Since(start), r, e
}
return time.Since(start), r, nil
} |
Here's my rerun on a new test cluster using v.1.12.dev and a isolated measurement of the According to me new test run the As you can see the pattern of the The isolated |
Next steps:
|
I've ripped out the WalkDir algorithm and any related code from MinIO and put it into a small Go program. The code is available here: https://github.com/rluetzner/minio-list-debug . |
I found an interesting Gluster setting: These curves are listing results from the tool I built from the extracted MinIO WalkDir function. However if I understand the Gluster code directly, this setting cannot be turned off and has a hard limit of 32,768 (32*1024 a.k.a. // excerpt from libglusterfs/src/inode.c from GlusterFS:
if (!mem_pool_size || (mem_pool_size > DEFAULT_INODE_MEMPOOL_ENTRIES))
mem_pool_size = DEFAULT_INODE_MEMPOOL_ENTRIES; One could argue that the Gluster docs are outdated or the code is wrong, but the same code above can be found in GlusterFS upstream and has been last modified 4 years ago. Maybe it's worth asking some Gluster people about this mismatch of information. |
I ran the tests longer and with more files. Anything with (more files) in the graph was recorded with # Create 400 folders with 500 files per folder, i.e. 200,000 files.
./measure-openFileNolog.sh nasxl/test8 1 400 500 /gluster/repositories/fsnoworm-s3/fsnoworm-s3/test8/ Surprisingly setting I wanted to check if the curves converge for higher file counts, no matter what value is configured. That's why the pink line is there. The curves do converge in a way, but it looks like there are still better and worse values. Most of the time, the green curve would be what we're looking for. Additionally, for now it looks like there is no way to get rid of these jumps entirely. I suspect that this is due to the cache implementation. When the memory pool starts hitting it's limit we encounter something that looks very similar to garbage collection in other languages. Execution time slows down to allow for data that is no longer needed to be removed from the mem pool. Next I want to re-run the test with a setting of 0 to see if the results are similar to the first execution. I'm still surprised that the curves for 0 and 200,000 differ from each other and I want to check if this might be caused by there being a larger amount of files overall on the system. If my assumption is correct, than a second run with a value of 0 should show a closer resemblance to the orange curve rather than the green one. |
Well...this is where expectations meet reality. The black and green graphs in this plot were recorded with the same The green curve jumps around 32,000 instead of 60,000 where all the other curves did. Right now the only thing we can gather is that Gluster's default setting is not ideal for large amounts of small files. To see if this is caused by the increasing amount of files in the backend or if this is indeed arbitrary, I'll have to setup a new system and run the script multiple times for different buckets. I also want to extract the Gluster code I posted above and didn't fully understand into a small C file and compile it to see what it actually does. |
I've written a short piece of C code to test my theory. Here's the code I used to test my theory: #include <stdio.h>
const int DEFAULT_INODE_MEMPOOL_ENTRIES = 32*1024;
int main()
{
int mem_pool_size = 0;
printf("Enter a Number: ");
scanf("%d", &mem_pool_size);
if (!mem_pool_size || (mem_pool_size > DEFAULT_INODE_MEMPOOL_ENTRIES))
{
mem_pool_size = DEFAULT_INODE_MEMPOOL_ENTRIES;
printf("Using the default size.");
}
else
printf("Not using the default size.");
return 0;
} And for anyone that is like me and has to look up how to compile C code:
Results:
|
I've had a very interesting discussion with Rafi about this. inode mem pool initializationI learned that what I saw in the code was merely where the inode mem pool is initialized. Whenever the inode-lru-limit is There is a separate function called The struct containing the mem pool also contains a field lru_limit, which holds the correct value. It is therefore very much possible to set values higher than 32 * 1024. I've since closed the issue in the Gluster docs, because I merely misunderstood the code. client-side vs. server-side cachingThere is a separate inode cache on the client-side (the FUSE mount). The The next steps would therefore be to increase the server-side cache to values beyond 65k and adjust the client-side lru-cache accordingly. We have to keep in mind that the higher we go, the more memory we will allocate on the systems. According to Rafi, the client-side setting is a mount option called |
The following graph was recorded with ./measure-openFileNolog.sh "nasxl/$bucket" 1 40 500 "/gluster/repositories/fsnoworm-s3/fsnoworm-s3/$bucket/" so the points are 500 files apart (fewer measurements than before). They also stop at 20,000 so we don't see a long-term trend. I replaced the tool I built from the extracted listing algorithm with calls to The result surprised me quite a bit. Even though we don't see the long-term trend it is obvious that the official MinIO binary with an XL backend performs much better at listing files compared to our forked code. I'm also surprised how little the curves deviate from each other. Maybe this would be different with more points (i.e. less files per folder, but more folders to list), but I would expect the trend to be similar. Right now I'm recording a measurement up to 200,000 files to see how the official MinIO binary performs with higher file counts. |
With commits f8c2946 and bd82a93 we passed the ContinuationToken as a ForwardTo marker to the underlying WalkDir function, to speed up the s3:ListObjectsV2 operation. This fix improved the listing performance. But there are still peaks and performance drops when you add more and more objects, as you can see in the following plot.
The red line is AWS S3. You can see it scales linearly.
The blue line is the nasxl code without the
ContinuatonToken
optimisation, that was introduced in commitsf8c2946
andbd82a93
. It looses performance every 4500 objects.The yellow line is nasxl with
ContinuatonToken
optimisation. The performance loss at each 4500 objects is gone. But still there peaks and a huge performance drop at ~16000 objects.The numbers are generated with the follwing test script:
Expected Behavior
Listing operation should scale linearly like AWS S3. The gradient however doesn't need to be the same as AWS S3.
The text was updated successfully, but these errors were encountered: