Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/slow cmu #19

Merged
merged 3 commits into from
Jun 1, 2022
Merged

Fix/slow cmu #19

merged 3 commits into from
Jun 1, 2022

Conversation

rluetzner
Copy link
Collaborator

@rluetzner rluetzner commented Feb 25, 2022

Description

In combination with a GlusterFS backend we realized that the CompleteMultipartUpload request becomes very slow. The larger the uploaded filesize is, the longer the request will take.

Motivation and Context

We noticed that for larger filesizes some clients would run into a timeout during CompleteMultipartUpload requests. The default timeout seems to be 1h, which is pretty big. Nevertheless uploading 20 or 50GiB files caused CMU requests that took way longer to complete.

Compared to AWS it is non-intuitive that a CMU request should take so long, but we're willing to accept it, because it's a limitation of the current implementation, which has to put all chunks together into one big file.

This fixes #13 and #11 .

How to test this PR?

Tests can easily be performed manually. time is helpful in tracking the total duration on the client side, whereas mc admin trace -v -a will log all requests including their duration.
If large files are required truncate -s 100G file can help you quickly create a file containing only zeros. Without using compression, this is just as good as creating such a big file with actual content, but waaaaay faster.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Optimization (provides speedup with no functional changes)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • Fixes a regression (If yes, please add commit-id or PR # here)
  • Documentation updated
  • Unit tests added/updated

Copy link
Collaborator

@aweisser aweisser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. But I would like @tristanessquare to take a second look and approve.
@tristanessquare: It's ok that this doesn't have the potential to be merged to the upstream. So please don't take this as criteria.

@tristanexsquare
Copy link
Collaborator

windows 1.17.x and ubuntu 1.16 testing stages are currently failing. We should fix this before merging.

@rluetzner
Copy link
Collaborator Author

@tristanessquare that's fine by me.

@rluetzner
Copy link
Collaborator Author

rluetzner commented May 18, 2022

This became more important and I started to look more into the failing unit tests. I'm immensely frustrated and confused why the tests seem to be succeeding for the upstream repository, but fail on our fork.
I've used https://github.com/nektos/act to run the GitHub workflows locally and limited myself to go-lint.yml. To save some resources I've also limited the execution to ubuntu-latest and Go version 1.17.x. I've used the following script:

#!/bin/bash
set -o errexit
set -o nounset
set -o pipefail

sed -i 's/go-version: \[.*$/go-version: \[1.17.x\]/' .github/workflows/go-lint.yml
sed -i 's/os: \[ubuntu-latest, windows-latest\]/os: \[ubuntu-latest\]/' .github/workflows/go-lint.yml

RC=0
act -W .github/workflows/go-lint.yml pull_request || RC=1
git reset --hard
exit $RC

I initially wanted to run a git bisect, but I discovered that our tests never worked at all. I've run different official RELEASE tags from upstream, the master from upstream and also commits early on when the go-lint.yml was added. None of them succeeded.

I then fell back to running

~/bisect.sh | tee <branch>.log

and diffing the results. I'll attach the log files here.

fix-slow-cmu.log
iternity-rb.log
RELEASE.2021-10-13T00-23-17Z.log

Looking at the diff shows that iternity-rb introduced a few tests that fail in a CI environment (they don't when run separately). The same tests fail between iternity-rb and this pull request, which kinda makes this safe to merge. We should address the failing tests ASAP, but I don't see any reason why this merge should be blocked by failing tests.

I'm saying this mostly because the fix gained importance due to a customer being blocked and it might become necessary to merge this without the unit tests succeeding, @tristanessquare . At least it looks like I didn't break anything else.

I only looked at the diff for failed tests, i.e. diff iternity-rb.log fix-slow-cmu.log | grep 'FAIL:'.

@rluetzner
Copy link
Collaborator Author

I've removed some additional code to make clear that what was previously used as a fallback operation is now the default.

We have no choice but to merge this even with the failing unit tests.

@rluetzner rluetzner requested a review from aweisser May 30, 2022 10:44
Copy link
Collaborator

@aweisser aweisser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Robert Lützner added 2 commits June 1, 2022 08:34
There's no need to continously stat all object parts on disk if most of
them have already been appended.
The backgroundAppend was meant as a performance optimization where
uploaded chunks would be aggregated to a complete file in the background
while more parts were uploaded at the same time.

This might be an optimization on actual filesystem paths, but on Gluster
it made the operation a lot slower because many redundant filesystem
calls were executed. The CompleteMultipartUpload request which requires
the uploaded chunks to be put together has to

1. wait for any ongoing backgroundAppend operations to complete,
2. enumerate all chunks and check if they were put together in the right
   order (otherwise start from scratch),
3. move the aggregated file.

Removing this piece of code started out as an experiment, because I
expected the chunks to not be aggregated at all. It turned out that
there is a fallback which is also necessary in case the final object
should have a different order or not contain some of the uploaded parts.
This also ensures that a final aggregate is created. At least on
GlusterFS this makes any CMU request run almost twice as fast as when
using backgroundAppend.
@rluetzner
Copy link
Collaborator Author

The same tests failed as always, so I'd say this is safe to merge.

@rluetzner rluetzner merged commit 33a1155 into iternity-rb Jun 1, 2022
@rluetzner rluetzner deleted the fix/slow-cmu branch June 1, 2022 08:52
@rluetzner rluetzner mentioned this pull request Jun 1, 2022
@rluetzner rluetzner restored the fix/slow-cmu branch June 2, 2022 11:48
@rluetzner rluetzner deleted the fix/slow-cmu branch June 2, 2022 11:51
@rluetzner rluetzner mentioned this pull request Jun 2, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multipart Upload: CompleteMultipartUpload takes multiple hours to complete for very big files
3 participants