Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge RAM usage on big file uploads #2856

Open
thierryba opened this issue Feb 13, 2024 · 7 comments
Open

Huge RAM usage on big file uploads #2856

thierryba opened this issue Feb 13, 2024 · 7 comments
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue

Comments

@thierryba
Copy link
Contributor

Describe the bug

I want to uploada big file. So I wanted to up the partSize of my s3crtclient Configuration.
But, it seems the RAM consumption of my process is a direct multiple (around 20x) of that value. So when I tried 50MB, my process was taking 1GB of RAM.

Expected Behavior

Uploading files should be simple enough that it consumes less RAM.

Current Behavior

It uses 20x the RAM of the part size. For a huge upload that is too much. And that means I cannot do more than 1 in parallel.

Reproduction Steps

see description

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

1.11.258

Compiler and Version used

Apple clang version 15.0.0 (clang-1500.1.0.2.5)

Operating System and version

macOS Sonoma 14.3

@thierryba thierryba added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 13, 2024
@jmklix
Copy link
Member

jmklix commented Feb 13, 2024

Can you provide a minimal code sample that reproduces this? What partSize are you using?

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Feb 13, 2024
@jmklix jmklix self-assigned this Feb 13, 2024
@thierryba
Copy link
Contributor Author

thierryba commented Feb 14, 2024

my minimal example

#include <iostream>
#include <fstream>
#include <aws/core/Aws.h>
#include <aws/s3-crt/S3CrtClient.h>
#include <aws/s3-crt/model/PutObjectRequest.h>


int main()
{
    Aws::SDKOptions options;
    Aws::InitAPI(options);
    Aws::S3Crt::ClientConfiguration conf;
    conf.partSize = 50 * 1024 * 1024; // that means using 1GB of RAM...
    Aws::Auth::AWSCredentials creds;
    creds.SetAWSAccessKeyId(Aws::String("your_Access_key"));
        creds.SetAWSSecretKey(Aws::String("your_secret_key"));
    const std::string fileName = "big file so that it takes a bit of time to upload";

    Aws::S3Crt::S3CrtClient client(creds, conf);

    Aws::S3Crt::Model::PutObjectRequest request;

    request.SetBucket("bucket name");
    request.SetKey("my_big_file_on_s3");

    std::shared_ptr<Aws::IOStream> inputData =
        Aws::MakeShared<Aws::FStream>("SampleAllocationTag",
                                      fileName.c_str(),
                                      std::ios_base::in | std::ios_base::binary);

    if (!*inputData) {
        std::cerr << "Error unable to read file " << fileName << std::endl;
        return 1;
    }

    request.SetBody(inputData);

    request.SetDataSentEventHandler([](const Aws::Http::HttpRequest*, long long) {
        std::cout << "callback" << std::endl;
    });

    Aws::S3Crt::Model::PutObjectOutcome outcome = client.PutObject(request);
    if (!outcome.IsSuccess()) {
        std::cerr << "Error: PutObject: " <<
            outcome.GetError().GetMessage() << std::endl;
    } else {
        std::cout << "DONE" << std::endl;
    }


    Aws::ShutdownAPI(options);
}

You can note that the callback is also not called but that is declared as a separate issue...

@DmitriyMusatkin
Copy link
Contributor

CRT S3 client will automatically split big uploads into multiple parts and upload them in parallel. So during upload, crt will hold several part-sized buffers in memory depending on overall parallelism settings. So depending on how big the file is and how many parts you are trying to upload at the same time, 1GB might be a reasonable number.

On top of that CRT will pool buffers to avoid reallocating them over an over again, so you might see crt holding on to a larger chunk of memory than you would expect. buffer pools are cleared after some period of inactivity

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Feb 15, 2024
@thierryba
Copy link
Contributor Author

Well to be frank, 1GB to upload a file, whatever the size, is a huge price to pay. On restricted cloud environment, this is a ridiculous amount of RAM, not to mention that we could have multiple uploads simultaneously.
On top of this, the fact that it is not controllable makes s3crtclient completely useless for us. I have no idea what the size of the upload will be and I am not sure how the total size is computed. It seems to be something like 20*the part size... But how do I know for sure? what is it dependent on?

@DmitriyMusatkin
Copy link
Contributor

S3 has a fairly low per connection throughput, so to reach decent amounts of throughput, crt needs to run several connections in parallel and buffer considerable portion of the data being uploaded. Amount of parallelism used by crt can be controlled by target throughput setting (https://github.com/aws/aws-sdk-cpp/blob/main/generated/src/aws-cpp-sdk-s3-crt/include/aws/s3-crt/ClientConfiguration.h#L58). Unfortunately, that setting already defaults to the lowest possible value in cpp sdl and setting it lower will not have impact on memory usage.

Note: that overall max memory usage for the client will have an upper bound that is derived from part size and number of connections (which in turn is derived from max throughput). so memory usage does not scale directly with the number of s3 requests queued up on the client and once that upper bound is reached, memory usage will stay there.

We've made several improvements to underlying C CRT libs with regards to memory usage un the past couple months that havent made its way to CPP SDK yet, so I would be interested in learning about your use cases. What kind of instances are you running code on? overall ram on the system and NIC bandwidth? what are the typical file sizes you are trying to upload?

@thierryba
Copy link
Contributor Author

Hi @DmitriyMusatkin and thank you for the reply. I was actually wondering if setting the throughput to a lower value would help.. Heh too bad for me. I suppose that if the canes to the memory usage does not directly affect those buffers it will not help me much. In essence we are a SaaS provider and there are times where we need to push data. Most likely in files of a few GB but it can go to 10s of GB (there is no actual limit), hence my questions.
The possibilities to run this are actually pretty diverse. Because it could be a SaaS instance on EC2 or on prem.
In any case we are trying to be careful with resources, and 1GB is ridiculously high just to upload a file.

That being said, for now, we have switched to using TransferManager that allows to control better the memory management.
Also TM, allows you to get the current upload progress, which the s3crtclient is failing to do (callbacks are never called...).

@jmklix
Copy link
Member

jmklix commented Feb 21, 2024

Thanks for bringing your use case to our attention. I'm sorry that s3crtclient doesn't currently fit you needs. I'm changing this issue to a feature request. This feature would be to add additional options for configuring the s3crtclient. If you have any ideas for which settings you would like to configure please let us know, but I can't guarantee that we will be able to implement them.

@jmklix jmklix added feature-request A feature should be added or improved. and removed bug This issue is a bug. labels Feb 21, 2024
@jmklix jmklix removed their assignment Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

3 participants