Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate if we can optimize s3:ListObjectsV2 performance for nasxl by using prefetching and caching. #20

Open
aweisser opened this issue Mar 30, 2022 · 0 comments
Assignees

Comments

@aweisser
Copy link
Collaborator

aweisser commented Mar 30, 2022

Is your feature request related to a problem? Please describe.
The implementation of the s3:ListObjectsV2 operation suffers a performance drop every 4500 object (which is the "old" default for maxObjectList). For every 4500 objects the client needs to request the next result page by passing the last ContinuationToken to another s3:ListObjectsV2 operation. So for large result sets the client has to request multiple result pages.

With commits f8c2946 and bd82a93 we passed the ContinuationToken as a ForwardTo marker to the underlying WalkDir function, to speed up the s3:ListObjectsV2 operation. This fix improved the listing performance. But we think, that it can still be optimized.

In our use case we need to list (and sequentially delete) 4000000 objects. We observed that the s3:ListObjectsV2 starts to drop in performance at some point and we assume that it could have to do something with the func (s *fsv1Storage) WalkDir(3) function, because this function is called for every result page and starts to list and sort the underlying folder structure again and again.
We assume that there's already some kind of caching but it seams not to be sufficiant in our case.

Describe the solution you'd like
We'd like to evaluate an approach, where the actual list operation on the filesystem (or let's say the scanDir = func(current string) error which is embedded in func (s *fsv1Storage) WalkDir(3) and delegates an important part of the work to func (s *fsv1Storage) ListDir(4)) is not called for every paginated s3:ListObjectsV2 request again and again but only once. Or at least less often.
Maybe something like:

  1. initiate the scanDir function on the first s3:ListObjectsV2 request (for a specific prefix) and populate a cache.
  2. let any subsequent s3:ListObjectsV2 request first use the cache. Maybe we can think of a way to identify "subsequent" requests, by caching a list of (prefix, continuationToken) tuples.
  3. Implement the cache so that it can quickly forward to the given ContinuationToken (or ForwardTo marker resp.).

It's just about "keeping in mind what I just have returned to the client or computed in the backend". So actually it's about "caching". ;)

A first evaluation can be done on a single node system. So currently we don't need to think about the clustered case, where caching would become way more complex.

We know that the task comes with unsolved, technical questions like cache invalidation. So we need to discuss this as part of this issue.

Maybe there are also other solutions to speed things up by avoiding uneccessary operations on the file system...

Describe alternatives you've considered
An alternative could be to introduce an "offset" logic in the underlying file system API. But this would be a properiatary implementation.

Additional context
The evaluation should be branched from commit 9631eed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants