Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vertical scaling and SoftReference for snapshot repository data cache #16489

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

inpink
Copy link
Contributor

@inpink inpink commented Oct 27, 2024

Description

Background

Currently, snapshot repository data is not cached if its compressed size exceeds 500KB.
This static limit causes repeated downloads in operations like clone, restore, and status checks, increasing latency.
The limit has not been adjusted for larger heap sizes, impacting repositories with numerous snapshots.
It doesn’t adjust with vertical or horizontal scaling. This restricts caching efficiency, even with enhanced system resources.


Changes

The new setting, opensearch.snapshot.cache.threshold, allows users to adjust the cache limit as a percentage of heap size. To maintain consistency in how cache sizes are set across OpenSearch, I referenced existing configurations like indices.requests.cache.size and indices.fielddata.cache.size.

  • snapshot.repository_data.cache.threshold (String): Defines the maximum threshold of the snapshot repository data cache. This value can be specified either as an absolute value (e.g., 2GB) or as a percentage of the node’s heap memory (e.g., 3%). It is a static setting, meaning it must be defined in the opensearch.yml file.

  • Based on the discussions, I set the user-selectable cache size range within 500kb ... min(max(500kb, 1% of heap memory), Integer.MAX - 8)), with a default of min(max(500kb, 1% of heap memory), Integer.MAX - 8)).

  • For implementing the maximum limit, I referred to the knn plugin and its knn.model.cache.size.limit setting.
    (It would be helpful if the Setting class could directly support maximum memory size limits in the future.)

  • I applied a [SoftReference](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/ref/SoftReference.html) to latestKnownRepositoryData, which is already managed with an AtomicReference for thread safety. The approach of combining both references was inspired by Gradle’s implementation.

  • The previous warning threshold for cached data was set at 5MB, but this has now been changed to issue a warning at min(configured limit * 10, Integer.MAX - 8).

  • I believed that a static setting, rather than dynamic, was more suitable here to avoid further amplifying the unpredictability of SoftReference.

(I plan to contribute documentation for this feature to the OpenSearch project. I have already opened an issue for documentation contribution.)


Testing

Setting Tests

I conducted tests in MemorySizeSettingsTests to verify that the snapshot.repository_data.cache.threshold option allocates memory as intended. I also confirmed that exceptions are triggered if the specified percentage exceeds the maximum limit or falls below the minimum limit.

E2E Test with Docker

I assembled the code I modified and ran it in Docker to conduct E2E testing to ensure everything functions properly overall.
(If any additional data is needed for these E2E tests, I am happy to provide it at any time.)

Cache Threshold Image
Default (1%) 1%
In Range(1024KiB) withinRagne
Over Limit(10%) overlimit
Below Minimum(499KiB) �belowMinimum

(Click the image to view it in full size.)

Cache Testing

I wrote test code to confirm that the SoftReference works as expected. However, since latestKnownRepositoryData is private, I temporarily changed its access level to protected in my local environment, and confirmed that the tests passed.

  1. testCacheRepositoryData: Verifies that caching works as expected in repository.latestKnownRepositoryData.
  2. testSoftReferenceRepositoryDataCacheCleared: Confirms that the cache is cleared when there is memory pressure and no objects are referencing latestKnownRepositoryData.
    public void testCachingRepositoryData() throws Exception {
        final BlobStoreRepository repository = setupRepo();
        final long pendingGeneration = repository.metadata.pendingGeneration();

        // Check the initial state of repository data and cache status
        assertThat(OpenSearchBlobStoreRepositoryIntegTestCase.getRepositoryData(repository)
            .getSnapshotIds().size(), equalTo(0));
        final RepositoryData emptyData = RepositoryData.EMPTY;
        writeIndexGen(repository, emptyData, emptyData.getGenId());
        RepositoryData repoData = OpenSearchBlobStoreRepositoryIntegTestCase.getRepositoryData(
            repository);

        // Verify caching status
        SoftReference<Tuple<Long, BytesReference>> cachedRef = repository.latestKnownRepositoryData.get();
        assertNotNull(cachedRef);
        Tuple<Long, BytesReference> cachedData = cachedRef.get();
        assertNotNull(cachedData);

        // Verify data consistency
        assertEquals(repoData, emptyData);
        assertEquals(repoData.getIndices().size(), 0);
        assertEquals(repoData.getSnapshotIds().size(), 0);
        assertEquals(pendingGeneration + 1L, repoData.getGenId());
    }

    public void testSoftReferenceRepositoryDataCacheCleared() throws Exception {
        final BlobStoreRepository repository = setupRepo();

        // Check the initial state of repository data and cache status
        assertThat(OpenSearchBlobStoreRepositoryIntegTestCase.getRepositoryData(repository)
            .getSnapshotIds().size(), equalTo(0));
        final RepositoryData emptyData = RepositoryData.EMPTY;
        writeIndexGen(repository, emptyData, emptyData.getGenId());

        // Induce memory pressure to test SoftReference eviction
        try {
            final ArrayList<Object[]> allocations = new ArrayList<>();
            while (true) {
                int size = (int) Math.min(Runtime.getRuntime().freeMemory() / 2, Integer.MAX_VALUE);
                allocations.add(new Object[size]);
            }
        } catch (OutOfMemoryError e) {
            // Memory shortage situation occurred
            System.out.println("Memory pressure caused OutOfMemoryError as expected");
        }

        System.gc();

        // Check if cache has been evicted
        SoftReference<Tuple<Long, BytesReference>> cachedRef = repository.latestKnownRepositoryData.get();
        assertNotNull(cachedRef);
        Tuple<Long, BytesReference> cachedData = cachedRef.get();
        assertNull(cachedData);
    }

Closing Thoughts

I’m very pleased to have contributed to OpenSearchㅡit was an exciting feature to add! Participating in discussions with excellent maintainers to provide a robust caching feature for users has been an incredibly valuable experience for me. If there’s anything that needs adjustment, I’ll make the changes as quickly as possible.

Related Issues

Resolves #16298

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ]Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for e746f68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@reta
Copy link
Collaborator

reta commented Nov 8, 2024

Despite keeping everything up to date, the CI continues to fail. I’ll give it another try in a few hours! :)


> Task :server:compileTestJava
/var/jenkins/workspace/gradle-check/search/server/src/test/java/org/opensearch/repositories/blobstore/BlobStoreRepositoryTests.java:661: warning: [static] static method should be qualified by type name, BlobStoreRepository, instead of by an expression
        long maxThreshold = repository.calculateMaxSnapshotRepositoryDataCacheThreshold();
                                      ^
error: warnings found and -Werror specified

Copy link
Contributor

github-actions bot commented Nov 8, 2024

❌ Gradle check result for a607693: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@inpink inpink force-pushed the 16298-solve branch 3 times, most recently from 2a3cbc5 to 46b46be Compare November 8, 2024 19:18
@reta
Copy link
Collaborator

reta commented Nov 8, 2024

Thanks @inpink , @andrross LGTY?

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.19.0 Issues and PRs related to version 2.19.0 labels Nov 8, 2024
@inpink
Copy link
Contributor Author

inpink commented Nov 8, 2024

Task :server:compileTestJava
/var/jenkins/workspace/gradle-check/search/server/src/test/java/org/opensearch/repositories/blobstore/BlobStoreRepositoryTests.java:661: warning: [static] static method should be qualified by type name, BlobStoreRepository, instead of by an expression
long maxThreshold = repository.calculateMaxSnapshotRepositoryDataCacheThreshold();
^
error: warnings found and -Werror specified

Oh, wow, thank you so much. I hadn’t caught this error when running with JUnit. In my future contributions, I’ll be sure to verify tests with Gradle before pushing!

Copy link
Contributor

github-actions bot commented Nov 8, 2024

✅ Gradle check result for 46b46be: SUCCESS

…data cache

- Applies `SoftReference` to cached repository data for efficient memory management under heap pressure.
- Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory.
- Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before.
- Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit.

Signed-off-by: inpink <[email protected]>
Copy link
Contributor

✅ Gradle check result for eeb0be0: SUCCESS

@reta reta merged commit 53d41d3 into opensearch-project:main Nov 12, 2024
37 of 38 checks passed
@reta reta added backport 2.x Backport to 2.x branch and removed skip-changelog labels Nov 12, 2024
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16489-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 53d41d3fac28c9f72d5883467d5a6211ad09feac
# Push it to GitHub
git push --set-upstream origin backport/backport-16489-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16489-to-2.x.

reta pushed a commit to reta/OpenSearch that referenced this pull request Nov 12, 2024
…data cache (opensearch-project#16489)

- Applies `SoftReference` to cached repository data for efficient memory management under heap pressure.
- Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory.
- Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before.
- Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit.

Signed-off-by: inpink <[email protected]>
(cherry picked from commit 53d41d3)
reta pushed a commit to reta/OpenSearch that referenced this pull request Nov 12, 2024
…data cache (opensearch-project#16489)

- Applies `SoftReference` to cached repository data for efficient memory management under heap pressure.
- Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory.
- Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before.
- Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit.

Signed-off-by: inpink <[email protected]>
(cherry picked from commit 53d41d3)
Signed-off-by: Andriy Redko <[email protected]>
@injae-kim
Copy link

Nice work @inpink !! 👍👍👍

@inpink
Copy link
Contributor Author

inpink commented Nov 12, 2024

Thank you so much, @injae-kim! 🎉
I really appreciate the encouragement and advice! 🙏 I’ll keep striving to grow and improve! 😊

dbwiddis pushed a commit that referenced this pull request Nov 12, 2024
…data cache (#16489) (#16624)

- Applies `SoftReference` to cached repository data for efficient memory management under heap pressure.
- Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory.
- Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before.
- Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit.


(cherry picked from commit 53d41d3)

Signed-off-by: inpink <[email protected]>
Signed-off-by: Andriy Redko <[email protected]>
Co-authored-by: inpink <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Snapshots v2.19.0 Issues and PRs related to version 2.19.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support vertical scaling for snapshot repository data cache limit
6 participants