Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove FileWatcher from KNN. #2182

Merged

Conversation

0ctopus13prime
Copy link
Contributor

Description

Design doc : #1885 (comment)

This PR is to remove FileWatcher from KNN.
FileWatcher periodically monitors a vector index file then clean up resource once it is deleted.
However, as outlined in the doc above, releasing allocated native memory is well being dealt with in 80DV Producer already. Therefore, we can safely remove FileWatcher.
Just that, in since the newest version is now using vector format, I've added the same cleaning logic in there as well.

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@0ctopus13prime
Copy link
Contributor Author

@jmazanec15
Please note that this is the initial PR for code sharing.
Will add integ testings + unit tests shortly.

@0ctopus13prime
Copy link
Contributor Author

Memory usage monitoring

During the benchmark, the memory usage stayed pretty much the same as the baseline.

Candidate

Screenshot 2024-10-15 at 9 57 05 AM

Baseline

kill-file-watcher-baseline

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do verify the latencies as well since you are doing the benchmarking

@0ctopus13prime
Copy link
Contributor Author

Do verify the latencies as well since you are doing the benchmarking

yeah, that's the plan.
I've stuck in fixing subtle issues, and I think it's good to go benchmarking.
I'm not expecting there are sever performance regression, but will post the gap in this issue.

@0ctopus13prime
Copy link
Contributor Author

Please note that in NativeIndexWriter, we still have FSDirectory dependency.
Which will be dealt with in the next PR as a part of introducing writing layer in native engines.

@0ctopus13prime
Copy link
Contributor Author

Found few testing cases are not updated according to this change.
Will fix the testing cases and re-raise PR.

@0ctopus13prime
Copy link
Contributor Author

Performance Benchmark Comparison

Conclusion : I could not find an implication that having this change would degrade performance.

<style> </style>
Metric Task Baseline-Value Candidate-Value (Candidate - Baseline) / Baseline Unit
Cumulative indexing time of primary shards   30.4969 30.3151 -0.005961262 min
Min cumulative indexing time across primary shards   0.00015 0.000133333 -0.111113333 min
Median cumulative indexing time across primary shards   15.2484 15.1575 -0.005961281 min
Max cumulative indexing time across primary shards   30.4967 30.315 -0.005958022 min
Cumulative indexing throttle time of primary shards   0 0 0 min
Min cumulative indexing throttle time across primary shards   0 0 0 min
Median cumulative indexing throttle time across primary shards   0 0 0 min
Max cumulative indexing throttle time across primary shards   0 0 0 min
Cumulative merge time of primary shards   175.918 172.859 -0.017388783 min
Cumulative merge count of primary shards   117 116 -0.008547009  
Min cumulative merge time across primary shards   0 0 0 min
Median cumulative merge time across primary shards   87.9589 86.4296 -0.017386529 min
Max cumulative merge time across primary shards   175.918 172.859 -0.017388783 min
Cumulative merge throttle time of primary shards   0.8007 0.89955 0.123454477 min
Min cumulative merge throttle time across primary shards   0 0 0 min
Median cumulative merge throttle time across primary shards   0.40035 0.449775 0.123454477 min
Max cumulative merge throttle time across primary shards   0.8007 0.89955 0.123454477 min
Cumulative refresh time of primary shards   1.2548 1.44825 0.154167995 min
Cumulative refresh count of primary shards   90 88 -0.022222222  
Min cumulative refresh time across primary shards   0.0003 0.0003 0 min
Median cumulative refresh time across primary shards   0.6274 0.724125 0.154167995 min
Max cumulative refresh time across primary shards   1.2545 1.44795 0.154204862 min
Cumulative flush time of primary shards   12.2934 11.8517 -0.035929849 min
Cumulative flush count of primary shards   58 58 0  
Min cumulative flush time across primary shards   0 0 0 min
Median cumulative flush time across primary shards   6.14672 5.92586 -0.035931359 min
Max cumulative flush time across primary shards   12.2934 11.8517 -0.035929849 min
Total Young Gen GC time   0.332 0.413 0.243975904 s
Total Young Gen GC count   18 18 0  
Total Old Gen GC time   0 0 0 s
Total Old Gen GC count   0 0 0  
Store size   29.8184 29.8183 -3.35363E-06 GB
Translog size   5.82E-07 5.82E-07 0 GB
Heap used for segments   0 0 0 MB
Heap used for doc values   0 0 0 MB
Heap used for terms   0 0 0 MB
Heap used for norms   0 0 0 MB
Heap used for points   0 0 0 MB
Heap used for stored fields   0 0 0 MB
Segment count   2 2 0  
Min Throughput custom-vector-bulk 7290.25 3034.92 -0.583701519 docs/s
Mean Throughput custom-vector-bulk 11204.8 11096 -0.009710124 docs/s
Median Throughput custom-vector-bulk 10741.7 10421.3 -0.029827681 docs/s
Max Throughput custom-vector-bulk 21616.1 20570.6 -0.048366727 docs/s
50th percentile latency custom-vector-bulk 80.3333 80.4247 0.00113776 ms
90th percentile latency custom-vector-bulk 166.626 164.697 -0.011576825 ms
99th percentile latency custom-vector-bulk 315.914 319.73 0.012079237 ms
99.9th percentile latency custom-vector-bulk 1505.69 1567.09 0.040778646 ms
99.99th percentile latency custom-vector-bulk 4159.22 4244.81 0.020578378 ms
100th percentile latency custom-vector-bulk 6280.65 7164.91 0.14079116 ms
50th percentile service time custom-vector-bulk 80.3333 80.4247 0.00113776 ms
90th percentile service time custom-vector-bulk 166.626 164.697 -0.011576825 ms
99th percentile service time custom-vector-bulk 315.914 319.73 0.012079237 ms
99.9th percentile service time custom-vector-bulk 1505.69 1567.09 0.040778646 ms
99.99th percentile service time custom-vector-bulk 4159.22 4244.81 0.020578378 ms
100th percentile service time custom-vector-bulk 6280.65 7164.91 0.14079116 ms
error rate custom-vector-bulk 0 0 0 %
Min Throughput force-merge-segments 0 0 0 ops/s
Mean Throughput force-merge-segments 0 0 0 ops/s
Median Throughput force-merge-segments 0 0 0 ops/s
Max Throughput force-merge-segments 0 0 0 ops/s
100th percentile latency force-merge-segments 6.17E+06 6.10E+06 -0.011397476 ms
100th percentile service time force-merge-segments 6.17E+06 6.10E+06 -0.011397476 ms
error rate force-merge-segments 0 0 0 %
Min Throughput warmup-indices 0.23 0.23 0 ops/s
Mean Throughput warmup-indices 0.23 0.23 0 ops/s
Median Throughput warmup-indices 0.23 0.23 0 ops/s
Max Throughput warmup-indices 0.23 0.23 0 ops/s
100th percentile latency warmup-indices 4423.45 4382.95 -0.009155749 ms
100th percentile service time warmup-indices 4423.45 4382.95 -0.009155749 ms
error rate warmup-indices 0 0 0 %
Min Throughput prod-queries 0.67 0.67 0 ops/s
Mean Throughput prod-queries 0.67 1 0.492537313 ops/s
Median Throughput prod-queries 0.67 0.67 0 ops/s
Max Throughput prod-queries 0.67 1.66 1.47761194 ops/s
50th percentile latency prod-queries 8.23135 8.26458 0.004037005 ms
90th percentile latency prod-queries 9.44003 9.03913 -0.042468085 ms
99th percentile latency prod-queries 26.661 26.0028 -0.024687746 ms
100th percentile latency prod-queries 1499.85 1482.8 -0.011367803 ms
50th percentile service time prod-queries 8.23135 8.26458 0.004037005 ms
90th percentile service time prod-queries 9.44003 9.03913 -0.042468085 ms
99th percentile service time prod-queries 26.661 26.0028 -0.024687746 ms
100th percentile service time prod-queries 1499.85 1482.8 -0.011367803 ms
error rate prod-queries 0 0 0 %
Mean recall@k prod-queries 0.3 0.29 -0.033333333  
Mean recall@1 prod-queries 0.37 0.28 -0.243243243  

CHANGELOG.md Outdated Show resolved Hide resolved
@@ -562,8 +562,10 @@ public static boolean isFaissAVX2Disabled() {

public static boolean isFaissAVX512Disabled() {
return Booleans.parseBoolean(
KNNSettings.state().getSettingValue(KNNSettings.KNN_FAISS_AVX512_DISABLED).toString(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KNNSettings.state().getSettingValue(KNNSettings.KNN_FAISS_AVX512_DISABLED) can return null value, so we should not call toString without checking its nullity.

@0ctopus13prime 0ctopus13prime force-pushed the decomission-file-watcher branch 2 times, most recently from 1a52a1b to 6c4fe69 Compare October 21, 2024 18:11
Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, waiting for the CI to pass

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmazanec15 jmazanec15 merged commit e5599aa into opensearch-project:main Oct 21, 2024
29 of 30 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2182-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 e5599aa151cf43700d0ea327ad536cc106e5bb61
# Push it to GitHub
git push --set-upstream origin backport/backport-2182-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2182-to-2.x.

@jmazanec15 jmazanec15 added the Refactoring Improve the design, structure, and implementation while preserving its functionality label Oct 21, 2024
@0ctopus13prime
Copy link
Contributor Author

Looks like auto backporting is failing. Will raise a backport PR shortly.

@0ctopus13prime 0ctopus13prime deleted the decomission-file-watcher branch October 21, 2024 22:58
@0ctopus13prime 0ctopus13prime restored the decomission-file-watcher branch October 21, 2024 22:58
0ctopus13prime added a commit to 0ctopus13prime/k-NN that referenced this pull request Oct 21, 2024
Signed-off-by: Dooyong Kim <[email protected]>

(cherry picked from commit e5599aa)
@0ctopus13prime
Copy link
Contributor Author

Backport PR : #2225

0ctopus13prime added a commit to 0ctopus13prime/k-NN that referenced this pull request Oct 22, 2024
Signed-off-by: Dooyong Kim <[email protected]>

(cherry picked from commit e5599aa)
naveentatikonda pushed a commit that referenced this pull request Oct 22, 2024
Signed-off-by: Dooyong Kim <[email protected]>

(cherry picked from commit e5599aa)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants