Fix OOM issue, prevent caching and miner sync bugfix #99
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
This PR aims to solve some problems described below, including:
The OOM Issue
This issue occurs on validators because the current code spawns a thread for each batch of validator requests sent to miners. This can result in over 1000 threads calling the embedding transformer. Although the model is small, each thread tries to allocate memory for itself and does not reuse memory from other threads until it is joined (which only happens after all threads have been fired). This leads to OOM (Out of Memory) errors or keeps the validator's GPU VRAM utilization at 99%, even though the computational utilization is very low (around 5%).
This problem can be effectively solved by constraining the number of concurrent threads. The implementation in this PR proposes using a ThreadPool. Check out the details in the validator's forward method.
Additional details:
time_to_sleep
per batch does not improve the situation. Instead, this PR proposes a different formulation fortime_to_sleep
, where the sleep duration between each early batch is shorter and gets longer as more batches wait to be processed by the thread pool.The Caching Issue
This issue arises because, within a batch, there are multiple minibatches. Each minibatch sends the same synapse to multiple miners. Participants running multiple miners can exploit this by returning a cached result when one of their miners receives a synapse that another of their miners has already processed. This makes scoring unfair, as caching effectively reduces processing time to nearly zero, giving exploiting participants a significant advantage in terms of higher scores and lower serving costs. Additionally, it lowers the creativity of the subnet since the same results are returned multiple times, making them less meaningful and valuable.
To address this, this PR introduces a penalty mechanism. When scoring, we consider not only individual results that should be rewarded but also the entire minibatch's results. The idea is to check if a response is similar to another within the same minibatch using a similarity metric that combines embedding similarity and Levenshtein distance. High similarity suggests the result was likely copied/cached and returned multiple times within the minibatch.
Since responses are for the same problem, they may look alike. We set a high similarity threshold (default: 0.95) to minimize false positives. Check out the implementation here and how penalties are accounted for in the final score/reward here.
More details:
logic_reasoning
, as it is long enough for meaningful comparison and varies among different miners.Check out this plot illustrating how the chosen threshold recognizes similarity in an experiment with a minibatch of 4 on 1000 requests using default miner code:
This method still results in a false positive rate of ~2.85%, but manual verification shows that all responses with a penalty higher than 0.95 are indeed very similar to others in the same minibatch. Miners should avoid penalties by improving their code to be more creative.
Fixing Miners' Sync Issue
Currently, miners sync by comparing the current block with the
last_update
block (ref). This approach is incorrect becauselast_updated
indicates the last time (block) that a participant set weights, which miners do not do. This causes miners to attempt syncing with the metagraph every second.This PR resolves the issue by making miners sync once per epoch. See the implementation here.