Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OOM issue, prevent caching and miner sync bugfix #99

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

haihp02
Copy link
Collaborator

@haihp02 haihp02 commented Feb 23, 2025

Pull Request Description

This PR aims to solve some problems described below, including:

The OOM Issue

This issue occurs on validators because the current code spawns a thread for each batch of validator requests sent to miners. This can result in over 1000 threads calling the embedding transformer. Although the model is small, each thread tries to allocate memory for itself and does not reuse memory from other threads until it is joined (which only happens after all threads have been fired). This leads to OOM (Out of Memory) errors or keeps the validator's GPU VRAM utilization at 99%, even though the computational utilization is very low (around 5%).

This problem can be effectively solved by constraining the number of concurrent threads. The implementation in this PR proposes using a ThreadPool. Check out the details in the validator's forward method.

Additional details:

  • Due to a memory leak issue, increasing time_to_sleep per batch does not improve the situation. Instead, this PR proposes a different formulation for time_to_sleep, where the sleep duration between each early batch is shorter and gets longer as more batches wait to be processed by the thread pool.
  • Increasing the batch size can reduce the number of concurrent threads, but this may not be scalable in the future, and running too many threads should be avoided. This PR sets the default number of workers in the thread pool to 32. This number is chosen considering that only one thread can actually use the CPU at a time, and 32 threads allow overlapping blocking IO or GPU computation without slowing down the forwarding process. Validators with better hardware may consider increasing this value.

The Caching Issue

This issue arises because, within a batch, there are multiple minibatches. Each minibatch sends the same synapse to multiple miners. Participants running multiple miners can exploit this by returning a cached result when one of their miners receives a synapse that another of their miners has already processed. This makes scoring unfair, as caching effectively reduces processing time to nearly zero, giving exploiting participants a significant advantage in terms of higher scores and lower serving costs. Additionally, it lowers the creativity of the subnet since the same results are returned multiple times, making them less meaningful and valuable.

To address this, this PR introduces a penalty mechanism. When scoring, we consider not only individual results that should be rewarded but also the entire minibatch's results. The idea is to check if a response is similar to another within the same minibatch using a similarity metric that combines embedding similarity and Levenshtein distance. High similarity suggests the result was likely copied/cached and returned multiple times within the minibatch.

Since responses are for the same problem, they may look alike. We set a high similarity threshold (default: 0.95) to minimize false positives. Check out the implementation here and how penalties are accounted for in the final score/reward here.

More details:

  • The field chosen for comparison in the miner's response is logic_reasoning, as it is long enough for meaningful comparison and varies among different miners.
  • Embedding similarity and Levenshtein distance are used in combination, as neither is perfect alone. Embedding similarity typically yields high scores due to its focus on context, making it prone to false positives. Levenshtein distance does not consider context, making it easy to bypass.
  • We can make all requests to each miner completely different by setting the minibatch size to 1 while ensuring that the forward process does not slow down significantly. However, this approach will result in much higher serving costs for validators.

Check out this plot illustrating how the chosen threshold recognizes similarity in an experiment with a minibatch of 4 on 1000 requests using default miner code:

penalty_histogram

This method still results in a false positive rate of ~2.85%, but manual verification shows that all responses with a penalty higher than 0.95 are indeed very similar to others in the same minibatch. Miners should avoid penalties by improving their code to be more creative.

Fixing Miners' Sync Issue

Currently, miners sync by comparing the current block with the last_update block (ref). This approach is incorrect because last_updated indicates the last time (block) that a participant set weights, which miners do not do. This causes miners to attempt syncing with the metagraph every second.

This PR resolves the issue by making miners sync once per epoch. See the implementation here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant