Fix OOM issue, prevent caching and miner sync bugfix #99

haihp02 · 2025-02-23T17:00:06Z

Pull Request Description

This PR aims to solve some problems described below, including:

The OOM Issue

This issue occurs on validators because the current code spawns a thread for each batch of validator requests sent to miners. This can result in over 1000 threads calling the embedding transformer. Although the model is small, each thread tries to allocate memory for itself and does not reuse memory from other threads until it is joined (which only happens after all threads have been fired). This leads to OOM (Out of Memory) errors or keeps the validator's GPU VRAM utilization at 99%, even though the computational utilization is very low (around 5%).

This problem can be effectively solved by constraining the number of concurrent threads. The implementation in this PR proposes using a ThreadPool. Check out the details in the validator's forward method.

Additional details:

Due to a memory leak issue, increasing time_to_sleep per batch does not improve the situation. Instead, this PR proposes a different formulation for time_to_sleep, where the sleep duration between each early batch is shorter and gets longer as more batches wait to be processed by the thread pool.
Increasing the batch size can reduce the number of concurrent threads, but this may not be scalable in the future, and running too many threads should be avoided. This PR sets the default number of workers in the thread pool to 32. This number is chosen considering that only one thread can actually use the CPU at a time, and 32 threads allow overlapping blocking IO or GPU computation without slowing down the forwarding process. Validators with better hardware may consider increasing this value.

The Caching Issue

This issue arises because, within a batch, there are multiple minibatches. Each minibatch sends the same synapse to multiple miners. Participants running multiple miners can exploit this by returning a cached result when one of their miners receives a synapse that another of their miners has already processed. This makes scoring unfair, as caching effectively reduces processing time to nearly zero, giving exploiting participants a significant advantage in terms of higher scores and lower serving costs. Additionally, it lowers the creativity of the subnet since the same results are returned multiple times, making them less meaningful and valuable.

To address this, this PR introduces a penalty mechanism. When scoring, we consider not only individual results that should be rewarded but also the entire minibatch's results. The idea is to check if a response is similar to another within the same minibatch using a similarity metric that combines embedding similarity and Levenshtein distance. High similarity suggests the result was likely copied/cached and returned multiple times within the minibatch.

Since responses are for the same problem, they may look alike. We set a high similarity threshold (default: 0.95) to minimize false positives. Check out the implementation here and how penalties are accounted for in the final score/reward here.

More details:

The field chosen for comparison in the miner's response is logic_reasoning, as it is long enough for meaningful comparison and varies among different miners.
Embedding similarity and Levenshtein distance are used in combination, as neither is perfect alone. Embedding similarity typically yields high scores due to its focus on context, making it prone to false positives. Levenshtein distance does not consider context, making it easy to bypass.
We can make all requests to each miner completely different by setting the minibatch size to 1 while ensuring that the forward process does not slow down significantly. However, this approach will result in much higher serving costs for validators.

Check out this plot illustrating how the chosen threshold recognizes similarity in an experiment with a minibatch of 4 on 1000 requests using default miner code:

This method still results in a false positive rate of ~2.85%, but manual verification shows that all responses with a penalty higher than 0.95 are indeed very similar to others in the same minibatch. Miners should avoid penalties by improving their code to be more creative.

Fixing Miners' Sync Issue

Currently, miners sync by comparing the current block with the last_update block (ref). This approach is incorrect because last_updated indicates the last time (block) that a participant set weights, which miners do not do. This causes miners to attempt syncing with the metagraph every second.

This PR resolves the issue by making miners sync once per epoch. See the implementation here.

…rs thats caching responses

…start and longer at end, refine penalties calcualtion with high default threshold at 0.9

haihp02 added 8 commits February 12, 2025 18:45

[UPDATE] Solve OOM problem when rewarding and apply penalties to mine…

8ea53b7

…rs thats caching responses

[UPDATE] Randomize category selection for each request batch

3b6fb0f

[UPDATE] Change time_to_sleep per batch formular to make it short at …

67a401f

…start and longer at end, refine penalties calcualtion with high default threshold at 0.9

Bug fix

125e478

Merge with main

ef88de2

Merge with main

47a04e1

Merge with main

e0e94c6

[FIX] Bugfix for miners sync

603ab00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM issue, prevent caching and miner sync bugfix #99

Fix OOM issue, prevent caching and miner sync bugfix #99

haihp02 commented Feb 23, 2025 •

edited

Loading

Fix OOM issue, prevent caching and miner sync bugfix #99

Are you sure you want to change the base?

Fix OOM issue, prevent caching and miner sync bugfix #99

Conversation

haihp02 commented Feb 23, 2025 • edited Loading

Pull Request Description

The OOM Issue

The Caching Issue

Fixing Miners' Sync Issue

haihp02 commented Feb 23, 2025 •

edited

Loading