vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files #3959

colinthomas-z80 · 2024-10-16T19:17:28Z

I encountered a problem, for which my naive solution is #3958

The problem situation is this: We have two workers A and B, a task which produces 100 output files, and a second task which consumes those 100 files as inputs. We schedule the second task and the output files are created at worker A. When we schedule the second task, the manager may choose to send it to worker B

The manager will see that worker A possesses input files for the task, and will schedule peer transfers from A to B. Peer transfers are accounted for and managed in the current_transfer_table. However, the current_transfer_table is only updated in vine_put_url after the task has been scheduled. Therefore the manager will have scheduled all 100 input files to be requested from worker A, since each time it checked the current_transfer_table during scheduling it had not yet been populated with scheduled transfers from the same task.

The naive solution I implemented prevented this from happening, but also brought some other implications along. The worker_source_max_transfers policy has only been effective for limiting the amount of multiple workers requesting single files from workers. Tasks have still been free to request a greater amount of files at once from a single worker without limitation.

If the policy is extended to all peer transfers, then it will be impossible to move task data between workers when the amount of outputs is greater than worker_source_max_transfers. The consequence of this will be another form of fixed_location where the tasks may run only where the data exists already.

This calls for a solution to transferring more than worker_source_max_transfer files between two workers.

If we revert to the previous policy, where we were free to schedule as many necessary transfers from a single worker for one task. The point of failure is the socket connect timeout between workers, where the requesting worker will fail to connect to the source and will declare a cache_invalid. One possibility would be to increase the connect timeout from 15 seconds to something else. However this would be a detriment to identifying genuinely unreachable workers.

We may consider the idea of workers limiting their own connections. Such that if the manager tells a worker to retrieve 100 files from a single host, instead of forking 100 transfer processes it will do it sequentially, or limit itself to 3 or 5 transfers from one host at a given time. Or the source worker will only serve 3-5 connections at a time. However if the worker starts queuing transfers and conserving bandwidth on its own, then the manager's policy might become redundant.

Proposed solution:
I propose that I close #3958 and instead implement partial bandwidth consideration to the requesting worker. So a receiving worker will queue transfer requests to a single source and only perform small amounts in parallel batches.

If the worker is made to be considerate of other hosts, then the transfers will eventually complete successfully. The manager is free to keep enforcing the same policy. From the manager's perspective, it will see that transfers occurring from a particular worker greatly exceed worker_source_max_transfers, so it will avoid scheduling any transfers from that source until they complete, which should be desirable.

The text was updated successfully, but these errors were encountered:

dthain · 2024-10-16T19:23:55Z

Ok, so the main change is that the worker will limit concurrent transfers from the same source?

colinthomas-z80 · 2024-10-16T19:26:22Z

That is correct

dthain · 2024-10-16T19:33:25Z

@BarrySlyDelgado what do you think?

BarrySlyDelgado · 2024-10-16T20:01:50Z

I'm in favor of a worker limiting its own transfers. Though, it will be interesting to see the performance in conjunction worker_source_max_transfer. If I have this right, if we limit a single worker to receiving 5 files concurrently and have worker_source_max_transfer set to 3 the max amount of files any single worker would be sending would be 15. I think there is room to explore the adequate amount of total transfers and the resulting ratio of max sends : max receives for any single worker.

colinthomas-z80 · 2024-10-24T15:31:35Z

#3961 is open, needs some review. I ended up limiting all transfers at the worker, rather than just from a specific source. It would be more complex to keep tabs on every individual source. Also it might be more proper to limit all transfers since the requesting worker could likely get overwhelmed at some point.

The amount I have tested with is a limit of 5 concurrent transfers. It does not appear to slow things down for me even when getting ~100 files from another worker. I think there is a benefit to only opening a few high bandwidth channels at a time.

colinthomas-z80 changed the title ~~vine: enforcing transfer limits between two workers -- how to handle large numbers of input files~~ vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files #3959

vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files #3959

colinthomas-z80 commented Oct 16, 2024 •

edited

Loading

dthain commented Oct 16, 2024

colinthomas-z80 commented Oct 16, 2024

dthain commented Oct 16, 2024

BarrySlyDelgado commented Oct 16, 2024

colinthomas-z80 commented Oct 24, 2024

vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files #3959

vine: enforcing transfer limits between two workers -- how to handle large numbers of temporary input files #3959

Comments

colinthomas-z80 commented Oct 16, 2024 • edited Loading

dthain commented Oct 16, 2024

colinthomas-z80 commented Oct 16, 2024

dthain commented Oct 16, 2024

BarrySlyDelgado commented Oct 16, 2024

colinthomas-z80 commented Oct 24, 2024

colinthomas-z80 commented Oct 16, 2024 •

edited

Loading