-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GraphBolt] Add offline sampling support #7679
base: master
Are you sure you want to change the base?
Conversation
To trigger regression tests:
|
minibatch.seeds.cpu(), | ||
minibatch.input_nodes.cpu(), | ||
minibatch.labels.cpu(), | ||
[block.cpu() for block in minibatch.blocks], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work for DGL. I would recommend changing your code so that it works with any GNN framework. Otherwise, your methods should be called DGLMinibatchProvider or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this is just for alignment with DiskGNN, as if I directly save all attributes in a minibatch, the size will be large and loading the minibatches is time consuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to save all attributes in minibatch, you can only save minibatch.sampled_subgraphs
as it is the counterpart to DGL's blocks.
if args.cpu_cache_size_in_gigabytes > 0 and isinstance( | ||
features[("node", None, "feat")], gb.DiskBasedFeature | ||
): | ||
features[("node", None, "feat")] = gb.CPUCachedFeature( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features[("node", None, "feat")] = gb.CPUCachedFeature( | |
features[("node", None, "feat")] = features[("node", None, "feat")].read_into_memory() | |
features[("node", None, "feat")] = gb.CPUCachedFeature( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have this line to debug, that way you won't be affected by any potential bugs inside DiskBasedFeature, it might not be handling exceptions correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way, the error will be inside TorchBasedFeature if the rest of your code has a bug.
I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:
|
I couldn't reproduce the issue on my local machine. Trying on another machine. |
@Liu-rj Can you provide more information on the machine you are getting this error on? |
The machine is g5.8xlarge with 32 cores, 128 RAM, one 24GB A10G GPU. And the error is got on EBS io2 SSD. The same instruction runs normally on instance NVMe storage. And other interactions adjusting CPU and GPU cache can also run normally on EBS io2 SSD. I don't know if it's the io_uring issue (maybe related to hardware) or bugs in the code. |
Since I can't reproduce the issue "yet", I might ask you to run the code with a modification to see if it will fix it. I am currently using thread sanitizer to see if it will catch anything. |
|
||
|
||
def main(): | ||
start = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start = time.time() | |
torch.ops.graphbolt.set_num_io_uring_threads(4) | |
start = time.time() |
#7698 fixes the issue. |
Description
Sample minibatches in advance and use
MinibatchLoader
to load during online training.Checklist
Please feel free to remove inapplicable items for your PR.
Changes