Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py doesn't use GPU #14

Open
agilebean opened this issue Dec 28, 2020 · 9 comments
Open

train.py doesn't use GPU #14

agilebean opened this issue Dec 28, 2020 · 9 comments

Comments

@agilebean
Copy link

I am running train.py successfully on my local machine (Macbook Pro 16).
Yet in Google Colabo, it seems to take an endless time so start the first epoch (the empty progressbar is shown).

I verified that CUDA is available:

import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))

Torch 1.5.1 CUDA 10.2
Device: cuda:0

The script starts as:

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 1 \
--step-size 0.2 \
--batch-size 2 \
--num-batches 8 \
--num-epochs 50 \
--num-workers 2 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

and gives:

2020-12-28 14:36:17.261821: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
DEBUG:root:Creating folder `/content/sync/output/2020-12-28_143618`
INFO:root:Saving configuration file in `/content/sync/output/2020-12-28_143618/config.json`
Training:  25% 2/8 [05:55<24:10, 241.79s/it, accuracy=0.1867, loss=35.6492]

So this makes 242s or 4min per iteration, whereas the same configuration and identical code on my Macbook Pro without GPU takes only about 2.4s per iteration - factor 100:

INFO:root:Saving configuration file in `/Users/chaehan/Google Drive/04 Publishing/18 Metalearning >ICAIIC/pytorch-maml/output/2020-12-28_144912/config.json`
Training:  25%|██▌       | 2/8 [00:05<00:14,  2.43s/it, accuracy=0.1867, loss=38.5514]

Why is this is the case?

A hint is that the GPU is not used, as Colabo shows a popup window after some minutes saying:
Warning: you are connected to a GPU runtime, but not utilizing the GPU. Change to a standard runtime
In this case, is 16 CPU cores on Macbook vs. 2 in Google Colabo. This doesn't account for the factor 100 between them but might be a hint.

I am convinced that this code must be super fast when running on an NVIDIA P100.
So I would be very grateful for any hints!

@tristandeleu
Copy link
Owner

The script checks if torch.cuda.is_available() is True (in addition to --use-cuda), and defaults back to CPU otherwise. It possibly means that Colab doesn't forward to the script that you have CUDA available: it is available in your notebook (since you can do torch.cuda.version and torch.device in the notebook I imagine), but isn't available in the script.

I don't know much about how Colab with external scripts works. But something you can try, to validate this, is to create a small script like

import torch

print(f'Torch available: {torch.cuda.is_available()}')

and call it with the !python magic command, to ensure that it has access to GPU.

@agilebean
Copy link
Author

agilebean commented Dec 30, 2020

You were right!
But very strange indeed!

I didn't execute it as external script, as Google Colabo works with jupyter notebooks in interactive mode only.
As I need it in a notebook cell, with

import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))
print(f'CUDA available: {torch.cuda.is_available()}')

I got

Torch 1.5.1 CUDA 10.2
Device: cuda:0
CUDA available: False

It still doesn't solve my problem, but you found the root cause.
It has nothing to do with your code, but more with torch making CUDA available.
It is strange, as the cuda version and cuda device is found.

Will keep you updated if I find a solution.

@agilebean
Copy link
Author

agilebean commented Dec 30, 2020

Update:
The problem was that the torch version 1.5.1 came with CUDA 10.2, yet the NVIDIA driver on Google Colabo has currently CUDA 10.1.
For anyone who wants to run the code in Google Colabo:
You must install torch 1.6 to match the torch CUDA driver to the Google Colabo-installed CUDA driver with
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Verify the correct CUDA version which pytorch can see by:
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)

@tristandeleu
Now that I have train.py running with CUDA, I still don't see a lot of speedup compared to my Macbook, even with an NVIDIA V100. Even batch-size 2, num-batches 2, training takes 128s (Macbook: 10s).

Do you have any idea why this is still so slow?
Are there any time benchmarks you can share?
I could share the Colabo notebook if you want.

@tristandeleu
Copy link
Owner

I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32) and 8 workers for data loading.

Hope this helps!

@agilebean
Copy link
Author

agilebean commented Dec 31, 2020 via email

@agilebean
Copy link
Author

It seems that the problem is solved although I cannot reproduce the reason.
All I did was restart Google Colabo, and install a few extensions. It seems that the availability of the GPU on Colabo fluctuates, as processing time can varies between 1.2 - 8.2it/s in one session!

The benchmark for your run:

reproduce Deleu paper

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 5 \
--num-steps 5 \
--step-size 0.1 \
--batch-size 4 \
--num-batches 100 \
--num-epochs 1 \
--hidden-size 32 \
--num-workers 8 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

ran smoothly in 80s per epoch using an NVIDIA V100 with

  • 65s for training
  • 15s for testing

Only one detail was missing to reproduce your experiment -
Can you tell me how to specify your configuration of 250 batches of validation per epoch?

@agilebean
Copy link
Author

Good news!
I finally found the root cause of the huge time delay:
My script read and wrote to folders which were synced to Google Drive. "Synced" meant actually "being synced" - and that cause the bottleneck. When I changed the folders to ones directly on the Google Colabo VM, the time per epoch went down from 240s to 0.2s in the previous test!
And as mentioned before, running your benchmark from your paper took about 82 seconds to complete per epoch.
To become more precise in this,
I would still be grateful if you could tell me how to configure the 250 batches of validation per epoch!

@tristandeleu
Copy link
Owner

That is good news! I/O can be a really big factor.

The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning.

I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want:

  • Create the subset of indices for a specific dataset; I'm taking MiniImagenet 5-way 5-shots as an example. I have a utility function to do that (which is not perfect, because it could return fewer tasks than requested, but that did the trick for me):
import random
from torchmeta.datasets.helpers import miniimagenet

def create_indices(dataset, num_tasks):
    indices = set()
    for _ in range(num_tasks):
        indices.add(tuple(random.sample(range(len(dataset.dataset)),
                                        dataset.num_classes_per_task)))
    indices = list(list(x) for x in indices)
    return indices

dataset = miniimagenet('data', shots=5, ways=5, meta_val=True)
indices = create_indices(dataset, 1000)  # Sample 1000 tasks
  • Save the indices in a separate file to freeze this subset of indices and reload it later:
import json

with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f:
    meta_val_indices = json.dump(indices, f)
# Load the indices (e.g. using json)
# with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'r') as f:
#     meta_val_indices = json.load(f)
meta_val_dataset = Subset(meta_val_dataset, meta_val_indices)

This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset, shots and ways arguments of get_benchmark_by_name), but I hope this helps!

@agilebean
Copy link
Author

agilebean commented Jan 4, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants