train.py doesn't use GPU #14

agilebean · 2020-12-28T05:50:59Z

I am running train.py successfully on my local machine (Macbook Pro 16).
Yet in Google Colabo, it seems to take an endless time so start the first epoch (the empty progressbar is shown).

I verified that CUDA is available:

import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))

Torch 1.5.1 CUDA 10.2
Device: cuda:0

The script starts as:

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 1 \
--step-size 0.2 \
--batch-size 2 \
--num-batches 8 \
--num-epochs 50 \
--num-workers 2 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

and gives:

2020-12-28 14:36:17.261821: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
DEBUG:root:Creating folder `/content/sync/output/2020-12-28_143618`
INFO:root:Saving configuration file in `/content/sync/output/2020-12-28_143618/config.json`
Training:  25% 2/8 [05:55<24:10, 241.79s/it, accuracy=0.1867, loss=35.6492]

So this makes 242s or 4min per iteration, whereas the same configuration and identical code on my Macbook Pro without GPU takes only about 2.4s per iteration - factor 100:

INFO:root:Saving configuration file in `/Users/chaehan/Google Drive/04 Publishing/18 Metalearning >ICAIIC/pytorch-maml/output/2020-12-28_144912/config.json`
Training:  25%|██▌       | 2/8 [00:05<00:14,  2.43s/it, accuracy=0.1867, loss=38.5514]

Why is this is the case?

A hint is that the GPU is not used, as Colabo shows a popup window after some minutes saying:
Warning: you are connected to a GPU runtime, but not utilizing the GPU. Change to a standard runtime
In this case, is 16 CPU cores on Macbook vs. 2 in Google Colabo. This doesn't account for the factor 100 between them but might be a hint.

I am convinced that this code must be super fast when running on an NVIDIA P100.
So I would be very grateful for any hints!

The text was updated successfully, but these errors were encountered:

tristandeleu · 2020-12-28T20:57:35Z

The script checks if torch.cuda.is_available() is True (in addition to --use-cuda), and defaults back to CPU otherwise. It possibly means that Colab doesn't forward to the script that you have CUDA available: it is available in your notebook (since you can do torch.cuda.version and torch.device in the notebook I imagine), but isn't available in the script.

I don't know much about how Colab with external scripts works. But something you can try, to validate this, is to create a small script like

import torch

print(f'Torch available: {torch.cuda.is_available()}')

and call it with the !python magic command, to ensure that it has access to GPU.

agilebean · 2020-12-30T08:50:42Z

You were right!
But very strange indeed!

I didn't execute it as external script, as Google Colabo works with jupyter notebooks in interactive mode only.
As I need it in a notebook cell, with

import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))
print(f'CUDA available: {torch.cuda.is_available()}')

I got

Torch 1.5.1 CUDA 10.2
Device: cuda:0
CUDA available: False

It still doesn't solve my problem, but you found the root cause.
It has nothing to do with your code, but more with torch making CUDA available.
It is strange, as the cuda version and cuda device is found.

Will keep you updated if I find a solution.

agilebean · 2020-12-30T12:39:03Z

Update:
The problem was that the torch version 1.5.1 came with CUDA 10.2, yet the NVIDIA driver on Google Colabo has currently CUDA 10.1.
For anyone who wants to run the code in Google Colabo:
You must install torch 1.6 to match the torch CUDA driver to the Google Colabo-installed CUDA driver with
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Verify the correct CUDA version which pytorch can see by:
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)

@tristandeleu
Now that I have train.py running with CUDA, I still don't see a lot of speedup compared to my Macbook, even with an NVIDIA V100. Even batch-size 2, num-batches 2, training takes 128s (Macbook: 10s).

Do you have any idea why this is still so slow?
Are there any time benchmarks you can share?
I could share the Colabo notebook if you want.

tristandeleu · 2020-12-31T11:43:20Z

I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32) and 8 workers for data loading.

Hope this helps!

agilebean · 2020-12-31T11:46:44Z

This definitely helps as reference. Will do more tests and report back.

…

On 31. Dec 2020, 20:43 +0900, Tristan Deleu ***@***.***>, wrote: I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32) and 8 workers for data loading. Hope this helps! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

agilebean · 2021-01-02T11:13:43Z

It seems that the problem is solved although I cannot reproduce the reason.
All I did was restart Google Colabo, and install a few extensions. It seems that the availability of the GPU on Colabo fluctuates, as processing time can varies between 1.2 - 8.2it/s in one session!

The benchmark for your run:

reproduce Deleu paper

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 5 \
--num-steps 5 \
--step-size 0.1 \
--batch-size 4 \
--num-batches 100 \
--num-epochs 1 \
--hidden-size 32 \
--num-workers 8 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

ran smoothly in 80s per epoch using an NVIDIA V100 with

65s for training
15s for testing

Only one detail was missing to reproduce your experiment -
Can you tell me how to specify your configuration of 250 batches of validation per epoch?

agilebean · 2021-01-03T07:50:21Z

Good news!
I finally found the root cause of the huge time delay:
My script read and wrote to folders which were synced to Google Drive. "Synced" meant actually "being synced" - and that cause the bottleneck. When I changed the folders to ones directly on the Google Colabo VM, the time per epoch went down from 240s to 0.2s in the previous test!
And as mentioned before, running your benchmark from your paper took about 82 seconds to complete per epoch.
To become more precise in this,
I would still be grateful if you could tell me how to configure the 250 batches of validation per epoch!

tristandeleu · 2021-01-03T15:09:18Z

That is good news! I/O can be a really big factor.

The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning.

I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want:

Create the subset of indices for a specific dataset; I'm taking MiniImagenet 5-way 5-shots as an example. I have a utility function to do that (which is not perfect, because it could return fewer tasks than requested, but that did the trick for me):

import random
from torchmeta.datasets.helpers import miniimagenet

def create_indices(dataset, num_tasks):
    indices = set()
    for _ in range(num_tasks):
        indices.add(tuple(random.sample(range(len(dataset.dataset)),
                                        dataset.num_classes_per_task)))
    indices = list(list(x) for x in indices)
    return indices

dataset = miniimagenet('data', shots=5, ways=5, meta_val=True)
indices = create_indices(dataset, 1000)  # Sample 1000 tasks

Save the indices in a separate file to freeze this subset of indices and reload it later:

import json

with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f:
    meta_val_indices = json.dump(indices, f)

In get_benchmark_by_name, use Subset class from PyTorch to only take a subset from meta_val_dataset

# Load the indices (e.g. using json)
# with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'r') as f:
#     meta_val_indices = json.load(f)
meta_val_dataset = Subset(meta_val_dataset, meta_val_indices)

This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset, shots and ways arguments of get_benchmark_by_name), but I hope this helps!

agilebean · 2021-01-04T04:36:27Z

This is super helpful, thanks a lot. Coming from psychology, it will be challenging to implement this function flawlessly but I will try. Thanks again!

…

On 4. Jan 2021, 00:09 +0900, Tristan Deleu ***@***.***>, wrote: That is good news! I/O can be a really big factor. The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning. I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want: • Create the subset of indices for a specific dataset; I'm taking MiniImagenet 5-way 5-shots as an example. I have a utility function to do that (which is not perfect, because it could return fewer tasks than requested, but that did the trick for me): import random from torchmeta.datasets.helpers import miniimagenet def create_indices(dataset, num_tasks): indices = set() for _ in range(num_tasks): indices.add(tuple(random.sample(range(len(dataset.dataset)), dataset.num_classes_per_task))) indices = list(list(x) for x in indices) return indices dataset = miniimagenet('data', shots=5, ways=5, meta_val=True) indices = create_indices(dataset, 1000) # Sample 1000 tasks • Save the indices in a separate file to freeze this subset of indices and reload it later: import json with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f: meta_val_indices = json.dump(indices, f) • In get_benchmark_by_name, use Subset class from PyTorch to only take a subset from meta_val_dataset # Load the indices (e.g. using json) # with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'r') as f: # meta_val_indices = json.load(f) meta_val_dataset = Subset(meta_val_dataset, meta_val_indices) This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset, shots and ways arguments of get_benchmark_by_name), but I hope this helps! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

tristandeleu mentioned this issue Jan 3, 2021

train.py throws multiprocessing errors #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train.py doesn't use GPU #14

train.py doesn't use GPU #14

agilebean commented Dec 28, 2020

tristandeleu commented Dec 28, 2020

agilebean commented Dec 30, 2020 •

edited

Loading

agilebean commented Dec 30, 2020 •

edited

Loading

tristandeleu commented Dec 31, 2020

agilebean commented Dec 31, 2020 via email

agilebean commented Jan 2, 2021

agilebean commented Jan 3, 2021

tristandeleu commented Jan 3, 2021

agilebean commented Jan 4, 2021 via email

train.py doesn't use GPU #14

train.py doesn't use GPU #14

Comments

agilebean commented Dec 28, 2020

tristandeleu commented Dec 28, 2020

agilebean commented Dec 30, 2020 • edited Loading

agilebean commented Dec 30, 2020 • edited Loading

tristandeleu commented Dec 31, 2020

agilebean commented Dec 31, 2020 via email

agilebean commented Jan 2, 2021

reproduce Deleu paper

agilebean commented Jan 3, 2021

tristandeleu commented Jan 3, 2021

agilebean commented Jan 4, 2021 via email

agilebean commented Dec 30, 2020 •

edited

Loading

agilebean commented Dec 30, 2020 •

edited

Loading