Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance seems to be very low #33

Open
scribblepad opened this issue Oct 22, 2021 · 4 comments
Open

Performance seems to be very low #33

scribblepad opened this issue Oct 22, 2021 · 4 comments

Comments

@scribblepad
Copy link

I'm trying to explore the usage the AutoAlbument for semantic segmentation task with default generated search.yaml.
The custom dataset has around 29000 RGB images and corresponding masks (height x width - 512 x 512). I'm running it on a single A100 GPU. I'm using max batch size of 8, I could fit only so much in memory without OOM errors. I see GPU getting utilized fine, utilization fluctuates between (35%->75%->99%->100%).

Issue

Looks like the approximate time required to complete autoalbument-search seems to be close 5 days (for 20 epochs) based on the output below, which seems to be too high. Is there a better optimized way to obtain augmentation policies generated by AutoAlbument?
Because it's too expensive to run it for 5 continuous days.

Current Output of autoalbument-search:
image

Segments from search.yaml:

architecture: Unet
encoder_architecture: resnet18
pretrained: true

dataloader:
target: torch.utils.data.DataLoader
batch_size: 8
shuffle: true
num_workers: 16
pin_memory: true
drop_last: true

@ihamdi
Copy link

ihamdi commented Nov 17, 2021

Same here. Trying the cifar10 example and its taking at least 5s/iteration. 390 itr/epoch means like half an hour per epoch. I can only imagine how slow it it will be if I try to use it on my x-ray classification task with 6000 high-res images.

I'm gonna wait just to see the result out of curiosity but otherwise not usable. Going to look into the Faster Auto Augment this is based on or even the older Rand Augment or Auto Augment

I'm trying to explore the usage the AutoAlbument for semantic segmentation task with default generated search.yaml. The custom dataset has around 29000 RGB images and corresponding masks (height x width - 512 x 512). I'm running it on a single A100 GPU. I'm using max batch size of 8, I could fit only so much in memory without OOM errors. I see GPU getting utilized fine, utilization fluctuates between (35%->75%->99%->100%).

Issue

Looks like the approximate time required to complete autoalbument-search seems to be close 5 days (for 20 epochs) based on the output below, which seems to be too high. Is there a better optimized way to obtain augmentation policies generated by AutoAlbument? Because it's too expensive to run it for 5 continuous days.

Current Output of autoalbument-search: image

Segments from search.yaml:

architecture: Unet encoder_architecture: resnet18 pretrained: true

dataloader: target: torch.utils.data.DataLoader batch_size: 8 shuffle: true num_workers: 16 pin_memory: true drop_last: true

@ihamdi
Copy link

ihamdi commented Nov 17, 2021

I'm trying to explore the usage the AutoAlbument for semantic segmentation task with default generated search.yaml. The custom dataset has around 29000 RGB images and corresponding masks (height x width - 512 x 512). I'm running it on a single A100 GPU. I'm using max batch size of 8, I could fit only so much in memory without OOM errors. I see GPU getting utilized fine, utilization fluctuates between (35%->75%->99%->100%).

Issue

Looks like the approximate time required to complete autoalbument-search seems to be close 5 days (for 20 epochs) based on the output below, which seems to be too high. Is there a better optimized way to obtain augmentation policies generated by AutoAlbument? Because it's too expensive to run it for 5 continuous days.

Current Output of autoalbument-search: image

Segments from search.yaml:

architecture: Unet encoder_architecture: resnet18 pretrained: true

dataloader: target: torch.utils.data.DataLoader batch_size: 8 shuffle: true num_workers: 16 pin_memory: true drop_last: true

Running the same cifar10 example with batch size of 128 on RTX2070 (6GB) takes the same amount of time/iteration as using a 24 GB RTX3090 and a batch size of 640. I think there's something limiting how quickly the iterations happen in their code.

@siddagra
Copy link

I think not being able to use AMP (even if u try to use it u get an error on Albumentations' end) might be hurting performance and batch size. Also, perhaps Pytorch Lightning manual mode is just a lot slower. Need to test that. It also seems like it is doing a generative step and a discriminative step. So perhaps that is slowing it a bit; but it really shouldn't slow it down too much, as inference is much faster than training as gradients are not required. Lastly, it does not seem to matter what size of model you use; speed is similar; which suggests a CPU bottleneck; unsure why one would get such a bottleneck on pytorch-lightning.

Will similarly explore a bit and see if I can pin down the exact cause and otherwise switch to some of the better more recent methods. (DADA, Adv AA, official faster auto-augment, RandAugment search).

Maybe I can gain insights on how to train/search policies, in general, using Albumentations, till now I had to edit each augmentation by hand to try something like RandAugment (varying magnitude), would be nice to have a policy parser that I can generate programmatically to vary the magnitude.

@saigontrade88
Copy link

You can reduce the training set size to 4,000 as the authors of Faster AA show in their paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants