Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练时报错:RuntimeError: CUDA error: too many resources requested for launch #11

Open
Strontia opened this issue Jun 24, 2022 · 3 comments

Comments

@Strontia
Copy link

完整报错信息:

ReResNet Orientation: 8 Fix Params: False
2022-06-25 00:20:44,437 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.7.r11.7/compiler.31294372_0
GPU 0: NVIDIA GeForce RTX 2080 Ti
GCC: gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.5.0
OpenCV: 4.6.0
MMCV: 0.6.2
MMDetection: 1.1.0+258d792
MMDetection Compiler: GCC 11.2
MMDetection CUDA Compiler: 11.7
------------------------------------------------------------

2022-06-25 00:20:44,437 - mmdet - INFO - Distributed training: False
2022-06-25 00:20:44,437 - mmdet - INFO - Config:
/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/configs/dota/r50_dotav1.py
work_dir = 'work_dirs/r50_dotav1/'

# model settings
norm_cfg = dict(type='GN', num_groups=32, requires_grad=True)

model = dict(
    type='OrientedRepPointsDetector',
    pretrained='torchvision://resnet50', 
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        style='pytorch',
    ),
    neck=
        dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs=True,
        num_outs=5,
        norm_cfg=norm_cfg
        ),
    bbox_head=dict(
        type='OrientedRepPointsHead',
        num_classes=16,
        in_channels=256,
        feat_channels=256,
        point_feat_channels=256,
        stacked_convs=3,
        num_points=9,
        gradient_mul=0.3,
        point_strides=[8, 16, 32, 64, 128],
        point_base_scale=2,
        norm_cfg=norm_cfg,
        loss_cls=dict(type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0),
        loss_rbox_init=dict(type='GIoULoss', loss_weight=0.375),
        loss_rbox_refine=dict(type='GIoULoss', loss_weight=1.0),
        loss_spatial_init=dict(type='SpatialBorderLoss', loss_weight=0.05),
        loss_spatial_refine=dict(type='SpatialBorderLoss', loss_weight=0.1),
        top_ratio=0.4,))
# training and testing settings
train_cfg = dict(
    init=dict(
        assigner=dict(type='PointAssigner', scale=4, pos_num=1),  # 每个gtbox仅选一个正样本
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    refine=dict(
        assigner=dict(
            type='MaxIoUAssigner', #pre-assign to select more samples for samples selection
            pos_iou_thr=0.1,
            neg_iou_thr=0.1,
            min_pos_iou=0,
            ignore_iof_thr=-1),
        allowed_border=-1,
        pos_weight=-1,
        debug=False))

test_cfg = dict(
    nms_pre=2000,
    min_bbox_size=0,
    score_thr=0.05,
    nms=dict(type='rnms', iou_thr=0.4),
    max_per_img=2000)

# dataset settings
dataset_type = 'DotaDatasetv1'
data_root = '/home/r/文档/WPW/Remote/DataSets/Dota-v1.5/' #'data/dataset_demo_split/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='CorrectRBBox', correct_rbbox=True, refine_rbbox=True),
    dict(type='PolyResize',
        img_scale=[(1333, 768), (1333, 1280)],
        keep_ratio=True,
        multiscale_mode='range',
        clamp_rbbox=False),
    dict(type='PolyRandomFlip', flip_ratio=0.5),
    #dict(type='HSVAugment', hgain=0.015, sgain=0.7, vgain=0.4),
    #dict(type='PolyRandomRotate', rotate_ratio=0.5, angles_range=180, auto_bound=False),
    dict(type='Pad', size_divisor=32),
    #dict(type='Poly_Mosaic_RandomPerspective', mosaic_ratio=0.5, ifcrop=True, degrees=0, translate=0.1, scale=0.2, shear=0, perspective=0.0),
    #dict(type='MixUp', mixup_ratio=0.5),
    dict(type='PolyImgPlot', img_save_path=work_dir, save_img_num=16, class_num=15, thickness=2),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='PolyResize', keep_ratio=True),
            dict(type='PolyRandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']), 
            dict(type='Collect', keys=['img']),
        ])
]

data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'trainval_split/' + 'trainval.json',
        img_prefix=data_root + 'trainval_split/' + 'images/',
        pipeline=train_pipeline,
        Mosaic4=False,
        Mosaic9=False,
        Mixup=False),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'trainval_split/' + 'trainval.json',
        img_prefix=data_root + 'trainval_split/' + 'images/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'test_split/' + 'test.json',
        img_prefix=data_root + 'test_split/' + 'images/',
        pipeline=test_pipeline))

evaluation = dict(interval=1, metric='bbox')
# optimizer
optimizer = dict(type='AdamW', lr=0.0001, betas=(0.9, 0.999), weight_decay=0.05,
                paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.),
                                                 'relative_position_bias_table': dict(decay_mult=0.),
                                                 'norm': dict(decay_mult=0.)}))
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[24, 32, 38])
checkpoint_config = dict(interval=20)
# yapf:disable
log_config = dict(
    interval=1,          # 迭代n次时打印一次
    hooks=[
        dict(type='TextLoggerHook')
    ])
# yapf:enable
# runtime settings
total_epochs = 40
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None#'work_dirs/orientedreppoints_r50_demo/latest.pth'
workflow = [('train', 1)]


2022-06-25 00:20:44,666 - mmdet - INFO - load model from: torchvision://resnet50
2022-06-25 00:20:44,779 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory...
Done (t=4.07s)
creating index...
index created!
2022-06-25 00:20:50,462 - mmdet - INFO - Start running, host: r@4508, work_dir: /home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/work_dirs/r50_dotav1
2022-06-25 00:20:50,462 - mmdet - INFO - workflow: [('train', 1)], max: 40 epochs
Traceback (most recent call last):
  File "tools/train.py", line 154, in <module>
    main()
  File "tools/train.py", line 143, in main
    train_detector(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/apis/train.py", line 105, in train_detector
    _non_dist_train(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/apis/train.py", line 244, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 34, in train
    outputs = self.batch_processor(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r/miniconda3/envs/orientedreppoints/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/models/detectors/base.py", line 147, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/models/detectors/orientedreppoints_detector.py", line 31, in forward_train
    losses = self.bbox_head.loss(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/models/anchor_heads/orientedreppoints_head.py", line 388, in loss
    cls_reg_targets_refine = refine_pointset_target(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/pointset_target.py", line 148, in refine_pointset_target
    all_proposal_weights, pos_inds_list, neg_inds_list, all_gt_inds) = multi_apply(
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/utils/misc.py", line 24, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/pointset_target.py", line 190, in refine_pointset_target_single
    assign_result = bbox_assigner.assign(proposals, gt_rbboxes,
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/assigners/max_iou_assigner.py", line 80, in assign
    assign_result = self.assign_wrt_overlaps(overlaps, gt_labels)
  File "/home/r/文档/WPW/Remote/Projects/OrientedRepPoints_DOTA/mmdet/core/bbox/assigners/max_iou_assigner.py", line 92, in assign_wrt_overlaps
    assigned_gt_inds = overlaps.new_full((num_bboxes,),
RuntimeError: CUDA error: too many resources requested for launch
@MyJerry1996
Copy link

请问解决了吗,我也有同样的问题

@ld-xy
Copy link

ld-xy commented Jul 12, 2024

请问解决了吗,我也有同样的问题

请问你解决了吗,我也遇到了同样的

@MyJerry1996
Copy link

请问解决了吗,我也有同样的问题

请问你解决了吗,我也遇到了同样的

我记得当时是重装了一遍环境解决的,问题有点久了可能记不清

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants