resume卡住 #3671

zdk258 · 2024-05-17T03:10:15Z

resume模型时卡住也不报错，重新开始训练是可以的。将num_workers设置为1也没用

05/17 11:07:53 - mmengine - INFO - resumed epoch: 0, iter: 32500
05/17 11:07:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
05/17 11:07:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
05/17 11:07:53 - mmengine - WARNING - Advance dataloader 32500 steps to skip data that has already been trained

ShenZheng2000 · 2024-05-19T19:50:51Z

Same Question.

chtzs · 2024-05-22T02:43:53Z

After checking source code of mmengine, I found that they just called next to skip training data:
in mmenging\runner\loops.py IterBasedTrainLoop

        if self._iter > 0:
            print_log(
                f'Advance dataloader {self._iter} steps to skip data '
                'that has already been trained',
                logger='current',
                level=logging.WARNING)
            for _ in range(self._iter):
                next(self.dataloader_iterator)

In other words, "-- resume" will load data like regular training, but discard all of this data before reaching the specified iteration. Therefore, the time required for resuming will not be much faster than starting a new training session.

ShenZheng2000 · 2024-05-22T02:47:13Z

I discovered that using a lower version of mmengine helps resolve the issue. For example:

mim install mmengine==0.10.2

chtzs · 2024-05-22T03:04:42Z

I think this is the cause of the problem. Here's the PR. open-mmlab/mmengine#1471

ShenZheng2000 · 2024-05-22T03:35:38Z

@chtzs Thanks!

Saillxl · 2024-05-28T13:47:58Z

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

Outlying3720 · 2024-05-30T10:12:24Z

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

Just comment out these lines.

chtzs · 2024-06-04T07:07:56Z

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

@Saillxl Solution could be found in this issue：open-mmlab/mmengine#1520

sunye23 · 2025-02-10T16:44:46Z

I discovered that using a lower version of mmengine helps resolve the issue. For example:
mim install mmengine==0.10.2

Hi, will degrade the mmengine version affect the model's training performance?

zdk258 closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume卡住 #3671

resume卡住 #3671

zdk258 commented May 17, 2024 •

edited

Loading

ShenZheng2000 commented May 19, 2024

chtzs commented May 22, 2024

ShenZheng2000 commented May 22, 2024

chtzs commented May 22, 2024

ShenZheng2000 commented May 22, 2024

Saillxl commented May 28, 2024

Outlying3720 commented May 30, 2024

chtzs commented Jun 4, 2024

sunye23 commented Feb 10, 2025

resume卡住 #3671

resume卡住 #3671

Comments

zdk258 commented May 17, 2024 • edited Loading

ShenZheng2000 commented May 19, 2024

chtzs commented May 22, 2024

ShenZheng2000 commented May 22, 2024

chtzs commented May 22, 2024

ShenZheng2000 commented May 22, 2024

Saillxl commented May 28, 2024

Outlying3720 commented May 30, 2024

chtzs commented Jun 4, 2024

sunye23 commented Feb 10, 2025

zdk258 commented May 17, 2024 •

edited

Loading