Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume卡住 #3671

Closed
zdk258 opened this issue May 17, 2024 · 9 comments
Closed

resume卡住 #3671

zdk258 opened this issue May 17, 2024 · 9 comments

Comments

@zdk258
Copy link

zdk258 commented May 17, 2024

resume模型时卡住也不报错,重新开始训练是可以的。将num_workers设置为1也没用

05/17 11:07:53 - mmengine - INFO - resumed epoch: 0, iter: 32500
05/17 11:07:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
05/17 11:07:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
05/17 11:07:53 - mmengine - WARNING - Advance dataloader 32500 steps to skip data that has already been trained

@ShenZheng2000
Copy link

Same Question.

@chtzs
Copy link

chtzs commented May 22, 2024

After checking source code of mmengine, I found that they just called next to skip training data:
in mmenging\runner\loops.py IterBasedTrainLoop

        if self._iter > 0:
            print_log(
                f'Advance dataloader {self._iter} steps to skip data '
                'that has already been trained',
                logger='current',
                level=logging.WARNING)
            for _ in range(self._iter):
                next(self.dataloader_iterator)

In other words, "-- resume" will load data like regular training, but discard all of this data before reaching the specified iteration. Therefore, the time required for resuming will not be much faster than starting a new training session.

@ShenZheng2000
Copy link

I discovered that using a lower version of mmengine helps resolve the issue. For example:

mim install mmengine==0.10.2

@chtzs
Copy link

chtzs commented May 22, 2024

I think this is the cause of the problem. Here's the PR. open-mmlab/mmengine#1471

@ShenZheng2000
Copy link

@chtzs Thanks!

@Saillxl
Copy link

Saillxl commented May 28, 2024

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

@Outlying3720
Copy link

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

Just comment out these lines.

@chtzs
Copy link

chtzs commented Jun 4, 2024

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

@Saillxl Solution could be found in this issue:open-mmlab/mmengine#1520

@zdk258 zdk258 closed this as completed Dec 11, 2024
@sunye23
Copy link

sunye23 commented Feb 10, 2025

I discovered that using a lower version of mmengine helps resolve the issue. For example:

mim install mmengine==0.10.2

Hi, will degrade the mmengine version affect the model's training performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants