Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining works fine, but rl training stays at 0 Accuracy #1

Open
BenjaminWinter opened this issue Nov 30, 2018 · 10 comments
Open

Pretraining works fine, but rl training stays at 0 Accuracy #1

BenjaminWinter opened this issue Nov 30, 2018 · 10 comments

Comments

@BenjaminWinter
Copy link

Running:

  • Pytorch 0.3.1
  • Python 3.5.2

The RL training doesn't work for me for the NYT10 dataset (havent checked other yet).
I first ran pretraining for 10 epochs with:

python main.py --epochPre 10 --numprocess 8 --datapath ../data/NYT10/ --pretrain True

which gets roughly 58 F1 on the test set, and then afterwards try the RL training with:

python main.py --epochRL 10 --numprocess 8 --start checkpoints/model_HRL_10 --datapath ../data/NYT10/

I stopped RL training after 3 epochs because not only was dev and test set F1 at 0, even training accuracy is 0.
Loss started at around 30, then after only 60 batches decreases to about -20, then slowly increases again and ends up hovering around -0.00005
Checking the optimize() method all reward arrays contain either straight 0's or negative numbers.

@keavil
Copy link

keavil commented Dec 1, 2018

Try to reduce the learning rate plz. I‘m not sure what's going wrong, but it worth a try.

@BenjaminWinter
Copy link
Author

thank you for your quick reply. I tried that over the weekend,and a lower learning rate (0.00002) indeed helped a little bit.
The accuracy is not pinned at 0 anymore, but it still stays in single digits. Shouldn't it already start higher due to the pretraining?

Would it be possible for you to share a pretraining model and a set of hyperparameters that work for you?

@truthless11
Copy link
Owner

I have just rerun the pretraining for the NYT10 dataset with

python3 main.py --datapath ../data/NYT10/ --pretrain True

and gets about 62 F1 on the test set. Here's the log output

epoch 0: dev F1: 0.5301069217782779, test F1: 0.46435134141859613
epoch 1: dev F1: 0.6318181818181818, test F1: 0.5483187471211423
epoch 2: dev F1: 0.6612162616194424, test F1: 0.5576974804985205
epoch 3: dev F1: 0.715510522213562, test F1: 0.6136810144668691
epoch 4: dev F1: 0.7223641817575825, test F1: 0.6080997979443029
epoch 5: dev F1: 0.7285908473040326, test F1: 0.6140748120641246
epoch 6: dev F1: 0.7290673172895641, test F1: 0.6133403731080604
epoch 7: dev F1: 0.7419283010465375, test F1: 0.62323850039883
epoch 8: dev F1: 0.7316315205327415, test F1: 0.6132978723404255
epoch 9: dev F1: 0.7432131731197152, test F1: 0.6178214317317052
epoch 10: dev F1: 0.7410535674594355, test F1: 0.6209842484648929
epoch 11: dev F1: 0.7460694491573352, test F1: 0.6244008320520936
epoch 12: dev F1: 0.7455326849129156, test F1: 0.6203670385030586
epoch 13: dev F1: 0.7360536612632756, test F1: 0.6146158650843222
epoch 14: dev F1: 0.75188138829608, test F1: 0.6181299072091364

Then I train the model using RL with

python3 main.py --lr 2e-5 --datapath ../data/NYT10/ --start checkpoints/model_HRL_10

and the F1 score continues to rise,

epoch 0: dev F1: 0.7637886897835698, test F1: 0.6370854740775339
epoch 1: dev F1: 0.7609631266720949, test F1: 0.6375350140056023
epoch 2: dev F1: 0.7648014859530996, test F1: 0.6340052258305339

The model is still training, only the logs of the first 3 epochs are quoted here.

Environment:

  • Python 3.5.2
  • Pytorch 0.3.1

@misaki-sysu
Copy link

misaki-sysu commented Mar 14, 2019

Similar question.@BenjaminWinter

Train the model with
python main.py --lr 2e-5 --datapath ../data/NYT10/

Dev and test set F1 are 0, training accuracy is 0 on each training epoh while the loss is continuing decline
After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]

Environment:
Python 3.5.2
Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet )
@truthless11

@WJYw
Copy link

WJYw commented Jun 4, 2019

@misaki-sysu
I meet the same question, test set F1 are 0. Have you solved this problem?

@WJYw
Copy link

WJYw commented Jun 5, 2019

I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set.
python main.py test --test True checkpoints/model_HRL_10
The result is:
0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0
What is the problem that is willing to cause this?

@yin-hong
Copy link

yin-hong commented Dec 6, 2019

I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set.
python main.py test --test True checkpoints/model_HRL_10
The result is:
0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0
What is the problem that is willing to cause this?

I meet the same problem where F1 in training process is good but test set F1 are 0. Have you solved this problem?

@Yangzhenping520
Copy link

Similar question.@BenjaminWinter

Train the model with
python main.py --lr 2e-5 --datapath ../data/NYT10/

Dev and test set F1 are 0, training accuracy is 0 on each training epoh while the loss is continuing decline
After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]

Environment:
Python 3.5.2
Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet )
@truthless11

What are you change code for 0.3 to 1.0.1?Can you give me you rewrite code?Thank you very much!

@YiYingsheng
Copy link

YiYingsheng commented Jul 7, 2020

@truthless11 I have encountered this problem. What is the reason and how to solve it

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS
Traceback (most recent call last):
File "E://HRL-RE-master/code/main.py", line 103, in
p.start()
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor
event_sync_required) = storage.share_cuda()
RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245

@xxxxxi-gg
Copy link

@truthless11 I have encountered this problem. What is the reason and how to solve it

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS
Traceback (most recent call last):
File "E://HRL-RE-master/code/main.py", line 103, in
p.start()
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor
event_sync_required) = storage.share_cuda()
RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245

@YiYingsheng Excuse me, Have you solved it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants