Pretraining works fine, but rl training stays at 0 Accuracy #1

BenjaminWinter · 2018-11-30T16:35:23Z

Running:

Pytorch 0.3.1
Python 3.5.2

The RL training doesn't work for me for the NYT10 dataset (havent checked other yet).
I first ran pretraining for 10 epochs with:

python main.py --epochPre 10 --numprocess 8 --datapath ../data/NYT10/ --pretrain True

which gets roughly 58 F1 on the test set, and then afterwards try the RL training with:

python main.py --epochRL 10 --numprocess 8 --start checkpoints/model_HRL_10 --datapath ../data/NYT10/

I stopped RL training after 3 epochs because not only was dev and test set F1 at 0, even training accuracy is 0.
Loss started at around 30, then after only 60 batches decreases to about -20, then slowly increases again and ends up hovering around -0.00005
Checking the optimize() method all reward arrays contain either straight 0's or negative numbers.

The text was updated successfully, but these errors were encountered:

keavil · 2018-12-01T09:46:03Z

Try to reduce the learning rate plz. I‘m not sure what's going wrong, but it worth a try.

BenjaminWinter · 2018-12-03T09:56:13Z

thank you for your quick reply. I tried that over the weekend,and a lower learning rate (0.00002) indeed helped a little bit.
The accuracy is not pinned at 0 anymore, but it still stays in single digits. Shouldn't it already start higher due to the pretraining?

Would it be possible for you to share a pretraining model and a set of hyperparameters that work for you?

truthless11 · 2018-12-06T02:38:46Z

I have just rerun the pretraining for the NYT10 dataset with

python3 main.py --datapath ../data/NYT10/ --pretrain True

and gets about 62 F1 on the test set. Here's the log output

epoch 0: dev F1: 0.5301069217782779, test F1: 0.46435134141859613
epoch 1: dev F1: 0.6318181818181818, test F1: 0.5483187471211423
epoch 2: dev F1: 0.6612162616194424, test F1: 0.5576974804985205
epoch 3: dev F1: 0.715510522213562, test F1: 0.6136810144668691
epoch 4: dev F1: 0.7223641817575825, test F1: 0.6080997979443029
epoch 5: dev F1: 0.7285908473040326, test F1: 0.6140748120641246
epoch 6: dev F1: 0.7290673172895641, test F1: 0.6133403731080604
epoch 7: dev F1: 0.7419283010465375, test F1: 0.62323850039883
epoch 8: dev F1: 0.7316315205327415, test F1: 0.6132978723404255
epoch 9: dev F1: 0.7432131731197152, test F1: 0.6178214317317052
epoch 10: dev F1: 0.7410535674594355, test F1: 0.6209842484648929
epoch 11: dev F1: 0.7460694491573352, test F1: 0.6244008320520936
epoch 12: dev F1: 0.7455326849129156, test F1: 0.6203670385030586
epoch 13: dev F1: 0.7360536612632756, test F1: 0.6146158650843222
epoch 14: dev F1: 0.75188138829608, test F1: 0.6181299072091364

Then I train the model using RL with

python3 main.py --lr 2e-5 --datapath ../data/NYT10/ --start checkpoints/model_HRL_10

and the F1 score continues to rise,

epoch 0: dev F1: 0.7637886897835698, test F1: 0.6370854740775339
epoch 1: dev F1: 0.7609631266720949, test F1: 0.6375350140056023
epoch 2: dev F1: 0.7648014859530996, test F1: 0.6340052258305339

The model is still training, only the logs of the first 3 epochs are quoted here.

Environment:

Python 3.5.2
Pytorch 0.3.1

misaki-sysu · 2019-03-14T02:48:09Z

Similar question.@BenjaminWinter

Train the model with
python main.py --lr 2e-5 --datapath ../data/NYT10/

Dev and test set F1 are 0， training accuracy is 0 on each training epoh while the loss is continuing decline
After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]

Environment:
Python 3.5.2
Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet )
@truthless11

WJYw · 2019-06-04T13:13:02Z

@misaki-sysu
I meet the same question, test set F1 are 0. Have you solved this problem?

WJYw · 2019-06-05T11:56:37Z

I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set.
python main.py test --test True checkpoints/model_HRL_10
The result is:
0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0
What is the problem that is willing to cause this?

yin-hong · 2019-12-06T02:56:04Z

I have pre-trained first, and the learning rate was 0.00002, but the accuracy was 0 when I tested the test set.
python main.py test --test True checkpoints/model_HRL_10
The result is:
0 148107 5628 0 150425 5713 0 152828 5803 0 154250 5859 test P: 0.0 test R: 0.0 test F1: 0
What is the problem that is willing to cause this?

I meet the same problem where F1 in training process is good but test set F1 are 0. Have you solved this problem?

Yangzhenping520 · 2020-06-27T13:52:14Z

Similar question.@BenjaminWinter

Train the model with
python main.py --lr 2e-5 --datapath ../data/NYT10/

Dev and test set F1 are 0， training accuracy is 0 on each training epoh while the loss is continuing decline
After training 2 epochs, all sentences' top_actions are[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0]

Environment:
Python 3.5.2
Pytorch 1.0.1(Have change the code to Pytorch1.0 and no error came out yet )
@truthless11

What are you change code for 0.3 to 1.0.1?Can you give me you rewrite code?Thank you very much!

YiYingsheng · 2020-07-07T08:13:05Z

@truthless11 I have encountered this problem. What is the reason and how to solve it

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS
Traceback (most recent call last):
File "E://HRL-RE-master/code/main.py", line 103, in
p.start()
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor
event_sync_required) = storage.share_cuda()
RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245

xxxxxi-gg · 2020-07-20T08:20:15Z

@truthless11 I have encountered this problem. What is the reason and how to solve it

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp line=245 error=63 : OS call failed or operation not supported on this OS
Traceback (most recent call last):
File "E://HRL-RE-master/code/main.py", line 103, in
p.start()
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "F:\SoftWare\Anaconda\envs\PyTorch\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor
event_sync_required) = storage.share_cuda()
RuntimeError: cuda runtime error (63) : OS call failed or operation not supported on this OS at C:\w\1\s\tmp_conda_3.7_055306\conda\conda-bld\pytorch_1556690124416\work\torch/csrc/generic/StorageSharing.cpp:245

@YiYingsheng Excuse me, Have you solved it?

truthless11 mentioned this issue Mar 19, 2019

accuracy stays 0 while training, and the top_actions are [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0] #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining works fine, but rl training stays at 0 Accuracy #1

Pretraining works fine, but rl training stays at 0 Accuracy #1

BenjaminWinter commented Nov 30, 2018

keavil commented Dec 1, 2018

BenjaminWinter commented Dec 3, 2018

truthless11 commented Dec 6, 2018

misaki-sysu commented Mar 14, 2019 •

edited

Loading

WJYw commented Jun 4, 2019

WJYw commented Jun 5, 2019 •

edited

Loading

yin-hong commented Dec 6, 2019

Yangzhenping520 commented Jun 27, 2020

YiYingsheng commented Jul 7, 2020 •

edited

Loading

xxxxxi-gg commented Jul 20, 2020

Pretraining works fine, but rl training stays at 0 Accuracy #1

Pretraining works fine, but rl training stays at 0 Accuracy #1

Comments

BenjaminWinter commented Nov 30, 2018

keavil commented Dec 1, 2018

BenjaminWinter commented Dec 3, 2018

truthless11 commented Dec 6, 2018

misaki-sysu commented Mar 14, 2019 • edited Loading

WJYw commented Jun 4, 2019

WJYw commented Jun 5, 2019 • edited Loading

yin-hong commented Dec 6, 2019

Yangzhenping520 commented Jun 27, 2020

YiYingsheng commented Jul 7, 2020 • edited Loading

xxxxxi-gg commented Jul 20, 2020

misaki-sysu commented Mar 14, 2019 •

edited

Loading

WJYw commented Jun 5, 2019 •

edited

Loading

YiYingsheng commented Jul 7, 2020 •

edited

Loading