Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeBERTa models support #1

Open
KoichiYasuoka opened this issue Jun 9, 2022 · 10 comments
Open

DeBERTa models support #1

KoichiYasuoka opened this issue Jun 9, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@KoichiYasuoka
Copy link

Thank you for releasing JGLUE, but I could not evaluate my deberta-base-japanese-aozora. There seem two problems exist:

  • DeBERTaV2ForMultipleChoice requires transformers v4.19.0 and after, but JGLUE requires v4.9.2
  • Fast tokenizers (including DeBERTaV2TokenizerFast) are not supported on JSQuAD with --use_fast_tokenizer

I tried to force v4.19.2 for the problems, but I could not resolve the latter. Please see detail in my diary (written in Japanese). Do you have any idea?

@KoichiYasuoka
Copy link
Author

KoichiYasuoka commented Jun 10, 2022

!test -d transformers-4.19.2 || git clone -b v4.19.2 --depth=1 https://github.com/huggingface/transformers transformers-4.19.2
!test -d JGLUE || ( git clone --depth=1 https://github.com/yahoojapan/JGLUE && cat JGLUE/fine-tuning/patch/transformers-4.9.2_jglue-1.0.0.patch | ( cd transformers-4.19.2 && patch -p1 ) )
!cd transformers-4.19.2 && pip install .
!pip install -r transformers-4.19.2/examples/pytorch/text-classification/requirements.txt
!pip install protobuf==3.19.1 tensorboard
import json
for f in ["train-v1.0.json","valid-v1.0.json"]:
  with open("JGLUE/datasets/jsquad-v1.0/"+f,"r",encoding="utf-8") as r:
    j=json.load(r)
  u=[]
  for d in j["data"]:
    for p in d["paragraphs"]:
      for q in p["qas"]:
        u.append({"id":q["id"],"title":d["title"],"context":p["context"],"question":q["question"],"answers":{"text":[x["text"] for x in q["answers"]],"answer_start":[x["answer_start"] for x in q["answers"]]}})
  with open(f,"w",encoding="utf-8") as w:
    json.dump({"data":u},w,ensure_ascii=False,indent=2)
!python transformers-4.19.2/examples/pytorch/question-answering/run_qa.py --model_name_or_path KoichiYasuoka/deberta-base-japanese-aozora --do_train --do_eval --max_seq_length 384 --learning_rate 5e-05 --num_train_epochs 3 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --output_dir ./output_jsquad2 --overwrite_output_dir --train_file train-v1.0.json --validation_file valid-v1.0.json --save_steps 5000 --warmup_ratio 0.1

I've just been trying the program above on Google Colaboratory, but I'm vague that the conversion is really suitable for JSQuAD. @tomohideshibata -san, does [SEP] in the jsquad-v1.0 files mean sep_token or not?

@tomohideshibata
Copy link
Contributor

Thank you for trying JGLUE.

For the first comment, the latest version, v4.19.2, can work. (We have updated the explanation for the huggingface versions via 53e5ecd.)

For the second comment, we used examples/legacy/question-answering/run_squad.py because examples/pytorch/question-answering/run_qa.py supports only fast tokenizers (BertJapaneseTokenizer does not have a fast version). We will check if run_qa.py works with JSQuAD.

Does [SEP] in the jsquad-v1.0 files mean sep_token or not?

Yes.

@KoichiYasuoka
Copy link
Author

KoichiYasuoka commented Jun 11, 2022

Thank you @tomohideshibata -san for confirming transformers v4.19.2. Here I realize that I need to replace [SEP] for another sep_token when I evaluate another model whose sep_token is not [SEP]. But... well... unless the sep_token consists of 5 characters, I should change answer_start, shoudn't I? Umm...

@tomohideshibata
Copy link
Contributor

I should change answer_start, shoudn't I?

Yes. In the current version, sep_token is hard-coded in the dataset.
One way to solve this problem is to calculate answer_start in the evaluation script given sep_token of a used tokenizer.
We will try this in the next version.

@KoichiYasuoka
Copy link
Author

KoichiYasuoka commented Jun 11, 2022

Thank you @tomohideshibata -san for the information about [SEP]. Well, I've just made tentative https://github.com/KoichiYasuoka/JGLUE/blob/main/fine-tuning/patch/transformers-4.19.2_jglue-1.0.0.patch for transformers v4.19.2 where I included jsquad_metrics.py instead of changing original squad_metrics.py. But I couldn't include jsquad.py since I couldn't find the proper way to force [SEP] as sep_token in squad_convert_example_to_features() and its neighbors...

@tomohideshibata tomohideshibata added the enhancement New feature or request label Jun 18, 2022
@conan1024hao
Copy link

We encountered a similar problem. examples/legacy/question-answering/run_squad.py does not fit fast tokenizers well, our model can not run on this script even with setting use_fast=False. So we tested examples/pytorch/question-answering/run_qa.py, multilingual models and waseda roberta can run on this well but tohoku berts' tokenizer does not support this. The result of nlp-waseda/roberta-base-japanese is as below(with out parameters optimizing), it seems to work fine as long as we can solve the tokenizer's problem.

EM F1
0.855 0.910

@tomohideshibata
Copy link
Contributor

Thanks for reporting your results. We are also going to test run_qa.py.

@kaisugi
Copy link

kaisugi commented Nov 15, 2022

I also tried run_qa.py (w/ trainer_qa.py & utils_qa.py) in transformers v4.19.2, but somehow an error occurred like this...

  File "run_qa.py", line 661, in <module>
    main()
  File "run_qa.py", line 337, in main
    answer_column_name = "answers" if "answers" in column_names else column_names[2]
IndexError: list index out of range

@KoichiYasuoka
Copy link
Author

KoichiYasuoka commented Nov 15, 2022

Hi @kaisugi -san, I needed some kind of conversion for run_qa.py. My tentative script on Google Colaboratory below:

!test -d transformers-4.19.2 || git clone -b v4.19.2 --depth=1 https://github.com/huggingface/transformers transformers-4.19.2
!test -d JGLUE || ( git clone --depth=1 https://github.com/yahoojapan/JGLUE && cat JGLUE/fine-tuning/patch/transformers-4.9.2_jglue-1.1.0.patch | ( cd transformers-4.19.2 && patch -p1 ) )
!cd transformers-4.19.2 && pip install .
!pip install -r transformers-4.19.2/examples/pytorch/text-classification/requirements.txt
!pip install protobuf==3.19.1 tensorboard
import json
for f in ["train-v1.1.json","valid-v1.1.json"]:
  with open("JGLUE/datasets/jsquad-v1.1/"+f,"r",encoding="utf-8") as r:
    j=json.load(r)
  u=[]
  for d in j["data"]:
    for p in d["paragraphs"]:
      for q in p["qas"]:
        u.append({"id":q["id"],"title":d["title"],"context":p["context"],"question":q["question"],"answers":{"text":[x["text"] for x in q["answers"]],"answer_start":[x["answer_start"] for x in q["answers"]]}})
  with open(f,"w",encoding="utf-8") as w:
    json.dump({"data":u},w,ensure_ascii=False,indent=2)
!python transformers-4.19.2/examples/pytorch/question-answering/run_qa.py --model_name_or_path KoichiYasuoka/deberta-base-japanese-aozora --do_train --do_eval --max_seq_length 384 --learning_rate 5e-05 --num_train_epochs 3 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --output_dir ./output_jsquad2 --overwrite_output_dir --train_file train-v1.1.json --validation_file valid-v1.1.json --save_steps 5000 --warmup_ratio 0.1

@kaisugi
Copy link

kaisugi commented Nov 16, 2022

@KoichiYasuoka

I confirmed your patch script worked properly, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants