Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For custom RE dataset with entity marked in advanced #166

Open
KennyNg-19 opened this issue Aug 17, 2021 · 1 comment
Open

For custom RE dataset with entity marked in advanced #166

KennyNg-19 opened this issue Aug 17, 2021 · 1 comment

Comments

@KennyNg-19
Copy link

KennyNg-19 commented Aug 17, 2021

Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,

  1. do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @disease$, [e] some disease [/e])?

  2. when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:

tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])

Hi Chloe,

Yes, you need to input task_name. If your dataset is a task of binary classification, you can use either of them. Basically, euadr and gad are processed in the same way (using BioBERTProcessor).

biobert/run_re.py

Lines 914 to 917 in 37599fb

"gad": BioBERTProcessor,
"polysearch": BioBERTProcessor,
"mirnadisease": BioBERTProcessor,
"euadr": BioBERTProcessor,

Please be noticed that, however, chemprot dataset is a multi-class classification task. Hence it is processed in a different way and the same holds for the evaluation script.
Thank you for your interest in our work!
Best,
WonJin

@wonjininfo
Copy link
Member

Hi Kenny,
Thank you for your interest in our paper and my apologies for the delay in response.

You can use any entity marker or dummy words but please refrain from using some popular words.
In my case, I utilized synthetic words like ENToGENEoMK. You need to register these words in the vocab. (Please see the next paragraph)

In order to add a custom token to the tokenizer, (for this repo; TensorFlow version) you need to modify vocab.txt. If you open the vocab.txt file, you can see the reserved [unused1] tokens at the beginning. You can replace these tokens with your custom tokens.
I think your code tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]']) is for HuggingFace framework. Please check https://github.com/dmis-lab/biobert-pytorch for the pytorch-HuggingFace version codes.

Thank you and once again, sorry for the delay in response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants