You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,
do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @disease$, [e] some disease [/e])?
when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:
Yes, you need to input task_name. If your dataset is a task of binary classification, you can use either of them. Basically, euadr and gad are processed in the same way (using BioBERTProcessor).
Please be noticed that, however, chemprot dataset is a multi-class classification task. Hence it is processed in a different way and the same holds for the evaluation script.
Thank you for your interest in our work!
Best,
WonJin
The text was updated successfully, but these errors were encountered:
Hi Kenny,
Thank you for your interest in our paper and my apologies for the delay in response.
You can use any entity marker or dummy words but please refrain from using some popular words.
In my case, I utilized synthetic words like ENToGENEoMK. You need to register these words in the vocab. (Please see the next paragraph)
In order to add a custom token to the tokenizer, (for this repo; TensorFlow version) you need to modify vocab.txt. If you open the vocab.txt file, you can see the reserved [unused1] tokens at the beginning. You can replace these tokens with your custom tokens.
I think your code tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]']) is for HuggingFace framework. Please check https://github.com/dmis-lab/biobert-pytorch for the pytorch-HuggingFace version codes.
Thank you and once again, sorry for the delay in response.
Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,
do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @disease$, [e] some disease [/e])?
when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:
tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])
The text was updated successfully, but these errors were encountered: