Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about Data Labelling in the Vietnamese language #19

Open
lacls opened this issue May 29, 2021 · 1 comment
Open

Discussion about Data Labelling in the Vietnamese language #19

lacls opened this issue May 29, 2021 · 1 comment

Comments

@lacls
Copy link

lacls commented May 29, 2021

Thank you really much for your grateful project.

I just having confused that,

For instance

  • Input sentence: “Giao tôi lê_lai phường hai tân_bình hcm”
  • Value after tokenizer:
  • {‘input_ids’: [0, 64003, 64003, 17489, 6115, 64139, 64151, 64003, 6446, 64313, 1340, 74780, 2], ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
  • Because tokenize of “lê_lai” is [‘lê@@’, ‘l@@’, ‘ai’]; of “tân_bình” is ['tân@@’, ‘bình’]; of “hcm” is [‘h@@’, ‘cm’]
  • The result I got after all: [‘O’,‘O’,‘B-LOC’,‘I-LOC’,‘I-LOC’,‘I-LOC’, ‘I-LOC’,‘I-LOC’,‘O’,‘I-LOC’,‘I-LOC’, ‘O’]

In fact, their prediction should only have 7 tags for the input tokens, but now it was more than this. Do this project have any strategies for this.

According to HuggingFace's document

Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBert’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary. For example, DistilBert’s tokenizer would split the Twitter handle @huggingface into the tokens ['@', 'hugging', '##face']. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗 Transformers by setting the labels we wish to ignore to -100. In the example above, if the label for @huggingface is 3 (indexing B-corporation), we would set the labels of ['@', 'hugging', '##face'] to [3, -100, -100].

Let’s write a function to do this. This is where we will use the offset_mapping from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100. While we’re at it, we can also set labels to -100 if the second position of the offset mapping is 0, since this means it must be a special token like [PAD] or [CLS].

I do appreciate your time and sharing.

@smaakage85
Copy link
Contributor

Hi @lacls

Sorry for the late reply.

I am a little unsure, what exactly you want me to do? Do you have an idea? I am very open to a Pull Request on this matter.

Best,
Lars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants