Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge ner tags in ANERCORP and LDC ontonotes #10

Open
YanLiang1102 opened this issue Jun 20, 2018 · 4 comments
Open

Merge ner tags in ANERCORP and LDC ontonotes #10

YanLiang1102 opened this issue Jun 20, 2018 · 4 comments

Comments

@YanLiang1102
Copy link
Collaborator

YanLiang1102 commented Jun 20, 2018

@ahalterman @cegme
Andy and Dr. Grant any idea? weird tags like " 'B-DATE" E_OFF="5'
'head': 0,
'id': 18,
'ner': 'B-DATE" E_OFF="5',
'orth': '2001',
'tag': ''},
{'dep': '',
'head': 0,
'id': 19,
'ner': 'I-DATE" E_OFF="5',
'orth': '-',
'tag': ''},
{'dep': '',
'head': 0,
'id': 20,
'ner': 'L-DATE" E_OFF="5',
'orth': '2002 ',
'tag': ''},

I have no idea what does this "5" means, the data only has 3 element in the range why it is 5,
and here is all the ner tags in LDConToNotes with count:

{'-': 4,
 'B-CARDINAL': 149,
 'B-CARDINAL" E_OFF="1': 3,
 'B-CARDINAL" E_OFF="3': 1,
 'B-DATE': 1186,
 'B-DATE" E_OFF="1': 38,
 'B-DATE" E_OFF="5': 3,
 'B-DATE" S_OFF="1': 1,
 'B-EVENT': 422,
 'B-FAC': 470,
 'B-FAC" S_OFF="1': 4,
 'B-GPE': 575,
 'B-GPE" E_OFF="4': 1,
 'B-GPE" S_OFF="1': 3,
 'B-LANGUAGE': 6,
 'B-LAW': 127,
 'B-LOC': 293,
 'B-LOC" S_OFF="1': 1,
 'B-MONEY': 233,
 'B-NORP': 61,
 'B-ORDINAL': 34,
 'B-ORG': 2848,
 'B-ORG" E_OFF="1': 1,
 'B-ORG" S_OFF="1': 13,
 'B-PERCENT': 129,
 'B-PERSON': 3869,
 'B-PERSON" E_OFF="2': 1,
 'B-PERSON" S_OFF="1': 46,
 'B-PRODUCT': 55,
 'B-QUANTITY': 222,
 'B-QUANTITY" E_OFF="2': 6,
 'B-QUANTITY" E_OFF="3': 1,
 'B-QUANTITY" S_OFF="2': 2,
 'B-TIME': 340,
 'B-TIME" E_OFF="1': 7,
 'B-WORK_OF_ART': 159,
 'I-CARDINAL': 136,
 'I-CARDINAL" E_OFF="1': 9,
 'I-CARDINAL" E_OFF="3': 1,
 'I-DATE': 860,
 'I-DATE" E_OFF="1': 14,
 'I-DATE" E_OFF="5': 3,
 'I-DATE" S_OFF="1': 1,
 'I-EVENT': 973,
 'I-FAC': 494,
 'I-GPE': 98,
 'I-LAW': 233,
 'I-LOC': 97,
 'I-MONEY': 177,
 'I-NORP': 8,
 'I-ORDINAL': 4,
 'I-ORG': 3290,
 'I-ORG" E_OFF="1': 1,
 'I-ORG" S_OFF="1': 3,
 'I-PERCENT': 156,
 'I-PERSON': 917,
 'I-PERSON" S_OFF="1': 1,
 'I-PRODUCT': 96,
 'I-QUANTITY': 73,
 'I-QUANTITY" E_OFF="2': 6,
 'I-QUANTITY" E_OFF="3': 1,
 'I-QUANTITY" S_OFF="2': 2,
 'I-TIME': 260,
 'I-WORK_OF_ART': 579,
 'L-CARDINAL': 148,
 'L-CARDINAL" E_OFF="1': 3,
 'L-CARDINAL" E_OFF="3': 1,
 'L-DATE': 1187,
 'L-DATE" E_OFF="1': 38,
 'L-DATE" E_OFF="5': 3,
 'L-DATE" S_OFF="1': 1,
 'L-EVENT': 420,
 'L-FAC': 462,
 'L-FAC" S_OFF="1': 4,
 'L-GPE': 573,
 'L-GPE" E_OFF="4': 1,
 'L-GPE" S_OFF="1': 3,
 'L-LANGUAGE': 6,
 'L-LAW': 127,
 'L-LOC': 289,
 'L-LOC" S_OFF="1': 1,
 'L-MONEY': 233,
 'L-NORP': 61,
 'L-ORDINAL': 34,
 'L-ORG': 2827,
 'L-ORG" E_OFF="1': 1,
 'L-ORG" S_OFF="1': 13,
 'L-PERCENT': 129,
 'L-PERSON': 3837,
 'L-PERSON" E_OFF="2': 1,
 'L-PERSON" S_OFF="1': 46,
 'L-PRODUCT': 55,
 'L-QUANTITY': 222,
 'L-QUANTITY" E_OFF="2': 6,
 'L-QUANTITY" E_OFF="3': 1,
 'L-QUANTITY" S_OFF="2': 2,
 'L-TIME': 339,
 'L-TIME" E_OFF="1': 7,
 'L-WORK_OF_ART': 159,
 'O': 225156,
 'U-CARDINAL': 670,
 'U-DATE': 1149,
 'U-DATE" E_OFF="1': 12,
 'U-DATE" S_OFF="1': 1,
 'U-EVENT': 33,
 'U-FAC': 42,
 'U-FAC" E_OFF="2': 1,
 'U-FAC" S_OFF="1': 1,
 'U-GPE': 3228,
 'U-GPE" E_OFF="1': 1,
 'U-GPE" E_OFF="2': 1,
 'U-GPE" S_OFF="1': 76,
 'U-GPE" S_OFF="1" E_OFF="1': 7,
 'U-LANGUAGE': 41,
 'U-LAW': 26,
 'U-LOC': 71,
 'U-LOC" S_OFF="1': 1,
 'U-MONEY': 4,
 'U-NORP': 3386,
 'U-NORP" E_OFF="2': 1,
 'U-NORP" S_OFF="1': 2,
 'U-NORP" S_OFF="2': 1,
 'U-ORDINAL': 843,
 'U-ORDINAL" E_OFF="1': 2,
 'U-ORDINAL" E_OFF="3': 1,
 'U-ORG': 1552,
 'U-ORG" E_OFF="1': 7,
 'U-ORG" E_OFF="2': 3,
 'U-ORG" S_OFF="1': 36,
 'U-ORG" S_OFF="1" E_OFF="1': 5,
 'U-PERCENT': 8,
 'U-PERSON': 1554,
 'U-PERSON" S_OFF="1': 85,
 'U-PERSON" S_OFF="1" E_OFF="1': 1,
 'U-PERSON" S_OFF="1" E_OFF="2': 1,
 'U-PRODUCT': 21,
 'U-PRODUCT" E_OFF="1': 1,
 'U-PRODUCT" S_OFF="1': 1,
 'U-QUANTITY': 121,
 'U-QUANTITY" E_OFF="1': 5,
 'U-TIME': 100,
 'U-TIME" E_OFF="1': 3,
 'U-WORK_OF_ART': 50}

and here is the ner tag class in ANERCorp.

{'B-LOC': 542,
 'B-MISC': 336,
 'B-ORG': 1050,
 'B-PERS': 2098,
 'I-LOC': 55,
 'I-MISC': 220,
 'I-ORG': 336,
 'I-PERS': 747,
 'L-LOC': 542,
 'L-MISC': 336,
 'L-ORG': 1050,
 'L-PERS': 2098,
 'O': 133705,
 'U-LOC': 3894,
 'U-MISC': 782,
 'U-ORG': 973,
 'U-PERS': 1508,
 'U-ts': 1}
@YanLiang1102 YanLiang1102 changed the title weird ner tag in LDContoNotes Merge ner tags in ANERCORP and LDC ontonotes Jun 20, 2018
@ahalterman
Copy link
Collaborator

Where do the E_OFF things come from? They aren't in the original ANER corpus.

@YanLiang1102
Copy link
Collaborator Author

YanLiang1102 commented Jun 20, 2018

they are in LDC onto notes
ANERCORP does not have it,
so what we need to do is we need to map out the ner tag class works for both LDC and ANERCORP, for example, U-CARDINAL in LDC should be mapped as MISC, i guess

@ahalterman
Copy link
Collaborator

The OntoNotes docs say that E_OFF is the offset number of the end (i.e. the token number of the final word). They sounds like they're mostly for coreference, so we can probably safely ignore them.

screen shot 2018-06-20 at 2 40 42 pm

@YanLiang1102
Copy link
Collaborator Author

gpe+loc+fac->loc
quantity+date+time..->misc.
per->per
org->org
o->o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants