Develop 9. Next Steps 2: Named Entity Recognition #13

drjwbaker · 2020-05-27T16:40:35Z

https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/_episodes/10-named-entity-recognition.md

Based on rough outline at #13 (comment) tasks are:

write 1
refine NER process for use in 'advanced' callout
- mk folder NER
- unzip download in it https://nlp.stanford.edu/software/CRF-NER.shtml#Download
- put custom batchner in there (with thanks to https://github.com/collectionslab/batchner)
- run sh SCRIPT. Output 'DATASET_people.txt' and 'DATASET_places.txt'
make NERd BL-IAMS datasets: https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/data/IAMS_Photographs_1850-1950_selection3.txt_people.txt and https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/data/IAMS_Photographs_1850-1950_selection3.txt_places.txt
write 2
write 3
- make a start
- include note that we've sculpted the tasks carefully to ensure that antconc does not hang/crash
- add a second task (that may take a little longer to load!) based on Develop 9. Next Steps 2: Named Entity Recognition #13 (comment)
sanity check

The text was updated successfully, but these errors were encountered:

drjwbaker · 2020-09-25T10:11:57Z

Based on #34 (comment) review project priority from 'Would like to have' to a 'Dev task'.

drjwbaker · 2020-09-29T11:52:37Z

Tested Standford NER, forked @brandontlocke's batch script, and added version that produces text files (line separated) for person markup and location markup respectively https://github.com/CatalogueLegacies/batchner

This should be sufficient for the NER module, perhaps showing how the data is made (seen as the shell is beyond the scope of the lesson), then putting the data back into AntConc to examine word use around marked up entities.

drjwbaker · 2020-09-29T11:53:55Z

(NB: I ran out of talent when writing batchner_markup.sh as seems to produce extra white space somewhere around the sed command, so it needs a little fettling before use in the module)

drjwbaker · 2020-11-12T16:29:18Z

rough outline

Intro to NER. What it does: add tags to a corpus (show an e.g.). We can put tags and corpus linguistics together in interesting ways. For example, if you wanted to know the language used around all places, we can do that, as the tags give us an anchor to search around. In AntConc, we'd search for the tag and then look at the concordance.
To generate NER tags you need a little shell (like Episode 8, have an advanced callout for running NER over BL-IAMS) using shell. Then provide NERd BL-IAMS dataset, to put into AntConc, to search on a tag (place), then look at the results in a few different views. Task here: create search to find '-ly' words around person tags.
Uses of NER+AntConc. a) checking variety of place names in a catalogue - e.g. use Concordance Plot to see if there are suspiciously un-NERd parts of the catalogue to look for place names off the computational grid, that might be missed in a semi-automated record improvement initiative. b) find discrepancies in how types of people are discussed, by looking at adjectives around persons, and sorting a concordance by name - cleaning/repairing legacy data.

add keyness and NER files for #12 and #13

drjwbaker · 2020-11-20T16:08:48Z

On 3., @rossi-uk try:

upload the IAMS_people txt file
Global Settings. Set to 'Token Definition' of 'letter' only.
Tool Prefs. For Wordlist, untick 'treat all as lower case' as usual.
Then in Concordance tab search: by */PERSON with words and case selected, and Level 1 set to 1L
This is a good start on the kind of contexts in which people appear (and is an essay search for antconc!)

drjwbaker · 2020-11-20T16:10:07Z

This also works nicely in the 'Collocates' with the 'window span' set to the left of the search term.

drjwbaker · 2020-11-20T16:22:45Z

but then stuff like the attached (warning, takes ages) show the limits of the approach (that is, it doesn't pick up initials before names or titles, so the 'by' above will be only an impression)

drjwbaker · 2020-11-20T16:54:14Z

With the _places file, we can do things like this

rossi-uk · 2020-11-24T15:36:54Z

In terms of numbers there are 45574 instances of /PERSON and yes, AntConc does take a while and appears to crash. It appears initials are tagged separately.

rossi-uk · 2020-11-24T17:16:07Z

There are 21082 instances of /LOCATION Below an example of person initials mistaken for location

rossi-uk · 2020-11-24T17:35:22Z

With the _places file, we can do things like this
Other examples: outside */LOCATION (26 instances), the village of */LOCATION (), towards */LOCATION (89 instances)

drjwbaker · 2020-11-27T15:15:22Z

Positive thing: 'the village of */LOCATION'
Negative about NER: mistakes around initial, AntConc is good for looking at transformations of data using tools we don't fully understand. Need for specialist gazateers for locations (point people to PH lesson on this)

drjwbaker · 2020-11-27T15:17:56Z

Point of episode is about using AntConc to work with marked up text and/or third party tools.

drjwbaker · 2020-11-27T15:18:53Z

Info on better NER https://github.com/impresso/named-entity-tutorial-dh2019/tree/master/slides

drjwbaker · 2021-01-11T15:38:31Z

@rossi-uk I had a run through this afternoon and we are more or less there with this episode! Just needs a second task (which I think you said you'd make a start on, but do correct me if I'm wrong!) and a run through (because you know my ability to introduce stupid errors!)

drjwbaker · 2021-01-20T16:22:56Z

@rossi-uk We have an action to add a second task? Do we want to still to do that, or shall we close without it?

rossi-uk · 2021-01-21T11:44:04Z

I was thinking of a task around identifying mentions of women for example searching for Lady xxx and percentage of total references to women recognised by the NER. Considering the omissions. AntConc is struggling with processing queries on my machine.
Some observations.

NER does not tag all women names, so a search query for titles can help find names and note the omissions by the NER tool
A search for Lady|Dame|Queen|Princess|Marquess|Baroness|Empress|Maharana|Mrs|Miss brings up over 2000 hits (knowing the data helps us use a wider range of titles; large part of the corpus is about nineteenth century British India - Maharana, references to Queen Victoria)

omissions by NER

-Note NER patterns: it recognises names when title is followed by initial: Miss C./PERSON Frere/PERSON; Mrs H.C./PERSON Noyce/PERSON

Over a quarter of the women seem to have the title Lady (see notes in task 2, again some omissions are noted)
To find Lady or Mrs Richardson you need to search for Lady/PERSON Richardson or Mrs/PERSON Richardson - Lady|Mrs/PERSON */PERSON ?

Scarcity of adjectives with names?
Punctuation around names, some are lists, and or with Mrs */PERSON
collocates with Lady */PERSON ...

drjwbaker added the development development of episodes before writing label May 27, 2020

drjwbaker self-assigned this May 27, 2020

drjwbaker changed the title ~~Develop 10. Next Steps 3: Named Entity Recognition~~ Develop 9. Next Steps 2: Named Entity Recognition Sep 25, 2020

drjwbaker pushed a commit that referenced this issue Nov 12, 2020

Add files via upload

ffdfacd

add keyness and NER files for #12 and #13

drjwbaker added the final check needed label Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop 9. Next Steps 2: Named Entity Recognition #13

Develop 9. Next Steps 2: Named Entity Recognition #13

drjwbaker commented May 27, 2020 •

edited

Loading

drjwbaker commented Sep 25, 2020 •

edited

Loading

drjwbaker commented Sep 29, 2020 •

edited

Loading

drjwbaker commented Sep 29, 2020 •

edited

Loading

drjwbaker commented Nov 12, 2020

drjwbaker commented Nov 20, 2020

drjwbaker commented Nov 20, 2020

drjwbaker commented Nov 20, 2020 •

edited

Loading

drjwbaker commented Nov 20, 2020

rossi-uk commented Nov 24, 2020

rossi-uk commented Nov 24, 2020

rossi-uk commented Nov 24, 2020

drjwbaker commented Nov 27, 2020

drjwbaker commented Nov 27, 2020

drjwbaker commented Nov 27, 2020

drjwbaker commented Jan 11, 2021 •

edited

Loading

drjwbaker commented Jan 20, 2021

rossi-uk commented Jan 21, 2021 •

edited

Loading

Develop 9. Next Steps 2: Named Entity Recognition #13

Develop 9. Next Steps 2: Named Entity Recognition #13

Comments

drjwbaker commented May 27, 2020 • edited Loading

drjwbaker commented Sep 25, 2020 • edited Loading

drjwbaker commented Sep 29, 2020 • edited Loading

drjwbaker commented Sep 29, 2020 • edited Loading

drjwbaker commented Nov 12, 2020

rough outline

drjwbaker commented Nov 20, 2020

drjwbaker commented Nov 20, 2020

drjwbaker commented Nov 20, 2020 • edited Loading

drjwbaker commented Nov 20, 2020

rossi-uk commented Nov 24, 2020

rossi-uk commented Nov 24, 2020

rossi-uk commented Nov 24, 2020

drjwbaker commented Nov 27, 2020

drjwbaker commented Nov 27, 2020

drjwbaker commented Nov 27, 2020

drjwbaker commented Jan 11, 2021 • edited Loading

drjwbaker commented Jan 20, 2021

rossi-uk commented Jan 21, 2021 • edited Loading

omissions by NER

drjwbaker commented May 27, 2020 •

edited

Loading

drjwbaker commented Sep 25, 2020 •

edited

Loading

drjwbaker commented Sep 29, 2020 •

edited

Loading

drjwbaker commented Sep 29, 2020 •

edited

Loading

drjwbaker commented Nov 20, 2020 •

edited

Loading

drjwbaker commented Jan 11, 2021 •

edited

Loading

rossi-uk commented Jan 21, 2021 •

edited

Loading