Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop 9. Next Steps 2: Named Entity Recognition #13

Open
7 of 9 tasks
drjwbaker opened this issue May 27, 2020 · 17 comments
Open
7 of 9 tasks

Develop 9. Next Steps 2: Named Entity Recognition #13

drjwbaker opened this issue May 27, 2020 · 17 comments
Assignees
Labels
development development of episodes before writing final check needed

Comments

@drjwbaker
Copy link

drjwbaker commented May 27, 2020

https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/_episodes/10-named-entity-recognition.md

Based on rough outline at #13 (comment) tasks are:

@drjwbaker drjwbaker added the development development of episodes before writing label May 27, 2020
@drjwbaker drjwbaker self-assigned this May 27, 2020
@drjwbaker
Copy link
Author

drjwbaker commented Sep 25, 2020

Based on #34 (comment) review project priority from 'Would like to have' to a 'Dev task'.

@drjwbaker drjwbaker changed the title Develop 10. Next Steps 3: Named Entity Recognition Develop 9. Next Steps 2: Named Entity Recognition Sep 25, 2020
@drjwbaker
Copy link
Author

drjwbaker commented Sep 29, 2020

Tested Standford NER, forked @brandontlocke's batch script, and added version that produces text files (line separated) for person markup and location markup respectively https://github.com/CatalogueLegacies/batchner

This should be sufficient for the NER module, perhaps showing how the data is made (seen as the shell is beyond the scope of the lesson), then putting the data back into AntConc to examine word use around marked up entities.

@drjwbaker
Copy link
Author

drjwbaker commented Sep 29, 2020

(NB: I ran out of talent when writing batchner_markup.sh as seems to produce extra white space somewhere around the sed command, so it needs a little fettling before use in the module)

@drjwbaker
Copy link
Author

rough outline

  1. Intro to NER. What it does: add tags to a corpus (show an e.g.). We can put tags and corpus linguistics together in interesting ways. For example, if you wanted to know the language used around all places, we can do that, as the tags give us an anchor to search around. In AntConc, we'd search for the tag and then look at the concordance.

  2. To generate NER tags you need a little shell (like Episode 8, have an advanced callout for running NER over BL-IAMS) using shell. Then provide NERd BL-IAMS dataset, to put into AntConc, to search on a tag (place), then look at the results in a few different views. Task here: create search to find '-ly' words around person tags.

  3. Uses of NER+AntConc. a) checking variety of place names in a catalogue - e.g. use Concordance Plot to see if there are suspiciously un-NERd parts of the catalogue to look for place names off the computational grid, that might be missed in a semi-automated record improvement initiative. b) find discrepancies in how types of people are discussed, by looking at adjectives around persons, and sorting a concordance by name - cleaning/repairing legacy data.

drjwbaker pushed a commit that referenced this issue Nov 12, 2020
add keyness and NER files for #12 and #13
@drjwbaker
Copy link
Author

On 3., @rossi-uk try:

  • upload the IAMS_people txt file
  • Global Settings. Set to 'Token Definition' of 'letter' only.
  • Tool Prefs. For Wordlist, untick 'treat all as lower case' as usual.
  • Then in Concordance tab search: by */PERSON with words and case selected, and Level 1 set to 1L
  • This is a good start on the kind of contexts in which people appear (and is an essay search for antconc!)

@drjwbaker
Copy link
Author

This also works nicely in the 'Collocates' with the 'window span' set to the left of the search term.
Screenshot from 2020-11-20 16-09-20

@drjwbaker
Copy link
Author

drjwbaker commented Nov 20, 2020

but then stuff like the attached (warning, takes ages) show the limits of the approach (that is, it doesn't pick up initials before names or titles, so the 'by' above will be only an impression)
Screenshot from 2020-11-20 16-22-26

@drjwbaker
Copy link
Author

With the _places file, we can do things like this
Screenshot from 2020-11-20 16-53-17

@rossi-uk
Copy link

In terms of numbers there are 45574 instances of /PERSON and yes, AntConc does take a while and appears to crash. It appears initials are tagged separately.
NER1

@rossi-uk
Copy link

There are 21082 instances of /LOCATION Below an example of person initials mistaken for location
NER2

@rossi-uk
Copy link

With the _places file, we can do things like this
Other examples: outside */LOCATION (26 instances), the village of */LOCATION (), towards */LOCATION (89 instances)
NER3

@drjwbaker
Copy link
Author

  1. Positive thing: 'the village of */LOCATION'
  2. Negative about NER: mistakes around initial, AntConc is good for looking at transformations of data using tools we don't fully understand. Need for specialist gazateers for locations (point people to PH lesson on this)

@drjwbaker
Copy link
Author

Point of episode is about using AntConc to work with marked up text and/or third party tools.

@drjwbaker
Copy link
Author

@drjwbaker
Copy link
Author

drjwbaker commented Jan 11, 2021

@rossi-uk I had a run through this afternoon and we are more or less there with this episode! Just needs a second task (which I think you said you'd make a start on, but do correct me if I'm wrong!) and a run through (because you know my ability to introduce stupid errors!)

@drjwbaker
Copy link
Author

@rossi-uk We have an action to add a second task? Do we want to still to do that, or shall we close without it?

@rossi-uk
Copy link

rossi-uk commented Jan 21, 2021

I was thinking of a task around identifying mentions of women for example searching for Lady xxx and percentage of total references to women recognised by the NER. Considering the omissions. AntConc is struggling with processing queries on my machine.
Some observations.

  • NER does not tag all women names, so a search query for titles can help find names and note the omissions by the NER tool
  • A search for Lady|Dame|Queen|Princess|Marquess|Baroness|Empress|Maharana|Mrs|Miss brings up over 2000 hits (knowing the data helps us use a wider range of titles; large part of the corpus is about nineteenth century British India - Maharana, references to Queen Victoria)

image

image

omissions by NER

image

-Note NER patterns: it recognises names when title is followed by initial: Miss C./PERSON Frere/PERSON; Mrs H.C./PERSON Noyce/PERSON

  • Over a quarter of the women seem to have the title Lady (see notes in task 2, again some omissions are noted)
  • To find Lady or Mrs Richardson you need to search for Lady/PERSON Richardson or Mrs/PERSON Richardson - Lady|Mrs/PERSON */PERSON ?

Scarcity of adjectives with names?
Punctuation around names, some are lists, and or with Mrs */PERSON
collocates with Lady */PERSON ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development development of episodes before writing final check needed
Projects
None yet
Development

No branches or pull requests

2 participants