Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop 8. Next Steps 1: comparing corpora #12

Open
11 of 12 tasks
drjwbaker opened this issue May 27, 2020 · 11 comments
Open
11 of 12 tasks

Develop 8. Next Steps 1: comparing corpora #12

drjwbaker opened this issue May 27, 2020 · 11 comments
Assignees

Comments

@drjwbaker
Copy link

drjwbaker commented May 27, 2020

https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/_episodes/09-comparing.md

Draft at #12 (comment) To do:

@drjwbaker drjwbaker added the development development of episodes before writing label May 27, 2020
@drjwbaker drjwbaker self-assigned this May 27, 2020
@drjwbaker
Copy link
Author

Based on #34 (comment) move to 'Dev Tasks'.

@drjwbaker drjwbaker changed the title Develop 9. Next Steps 2: comparing corpora Develop 9. Next Steps 1: comparing corpora Sep 25, 2020
@drjwbaker drjwbaker changed the title Develop 9. Next Steps 1: comparing corpora Develop 8. Next Steps 1: comparing corpora Sep 25, 2020
@drjwbaker
Copy link
Author

Based on #34 (comment) scope here to:

  • look at unusually frequent vocab
  • look at two sets of word lists (e.g. MDG + IAMS) for sense of difference in structure of language, emphasis, et cetera.

@drjwbaker
Copy link
Author

drjwbaker commented Nov 12, 2020

rough outline

intro

This episode introduces potential next steps for comparing corpora. In the context of catalogue data, this is important because: provides an alternative point of analysis for recognising the features of the catalogue data under analysis; can be used to compare sub-sets of catalogue data, e.g. use an exemplar subset to understand what linguistic features of the comparative sub-set need adjusting/repairing; allows comparison of catalogue data to everyday speech, in order to tease out - in an evidential way - the special language that should be used in guides to cataloguing at your institution (because, you'll - probably - have a style you want based on some exemplar cataloguing)

main body

Three parts:

  1. Comparing wordlists. Generate word lists for BMCSat and BL-IAMS datasets in AntConc. Renaming files sensible things for comparison. Task: making of a list of words that appear in both top 30, and a list of unique words for each, then using knowledge from previous episodes to say what is different about BMSat.

  2. Keyness. What it is. What it can be used for. Use AntConc to create keyness file. Explain negative keyness. Task: a) something basic on reading the results ("what are the five most unusually frequent verbs) + b) one on interpretation of negative keyness results: what does it tells us BL-IAMS cataloguing is not about, and given what we know about the collection is that a surprise? (that is, press at idea that these results cannot be a function of frequency effects in the objects being catalogued alone). Finally, link to AS/JB paper section on negative keyness for more info on practical uses.

  • include 'it' and 'or' from part 1
  1. Comparing concordances in AntConc. Put both BMCSat and BL-IAMS in. Note need to think about relative sizes of the corpus (File View tab) and spread across corpora (Concordance Plot view, can open up need to order records logically - e.g. change over time). Read concordance lists. Task: compare use of a n-gram (something 'special language' like "towards the rear" that both have)

drjwbaker pushed a commit that referenced this issue Nov 12, 2020
add keyness and NER files for #12 and #13
@drjwbaker
Copy link
Author

Potentially, use comparing with Photo db subjects as a way of thinking about comparing between parts of the catalogue entry (so, 'description' is not in isolation)

@drjwbaker
Copy link
Author

@rossi-uk Made a big update today! Are you able to work on the four remaining points at the top of the ticket? #12 (comment)

@rossi-uk
Copy link

rossi-uk commented Dec 2, 2020

@drjwbaker Thanks, yes, will do before our meeting.

@rossi-uk
Copy link

rossi-uk commented Dec 4, 2020

for numbers in line 31 what settings should be used for the wordlist - I had unticked treat all as lowercase in Tool Preferences and got 73295 vs 63100; once ticked it give the numbers in the lesson

@rossi-uk
Copy link

rossi-uk commented Dec 4, 2020

Keyness section - should we mention that the selection3 file needs to be open and a word list created. After the previous section I was working with the wordlist that I had generated to compare the corpora and that confused me. Also clarify the settings for the word list - Tool Preference - untick treat all data as lower case.

@rossi-uk
Copy link

rossi-uk commented Dec 4, 2020

Re task 2 are we suggesting that people run this with both corpora and then export Iams keyness txt file and BMC keyness txt file and open side by side in notepad and compare there?

@rossi-uk
Copy link

rossi-uk commented Dec 4, 2020

Comparing concordances section - again I get a different number of results for both corpora - what settings are we using in Tool preferences for the wordlists? I got 3103 results for behind across both corpora

@drjwbaker
Copy link
Author

drjwbaker commented Dec 4, 2020

From meeting 4/12:

  • suggest a tool for opening text files
  • clarify what the screen should look like in keyness section (so no confusion about what is being compared)
  • put back in the default global/tool settings for this episode
  • put in sub-headings
  • add in brief explainer on adding own dataset
  • update looking *ly as potential task for Part 3 Task
  • roughly 45 minutes to complete

drjwbaker pushed a commit that referenced this issue Dec 8, 2020
implement first 5 changes from #12 (comment)
@drjwbaker drjwbaker added final check needed and removed development development of episodes before writing labels Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants