Skip to content
This repository has been archived by the owner on Dec 21, 2023. It is now read-only.

Add additional linguistic information to saved queries #74

Open
melvinwevers opened this issue Oct 1, 2015 · 8 comments
Open

Add additional linguistic information to saved queries #74

melvinwevers opened this issue Oct 1, 2015 · 8 comments

Comments

@melvinwevers
Copy link

It would be useful to provide the user with some additional linguistic information.

  • The number of words (tokens) in a query (corpus)
  • The option to look for count the instances of one particular keyword within the corpus (this would require an additional search window perhaps)

This enables the user to calculate (or have the computer calculate) the normalized frequency of a word within the sub-collection (or entire collection)

@jgonggrijp
Copy link
Member

For my understanding: are you trying to solve the same problem as @PimHuijnen in #69?
If not, what is the difference?

@melvinwevers
Copy link
Author

I think @PimHuijnen wants a different way to generate wordclouds.

I would like to have some linguistic information on the saved queries. So just a number that says how many words there are within a query.

Still, I think both require the same calculation, namely how many words are there in a query.

Second part of this reply extracted to #79 by @jgonggrijp

@jgonggrijp
Copy link
Member

When you say "how many words there are within a query", do you mean

  • the total length (raw word count) of all matches combined, or
  • the total number of unique words (repetitions not counted) across all the matches, or
  • the total number of occurrences of the search terms in the matches, or
  • something else?

And when you say "normalized frequency of a word within a collection", I presume that you divide one number by another. What would be the numerator and what would be the denominator?

@melvinwevers
Copy link
Author

The total amount of words (raw word count) found within the documents belonging to a saved query.

I would like to know how the relative occurrence of a word within a collection based on a query.

So, If I would query Vietnam AND Soviet and this would yield 300 documents. I would like to know how many words there are in these 300 documents. let saw 3000

Then I would like to be able to know how often America appeared in this subset of 300 documents. let say 15 times.

Then this frequency would be: 15/3000

This allows me to compare the relative frequency of words within particular corpora.

@jgonggrijp
Copy link
Member

Ok, clear!

@mhkuu
Copy link
Contributor

mhkuu commented Oct 20, 2015

ElasticSearch can provide the number of words (i.e. tokens) per document, I think: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#token_count (see this issue as well).

Total number of occurrences of a term can be found in the word cloud, or am I missing something?

@melvinwevers
Copy link
Author

It gives the occurrences per term within the sub-collection/saved query. But only if the word also appears in the word-cloud. You cannot query for a particular word.

@mhkuu mhkuu added this to the Milestone 6 milestone Oct 22, 2015
@mhkuu
Copy link
Contributor

mhkuu commented Nov 19, 2015

Note to self: the link above is broken; in ElasticSearch 2.0 this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html.

In the 1.7 branch this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html#token_count

@mhkuu mhkuu modified the milestones: Milestone 6, Milestone 7 Dec 3, 2015
@mhkuu mhkuu removed the priority-3 label Dec 3, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants