-
Notifications
You must be signed in to change notification settings - Fork 1
Add additional linguistic information to saved queries #74
Comments
For my understanding: are you trying to solve the same problem as @PimHuijnen in #69? |
I think @PimHuijnen wants a different way to generate wordclouds. I would like to have some linguistic information on the saved queries. So just a number that says how many words there are within a query. Still, I think both require the same calculation, namely how many words are there in a query. Second part of this reply extracted to #79 by @jgonggrijp |
When you say "how many words there are within a query", do you mean
And when you say "normalized frequency of a word within a collection", I presume that you divide one number by another. What would be the numerator and what would be the denominator? |
The total amount of words (raw word count) found within the documents belonging to a saved query. I would like to know how the relative occurrence of a word within a collection based on a query. So, If I would query Vietnam AND Soviet and this would yield 300 documents. I would like to know how many words there are in these 300 documents. let saw 3000 Then I would like to be able to know how often America appeared in this subset of 300 documents. let say 15 times. Then this frequency would be: 15/3000 This allows me to compare the relative frequency of words within particular corpora. |
Ok, clear! |
ElasticSearch can provide the number of words (i.e. tokens) per document, I think: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#token_count (see this issue as well). Total number of occurrences of a term can be found in the word cloud, or am I missing something? |
It gives the occurrences per term within the sub-collection/saved query. But only if the word also appears in the word-cloud. You cannot query for a particular word. |
Note to self: the link above is broken; in ElasticSearch 2.0 this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html. In the 1.7 branch this is the correct URL: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-core-types.html#token_count |
It would be useful to provide the user with some additional linguistic information.
This enables the user to calculate (or have the computer calculate) the normalized frequency of a word within the sub-collection (or entire collection)
The text was updated successfully, but these errors were encountered: