Releases: gederajeg/collogetr
Bug fixes to comply with tidyr's `nest` and `unnest` new behaviour
Bug fix and update
Bug fixes
- The bug includes error when pulling out
nn
as the results oftally()
in the previous version ofdplyr
(i.e. v0.7.8). This bug was identified in the AppVeyor and Travis builds (cf. here and here respectively), where columnnn
was not identified from.data
. There is one line of code where columnnn
was used incolloc_leipzig()
. Now, that has been changed inton
and the builds for this release are success with the updateddplyr
version (v0.8.0.1) (cf. here and here for AppVeyor and Travis builds respectively).
Development
- Add a new function called
collex_llr()
to perform association measure using log-likelihood ratio.
Bug fixes and updates
Bug fixes
- Fix bug in the searching procedure. In this version, the corpus is firstly tokenised and the node word is searched through its exact word-form.
- Fix bug in the output column names and the number of columns output when the
save_interim
argument isTRUE
.
Development
- Increase the test coverage for the codes
- Add lifecycle and repo status badge, including the app veyor build badge
Next release
- Add the Log-likelihood as alternative association measure
- Add the Multiple Distinctive Collexeme Analysis (MDCA) as association measure for contrasting more than two near-synonymous node words. MDCA uses one-tailed, exact Binomial Test to determine the distinctive collocates of a node word in comparison to its near-synonyms.
Minor update on LICENSE and Website
This is a minor update involving change of License from GPL-2 to MIT. The update also includes setting up GitHub webpage for the package. There are no additional functions, but more test coverage for the existing functions.
collogetr 1.0.0
Breaking changes
Existing functions
colloc_leipzig()
- A feature to search collocates for multiple node words in one go. These words have to be combined in the form of a character vector (e.g.,
c("membeli", "menjual")
). - Additional output of (i)
sentence-match
in which the collocates and the node word(s) are found, and (ii) windowspan
information of the collocates in relation to the node word(s) (e.g.,r1
for collocates occurring one-word to the right of the node).
- A feature to search collocates for multiple node words in one go. These words have to be combined in the form of a character vector (e.g.,
assoc_prepare()
- Allows processing the input frequency data per corpus or combined across all corpus files.
- Allows to select a give collocate
span
to focus on for the association measure.
New functions
assoc_prepare_dca()
- The function to generate required input data for performing Distinctive Collexeme/Collocates Analysis (DCA). It takes the output of
assoc_prepare()
, which in turns is fed with the output ofcolloc_leipzig()
.
- The function to generate required input data for performing Distinctive Collexeme/Collocates Analysis (DCA). It takes the output of
collex_fye_dca()
- The function to perform DCA using one-tailed Fisher-Yates' Exact (FYE) test. It requires the output of the
assoc_prepare_dca()
.
- The function to perform DCA using one-tailed Fisher-Yates' Exact (FYE) test. It requires the output of the
dca_top_collex()
- The function to extract the top-n distinctive collocates/collexemes for a given word/construction.
collex_chisq()
- The function to perform association measure using the Chi-square statistics.
Future developments
- The next iteration of the package will include:
- Other kinds of association measures commonly used in collocational studies, such as Mutual Information and Log-likelihood, and the inclusion of the odds ratio from the FYE test.
- Another function to retrieve collocates from different corpus types (e.g. from a corpus that is not parsed/split according to sentences as in the Indonesian Leipzig Corpora).
First release
The package contains one function called colloc_leipzig()
to retrieve window-span collocates from Indonesian Leipzig Corpora. The function currently can only search for one word at a time. Thus, it is slow considering the function do tokenisation in the process. So, if we want to search for word X and Y in corpus C, two searching calls are required and thus corpus C need to be tokenised in each of these calls.
The package also contains a function to prepare an input table (assoc_prepare()
) for performing association measure for collocational analysis using Fisher's Yates Exact Test (collex_fye()
).
The next release will fix the colloc_leipzig()
function for multiple pattern search and more efficient procedure.