Update references in documentation, plus minor docs changes

- Format all references in APA - Add references section to all relevant documentation - Includes links and DIOs for references where possible - Rework some wording - Add @see annotations to distance and similarity interfaces Signed-off-by: solonovamax <[email protected]>
solo-studios · Sep 29, 2023 · da9bb56 · da9bb56
1 parent 9a7f035
commit da9bb56
Show file tree

Hide file tree

Showing 25 changed files with 270 additions and 115 deletions.
diff --git a/kt-string-similarity/dokka/includes/kt-string-similarity.md b/kt-string-similarity/dokka/includes/kt-string-similarity.md
@@ -4,14 +4,15 @@ Kotlin String Similarity is a Kotlin Multiplatform library for measuring and com
 
 Kotlin String Similarity implements various string similarity and distance measures.
 It contains over a dozen algorithms, including, but not limited to,
-[Levenshtein][ca.solostudios.stringsimilarity.Levenshtein] distance (and siblings),
+[Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] distance (and siblings),
 [Jaro-Winkler][ca.solostudios.stringsimilarity.JaroWinkler],
-[Longest Common Subsequence][ca.solostudios.stringsimilarity.LongestCommonSubsequence],
+[Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS],
 [Cosine similarity][ca.solostudios.stringsimilarity.Cosine], and many others.
 Check the summary table below for the complete list.
 
-This is project contains a port of tdebatty's
-[java-string-similarity](https://github.com/tdebatty/java-string-similarity) to Kotlin Multiplatform.
+This is project was initially a port of tdebatty's
+[java-string-similarity](https://github.com/tdebatty/java-string-similarity) to Kotlin Multiplatform,
+however is now expanding upon it.
 
 ## Including
 
@@ -20,28 +21,35 @@ You can include ${project.module} in your project by adding the following:
 ### Maven
 
 ```xml
-<dependency>
-  <groupId>${project.group}</groupId>
-  <artifactId>${project.module}</artifactId>
-  <version>${project.version}</version>
-</dependency>
+<dependencies>
+    <dependency>
+        <groupId>${project.group}</groupId>
+        <artifactId>${project.module}</artifactId>
+        <version>${project.version}</version>
+    </dependency>
+</dependencies>
 ```
 
 ### Gradle Groovy DSL
 
-```groovy
-implementation '${project.group}:${project.module}:${project.version}'
+```gradle
+dependencies {
+    implementation '${project.group}:${project.module}:${project.version}'
+}
 ```
 
 ### Gradle Kotlin DSL
 
 ```kotlin
-implementation("${project.group}:${project.module}:${project.version}")
+dependencies {
+    implementation("${project.group}:${project.module}:${project.version}")
+}
 ```
 
 ### Gradle Version Catalog
 
 ```toml
+[libraries]
 ${project.module} = { group = "${project.group}", name = "${project.module}", version = "${project.version}" }
 ```
 
@@ -51,42 +59,87 @@ The main characteristics of each implemented algorithm are presented below.
 The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length
 \\(m\\) and \\(n\\) respectively.
 
-| Name                                 | Similarity support | Normalized | Metric | Type    | Cost                                | Typical usage                    |
-|--------------------------------------|--------------------|------------|--------|---------|-------------------------------------|----------------------------------|
-| Levenshtein                          | ☐                  | ☐          | ☒      |         | \\(O(m \\times n)\\) <sup>1</sup>   |                                  |
-| Normalized Levenshtein               | ☒                  | ☒          | ☐      |         | \\(O(m \\times n)\\) <sup>1</sup>   |                                  |
-| Weighted Levenshtein                 | ☐                  | ☐          | ☐      |         | \\(O(m \\times n)\\) <sup>1</sup>   | OCR                              |
-| Damerau-Levenshtein<sup>3</sup>      | ☐                  | ☐          | ☒      |         | \\(O(m \\times n)\\) <sup>1</sup>   |                                  |
-| Optimal String Alignment<sup>3</sup> | ☐                  | ☐          | ☐      |         | \\(O(m \\times n)\\) <sup>1</sup>   |                                  |
-| Jaro-Winkler                         | ☒                  | ☒          | ☐      |         | \\(O(m \\times n)\\)                | typo correction                  |
-| Longest Common Subsequence           | ☐                  | ☐          | ☐      |         | \\(O(m \\times n)\\) <sup>1,2</sup> | diff utility, GIT reconciliation |
-| Metric Longest Common Subsequence    | ☐                  | ☒          | ☒      |         | \\(O(m \\times n)\\) <sup>1,2</sup> |                                  |
-| N-Gram                               | ☐                  | ☒          | ☐      |         | \\(O(m \\times n)\\)                |                                  |
-| Q-Gram                               | ☐                  | ☐          | ☐      | Profile | \\(O(m+n)\\)                        |                                  |
-| Cosine similarity                    | ☒                  | ☒          | ☐      | Profile | \\(O(m+n)\\)                        |                                  |
-| Jaccard index                        | ☒                  | ☒          | ☒      | Set     | \\(O(m+n)\\)                        |                                  |
-| Sorensen-Dice coefficient            | ☒                  | ☒          | ☐      | Set     | \\(O(m+n)\\)                        |                                  |
-| Ratcliff-Obershelp                   | ☒                  | ☒          | ☐      |         | ?                                   |                                  |
-
-1. In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the dynamic programming method, which
-   has a cost \\(O(m \\times n)\\).
-   For Levenshtein distance, the algorithm is sometimes called Wagner-Fischer algorithm ("The string-to-string correction problem", 1974).
-   The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.
-
-   If the alphabet is finite, it is possible to use the method of four russians (Arlazarov et al. "On economic construction of the
-   transitive
-   closure of a directed graph", 1970) to speedup computation.
-   This was published by Masek in 1980 ("A Faster Algorithm Computing String Edit Distances").
-   This method splits the matrix in blocks of size \\(t \\times t\\).
-   Each possible block is precomputed to produce a lookup table.
-   This lookup table can then be used to compute the string similarity (or distance) in \\(O(\\frac{nm}{t})\\).
-   Usually, \\(t\\) is chosen as \\(log(m)\\) if \\(m > n\\).
-   The resulting computation cost is thus \\(O(\\frac{mn}{log(m)})\\).
-   This method has not been implemented (yet).
-
-2. In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time
-   \\(O(log(m) \\times log(n))\\). But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not implemented here.
-
-3. There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions (also sometimes called
-   unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called restricted edit distance).
-   For Optimal String Alignment, no substring can be edited more than once.
+| Name                                       | Distance | Similarity | Normalized | Metric | Memory cost          | Execution cost                     | Typical usage   |
+|--------------------------------------------|:--------:|:----------:|:----------:|:------:|----------------------|------------------------------------|-----------------|
+| Levenshtein                                |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |                 |
+| Damerau-Levenshtein[@ft-c]                 |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |                 |
+| Optimal String Alignment[@ft-c]            |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |                 |
+| Longest Common Subsequence                 |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] | diff, git       |
+| Normalized Levenshtein                     |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |                 |
+| Normalized Damerau-Levenshtein[@ft-c]      |    ☒     |     ☐      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |                 |
+| Normalized Optimal String Alignment[@ft-c] |    ☒     |     ☐      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |                 |
+| Normalized Longest Common Subsequence      |    ☒     |     ☐      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |                 |
+| Cosine similarity                          |    ☒     |     ☒      |     ☒      |   ☐    | \\(O(m + n)\\)       | \\(O(m + n)\\)                     |                 |
+| Jaccard index                              |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m + n)\\)       | \\(O(m + n)\\)                     |                 |
+| Jaro-Winkler                               |    ☒     |     ☒      |     ☒      |   ☐    | \\(O(m + n)\\)       | \\(O(m \\times n)\\)               | typo correction |
+| N-Gram                                     |    ☒     |     ☐      |     ☒      |   ☐    |                      | \\(O(m \\times n)\\)               |                 |
+| Q-Gram                                     |    ☒     |     ☐      |     ☐      |   ☐    |                      | \\(O(m + n)\\)                     |                 |
+| Ratcliff-Obershelp                         |    ☒     |     ☒      |     ☒      |   ☐    | \\(O(m + n)\\)       | \\(O(n^3)\\)                       |                 |
+| Sorensen-Dice coefficient                  |    ☒     |     ☒      |     ☒      |   ☐    |                      | \\(O(m + n)\\)                     |                 |
+| Sift 4                                     |    ☒     |     ☐      |     ☐      |   ☐    | \\(O(m + n)\\)       | \\(O(m + n)\\)                     |                 |
+
+<h2 class="footnotes-header">Notes</h2>
+<div class="footnotes">
+<ol>
+<li id="footnote-a">
+
+In this library, Levenshtein edit distance, LCS distance and their siblings are computed using the dynamic
+programming method, which has a cost \\(O(m \\times n)\\).
+For Levenshtein distance, the algorithm is sometimes called Wagner-Fischer algorithm.[@ref-1]
+The original algorithm uses a matrix of size \\(m \\times n\\) to store the Levenshtein distance between string
+prefixes.
+
+If the alphabet is finite, it is possible to use the "Four-Russians" technique[@ref-2] to speedup computation,
+as shown by Masek and Paterson.[@ref-3]
+This method splits the matrix in blocks of size \\(t \\times t\\).
+Each possible block is precomputed to produce a lookup table.
+This lookup table can then be used to compute the string similarity (or distance) in \\(O(\\frac{n \\times m}{t})\\).
+Usually, \\(t\\) is chosen as \\(log(m)\\) if \\(m > n\\).
+The resulting computation cost is thus \\(O(\\frac{m \\times n}{\\text{log}(m)})\\).
+This method has not been implemented (yet).
+</li>
+<li id="footnote-b">
+
+K.S. Larsen proposed an algorithm that computes the length of LCS in time
+\\(O(log(m) \\times log(n))\\).[@ref-4] But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not
+implemented here.
+</li>
+<li id="footnote-c">
+
+There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions
+(also sometimes called unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called
+restricted edit distance). For Optimal String Alignment, no substring can be edited more than once.
+</li>
+</ol>
+</div>
+
+<h2 class="references-header">References</h2>
+<div class="references">
+<ol>
+<li id="reference-1">
+
+Wagner, R. A., & Fischer, M. J. (1974-01). The string-to-string correction problem.
+Journal of the ACM, 21(1), 168–173.
+<https://doi.org/10.1145/321796.321811><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1145/321796.321811)</sup>
+</li>
+<li id="reference-2">
+
+Arlazarov, V. L., Dinitz, Y. A., Kronrod, M. A., & Faradzhev, I. (1970).
+An algorithm for the reduction of finite non-oriented graphs to canonical form.
+*Soviet Mathematics Doklady*, *194*(3), 487-488.
+</li>
+<li id="reference-3">
+
+Masek, W. J., & Paterson, M. S. (1980-02). A faster algorithm computing string
+edit distances. *Journal of Computer and System Sciences*, *20*(1), 18-31.
+<https://doi.org/10.1016/0022-0000(80)90002-1><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1016/0022-0000(80)90002-1)</sup>
+</li>
+<li id="reference-4">
+
+Larsen, K. S. (1992-10). Length of maximal common subsequences. DAIMI Report
+Series, 21(426).
+<https://doi.org/10.7146/dpb.v21i426.6740><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.7146/dpb.v21i426.6740)</sup>
+</li>
+</ol>
+</div>
+
diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Cosine.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Cosine.kt
@@ -33,10 +33,11 @@ import ca.solostudios.stringsimilarity.util.minMaxOf
 import kotlin.math.sqrt
 
 /**
- * Implements Soft Cosine Similarity between strings. The strings are first
- * transformed in vectors of occurrences of k-shingles (sequences of k
- * characters). In this n-dimensional space, the similarity between the two
- * strings is the Cosine of their respective vectors.
+ * Implements Soft Cosine Similarity between strings.
+ *
+ * The strings are first transformed in vectors of occurrences of k-shingles
+ * (sequences of k characters). In this n-dimensional space, the similarity
+ * between the two strings is the Cosine of their respective vectors.
  *
  * The Cosine similarity between strings \(X\) and \(Y\) is
  * the Cosine of the angle between the two strings as vectors. It is computed as:

diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Jaccard.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Jaccard.kt
@@ -32,6 +32,8 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance
 import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity
 
 /**
+ * Implements the Jaccard index, also known as the Jaccard similarity coefficient (Jaccard, 1912).
+ *
  * Each input string is converted into a set of n-grams, the Jaccard index is
  * then computed as \(\frac{\lVert V_1 \cap V_2 \rVert}{\lVert V_1 \cup V_2 \rVert}\).
  * Like Q-Gram distance, the input strings \(X\) and \(Y\) are first converted into sets of
@@ -41,6 +43,11 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity
  * The distance is computed as
  * \(1 - similarity(X, Y)\).
  *
+ * #### References
+ * Jaccard, P. (1912-02). The distribution of the flora in the alpine zone.
+ * *New Phytologist*, *11*(2), 37–50.
+ * <https://doi.org/10.1111/j.1469-8137.1912.tb05611.x><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1111/j.1469-8137.1912.tb05611.x)</sup>
+ *
  * @see MetricStringDistance
  * @see NormalizedStringDistance
  * @see NormalizedStringSimilarity

diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/JaroWinkler.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/JaroWinkler.kt
@@ -35,6 +35,8 @@ import kotlin.math.max
 import kotlin.math.min
 
 /**
+ * Implements the Jaro-Winkler distance (Winkler, 1990) between strings.
+ *
  * The Jaro–Winkler distance is designed and best suited for short
  * strings such as person names, and to detect typos; it is (roughly) a
  * variation of Damerau-Levenshtein, where the substitution of 2 close
@@ -47,6 +49,11 @@ import kotlin.math.min
  * The distance is computed as
  * \(1 - similarity(X, Y)\).
  *
+ * #### References
+ * Winkler, W. E. (1990). String comparator metrics and enhanced decision rules
+ * in the fellegi-sunter model of record linkage. *Proceedings of the Survey
+ * Research Methods Section*, 354-359. <https://eric.ed.gov/?id=ED325505>
+ *
  * @param threshold The threshold value used for adding the Winkler bonus.
  *
  * @see NormalizedStringDistance

diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/NGram.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/NGram.kt
@@ -35,15 +35,20 @@ import ca.solostudios.stringsimilarity.util.min
 import ca.solostudios.stringsimilarity.util.minMaxByLength
 
 /**
- * N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance",
- * String Processing and Information Retrieval, Lecture Notes in Computer
- * Science Volume 3772, 2005, pp 115-126.
+ * Implements the N-Gram Similarity (Kondrak, 2005) between strings.
  *
- * The algorithm uses affixing with special character '\0' to increase the
+ * The algorithm uses affixing with special character `'\0'` to increase the
  * weight of first characters. The normalization is achieved by dividing the
  * total similarity score the original length of the longest word.
  *
- * [N-Gram Similarity and Distance](http://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf)
+ * The similarity is computed as
+ * \(1 - distance(X, Y)\).
+ *
+ * #### References
+ * Kondrak, G. (2005-11-02). N-gram similarity and distance. In String processing
+ * and information retrieval, lecture notes in computer science (Pages 115-126).
+ * Springer Berlin Heidelberg.
+ * <https://doi.org/10.1007/11575832_13><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1007/11575832_13)</sup>
  *
  * @see NormalizedStringDistance
  * @see NormalizedStringSimilarity

diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/QGram.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/QGram.kt
@@ -32,10 +32,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringDistance
 import kotlin.math.abs
 
 /**
- * Q-gram distance, as defined by
- * Esko Ukkonen. Bo, "Approximate string-matching with q-grams and maximal matches", in Theoretical Computer Science,
- * vol. 92, no. 1, pp. 191-211, Elsevier BV, Jan. 1992, pp. 191–211, doi: 10.1016/0304-3975(92)90143-4.
- * <sup>[&#91;sci-hub&#93;](https://sci-hub.st/https://doi.org/10.1016/0304-3975(92)90143-4)</sup>
+ * Implements the Q-gram distance (Ukkonen, 1992) between strings.
+ *
  * The distance between two strings is defined as
  * the number of occurrences of different q-grams in each string:
  * \(\sum_{i=1}^n \lVert \vec{v1_i} - \vec{v2_i} \rVert\).
@@ -47,9 +45,14 @@ import kotlin.math.abs
  * resulting in \(distance(X, Y) = 0\) where \(X \neq Y\).
  * However, it does respect the other 3 axioms.
  *
+ * #### References
+ * Ukkonen, E. (1992-01). Approximate string matching with q-grams and maximal
+ * matches. *Theoretical Computer Science*, *92*(1), 191–211.
+ * <https://doi.org/10.1016/0304-3975(92)90143-4><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1016/0304-3975(92)90143-4)</sup>
+ *
  * @param q The length of each q-gram.
  *
- * @throws IllegalArgumentException if \(k \leqslant 0\)
+ * @throws IllegalArgumentException if \(q \leqslant 0\)
  *
  * @author Thibault Debatty, solonovamax
  */

diff --git a/...ing-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/RatcliffObershelp.kt b/...ing-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/RatcliffObershelp.kt
@@ -31,7 +31,7 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance
 import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity
 
 /**
- * Implements Ratcliff/Obershelp pattern recognition, also known as Gestalt pattern matching,
+ * Implements Ratcliff/Obershelp pattern recognition (Ratcliff & Metzener, 1988), also known as Gestalt pattern matching,
  * similarity between strings.
  *
  * The similarity is defined as
@@ -41,6 +41,11 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity
  * The distance is computed as
  * \(1 - similarity(X, Y)\).
  *
+ * #### References
+ * Ratcliff, J., & Metzener, D. E. (1988-07-01). Pattern matching: The gestalt ap-
+ * proach. *Dr. Dobb’s Journal*, *13*(7), 46. https://www.drdobbs.com/database/
+ * pattern-matching-the-gestalt-approach/184407970?pgno=5
+ *
  * @author [Ligi](https://github.com/dxpux), solonovamax, Ported to java from .net by denmase
  */
 public class RatcliffObershelp : NormalizedStringSimilarity, NormalizedStringDistance {