Large docs refactor

Signed-off-by: solonovamax <[email protected]>
solo-studios · Oct 3, 2023 · 45f8471 · 45f8471
1 parent 8615617
commit 45f8471
Show file tree

Hide file tree

Showing 7 changed files with 407 additions and 318 deletions.
diff --git a/kt-string-similarity/dokka/includes/edit.md b/kt-string-similarity/dokka/includes/edit.md
@@ -0,0 +1,236 @@
+# Package ca.solostudios.stringsimilarity.edit
+
+This package contains the edit-based string measure implementations.
+
+## Algorithms
+
+### [Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein]
+
+The [Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] distance between two words is the minimum number of
+single-character edits (insertions, deletions, or substitutions) required to change one word into the other.
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).
+
+#### Example
+
+```kotlin
+val levenshtein = Levenshtein()
+
+println(levenshtein.distance("My string", "My \$tring")) // prints 1.0
+```
+
+### [Normalized Levenshtein][ca.solostudios.stringsimilarity.edit.NormalizedLevenshtein]
+
+This is computed as the [levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein]
+normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).
+
+#### Example
+
+```kotlin
+val normLevenshtein = NormalizedLevenshtein()
+
+println(normLevenshtein.distance("My string", "My \$tring")) // prints 0.10526315789473684
+```
+
+### [Damerau-Levenshtein][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
+
+Similar to the [Levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein],
+the [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein] with transposition
+(also sometimes calls unrestricted Damerau-Levenshtein distance) is the minimum number of operations needed to transform
+one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character,
+or a **transposition of two adjacent characters**.
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).
+
+This is not to be confused with the optimal string alignment distance, which is an extension where no substring can be
+edited more than once.
+
+#### Example
+
+```kotlin
+val damerau = DamerauLevenshtein()
+
+println(damerau.distance("ABCDEF", "ABDCEF")) // prints 1.0
+
+// 2 substitutions
+println(damerau.distance("ABCDEF", "BACDFE")) // prints 2.0
+
+// 1 deletion
+println(damerau.distance("ABCDEF", "ABCDE")) // prints 1.0
+println(damerau.distance("ABCDEF", "BCDEF")) // prints 1.0
+println(damerau.distance("ABCDEF", "ABCGDEF")) // prints 1.0
+
+// All different
+println(damerau.distance("ABCDEF", "POIU")) // prints 6.0
+
+// Transpose
+println(damerau.distance("CA", "ABC")) // prints 2.0
+```
+
+### [Normalized Damerau-Levenshtein][ca.solostudios.stringsimilarity.edit.NormalizedDamerauLevenshtein]
+
+This is computed as the [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
+normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).
+
+#### Example
+
+```kotlin
+val damerau = NormalizedDamerauLevenshtein()
+
+println(damerau.distance("ABCDEF", "ABDCEF")) // prints 0.15384615384615385
+
+// 2 substitutions
+println(damerau.distance("ABCDEF", "BACDFE")) // prints 0.2857142857142857
+
+// 1 deletion
+println(damerau.distance("ABCDEF", "ABCDE")) // prints 0.16666666666666666
+println(damerau.distance("ABCDEF", "BCDEF")) // prints 0.16666666666666666
+println(damerau.distance("ABCDEF", "ABCGDEF")) // prints 0.14285714285714285
+
+// All different
+println(damerau.distance("ABCDEF", "POIU")) // prints 0.75
+
+// Transpose
+println(damerau.distance("CA", "ABC")) // prints 0.5714285714285714
+```
+
+### [Optimal String Alignment][ca.solostudios.stringsimilarity.edit.OptimalStringAlignment]
+
+The [Optimal String Alignment distance][ca.solostudios.stringsimilarity.edit.OptimalStringAlignment] variant
+of [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
+(sometimes called the restricted edit distance) computes the number of edit operations needed
+to make the strings equal under the condition that **no substring is edited more than once**,
+whereas the true the [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
+presents no such restriction.
+The difference from the algorithm for the [Levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein] is the
+addition of one recurrence for the transposition operations.
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).
+
+#### Example
+
+```kotlin
+val osa = OptimalStringAlignment()
+
+println(osa.distance("ABCDEF", "ABDCEF")) // prints 1.0
+
+// 2 substitutions
+println(osa.distance("ABCDEF", "BACDFE")) // prints 2.0
+
+// 1 deletion
+println(osa.distance("ABCDEF", "ABCDE")) // prints 1.0
+println(osa.distance("ABCDEF", "BCDEF")) // prints 1.0
+println(osa.distance("ABCDEF", "ABCGDEF")) // prints 1.0
+
+// All different
+println(osa.distance("ABCDEF", "POIU")) // prints 6.0
+
+println(osa.distance("CA", "ABC")) // prints 3.0
+```
+
+### [Normalized Optimal String Alignment][ca.solostudios.stringsimilarity.edit.NormalizedOptimalStringAlignment]
+
+This is computed as the [Optimal String Alignment][ca.solostudios.stringsimilarity.edit.OptimalStringAlignment]
+normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).
+
+#### Example
+
+```kotlin
+val osa = NormalizedOptimalStringAlignment()
+
+println(osa.distance("ABCDEF", "ABDCEF")) // prints 0.15384615384615385
+
+// 2 substitutions
+println(osa.distance("ABCDEF", "BACDFE")) // prints 0.2857142857142857
+
+// 1 deletion
+println(osa.distance("ABCDEF", "ABCDE")) // prints 0.16666666666666666
+println(osa.distance("ABCDEF", "BCDEF")) // prints 0.16666666666666666
+println(osa.distance("ABCDEF", "ABCGDEF")) // prints 0.14285714285714285
+
+// All different
+println(osa.distance("ABCDEF", "POIU")) // prints 0.75
+
+// Transpose
+println(osa.distance("CA", "ABC")) // prints 0.75
+```
+
+### [Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS]
+
+The [Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS] (LCS) problem consists in finding the longest
+subsequence common to two (or more) sequences.
+It differs from problems of finding common substrings: unlike substrings, subsequences are not required to
+occupy consecutive positions within the original sequences.
+
+It is used by the diff utility, by Git for reconciling multiple changes, etc.
+
+The [LCS distance][ca.solostudios.stringsimilarity.edit.LCS] is equivalent
+to the [Levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein] when only insertion and deletion is
+allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion.
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\)[@ft-a].
+
+#### Example
+
+```kotlin
+val lcs = LongestCommonSubsequence()
+
+println(lcs.distance("AGCAT", "GAC")) // prints 4.0
+
+println(lcs.distance("AGCAT", "AGCT")) // prints 1.0
+```
+
+### [Normalized Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.NormalizedLCS]
+
+This is computed as the [Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS]
+normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).
+
+It is a metric string distance. This class implements the dynamic programming approach,
+which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\)[@ft-a].
+
+#### Example
+
+```kotlin
+val normalizedLCS = NormalizedLCS()
+
+println(normalizedLCS.distance("ABCDEFG", "ABCDEFHJKL")) // prints 0.45454545454545453
+
+println(normalizedLCS.distance("ABDEF", "ABDIF")) // prints 0.3333333333333333
+```
+
+<h2 class="footnotes-header">Notes</h2>
+<div class="footnotes">
+<ol>
+<li id="footnote-a">
+
+K.S. Larsen proposed an algorithm that computes the length of LCS in time
+\\(O(log(m) \\times log(n))\\).[@ref-4] But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not
+implemented here.
+</li>
+</ol>
+</div>
+
+<h2 class="references-header">References</h2>
+<div class="references">
+<ol>
+<li id="reference-1">
+
+Larsen, K. S. (1992-10). Length of maximal common subsequences. DAIMI Report
+Series, 21(426).
+<https://doi.org/10.7146/dpb.v21i426.6740><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.7146/dpb.v21i426.6740)</sup>
+</li>
+</ol>
+</div>
diff --git a/kt-string-similarity/dokka/includes/interfaces.md b/kt-string-similarity/dokka/includes/interfaces.md
@@ -0,0 +1,51 @@
+# Package ca.solostudios.stringsimilarity.interfaces
+
+This package contains all the interfaces for string measures.
+
+## Normalized, metric, similarity and distance
+
+Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance.
+Therefore, the library defines some interfaces to categorize them.
+
+### (Normalized) Similarity and Distance
+
+- [StringSimilarity][ca.solostudios.stringsimilarity.interfaces.StringSimilarity]: Implementing algorithms define a
+  similarity between
+  strings (0 means strings are completely different).
+- [NormalizedStringSimilarity][ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity]: The interface
+  extends [StringSimilarity][ca.solostudios.stringsimilarity.interfaces.StringSimilarity].
+  Implementing algorithms compute a similarity that has been normalized based on the number of operations performed.
+  This means that for non-weighted implementations, the result will always be between 0 and 1.
+  [Jaro-Winkler][ca.solostudios.stringsimilarity.JaroWinkler] is an example of this.
+- [StringDistance][ca.solostudios.stringsimilarity.interfaces.StringDistance]: Implementing algorithms define a distance
+  between strings (0 means strings are identical), like [Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] for example.
+  The maximum distance value depends on the algorithm.
+- [NormalizedStringDistance][ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance]: This interface
+  extends [StringDistance][ca.solostudios.stringsimilarity.interfaces.StringDistance].
+  Implementing algorithms compute a distance that has been normalized based on the number of operations performed.
+  This means that for non-weighted implementations, the result will always be between \\(&#91;0, 1&#93;\\).
+  [NormalizedLevenshtein][ca.solostudios.stringsimilarity.edit.NormalizedLevenshtein] is an example of this.
+
+Generally, algorithms that
+implement [NormalizedStringSimilarity][ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity]
+also implement [NormalizedStringDistance][ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance].
+This is because the similarity can be computed as \\(1 - \\text{distance}\\),
+and the distance can be computed as \\(1 - \\text{similarity}\\).
+
+> Note: This is only applicable if the result is *always* between 0 and 1.
+
+### Metric Distances
+
+The [MetricStringDistance][ca.solostudios.stringsimilarity.interfaces.MetricStringDistance]
+interface indicates that the implementing class is a metric distance,
+which means that it satisfies the required axioms to be considered metric.
+Read [MetricStringDistance][ca.solostudios.stringsimilarity.interfaces.MetricStringDistance] for more information.
+
+A lot of nearest-neighbor search algorithms and indexing structures rely on the triangle inequality.
+You can check "Similarity Search, The Metric Space Approach" by Zezula et al. for a survey.
+These cannot be used with non-metric similarity measures.
+
+### Edit Measures
+
+The edit measure interfaces indicate when a specific algorithm is edit-based.
+See the `edit` package for all implementors.
diff --git a/kt-string-similarity/dokka/includes/kt-string-similarity.md b/kt-string-similarity/dokka/includes/kt-string-similarity.md
@@ -26,14 +26,14 @@ The "cost" columns gives an estimation of the computational/memory costs to comp
 
 | Name                                       | Distance | Similarity | Normalized | Metric | Memory cost          | Execution cost                     |
 |--------------------------------------------|:--------:|:----------:|:----------:|:------:|----------------------|------------------------------------|
-| Levenshtein                                |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
-| Damerau-Levenshtein[@ft-c]                 |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
-| Optimal String Alignment[@ft-c]            |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
-| Longest Common Subsequence                 |    ☒     |     ☐      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
+| Levenshtein                                |    ☒     |     ☒      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
+| Damerau-Levenshtein[@ft-c]                 |    ☒     |     ☒      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
+| Optimal String Alignment[@ft-c]            |    ☒     |     ☒      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
+| Longest Common Subsequence                 |    ☒     |     ☒      |     ☐      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
 | Normalized Levenshtein                     |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
-| Normalized Damerau-Levenshtein[@ft-c]      |    ☒     |     ☐      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
-| Normalized Optimal String Alignment[@ft-c] |    ☒     |     ☐      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
-| Normalized Longest Common Subsequence      |    ☒     |     ☐      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
+| Normalized Damerau-Levenshtein[@ft-c]      |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
+| Normalized Optimal String Alignment[@ft-c] |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a]        |
+| Normalized Longest Common Subsequence      |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
 | Cosine similarity                          |    ☒     |     ☒      |     ☒      |   ☐    | \\(O(m + n)\\)       | \\(O(m + n)\\)                     |
 | Jaccard index                              |    ☒     |     ☒      |     ☒      |   ☒    | \\(O(m + n)\\)       | \\(O(m + n)\\)                     |
 | Jaro-Winkler                               |    ☒     |     ☒      |     ☒      |   ☐    | \\(O(m + n)\\)       | \\(O(m \\times n)\\)               |