Fix BWT Locate explanation. Typos and English.

bebop · Dec 22, 2023 · 6269764 · 6269764
1 parent e52ba57
commit 6269764
Showing 1 changed file with 38 additions and 32 deletions.
diff --git a/bwt/bwt.go b/bwt/bwt.go
@@ -11,13 +11,13 @@ import (
 
 /*
 
-For the BWT usage, please read the its
-method documentation. To understand what it is and how
+For the BWT usage, please read the BWT methods
+below. To understand what it is and how
 it works for either curiosity or maintenance, then read below.
 
 # BWT
 
-BWT Stand for (B)urrow (W)heeler (T)ransform. The BWT aids in
+BWT Stands for (B)urrows-(W)heeler (T)ransform. The BWT aids in
 text compression and acts as a search index for any arbitrary
 sequence of characters. With the BWT and some auxiliary data
 structures, we can analyze a sequence in a memory and run time
@@ -75,8 +75,7 @@ To best way to show this is to rebuild the original sequence
 using the F and L columns. We do this by rebuilding the original string
 in reverse order starting with the nullChar.
 
-Original string: ______$0
-
+Original string:  ______$0
 F($0) -> L(a0) -> _____a0$0
 F(a0) -> L(n0) -> ____n0a0$0
 F(n0) -> L(a1) -> ___a1n0a0$0
@@ -101,18 +100,18 @@ pattern "ana" in "banana". We can do this by:
 2. Find that range of a's, [1, 4)
 3. Take the next previous character in the pattern, n
 4. Find the rank of n before the range from 2. [0, 1) = 0
-5. Find the rank of n in the range from 2. [1, 4) = 1
+5. Find the rank of n in the range from 2. [1, 4) = 2
 6. Look up the start range of the n's in the F column, 5
 7. Add the result from 4 and 5 respectively to form the next
-   L search range: [5+0, 5+1) = [5, 6)
+   L search range: [5+0, 5+2) = [5, 7)
 8. Take next previous character in the pattern, a
-9. Take the rank of "a" before, 0
-10. Take the rank of "a" within, 1
+9. Take the rank of "a" before position 5, which is 1
+10. Take the rank of "a" before position 7, which is 3
 11. Lookup the a's in the F column again, but add the results
     from 9 and 10 to the start range to get the next search
-    range = [1+0, 1+1) = [1, 2)
-12. That is beginning of out pattern, we sub subtract the end and start
-    of the search range to get out count, 2-1=1
+    range = [1+1, 1+3) = [2, 4)
+12. That is beginning of our pattern, we sub subtract the end and start
+    of the search range to get our count, 4-2=2
 
 Another way to look at this is that we are constantly refining our search
 range for each character of the pattern we are searching for. Once we
@@ -123,19 +122,21 @@ has collapsed and we can conclude that there is no matching pattern.
 ## Suffix Array
 
 For other operations such as Locate and Extract, we need another auxiliary
-data structure, the suffix array. Since we could be at multiple points
-within the original sequence and at any point within that sequence, we need
-some kind of point of reference of where we are. We can do this by storing
-the position of each original character for each of the corresponding
-characters in the F column. With our banana example:
+data structure, the suffix array. Since rows of the BWT can map to any
+position within the original sequence, we need some kind of reference as to
+which BWT rows map to which positions in the original sequence. We can do this by storing
+the positions of each character from the original sequence to each of the corresponding
+rows in the BWT column. With our banana example:
 
 F:   $0 a0 a1 a2 b0 n0 n1
 SA: [6  5  3  1  0  4  2]
 
 If we take our count example for the pattern "ana" above, you'll remember
-that our final search range was [1, 2). If we look up 1 in the SA, we'll
-find that there is only one offset at position 3 in the original sequence
-"banana"
+that our final search range was [2, 4). You'll also remember that we counted
+2 occurrences of "ana" by subtracting the end of the range from the start, 4-2=2.
+If iterate from 2 to 4, we can lookup the corresponding SA entry for the BWT rows 2 and 3.
+If we look up 2 in the SA, we'll find that our first offset is at position 3 in the original sequence ban"ana"
+If we look up 3 in the SA, we'll find that our second offset is at position 1 in the original sequence b"ana"na
 
 ## Notes on Performance
 
@@ -145,12 +146,12 @@ int64, that would 8 times the amount of memory of the BWT in its plain text
 representation! In the implementation below, we may instead sample the SA
 and do additional look ups as needed to find the offsets we need.
 
-Similarly, storing the F and L column as plain text has just doubled the
-amount of memory from the original sequence... BWT is used for text
+Similarly, storing both the F and L column as plain text would take double the
+amount of memory to store the original sequence... BWT is used for text
 compression, not expansion! That's why in the below implementation, you
-will see other data structures to actually compress the amount of memory
+will see other data structures that lower the amount of memory
 needed. You will also notice that we can make huge improvements by
-compressing sequences like with the F column.
+compressing sequences by runs of characters like with the F column.
 
 Instead of:
 
@@ -160,22 +161,21 @@ Since F is lexicographically sorted, we can have:
 
 F: {$: [0, 1)}, {a: [1, 4)}, {b: [4, 5)} {n: [5, 7)}
 
-Although these performance enhancements may look different from what is
-described above, it is still just an FL mapping at the end of the day- just
-with more steps.
+Although these performance enhancements may lead to a different implementation to what is
+described above, any implementation will just be an LF mapping- just with a few more steps.
 
 
 NOTE: The above is just to explain what is happening at a high level. Please
 reference the implementation below to see how the BWT is actually currently
 working
 
-Many of the Idea's come from Ben Langmead.
+Many of the Ideas come from Ben Langmead.
 He has a whole YouTube playlist about BWT Indexing: https://www.youtube.com/watch?v=5G2Db41pSHE&list=PL2mpR0RYFQsADmYpW2YWBrXJZ_6EL_3nu
 */
 
 const nullChar = "$"
 
-// BWT Burrow Wheeler Transform
+// BWT Burrows-Wheeler Transform
 // Compresses and Indexes a given sequence so that it can be
 // be used for search, alignment, and text extraction. This is
 // useful for sequences so large that it would be beneficial
@@ -269,6 +269,8 @@ func (bwt BWT) Len() int {
 
 // getFCharPosFromOriginalSequenceCharPos looks up mapping from the original position
 // of the sequence to its corresponding position in the First Column of the BWT
+// NOTE: This clearly isn't ideal. Instead of improving this implementation, this will be replaced with
+// something like r-index in the near future.
 func (bwt BWT) getFCharPosFromOriginalSequenceCharPos(originalPos int) int {
 	for i := range bwt.suffixArray {
 		if bwt.suffixArray[i] == originalPos {
@@ -407,9 +409,13 @@ func buildSkipList(prefixArray []string) []skipEntry {
 	return skipList
 }
 
-// getBWTIndex returns the position of the character from the sequence used to build the BWT
-// that corresponds the last character that would exist in the entry of the prefixArray that
-// would be the last character if we were actually doing full rotations
+// getBWTIndex helps us calculate the corresponding character that would
+// be in the L column without having to rotate the full string.
+// For example:
+// Original string: banana$
+// Rotation:        ana$___
+// Position:        7-4-1= 2
+// Original[3]:     n
 func getBWTIndex(lenOfSequenceBeingBuilt, lenOfSuffixArrayVisited int) int {
 	bwtCharIndex := lenOfSequenceBeingBuilt - lenOfSuffixArrayVisited - 1
 	if bwtCharIndex == -1 {