Skip to content

Commit

Permalink
Fix BWT Locate explanation. Typos and English.
Browse files Browse the repository at this point in the history
  • Loading branch information
TwFlem committed Dec 22, 2023
1 parent e52ba57 commit 6269764
Showing 1 changed file with 38 additions and 32 deletions.
70 changes: 38 additions & 32 deletions bwt/bwt.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ import (

/*
For the BWT usage, please read the its
method documentation. To understand what it is and how
For the BWT usage, please read the BWT methods
below. To understand what it is and how
it works for either curiosity or maintenance, then read below.
# BWT
BWT Stand for (B)urrow (W)heeler (T)ransform. The BWT aids in
BWT Stands for (B)urrows-(W)heeler (T)ransform. The BWT aids in
text compression and acts as a search index for any arbitrary
sequence of characters. With the BWT and some auxiliary data
structures, we can analyze a sequence in a memory and run time
Expand Down Expand Up @@ -75,8 +75,7 @@ To best way to show this is to rebuild the original sequence
using the F and L columns. We do this by rebuilding the original string
in reverse order starting with the nullChar.
Original string: ______$0
Original string: ______$0
F($0) -> L(a0) -> _____a0$0
F(a0) -> L(n0) -> ____n0a0$0
F(n0) -> L(a1) -> ___a1n0a0$0
Expand All @@ -101,18 +100,18 @@ pattern "ana" in "banana". We can do this by:
2. Find that range of a's, [1, 4)
3. Take the next previous character in the pattern, n
4. Find the rank of n before the range from 2. [0, 1) = 0
5. Find the rank of n in the range from 2. [1, 4) = 1
5. Find the rank of n in the range from 2. [1, 4) = 2
6. Look up the start range of the n's in the F column, 5
7. Add the result from 4 and 5 respectively to form the next
L search range: [5+0, 5+1) = [5, 6)
L search range: [5+0, 5+2) = [5, 7)
8. Take next previous character in the pattern, a
9. Take the rank of "a" before, 0
10. Take the rank of "a" within, 1
9. Take the rank of "a" before position 5, which is 1
10. Take the rank of "a" before position 7, which is 3
11. Lookup the a's in the F column again, but add the results
from 9 and 10 to the start range to get the next search
range = [1+0, 1+1) = [1, 2)
12. That is beginning of out pattern, we sub subtract the end and start
of the search range to get out count, 2-1=1
range = [1+1, 1+3) = [2, 4)
12. That is beginning of our pattern, we sub subtract the end and start
of the search range to get our count, 4-2=2
Another way to look at this is that we are constantly refining our search
range for each character of the pattern we are searching for. Once we
Expand All @@ -123,19 +122,21 @@ has collapsed and we can conclude that there is no matching pattern.
## Suffix Array
For other operations such as Locate and Extract, we need another auxiliary
data structure, the suffix array. Since we could be at multiple points
within the original sequence and at any point within that sequence, we need
some kind of point of reference of where we are. We can do this by storing
the position of each original character for each of the corresponding
characters in the F column. With our banana example:
data structure, the suffix array. Since rows of the BWT can map to any
position within the original sequence, we need some kind of reference as to
which BWT rows map to which positions in the original sequence. We can do this by storing
the positions of each character from the original sequence to each of the corresponding
rows in the BWT column. With our banana example:
F: $0 a0 a1 a2 b0 n0 n1
SA: [6 5 3 1 0 4 2]
If we take our count example for the pattern "ana" above, you'll remember
that our final search range was [1, 2). If we look up 1 in the SA, we'll
find that there is only one offset at position 3 in the original sequence
"banana"
that our final search range was [2, 4). You'll also remember that we counted
2 occurrences of "ana" by subtracting the end of the range from the start, 4-2=2.
If iterate from 2 to 4, we can lookup the corresponding SA entry for the BWT rows 2 and 3.
If we look up 2 in the SA, we'll find that our first offset is at position 3 in the original sequence ban"ana"
If we look up 3 in the SA, we'll find that our second offset is at position 1 in the original sequence b"ana"na
## Notes on Performance
Expand All @@ -145,12 +146,12 @@ int64, that would 8 times the amount of memory of the BWT in its plain text
representation! In the implementation below, we may instead sample the SA
and do additional look ups as needed to find the offsets we need.
Similarly, storing the F and L column as plain text has just doubled the
amount of memory from the original sequence... BWT is used for text
Similarly, storing both the F and L column as plain text would take double the
amount of memory to store the original sequence... BWT is used for text
compression, not expansion! That's why in the below implementation, you
will see other data structures to actually compress the amount of memory
will see other data structures that lower the amount of memory
needed. You will also notice that we can make huge improvements by
compressing sequences like with the F column.
compressing sequences by runs of characters like with the F column.
Instead of:
Expand All @@ -160,22 +161,21 @@ Since F is lexicographically sorted, we can have:
F: {$: [0, 1)}, {a: [1, 4)}, {b: [4, 5)} {n: [5, 7)}
Although these performance enhancements may look different from what is
described above, it is still just an FL mapping at the end of the day- just
with more steps.
Although these performance enhancements may lead to a different implementation to what is
described above, any implementation will just be an LF mapping- just with a few more steps.
NOTE: The above is just to explain what is happening at a high level. Please
reference the implementation below to see how the BWT is actually currently
working
Many of the Idea's come from Ben Langmead.
Many of the Ideas come from Ben Langmead.
He has a whole YouTube playlist about BWT Indexing: https://www.youtube.com/watch?v=5G2Db41pSHE&list=PL2mpR0RYFQsADmYpW2YWBrXJZ_6EL_3nu
*/

const nullChar = "$"

// BWT Burrow Wheeler Transform
// BWT Burrows-Wheeler Transform
// Compresses and Indexes a given sequence so that it can be
// be used for search, alignment, and text extraction. This is
// useful for sequences so large that it would be beneficial
Expand Down Expand Up @@ -269,6 +269,8 @@ func (bwt BWT) Len() int {

// getFCharPosFromOriginalSequenceCharPos looks up mapping from the original position
// of the sequence to its corresponding position in the First Column of the BWT
// NOTE: This clearly isn't ideal. Instead of improving this implementation, this will be replaced with
// something like r-index in the near future.
func (bwt BWT) getFCharPosFromOriginalSequenceCharPos(originalPos int) int {
for i := range bwt.suffixArray {
if bwt.suffixArray[i] == originalPos {
Expand Down Expand Up @@ -407,9 +409,13 @@ func buildSkipList(prefixArray []string) []skipEntry {
return skipList
}

// getBWTIndex returns the position of the character from the sequence used to build the BWT
// that corresponds the last character that would exist in the entry of the prefixArray that
// would be the last character if we were actually doing full rotations
// getBWTIndex helps us calculate the corresponding character that would
// be in the L column without having to rotate the full string.
// For example:
// Original string: banana$
// Rotation: ana$___
// Position: 7-4-1= 2
// Original[3]: n
func getBWTIndex(lenOfSequenceBeingBuilt, lenOfSuffixArrayVisited int) int {
bwtCharIndex := lenOfSequenceBeingBuilt - lenOfSuffixArrayVisited - 1
if bwtCharIndex == -1 {
Expand Down

0 comments on commit 6269764

Please sign in to comment.