Skip to content

Latest commit

 

History

History
271 lines (227 loc) · 23.5 KB

README.md

File metadata and controls

271 lines (227 loc) · 23.5 KB

Beating C with Dyalog APL: wc

Background

A recent blog post that bubbled up on r/programming entitled Beating C With 80 Lines Of Haskell: Wc inspired me to see how effective APL would be in solving this problem.

But Why APL?

Why not? I was told by older students during my university years to avoid APL and any internship or co-op opportunities that used it. Turns out that was bad advice, at least for me--I can see how it might not be other folks' cup of tea, but it is definitely something I like. A few years back I had the opportunity to finally play with an APL derivative K (I got a job at a small K shop). I liked K, but had to move on job wise. I recently downloaded a free personal copy of Dyalog and have been playing around with it while reading through Mastering Dyalog APL by Bernard Legrand; I'm only a few chapters in so far. I find APL friendlier to use than K and enjoy it a great deal, so this seemed like a good excuse to figure out how to do file I/O while searching through the book's PDF and flexing google-fu.

On with the code

Just like Chris Penner's original article, I'm benchmarking against the OSX version of wc that shipped with my machine. Just like in the original article, I admit that there are likely faster versions of wc--I'm just comparing what I got.

Counting Words

Counting characters is trivial. Counting lines is almost as trivial, and becomes as trivial if we count CRLF as two lines--I chose this solution out of laziness. CRLF could be counted as one end of line with an additional three lines (warning: guesstimate).

Just as in the original post, the meat of the problem is in accurately counting words. The first step is counting the number of words in a single string. First, let's make two definitions:

nl←⎕UCS¨10 11 12 13  ⍝ Newline characters
sp←nl,⎕UCS¨9 32  ⍝ Space characters

The above doesn't look like "normal" programming languages because APL doesn't use the standard ASCII character set for its symbols/terms. This means none of its names are stolen from you--if you name a variable new or this or function it won't clash with anything. Note that it does use ASCII symbols like + or = but there are no keywords using ASCII numbers/letters.

An explanation of the above:

  1. denotes a // style comment, it's called lamp and it illuminates.
  2. nl and sp are variables with nl being newline characters and sp whitespace characters. In APL, is assignment. Here we assign everything to the right of nl or sp to the variables nl and sp.
  3. ¨ will create a derived function which will map the function immediately to the left of ¨ across the value to the right. By immediately to the left I mean you read as little as possible to the left to figure out what the lefthand argument--that is, take the simple, immediate value to the left but do not compute anything to the left unless it's an expression in parentheses--but the righthand argument is the result of evaluating the entire expression to the right, and this is how APL works in general--take as little as possible from the left but everything from the right.
  4. ⎕UCS is a function in Dyalog for getting unicode characters from numeric values.
  5. The ravel function , will catenate its lefthand and righthand arguments, so sp contains the newlines in nl along with tab and space.

Now that we know what constitutes a newline or a space we can figure out how to get wc results. Assuming the input is stored in the variable s, we can even give it a concrete value for now: s←'I am the model of a modern major general'. To find where the spaces are we use the membership function with s∊sp, which yields

0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0

telling us the exact position of any whitespace characters. We can negate the vector (turn 1s into 0s and vice versa) and use the partition function to collect the words into an array of strings with (~s∊sp)⊆s (~ is the negation function) and we can count the result using ≢(~s∊sp)⊆s. The way the partition function works is it uses the boolean vector to strip out parts of its righthand argument based on runs of 1 in its lefthand argument (thanks to u/olzd for pointing out when I posted this article originally on r/programming).

We only have one piece of input in the above calculation and so it is easy to turn it into a function. In APL we can make a direct function by using braces (I won't go into the differences in the two function types here but know that the book referenced above goes into some detail). So we can store this calculation in a variable with:

words←{≢(~⍵∊sp)⊆⍵}

In APL (alpha) and (omega) are the names of the left and right arguments to a direct function. We apply a function just like we do the built-in primitive functions; namely, we give it parameters: words s. One thing we can do is make words a bit more general and have it take any boolean vector and count the partition of that vector (we don't need the characters, after all), and this then gives us:

words←{≢(~⍵)⊆⍵}

And so when we apply it to our string we will do it with words s∊sp. Just like with the built-in primitive functions, words will be evaluated on the entire expression to the right which is s∊sp. This yields 9. One thing to keep in mind is testing edge cases and, for the empty string we get 0.

WC via Partition (⊆)

Now that we can count words in a string, we can count words in a file. We could just do this by reading in the entire file all at once and applying the words function and then counting the number of characters and the number of newlines. But that would be a bad idea--what if the file is 123GB in size?! Clearly we don't want to read that in all at once; instead we want to read it in a few blocks at a time. We're going to need a loop, and for that we need another type of function: a procedural function (again, see the referenced book for details).

One computation issue, as mentioned in the original Haskell based blog post, is what happens if we split up the input in the middle of a word? Well, we just use the same technique used in that blog post. No, not monoids (though that is cool), just keeping track if the last character we saw was a nonspace or not. If it was and the first new character is also a nonspace then we subtract one word from the count as we accumulate.

With that in mind, here's the first attempt

res←wc fn;c;w;l;fd;data;blk;nl;sp;lnsp;words
words←{≢(~⍵)⊆⍵}
blk←256×1024
nl←⎕UCS¨10 11 12 13
sp←nl,⎕UCS¨9 32
c←w←l←0
fd←fn ⎕NTIE 0
lnsp←0
:Repeat
    data←⎕NREAD fd 80 blk
    :If 0=≢data
        :Leave
    :EndIf
    c←c+(≢data)
    l←l++/data∊nl
    data←data∊sp
    w←(w+words data)-(lnsp∧~1↑data)
    lnsp←~(¯1)↑data
:EndRepeat
⎕NUNTIE fd
res←l w c

We define the function wc with input fn (filename) and output res (result). The names after fn are the names of variables we want to be local, which happens to be every variable we are using. If we left a variable out, it would be assumed global and be able to change global state. An explanation:

  1. The first line, as just described, declares the function wc with input fn, output res, and a handful of local variables.
  2. words is the word counting function we derived above.
  3. blk is the block size we will use while we read from the file fn.
  4. nl and sp are the newline and whitespace vectors we defined originally.
  5. c, w, and l are character, word, and line counters initially set to 0.
  6. fd is the file descriptor for fn (next available since we passed 0).
  7. lnsp is 1 if the last character read during the previous iteration was not a space, 0 otherwise.
  8. Now we enter a loop with :Repeat, the body of which is everything between :Repeat and :EndRepeat. :Leave is like break in other languages.
  9. data is assigned the next block of data read from fd. 80 tells the ⎕NREAD function that we want to read characters (intuitive, right?). ≢data counts the number of items in data and if this is zero (i.e. data is empty) we exit the loop. Otherwise we add that count to c.
  10. We then use data∊nl to find the newlines in data and sum these up with +/ and add this sum to l.
  11. We no longer need the contents of data; we only need to know where whitespace is, so we store a boolean vector describing this in data. Passing this to words counts the number of whitespace terminated words in data, which we add to w. But then, to account for word splitting, we use 1↑data to get whether the first character we read this iteration was a space and then negate that with ~; we use to and that against whether the last character in the previous iteration was not a space, and subtract that from the count. As in C, 0 is false and so 1∧1 is 1 and we only subtract 1 if the previous iteration's data ended with a nonspace and this iteration's data started with one.
  12. We then store whether this iteration's data terminated with a nonspace character.
  13. Finally we close the file descriptor after the loop and return the line, word, and character counts in res.

Performance Measurements

Using a user defined timing operator we can time the wc function. On a 1.661GB file (which is just the big.txt file from the original Haskell post's repository repeated multiple times), I initially get (on my 2.2 GHz Intel Core i7, 8 GB 1600 MHz DDR3 11-inch early 2k15 MacBook Air) a run of 32.97 seconds and about the same on immediate subsequent runs. Using the time terminal utility to run wc against the same file I get user times ranging from 5.345s to 5.549s, so this first attempt is definitely slow compared to OSX's wc utility. If I comment out the line where w is calculated my initial running time on that same file is 4.32 seconds with subsequent runs hovering around 1.85 seconds, so clearly our word calculation is the culprit here (as was to be expected).

Counting Words with Array Arithmetic

Let's call the vector s∊sp vector f for found: f←s∊sp. We don't want to count only the ones in this vector, because consecutive whitespace characters would each count as a word; what we want to count is when we switch from non-whitespace to whitespace (or vice-versa), and we can accomplish this with the difference between adjacent values. This is easy in APL, it's just (1↓f)-(¯1)↓f:

  1. ¯ is "high minus" and is used to negate literals.
  2. is the drop function which drops a number of items from the value to the right. If the number on the left is positive, it drops that many from the front of the value, if it is negative then it drops the absolute value of that many from the back of the value. So 1↓f is f without the first item and (¯1)↓f is f without the last item.
  3. - is subtraction, so we are subtracting values in f from the values to their left. For this particular f we get the following:
1 ¯1 0 1 ¯1 0 0 1 ¯1 0 0 0 0 1 ¯1 0 1 ¯1 1 ¯1 0 0 0 0 0 1 ¯1 0 0 0 0 1 ¯1 0 0 0 0 0 0

This vector is one element shorter than f--if f were a vector of five items with values 0 1 0 0 1 then we would have calculated (1 0 0 1) - (0 1 0 0) after the drops, which would be 1 ¯1 0 1, a vector of four items. Back to the above result, we can find all the places where 1 appears simply with 1=(1↓f)-(¯1)↓f which compares every element in the vector to the value 1, yielding a vector of the same size as the shorter "difference" vector, and this vector has a 1 where a difference of 1 was found and 0 elsewhere:

1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0

This represents the positions where a space followed a nonspace (the first 1 corresponds to the space between I and am). Summing this up gives us the number of places a word terminated by being followed by a space, and we can get that sum with the function +/. Just as ¨ created a derived function so does / -- both are called operators. As ¨ was associated with the commonly named function map so is / associated with the commonly named function reduce/fold, and so +/ is a sum function.

For +/1=(1↓f)-(¯1)↓f we get 8, but 'I am the model of a modern major general' has 9 words, not 8. The problem is that we were only counting words terminated by a space, not by the end of the string. To count that, we must check if the last character is a space or not and if it's not then we can count the end of the string as terminating a word. We can get the whether the last character was a space or not with (¯1)↑f, which will be 1 if s ended with a space or 0 if not. Since we want to add exactly one word to the count when a string does not end with a space we negate this with ~ and add that. Thus the word count expression is:

(~(¯1)↑f)++/1=(1↓f)-(¯1)↓f

Turning this into a function yields:

words←{(~(¯1)↑⍵)++/1=(1↓⍵)-(¯1)↓⍵}

One problem is that the calculation is wrong for the empty string: words ''∊sp gives 1 not 0. This won't be an issue for us because we won't even be invoking it on empty strings.

WC via Array Arithmetic

Here's what our wc function looks like with this change (only the second line changed):

res←wc fn;c;w;l;fd;data;blk;nl;sp;lnsp;words
words←{(~(¯1)↑⍵)++/1=(1↓⍵)-(¯1)↓⍵}
blk←256×1024
nl←⎕UCS¨10 11 12 13
sp←nl,⎕UCS¨9 32
c←w←l←0
fd←fn ⎕NTIE 0
lnsp←0
:Repeat
    data←⎕NREAD fd 80 blk
    :If 0=≢data
        :Leave
    :EndIf
    c←c+(≢data)
    l←l++/data∊nl
    data←data∊sp
    w←(w+words data)-(lnsp∧~1↑data)
    lnsp←~(¯1)↑data
:EndRepeat
⎕NUNTIE fd
res←l w c

Performance Measurements

On the same 1.661GB file as before I initially get a run of 3.36 seconds, but then 2.75s, 2.34s, 2.36s on immediate subsequent runs now that the computer is pumped and primed. Using the time terminal utility to run wc against the same file I (again) get user times ranging from 5.345s to 5.549s, so this attempt is faster and we are done. I get similarly scaled differences for smaller files.

Counting Words with a Windowed Reduction

Eventually this article came to the attention of Marshall Lochbaum at Dyalog who pointed out that we can use a windowed reduction to more efficiently count words. A windowed reduction involves a reduction applied to a sliding window across a vector. If we have the expression ⍳9, which yields the vector 1 2 3 4 5 6 7 8 9, then +/⍳9 gives 45 as expected. If we do 1+/⍳9 however we just get back the same vector (as + was reduced across just each item by itself). 2+/⍳9 yields 3 5 7 9 11 13 15 17 and 6+/⍳9 yields 21 27 33 39.

WC via Windowed Reduction

Whereas before I was using 1=(1↓⍵)-(¯1)↓⍵ to try and find where we went from nonspace to space we could instead use 2</⍵, which will look at any two adjacent elements in and compare them with <. The only change we need in the above wc function is to have

words←{(~(¯1)↑⍵)++/2</⍵}

Performance Measurements

As far as performance, for this new version I get an initial run of 3.33 seconds followed by runs hovering around 1.85s, so a significant improvement that also taught me something (and maybe I'll even keep in touch with someone from Dyalog).

Splitting It All Up

We can split our procedural function up into a few direct functions and an operator, which might make it easier to understand and maintain (or maybe not). We start with a function computing the wc stats on a string together with whether the string's first and last characters are not spaces:

blockCount←{
    l←+/⍵∊⎕UCS¨10 11 12 13  ⍝ #lines
    s←⍵∊⎕UCS¨9 10 11 12 13 32  ⍝ is space
    w←(~(¯1)↑s)++/1=(1↓s)-(¯1)↓s  ⍝ words
    ⍝ first-nonspace, chars, words, lines, last-nonspace
    (~1↑s)(≢⍵)w l(~(¯1)↑s)
}
  1. We directly use the vector of newlines, passing ⎕UCS¨10 11 12 13 as the righthand argument to the membership function . The lefthand argument is the righthand argument passed to blockCount.
  2. s is the boolean vector telling us where spaces (⎕UCS¨9 10 11 12 13 32) occur in the righthand argument to blockCount.
  3. From s we calculate the words in blockCount's righthand argument directly, just as we did before.
  4. Finally we return whether the first character is not a space, the number of characters, the number of words, the number of lines, and whether the last character is not a space.

We can apply blockCount to multiple strings. Just like the original Haskell version, we need some way to combine the results of applying blockCount to sequential strings from a file, which we do with the following:

blockCollapse←{
    adjustment←0 0(-+/(1↓⍵[;1])∧(¯1)↓⍵[;5])0 0
    collapsed←(⍵[1;1]),(+⌿⍵)[2 3 4],(⍵[≢⍵;5])
    adjustment+collapsed
}

The assumption here is that the results from blockCount are stored in a matrix where each row is a result, with the values as described as above (whether the first character is a nonspace, etc.). When we collapse a sequence of blockCount results we want a result of the same form as blockCount. This will be the following:

  1. Whether the first result in the matrix represented a string starting with a nonspace--that value is ⍵[1;1].
  2. The sum of all the characters from all the results.
  3. The sum of all the words from the results minus the number of times a word was split.
  4. The sum of all the lines from all the results.
  5. Whether the last result in the matrix represented a string ending with a nonspace--that value is ⍵[≢⍵;5].

The sums in the middle (items 2, 3, and 4) can be calculated if we reduce addition down the input matrix. (+⌿⍵) will calculate the total sum, and (+⌿⍵)[2 3 4] extracts the middle three values (we don't need the sums of whether the strings started or ended with nonspace characters).

The value we have to subtract--the number of times a word was split--is calculated by taking the sum of whether a string started with a nonspace anded with whether the string before ended with one. That is the value +/(1↓⍵[;1])∧(¯1)↓⍵[;5]). (1↓⍵[;1]) is the first column of our results matrix (which represents whether strings started with a nonspace) with the value from the first result dropped; (¯1)↓⍵[;5] is the last column (representing whether strings ended with a nonspace) with the value from the last result dropped. We and these together with and sum that with +/ to get the number of times we split a word; prepending - negates the value and placing it in the middle of a vector of four zeroes lets us just add it to our calculated result, adjusting the word count appropriately.

Now we add a user defined operator to handle looping over reading a file. The below does this:

v←(f rc)fd;data
v←⍬
:Repeat
    data←⎕NREAD fd 80(1024×1024)
    :If 0<≢data
        v←f data
    :EndIf
:Until 0=≢data

v is the output of the operator; we need this for technical reasons that I won't go into (feel free to try it with it missing), but the value won't be used. Our operator accepts a computation to use on each string on its left and a file descriptor on its right and loops until it has consumed all the data in the file.

Finally, we can bring this all together in a (possibly) simpler implementation of wc which relies on the computations we've defined. Since we no longer need to use looping or conditionals in wc (all that work is done in rc), we can make wc a direct function:

wc←{
    res←1 5⍴0
    update←{
        val←res,[1]blockCount ⍵
        res∘←val
        val
    }
    fd←⍵ ⎕NTIE 0
    _←update rc fd
    _←⎕NUNTIE fd
    (blockCollapse res)[4 3 2]
}

We start by making res a zero-filled matrix of one row by five column. Then we define a helper function update which will update res using blockCount. ,[1] lets us use ravel along its lefthand argument's first axis so that val will be res with the result from blockCount ⍵ appended as the last row. ∘← lets us update res from within update and we return val.

Just like the built-in operators, we put our operated function on the left of rc, which creates a derived function that accepts the file descriptor fd. rc will iterate over the contents of fd and successively apply update. We have to assign the result (or just use it somehow) because the first unused computation of a direct function is its return value, and we don't want an early return.

Finally, we used blockCollapse to compress the result matrix down into the desired answer and pick out the values we want with [4 3 2].

Space Utilization

The above split up implementation has one drawback--we accumulate the results into a matrix whose size grows linearly as a function of the input file. Granted, it has a great constant factor since we collapse each megabyte down to only five numbers, but still if we run this on a terabyte file we will likely use at least 20MB just for storing the results. That doesn't sound bad, but our earlier version ran in constant space and we can reclaim that here, too.

The main change comes in the rc operator which now looks like this:

res←fd(m rc r)init;data
res←init
:Repeat
    data←⎕NREAD fd 80(1024×1024)
    :If 0<≢data
        res←res r m data
    :EndIf
:Until 0=≢data

We are now operating on two functions--m to map the file data and r to fold new data into an accumulated result--i.e., we are doing a reduce/fold across a transformation of the file's contents. We also now accept the file descriptor fd on the left so that we can easily accept an initial accumulation value on the right.

One thing we won't have to change is blockCount: it is the same as the above. We do have to change blockCollapse, though, and since we no longer need to collapse a whole matrix of results but instead need to add two results together, we can have a much simpler function blockAdd←{(⍺[1],(⍺+⍵)[2 3 4],⍵[5])-(0 0(⍺[5]∧⍵[1])0 0)}.

Since rc is now accumulating results, we don't really need to be updating a result in wc, which now has the following sleek form:

wc←{
    blockAdd←{(⍺[1],(⍺+⍵)[2 3 4],⍵[5])-(0 0(⍺[5]∧⍵[1])0 0)}
    fd←⍵ ⎕NTIE 0
    res←fd blockCount rc blockAdd 5⍴0
    _←⎕NUNTIE fd
    res[4 3 2]
}

We use blockCount to map data blocks in the file to statistics for that block then use blockAdd to combine them, all handled by our operator rc. We now have regained constant memory space (the size of which depends on how much data we read at once; here I chose one megabyte), and it's starting to have the the feel of the monoid solution in the original Haskell article.

Acknowledgements

  1. My long time friend Patrick Beh-Forrest was kind enough to read my post and suggest edits to my writing. I thank him deeply.
  2. My thanks to u/olzd for pointing out the beautiful partition (⊆) solution to counting words in a string.
  3. Thanks to Marshall Lochbaum from Dyalog for pointing out windowed reductions.