Skip to content

Latest commit

 

History

History
70 lines (55 loc) · 2.01 KB

2024-09-01-small-numbers.md

File metadata and controls

70 lines (55 loc) · 2.01 KB
title categories
Some small numbers
programming

From time to time, some performance challenge comes up, being a workflow at work, or an external challenge like the 1BillionRow thing: https://www.youtube.com/watch?v=9-S_nZ5gzGE.

And then I remember that, the same way it's "notmuch" mail, it's probably notmuch data anyway. My first heuristic is always to think if excel would be able to handle the amount of data, or if grep/ripgrep would deatl with it.

So, I was doing some back-of-napkin numbers about csv's, duckdb, and the storage of tables and comparing them to plain ripgrep.

Zsh Helpers:

alias -g FORI='| while read i ; do '
alias -g IROF='; done '
seq 1000000 FORI
  echo "$i, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" >>table.csv
IROF
361312 -rw-r--r--@   1 rgrau  staff   165M Aug  2 10:11 table.csv
  1048 -rw-r--r--@   1 rgrau  staff   524K Aug  2 18:54 table.ddb

So, highly optimized backends compress 1 million rows of very predictable data to basically nothing.

Again, with less predictable data:

rm table.csv table.ddb
seq 1000000 FORI
  echo "$i,$(uuidgen), $(uuidgen), $(uuidgen) " >>table.csv
IROF

That took a few hours to generate(!!)

Duckdb importing of the 115Mb file

D .timer on
D create table t1 as SELECT * from 'table.csv';
Run Time (s): real 1.184 user 2.655083 sys 0.068267

ls -las table.*

262152 -rw-r--r--@   1 rgrau  staff   115M Aug  3 03:38 table.csv
131608 -rw-r--r--@   1 rgrau  staff    64M Aug  5 10:50 table.ddb

Not bad.

anyway, how long would it take for ripgrep to find the last line of the file?

time rg "$(tail -n1 table.csv | cut -f2 -d,)" table.csv

1000001:1e+06,CB834048-DB59-48C2-8F33-CEB377CC0C54, 95CC3936-812D-4AB1-8DB7-F7F3BED7973F, 1B5DD86C-777B-44BB-A404-33F540BE255D

rg "$(tail -n1 table.csv | cut -f2 -d,)" table.csv
0.03s user 0.02s system 67% cpu 0.079 total