This is my attempt to solve a popular 1brc challenge and push Clojure to its (my) limits.
It used to be 100% Clojure solution but gradually more and more Java code was introduced. I might come back in the future and go the other way around rewriting some of Java code in Clojure trying to keep the performance up to its current levels. However, for now I am more than satisfied with the project's current state.
I did my best to follow the best practices keeping code clean, modular and high-level enough while still preserving its performance. Still, due to the nature of the challenge there might be a few cryptic hard-to-follow places.
Official webpage of the challenge: https://1brc.dev/
Official repo for Java solutions: https://github.com/gunnarmorling/1brc
- 1 Billion Row Challenge in Clojure
Please make sure that you have JDK installed on your system.
During development, I used Java 21 from Adoptium. Feel free to install any Java starting from version 21.
This project relies heavily on Leiningen build and project management tool. All build processes both for Clojure and Java are done with it.
Despite that, it is not required to have Leiningen installed on your system. All commands given in this document
use portable Leiningen scripts included in this repo and allow you to use Leiningen without installing it on your system,
pretty much like Maven Wrapper mvnw
.
- All coding, testing and benchmarking were done solely on a Windows machine. Therefore, I have not properly tested Linux commands.
- All commands provided here are expected to be run from the project's root directory.
For the sake of simplicity and convenience, I have pre-built and included a JAR from the official Java repo.
Hopefully there won't be any license issues with the original project.
The following command will generate measurements.txt
file containing 1 billion rows and place it in
the project's root. The generation process might take a few minutes to finish.
java --class-path vendor/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CreateMeasurements 1000000000
On Windows with PowerShell:
$env:LEIN_HOME = (Get-Location).Path + "\.lein"; ./lein.bat uberjar
On Linux with shell:
LEIN_HOME=.lein ./lein uberjar
java -jar target\onebrc-0.1.0-SNAPSHOT-standalone.jar
- Get GraalVM from its official website here: https://www.graalvm.org/
The repo usesGraalVM JDK 21.0.4+8.1
, so you should be totally good with any distribution starting from Java 21 - Set
GRAALVM_HOME
environment variable
Here is a more detailed guide: https://graalvm.github.io/native-build-tools/0.9.6/graalvm-setup.html
However, keep in mind that this guide is for older versions, when native-image
required a manual
installation. Nowadays, you can completely skip this step as it already comes preinstalled.
This amazing plugin for Leiningen allows us to easily build a native image. Everything is already pre-configured, all you need is just run a single-line command and wait a bit for it to finish as it takes longer than uberjar build method.
On Windows with PowerShell:
$env:LEIN_HOME = (Get-Location).Path + "\.lein"; ./lein.bat native-image
On Linux with shell:
LEIN_HOME=.lein ./lein native-image
On Windows with PowerShell:
./target/onebrc-0.1.0-SNAPSHOT.exe
On Linux with shell:
./target/onebrc-0.1.0-SNAPSHOT
- ThinkPad E14 Gen.4 (AMD)
- Windows 11 Home (23H2)
- AMD Ryzen 5 5625U (6 cores / 12 threads)
- 16 GB RAM (DDR4-3200)
- Micron 2450 512GB MTFDKCD512TFK (PCIe Gen4 NVMe SSD)
- OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
- GraalVM JDK 21.0.4+8.1
All final benchmarks were done on my personal everyday laptop ThinkPad E14 with
hyperfine
utility the same way as described in the official Java repo here:
Evaluating Results.
The
hyperfine
program is used for measuring execution times of the launch scripts of all entries, i.e. end-to-end times are measured. Each contender is run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the results table above. The exact same measurements.txt file is used for evaluating all contenders. See the script evaluate.sh for the exact implementation of the evaluation steps.
In the official description it is said that they run the program from a RAM disk, which completely removes disk latency from the equation. During all my benchmarks, I ran everything as is, keeping it simple. What is more, I don't have that much RAM to dump a 13 GB text file. So no RAM disk for me. Lastly, I completely ignored number of cores they used for benchmarking and used all available cores of my machine.
Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine.
All benchmark commands were derived and extracted from evaluate.sh
script
that can be found in the official Java repo.
hyperfine --warmup 0 --runs 10 --export-json ./bench-baseline-timing.json "java --class-path vendor/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage_baseline"
hyperfine --warmup 0 --runs 10 --export-json ./bench-uberjar-timing.json "java -jar ./target/onebrc-0.1.0-SNAPSHOT-standalone.jar"
On Windows with PowerShell:
hyperfine --warmup 0 --runs 10 --export-json ./bench-graalvm-timing.json ".\target\onebrc-0.1.0-SNAPSHOT.exe"
On Linux with shell:
hyperfine --warmup 0 --runs 10 --export-json ./bench-graalvm-timing.json "./target/onebrc-0.1.0-SNAPSHOT"
Run | Slowest (s.ms) | Fastest (s.ms) | Average (s.ms) | Performance improvement (times baseline) |
---|---|---|---|---|
Baseline | 168.67 | 149.50 | 155.91 | 1 |
Uberjar | 13.49 | 11.67 | 12.84 | 12 |
GraalVM | 11.48 | 10.24 | 11.04 | 14 |
GraalVM executable shows better performance compared to uberjar. The difference might seem not so sufficient,
only around 1-2 seconds. However, it is approximately 15% of free performance boost.
Not that bad, considering that all you have to do is simply recompile the code with GraalVM as-is and provide a few extra
compilation flags (can be found in project.clj
file).
- Parallel processing with
pmap
- Split and process file in small chunks
- Memory-mapped chunks (
MappedByteBuffer
) - Mutable aggregator objects for each chunk
- Work with strings as raw bytes and delay string decoding
- Custom hashmap optimized for raw byte strings
- Cache hashmap key hashes
- Branchless programming techniques
- Unroll temperature parsing loop
- Parse station names in chunks of 8 and 4 bytes
- Look for separators with bitwise operators and masks
- Heavy utilization of
ByteBuffer
- Reduce memory allocation by reusing the same
ByteBuffer
within a thread to parse station names - Compilation with GraalVM
- Overuse of branchless techniques
- Fully branchless temperature parser
- SIMD with Java Vector API
- Hashmap with double hashing to reduce collisions
- Any simple hashing algorithm other than
djb2
- Calculate
djb2
hash in chunks of 8 and 4 bytes
- Reimplement
ChunkedFile
in Clojure
No noticeable performance hit expected -
ReimplementBitwiseHelpers
in Clojure
No noticeable performance hit expected
Some Java classes likeChunkReader
depend on it. No point it reimplementing it in Clojure -
ReimplementChunkReader
in Clojure
Potential performance degradation. Proceed carefully
As expected, didn't work out well. 2-3 times performance hit. KeepingChunkReader
as is - [Optional]
ReimplementResult
in Clojure
Might be difficult or non-obvious because of mutable nature ofResult
class
Won't do.Result
is used only by hashmap implemented in Java. Because hashmap implementation in Clojure failed, it doesn't make much sense to reimplementResult
- [Optional]
Challenge myself implementing a hashmap in Clojure
Total fail. Reimplementing even a simpledjb2
in Clojure resulted in two orders of magnitude (!) of performance drop, 100-200 times slower. After that I dropped the idea of implementing hashmap in Clojure completely