A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .
- Storage Design and Data Structures
- Distributed Infrastructure for Cloud---Database and Storage
- Distributed Infrastructure for Cloud---Application
- Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)
- Concurrency
- System Performance and Profiling
- Search Engine and Information Retrieval
- Db-readings - Readings in Databases .
- Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .
- BitSliceIndex - Experiments on bit-slice indexing .
- RoaringBitmap - Roaring Bitmap .
- Pilosa - High performance OLAP based on roaring bitmap .
- Cpp-btree - C++ in-memory containers based on a B-tree data structure.
- Graphillion - Fast, lightweight graphset operation library .
- Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .
- Skipgraph - Implementation of skipgraph on messagepack-rpc .
- Splay Map - STL map implemented with splay tree .
- Cedar - C++ implementation of efficiently-updatable double-array trie .
- WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .
- Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .
- Expgram - An ngram toolkit with succinct storage .
- Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .
- DCF - Dynamic Cuckoo Filter .
- PackedArray - Random access array of tightly packed unsigned integers .
- FrameOfReference - C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference .
- FFBF - Feed-forward Bloom filters .
- Concurrent Trees - C++ implementation of concurrent Binary Search Trees .
- Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .
- Palmtree - An implementation of Intel's concurrent B+Tree (Palm Tree) .
- BwTree - An open sourced implementation of Bw-Tree in SQL Server Hekaton .
- W-TinyLFU - C++11 header-only implementation for Window-TinyLFU Cache .
- Block-graph - A succinct implementation of a block-graph data structure .
- RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .
- RLZ - Contains the RLZ compression and self-index source code .
- Serangequerying - Space-Efficient Structures for Range Querying .
- Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .
- Sdsl-lite - Succinct Data Structure Library 2.0 .
- Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.
- GCSA - Generalized Compressed Suffix Array.
- Succinct - A collection of succinct data structures .
- DYNAMIC - Dynamic succinct/compressed data structures .
- DPT - Distributed Patricia Trie .
- Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .
- YuNomi - Compressed Array Library .
- DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .
- Cpi00 - The compressed permuterm index .
- Smbt - Succinct Multibit Tree for similarity search .
- Gwt - Graph-indexing wavelet tree for graph similarity search .
- Webgraphs - Fast and Compact Web Graph Representations .
- Erika-trie - Erika-trie: succinct trie library .
- Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .
- Sumire-tries - A variety of succinct tries .
- Trie4j - (Succinct) trie implementation in Java .
- SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .
- Marisa-trie - Marisa succinct trie .
- LibCDS - Compact Data Structures Library .
- HSDS - Succinct Data Structure Library Collection including bit-vector/wavelet-matrix/trie .
- BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .
- Bwt-Merge - A tool for merging large BWTs .
- PWT - Parallel Wavelet Tree and Wavelet Matrix Construction .
- PSAC - Parallel Suffix Array, LCP Array, and Suffix Tree Construction .
- R-Index - Optimal space run-length Burrows-Wheeler transform full-text index .
- Fbcsa - Fixed Block based Compact Suffix Array .
- Quantile-Index - Code for "The Quantile Index -- Succinct Self-Index for Top-k Document Retrieval" .
- Gonzalo Navarro - Publications of Gonzalo Navarro .
- Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .
- MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.
- Libart - Adaptive Radix Trees implemented in C .
- Masstree - Masstree, a fast, multi-core key-value store .
- HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .
- HERD - A Highly Efficient key-value system for RDMA .
- Nldb - Nanolat Database supporting 1M transactions per second .
- Sophia - Modern embeddable key-value database designed for a high load environment .
- FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .
- FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .
- Jump Consistent Hash - A Go implementation of the jump consistent hash .
- Content Defined Chunking - High Performance Content Defined Chunking .
- SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .
- Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .
- Article-Key-Value - Implementing a Key-Value Store .
- Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .
- Article-SSD - Solid-state revolution: in-depth on how SSDs really work .
- DB Redbook - Readings in Database Systems .
- Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .
- TiDB - Distributed NewSQL database compatible with MySQL protocol .
- ElastiCell - Cloud native key-value store with strong consistency and reliability .
- Yugabyte - Cloud native database store with strong consistency and reliability .
- FBase - Cloud native database store with strong consistency and reliability by JD.
- Paxosstore - Cloud native key value store with strong consistency and reliability by WeChat.
- Phxqueue - A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.
- Youzan-nsq - Youzan's modification of nsq to provide cloud native capability from reliability to auto rebalancing.
- Baidu-Elasticsearch - Baidu's modification of elasticsearch to provide strong data consistency and full SQL.
- ClickHouse - Yandex's distributed column store OLAP.
- Palo - Baidu's distributed OLAP based on Google's Mesa paper.
- MapD - MapD OLAP based on GPU.
- ContainerFS - Cloud native distributed filesystem for Kubernetes.
- OpenEBS - Cloud native filesystem for Kubernetes(non-distributed ).
- Seaweed-FS - Distributed filesystem for small blob files.
- Ambry - Distributed filesystem for small and large blob files.
- DistributedLog - High performance replicated log service.
- Jepsen - Techniques Jepsen occupies a particular niche of the correctness testing landscape .
- Namazu - Programmable fuzzy scheduler for testing distributed system .
- GPaxos - Golang Paxos implementation based on Phxpaxos .
- Consensus-Yaraft - C++ Raft implementation based on Etcd's golang raft .
- NOPaxos - Network-Ordered Paxos .
- TAPIR - Building Consistent Transactions with Inconsistent Replication .
- Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .
- Hydra - A distributed data processing and storage system originally developed at AddThis .
- Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .
- Hustle - A column oriented, embarrassingly distributed relational event database .
- MDCC - Multi-DataCenter Consistency protocol .
- URingPaxos - High throughput atomic multicast protocol .
- Course-CS6452 - Datacenter Networks and Services .
- Pinpoint - Non-intrusive Dapper-like APM solution .
- CAT - APM solution at Dianping Inc .
- Brave - Java version of OpenZipkin .
- Appdash - Golang version of Dapper .
- Jaeger - Golang version of Dapper in Uber.
- Cadence - Microservice workflow orchestrator .
- Zeebe - Microservice workflow orchestrator .
- F-Stack - Network framework with high performance based on DPDK .
- DPVS - High performance Layer-4 load balancer based on DPDK .
- Galaxy - Naive scheduler for Baidu search cluster .
- Cook - Fair job scheduler on Mesos for batch workloads and Spark .
- Kube-arbitrator - Cluster colocation scheduler for Kubernetes .
- BigFlow - Baidu dataflow operator .
- Pulsar - Business level monitor and analysis .
- Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .
- Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .
- Gobblin - Data ingestion as a service .
- Magpie - Deploying and managing a Hadoop Yarn cluster with Docker Swarm .
- Horovod - Uber's modification of TensorFlow to provide RingReduce based on MPI.
- Angel - Tencent's parameter server infrastructure to support machine learning.
- Ytk-Learn - Yuantiku's distributed machine learning platform.
- Libble - LIBBLE from NJU to provide faster convergence than SGD.
- Gloo - Facebook's communications library with various primitives for multi-machine training.
- xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package (C++, Python, R).
- LASER - A Scalable Response Prediction Platform For Online Advertising .
- Hivemall - Scalable machine learning library for Hive/Hadoop .
- Ml-ease - ADMM based large scale logistic regression .
- Jubatus - Distributed Online Machine Learning Framework .
- Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .
- CAF - An Open Source Implementation of the Actor Model in C++ .
- TAMER - C++ extensions for readable event-driven programming .
- C++React - A reactive programming library for C++11 .
- Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .
- CDS - Header only C++ Concurrent Data Structures library .
- Libcds - A C++ template library of lock-free and fine-grained algorithms .
- Locksmith - A library for debugging locking in C, C++, or Objective C programs .
- Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .
- Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .
- Nanahan - An implementation of Hopscotch hashing for single thread .
- Scalex - Code snippets for the workshop on concurrent data structure implementation .
- CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .
- Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .
- Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .
- Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .
- Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .
- CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .
- Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .
- Quasar-thrift - Quasar fiber based Thrift RPC .
- Seastar - Concurrency library in user space .
- Article-TM - Transactional Memory: History and Development .
- Vmmlib - A templatized C++ vector and matrix math library .
- Blaze-lib - A high performance C++ math library .
- Light-matrix - A Light-weight and Fast Template Matrix Library .
- Light-simd - A light weight library for SIMD based computation .
- MathSimd - SIMD-optimized math library in C++ .
- Opti - Experiment of x86/x64 optimization .
- Fmath - Fast log and exp functions for x86/x64 SSE http://homepage1.nifty.com/herumi/soft/fmath.html .
- Mie - Fast string library with SSE4.2 .
- Libsimdpp - Header-only zero-overhead C++ wrapper for SIMD intrinsics of multiple instruction sets .
- Smart - SMT-aware Real-time scheduler for Linux from Yandex.
- Simple Binary Encoding - Serialization with ultra low latency .
- Farmhash - FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash .
- Proxygen - A collection of C++ HTTP libraries including an easy to use HTTP server .
- Yamail - YMail General Purpose Library .
- WDT - Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths .
- UNetStack - Userspace TCP/IP stack .
- CamIO - Userspace IO abstraction .
- Ktap - A lightweight script-based dynamic tracing tool for Linux http://ktap.org .
- Perfbook - Is Parallel Programming Hard, And, If So, What Can You Do About It ?
- Article-GC-Java - Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications | LinkedIn Engineering .
- Article-Memory Management - Optimizing Linux Memory Management for Low-latency / High-throughput Databases | LinkedIn Engineering .
- Article-Modern Microprocessors - Modern Microprocessors A 90 Minute Guide! .
- Article-Cache Oblivious Array - Cache oblivious array operations .
- Article-Understanding Memory - Understanding Memory .
- Article-1975 Programming - So what's wrong with 1975 programming? .
- Article-Database Research - Database Research on Modern Computing Architecture .
- Article-Linux Learn From Solaris - What Linux can learn from Solaris performance and vice-versa .
- Brendan D. Gregg - Blog of Brendan D. Gregg .
- Course-CMU 18-645 - How to Write Fast Code .
- ParallelismBook - A book about parallel computing & code optimization .
- Vespa - Production ready search engine to support web-scale data .
- SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .
- BitFunnel - Signature file based search engine from Bing .
- Trinity - Trinity IR toolkit .
- IResearch - IR toolkit to be used for ArangoDB .
- Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .
- Clustered_Partitioned_elias_fano - Code used for paper Clustered Elias-Fano Indexes" .
- Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .
- Surf - SUccinct Retrieval Framework .
- FastPFor - Fast integer compression .
- Indexing - Experimenting with indexing on GPUs .
- Genie - Generic Inverted Index on GPU .
- Simdcomp - A simple C library for compressing lists of integers .
- SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .
- TurboPFor - Fastest Integer Compression .
- Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .
- MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .
- Wavelet - Information Retrieval based on Wavelet Tree .
- Shuffla - Search engine using kd-tree .
- RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .
- Dualsorted - Dual sorted inverted index based on Wavelet Tree .
- Treap - Faster and Smaller Inverted Indices with Treaps .
- Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .
- SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .
- Groonga - Open-source fulltext search engine and column store .
- Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .
- Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .
- Argos - A structural data search engine .
- MFRetrieval - Tools for maximum inner product retrieval in recommender systems .
- Faiss - A library for efficient similarity search and clustering of dense vectors .
- Lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark .