AI+BigData+Cloud Made Easy

A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .

Projects

Storage Design and Data Structures
Distributed Infrastructure for Cloud---Database and Storage
Distributed Infrastructure for Cloud---Application
Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)
Concurrency
System Performance and Profiling
Search Engine and Information Retrieval

Storage Design and Data Structures

Db-readings - Readings in Databases .
Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .
BitSliceIndex - Experiments on bit-slice indexing .
RoaringBitmap - Roaring Bitmap .
Pilosa - High performance OLAP based on roaring bitmap .
Cpp-btree - C++ in-memory containers based on a B-tree data structure.
Graphillion - Fast, lightweight graphset operation library .
Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .
Skipgraph - Implementation of skipgraph on messagepack-rpc .
Splay Map - STL map implemented with splay tree .
Cedar - C++ implementation of efficiently-updatable double-array trie .
WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .
Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .
Expgram - An ngram toolkit with succinct storage .
Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .
DCF - Dynamic Cuckoo Filter .
PackedArray - Random access array of tightly packed unsigned integers .
FrameOfReference - C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference .
FFBF - Feed-forward Bloom filters .
Concurrent Trees - C++ implementation of concurrent Binary Search Trees .
Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .
Palmtree - An implementation of Intel's concurrent B+Tree (Palm Tree) .
BwTree - An open sourced implementation of Bw-Tree in SQL Server Hekaton .
W-TinyLFU - C++11 header-only implementation for Window-TinyLFU Cache .
Block-graph - A succinct implementation of a block-graph data structure .
RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .
RLZ - Contains the RLZ compression and self-index source code .
Serangequerying - Space-Efficient Structures for Range Querying .
Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .
Sdsl-lite - Succinct Data Structure Library 2.0 .
Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.
GCSA - Generalized Compressed Suffix Array.
Succinct - A collection of succinct data structures .
DYNAMIC - Dynamic succinct/compressed data structures .
DPT - Distributed Patricia Trie .
Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .
YuNomi - Compressed Array Library .
DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .
Cpi00 - The compressed permuterm index .
Smbt - Succinct Multibit Tree for similarity search .
Gwt - Graph-indexing wavelet tree for graph similarity search .
Webgraphs - Fast and Compact Web Graph Representations .
Erika-trie - Erika-trie: succinct trie library .
Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .
Sumire-tries - A variety of succinct tries .
Trie4j - (Succinct) trie implementation in Java .
SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .
Marisa-trie - Marisa succinct trie .
LibCDS - Compact Data Structures Library .
HSDS - Succinct Data Structure Library Collection including bit-vector/wavelet-matrix/trie .
BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .
Bwt-Merge - A tool for merging large BWTs .
PWT - Parallel Wavelet Tree and Wavelet Matrix Construction .
PSAC - Parallel Suffix Array, LCP Array, and Suffix Tree Construction .
R-Index - Optimal space run-length Burrows-Wheeler transform full-text index .
Fbcsa - Fixed Block based Compact Suffix Array .
Quantile-Index - Code for "The Quantile Index -- Succinct Self-Index for Top-k Document Retrieval" .
Gonzalo Navarro - Publications of Gonzalo Navarro .
Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .
MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.
Libart - Adaptive Radix Trees implemented in C .
Masstree - Masstree, a fast, multi-core key-value store .
HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .
HERD - A Highly Efficient key-value system for RDMA .
Nldb - Nanolat Database supporting 1M transactions per second .
Sophia - Modern embeddable key-value database designed for a high load environment .
FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .
FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .
Jump Consistent Hash - A Go implementation of the jump consistent hash .
Content Defined Chunking - High Performance Content Defined Chunking .
SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .
Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .
Article-Key-Value - Implementing a Key-Value Store .
Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .
Article-SSD - Solid-state revolution: in-depth on how SSDs really work .
DB Redbook - Readings in Database Systems .

Distributed Infrastructure for Cloud---Database and Storage

Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .
TiDB - Distributed NewSQL database compatible with MySQL protocol .
ElastiCell - Cloud native key-value store with strong consistency and reliability .
Yugabyte - Cloud native database store with strong consistency and reliability .
FBase - Cloud native database store with strong consistency and reliability by JD.
Paxosstore - Cloud native key value store with strong consistency and reliability by WeChat.
Phxqueue - A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.
Youzan-nsq - Youzan's modification of nsq to provide cloud native capability from reliability to auto rebalancing.
Baidu-Elasticsearch - Baidu's modification of elasticsearch to provide strong data consistency and full SQL.
ClickHouse - Yandex's distributed column store OLAP.
Palo - Baidu's distributed OLAP based on Google's Mesa paper.
MapD - MapD OLAP based on GPU.
ContainerFS - Cloud native distributed filesystem for Kubernetes.
OpenEBS - Cloud native filesystem for Kubernetes(non-distributed ).
Seaweed-FS - Distributed filesystem for small blob files.
Ambry - Distributed filesystem for small and large blob files.
DistributedLog - High performance replicated log service.
Jepsen - Techniques Jepsen occupies a particular niche of the correctness testing landscape .
Namazu - Programmable fuzzy scheduler for testing distributed system .
GPaxos - Golang Paxos implementation based on Phxpaxos .
Consensus-Yaraft - C++ Raft implementation based on Etcd's golang raft .
NOPaxos - Network-Ordered Paxos .
TAPIR - Building Consistent Transactions with Inconsistent Replication .
Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .
Hydra - A distributed data processing and storage system originally developed at AddThis .
Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .
Hustle - A column oriented, embarrassingly distributed relational event database .
MDCC - Multi-DataCenter Consistency protocol .
URingPaxos - High throughput atomic multicast protocol .
Course-CS6452 - Datacenter Networks and Services .

Distributed Infrastructure for Cloud---Application

Pinpoint - Non-intrusive Dapper-like APM solution .
CAT - APM solution at Dianping Inc .
Brave - Java version of OpenZipkin .
Appdash - Golang version of Dapper .
Jaeger - Golang version of Dapper in Uber.
Cadence - Microservice workflow orchestrator .
Zeebe - Microservice workflow orchestrator .
F-Stack - Network framework with high performance based on DPDK .
DPVS - High performance Layer-4 load balancer based on DPDK .

Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)

Galaxy - Naive scheduler for Baidu search cluster .
Cook - Fair job scheduler on Mesos for batch workloads and Spark .
Kube-arbitrator - Cluster colocation scheduler for Kubernetes .
BigFlow - Baidu dataflow operator .
Pulsar - Business level monitor and analysis .
Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .
Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .
Gobblin - Data ingestion as a service .
Magpie - Deploying and managing a Hadoop Yarn cluster with Docker Swarm .
Horovod - Uber's modification of TensorFlow to provide RingReduce based on MPI.
Angel - Tencent's parameter server infrastructure to support machine learning.
Ytk-Learn - Yuantiku's distributed machine learning platform.
Libble - LIBBLE from NJU to provide faster convergence than SGD.
Gloo - Facebook's communications library with various primitives for multi-machine training.
xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package (C++, Python, R).
LASER - A Scalable Response Prediction Platform For Online Advertising .
Hivemall - Scalable machine learning library for Hive/Hadoop .
Ml-ease - ADMM based large scale logistic regression .
Jubatus - Distributed Online Machine Learning Framework .

Concurrency

Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .
CAF - An Open Source Implementation of the Actor Model in C++ .
TAMER - C++ extensions for readable event-driven programming .
C++React - A reactive programming library for C++11 .
Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .
CDS - Header only C++ Concurrent Data Structures library .
Libcds - A C++ template library of lock-free and fine-grained algorithms .
Locksmith - A library for debugging locking in C, C++, or Objective C programs .
Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .
Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .
Nanahan - An implementation of Hopscotch hashing for single thread .
Scalex - Code snippets for the workshop on concurrent data structure implementation .
CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .
Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .
Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .
Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .
Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .
CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .
Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .
Quasar-thrift - Quasar fiber based Thrift RPC .
Seastar - Concurrency library in user space .
Article-TM - Transactional Memory: History and Development .

System Performance And Profiling

Vmmlib - A templatized C++ vector and matrix math library .
Blaze-lib - A high performance C++ math library .
Light-matrix - A Light-weight and Fast Template Matrix Library .
Light-simd - A light weight library for SIMD based computation .
MathSimd - SIMD-optimized math library in C++ .
Opti - Experiment of x86/x64 optimization .
Fmath - Fast log and exp functions for x86/x64 SSE http://homepage1.nifty.com/herumi/soft/fmath.html .
Mie - Fast string library with SSE4.2 .
Libsimdpp - Header-only zero-overhead C++ wrapper for SIMD intrinsics of multiple instruction sets .
Smart - SMT-aware Real-time scheduler for Linux from Yandex.
Simple Binary Encoding - Serialization with ultra low latency .
Farmhash - FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash .
Proxygen - A collection of C++ HTTP libraries including an easy to use HTTP server .
Yamail - YMail General Purpose Library .
WDT - Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths .
UNetStack - Userspace TCP/IP stack .
CamIO - Userspace IO abstraction .
Ktap - A lightweight script-based dynamic tracing tool for Linux http://ktap.org .
Perfbook - Is Parallel Programming Hard, And, If So, What Can You Do About It ?
Article-GC-Java - Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications | LinkedIn Engineering .
Article-Memory Management - Optimizing Linux Memory Management for Low-latency / High-throughput Databases | LinkedIn Engineering .
Article-Modern Microprocessors - Modern Microprocessors A 90 Minute Guide! .
Article-Cache Oblivious Array - Cache oblivious array operations .
Article-Understanding Memory - Understanding Memory .
Article-1975 Programming - So what's wrong with 1975 programming? .
Article-Database Research - Database Research on Modern Computing Architecture .
Article-Linux Learn From Solaris - What Linux can learn from Solaris performance and vice-versa .
Brendan D. Gregg - Blog of Brendan D. Gregg .
Course-CMU 18-645 - How to Write Fast Code .
ParallelismBook - A book about parallel computing & code optimization .

Search Engine and Information Retrieval

Vespa - Production ready search engine to support web-scale data .
SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .
BitFunnel - Signature file based search engine from Bing .
Trinity - Trinity IR toolkit .
IResearch - IR toolkit to be used for ArangoDB .
Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .
Clustered_Partitioned_elias_fano - Code used for paper Clustered Elias-Fano Indexes" .
Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .
Surf - SUccinct Retrieval Framework .
FastPFor - Fast integer compression .
Indexing - Experimenting with indexing on GPUs .
Genie - Generic Inverted Index on GPU .
Simdcomp - A simple C library for compressing lists of integers .
SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .
TurboPFor - Fastest Integer Compression .
Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .
MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .
Wavelet - Information Retrieval based on Wavelet Tree .
Shuffla - Search engine using kd-tree .
RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .
Dualsorted - Dual sorted inverted index based on Wavelet Tree .
Treap - Faster and Smaller Inverted Indices with Treaps .
Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .
SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .
Groonga - Open-source fulltext search engine and column store .
Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .
Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .
Argos - A structural data search engine .
MFRetrieval - Tools for maximum inner product retrieval in recommender systems .
Faiss - A library for efficient similarity search and clustering of dense vectors .
Lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark .

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI+BigData+Cloud Made Easy

Projects

Storage Design and Data Structures

Distributed Infrastructure for Cloud---Database and Storage

Distributed Infrastructure for Cloud---Application

Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)

Concurrency

System Performance And Profiling

Search Engine and Information Retrieval

About

Releases

Packages

Contributors 3

License

izenecloud/big-data-made-easy

Folders and files

Latest commit

History

Repository files navigation

AI+BigData+Cloud Made Easy

Projects

Storage Design and Data Structures

Distributed Infrastructure for Cloud---Database and Storage

Distributed Infrastructure for Cloud---Application

Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)

Concurrency

System Performance And Profiling

Search Engine and Information Retrieval

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages