diff --git a/README.md b/README.md index 13b6c4469..b62c77df2 100644 --- a/README.md +++ b/README.md @@ -16,101 +16,35 @@ pgvecto.rs is a Postgres extension that provides vector similarity search functi - 🥅 **Filtering**: pgvecto.rs supports filtering. You can set conditions when searching or retrieving points. This is the missing feature of other postgres extensions. - 🚀 **High Performance**: pgvecto.rs is designed to provide significant improvements compared to existing Postgres extensions. Benchmarks have shown that its HNSW index can deliver search performance up to 20 times faster than other indexes like ivfflat. - 🔧 **Extensible**: pgvecto.rs is designed to be extensible. It is easy to add new index structures and search algorithms. This flexibility ensures that pgvecto.rs can adapt to emerging vector search algorithms and meet diverse performance needs. -- 🦀 **Rewrite in Rust**: Rust's strict compile-time checks ensure memory safety, reducing the risk of bugs and security issues commonly associated with C extensions. +- 🦀 **Rewrite in Rust**: Rust's strict compile-time checks ensure memory safety, reducing the risk of bugs and security issues commonly associated with C extensions. - 🙋 **Community Driven**: We encourage community involvement and contributions, fostering innovation and continuous improvement. -## Installation - -### Try with docker - -We have prebuild image at [tensorchord/pgvecto-rs](https://hub.docker.com/r/tensorchord/pgvecto-rs). You can try it with - -``` -docker run --name pgvecto-rs-demo -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d tensorchord/pgvecto-rs:latest -``` - -To acheive full performance, please mount the volume to pg data directory by adding the option like `-v $PWD/pgdata:/var/lib/postgresql/data` - -Reference: https://hub.docker.com/_/postgres/. - -
- Build from source - -### Install Rust and base dependency - -```sh -sudo apt install -y build-essential libpq-dev libssl-dev pkg-config gcc libreadline-dev flex bison libxml2-dev libxslt-dev libxml2-utils xsltproc zlib1g-dev ccache clang git -curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -``` - -### Clone the Repository - -```sh -git clone https://github.com/tensorchord/pgvecto.rs.git -cd pgvecto.rs -``` - -### Install Postgresql and pgrx - -```sh -sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' -wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - -sudo apt-get update -sudo apt-get -y install libpq-dev postgresql-15 postgresql-server-dev-15 -cargo install cargo-pgrx --git https://github.com/tensorchord/pgrx.git --rev $(cat Cargo.toml | grep "pgrx =" | awk -F'rev = "' '{print $2}' | cut -d'"' -f1) -cargo pgrx init --pg15=/usr/lib/postgresql/15/bin/pg_config -``` - -### Install pgvecto.rs - -```sh -cargo pgrx install --release -``` - -Configure your PostgreSQL by modifying the `shared_preload_libraries` to include `vectors.so`. - -```sh -psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors.so"' -``` - -You need restart the PostgreSQL cluster. - -```sh -sudo systemctl restart postgresql.service -``` - -
- -
- Install from release +## Comparison with pgvector -Download the deb package in the release page, and type `sudo apt install vectors-pg15-*.deb` to install the deb package. +| | pgvecto.rs | pgvector | +| ------------------------------------------- | ----------------------------------- | ------------------------- | +| Transaction support | ✅ | ⚠️ | +| Sufficient Result with Delete/Update/Filter | ✅ | ⚠️ | +| Vector Dimension Limit | 65535 | 2000 | +| Prefilter on HNSW | ✅ | ❌ | +| Parallel Index build | ⚡️ Linearly faster with more cores | 🐌 Only single core used | +| Index Persistence | mmap file | Postgres internal storage | +| WAL amplification | 2x 😃 | 30x 🧐 | -Configure your PostgreSQL by modifying the `shared_preload_libraries` to include `vectors.so`. +And based on our benchmark, pgvecto.rs can be up to 2x faster than pgvector on hnsw indexes with same configurations. Read more about the comparison at [here](./docs/comparison-pgvector.md). -```sh -psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors.so"' -``` +## Installation -You need restart the PostgreSQL cluster. +We recommend you to try pgvecto.rs using our pre-built docker, by running -```sh -sudo systemctl restart postgresql.service +```bash +docker run --name pgvecto-rs-demo -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d tensorchord/pgvecto-rs:latest ``` -
- - -Connect to the database and enable the extension. - -```sql -DROP EXTENSION IF EXISTS vectors; -CREATE EXTENSION vectors; -``` +For more installation method (binary install or install from source), read more at [docs/install.md](./docs/install.md) ## Get started - Run the following SQL to ensure the extension is enabled ```sql @@ -223,6 +157,10 @@ We planning to support more index types ([issue here](https://github.com/tensorc Welcome to contribute if you are also interested! +## Why not a specialized vector database? + +Read our blog at [modelz.ai/blog/pgvector](https://modelz.ai/blog/pgvector) + ## Reference ### `vector` type @@ -237,55 +175,35 @@ There is only one exception: indexes cannot be created on columns without specif We utilize TOML syntax to express the index's configuration. Here's what each key in the configuration signifies: -| Key | Type | Description | -| ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------- | -| capacity | integer | The index's capacity. The value should be greater than the number of rows in your table. | -| vectors | table | Configuration of background process vector storage. | -| vectors.memmap | string | (Optional) `ram` ensures that the vectors always stays in memory while `disk` suggests otherwise. | -| algorithm.ivf | table | If this table is set, the IVF algorithm will be used for the index. | -| algorithm.ivf.memmap | string | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. | -| algorithm.ivf.nlist | integer | Number of cluster units. | -| algorithm.ivf.nprobe | integer | Number of units to query. | -| algorithm.hnsw | table | If this table is set, the HNSW algorithm will be used for the index. | -| algorithm.hnsw.memmap | string | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. | -| algorithm.hnsw.m | integer | (Optional) Maximum degree of the node. | -| algorithm.hnsw.ef | integer | (Optional) Search scope in building. | - -## Limitations -- The index is constructed and persisted using a memory map file (mmap) instead of PostgreSQL's shared buffer. As a result, physical replication or logical replication may not function correctly. Additionally, vector indexes are not automatically loaded when PostgreSQL restarts. To load or unload the index, you can utilize the `vectors_load` and `vectors_unload` commands. -- The filtering process is not yet optimized. To achieve optimal performance, you may need to manually experiment with different strategies. For example, you can try searching without a vector index or implementing post-filtering techniques like the following query: `select * from (select * from items ORDER BY embedding <-> '[3,2,1]' LIMIT 100 ) where category = 1`. This involves using approximate nearest neighbor (ANN) search to obtain enough results and then applying filtering afterwards. - - -## Why not a specialty vector database? - -Imagine this, your existing data is stored in a Postgres database, and you want to use a vector database to do some vector similarity search. You have to move your data from Postgres to the vector database, and you have to maintain two databases at the same time. This is not a good idea. - -Why not just use Postgres to do the vector similarity search? This is the reason why we build pgvecto.rs. The user journey is like this: +| Key | Type | Description | +| --------------------- | ------- | --------------------------------------------------------------------------------------------------------------------- | +| capacity | integer | The index's capacity. The value should be greater than the number of rows in your table. | +| vectors | table | Configuration of background process vector storage. | +| vectors.memmap | string | (Optional) `ram` ensures that the vectors always stays in memory while `disk` suggests otherwise. | +| algorithm.ivf | table | If this table is set, the IVF algorithm will be used for the index. | +| algorithm.ivf.memmap | string | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. | +| algorithm.ivf.nlist | integer | Number of cluster units. | +| algorithm.ivf.nprobe | integer | Number of units to query. | +| algorithm.hnsw | table | If this table is set, the HNSW algorithm will be used for the index. | +| algorithm.hnsw.memmap | string | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. | +| algorithm.hnsw.m | integer | (Optional) Maximum degree of the node. | +| algorithm.hnsw.ef | integer | (Optional) Search scope in building. | + +And you can change the number of expected result (such as `ef_search` in hnsw) by using the following SQL. ```sql --- Update the embedding column for the documents table -UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0; - --- Create an index on the embedding column -CREATE INDEX ON documents USING vectors (embedding l2_ops) -WITH (options = $$ -capacity = 2097152 -[vectors] -memmap = "ram" -[algorithm.hnsw] -memmap = "ram" -m = 32 -ef = 256 -$$); - --- Query the similar embeddings -SELECT * FROM documents ORDER BY embedding <-> ai_embedding_vector('hello world') LIMIT 5; +--- (Optional) Expected number of candidates returned by index +SET vectors.k = 32; +--- Or use local to set the value for the current session +SET LOCAL vectors.k = 32; ``` -From [SingleStore DB Blog](https://www.singlestore.com/blog/why-your-vector-database-should-not-be-a-vector-database/): +```` -> Vectors and vector search are a data type and query processing approach, not a foundation for a new way of processing data. Using a specialty vector database (SVDB) will lead to the usual problems we see (and solve) again and again with our customers who use multiple specialty systems: redundant data, excessive data movement, lack of agreement on data values among distributed components, extra labor expense for specialized skills, extra licensing costs, limited query language power, programmability and extensibility, limited tool integration, and poor data integrity and availability compared with a true DBMS. +## Limitations +- The index is constructed and persisted using a memory map file (mmap) instead of PostgreSQL's shared buffer. As a result, physical replication or logical replication may not function correctly. Additionally, vector indexes are not automatically loaded when PostgreSQL restarts. To load or unload the index, you can utilize the `vectors_load` and `vectors_unload` commands. +- The filtering process is not yet optimized. To achieve optimal performance, you may need to manually experiment with different strategies. For example, you can try searching without a vector index or implementing post-filtering techniques like the following query: `select * from (select * from items ORDER BY embedding <-> '[3,2,1]' LIMIT 100 ) where category = 1`. This involves using approximate nearest neighbor (ANN) search to obtain enough results and then applying filtering afterwards. ## Setting up the development environment @@ -294,7 +212,7 @@ You could use [envd](https://github.com/tensorchord/envd) to set up the developm ```sh pip install envd envd up -``` +```` ## Contributing diff --git a/docs/comparison-pgvector.md b/docs/comparison-pgvector.md new file mode 100644 index 000000000..ae21317af --- /dev/null +++ b/docs/comparison-pgvector.md @@ -0,0 +1,51 @@ +# Comparison with pgvector + +## Delete and Transaction Support (Dead tuple problem) + +In the HNSW index, the `ef_search` parameter controls the number of candidates returned by the index. However, pgvector may not provide sufficient results in cases where data is updated or deleted. This occurs because pgvector does not handle invisible tuples, which can arise when data has already been deleted or is part of an uncommitted transaction. Consequently, pgvector may fail to return the desired number of results specified by `ef_search`, resulting in poorer recall rates and potentially affecting application performance. + +In contrast, pgvecto.rs resolves this problem by checking the tuple visibility in the traversal process and consistently returning the specified number of candidates. This makes pgvecto.rs fully supports the ACID transactions and allows users to utilize it in any scenario without sacrificing performance or incurring additional operational overhead. + +We've conducted a straightforward [experiment](https://gist.github.com/VoVAllen/a83d2ee4b56a2a152019d768926f1a40) involving the insertion of 5000 vectors, followed by another 5000 vectors in an uncommitted transaction. When querying with `ef_search` set to 32, it was expected to return 32 results. However, pgvector only returned 17 results because it improperly skipped invisible tuples during traversal. Related issue can be found at https://github.com/pgvector/pgvector/issues/244 + + +## Vector Dimension Limit + +Another key advantage of pgvecto.rs over pgvector is its almost unlimited (65536) vector dimension support versus pgvector capping at 2000 dimensions. Pgvector limits the vector size to a maximum of 2000 dimensions. However, in order to achieve better results, it may occasionally exceed this limit. This also prevents us from conducting further tests with vector-db-benchmark on larger vectors, such as those with 2048 dimensions. + +## Prefilter support + +pgvecto.rs implements conditional filtering using a pre-filtering approach optimized for HNSW indexes. When a conditional filter is applied, a vector similarity search proceeds through the HNSW index to locate potential matching candidates based on the index topology. As candidates are identified, they are checked against the filter criteria by leveraging PostgreSQL's native indexing capabilities before being added to the result set. Only candidates that satisfy the filters are included. This allows pruning the raw HNSW results on-the-fly based on the specified filters. Candidates that don't meet the conditions are excluded without needing an explicit allow-list. + +The search still adheres to the normal HNSW exit conditions, concluding when limits are hit and results no longer improve. By evaluating filters in real-time using PostgreSQL indexes during the HNSW traversal, pgvecto.rs combines the speed of approximate search with the precision of exact conditional logic. + +Without pre-filtering support, pgvector's hnsw search with filtering is executed in the post-filtering pattern. This pattern returns a certain number of candidates and then applies the filter condition to them. However, it is difficult to determine how many candidates should be selected in the first step, which can lead to lower precision in the final results. + +We did the experiments using laion dataset with vector-db-benchmark, and pgvecto.rs shows up to 2x speed up when precision > 90%, and higher precision, that pgvector cannot achieve due to the limit on hnsw's ef_search parameter. +![Alt text](./images/filter-benchmark.png) + +## Index build and persistence, and WAL amplification + +pgvector: +- Utilize PostgreSQL's buffer and page storage for index storage. +- The index build process cannot be parallelized because hnsw is originally designed for memory usage. Inserting new points will result in significant changes to memory. Using postgres's pages will cause too many buffer locking and raising errors in postgres. +- There is also a WAL amplification issue for the same reason. When inserting 100k vectors with 100 dimensions. It will use 45mb as the data size and 279mb as the index size, but generates 1216mb write ahead logs, about 30x write amplification on the index +- The advantage of such implementation is that it seamlessly integrates with the postgres ecosystem, allowing for out-of-the-box compatibility with logical replication. However, it's important to note that pgvector's WAL amplification can also impact the logical replication process. + +pgvecto.rs +- We initially attempted to use the page storage feature of PostgreSQL, but encountered issues with parallel build and WAL amplification mentioned above. As a result, we decided to utilize mmap for storing the index outside of PostgreSQL's storage system. We believe this approach will provide users with better experiences and allow us to iterate more quickly with various algorithms. +- pgvecto.rs doesn't have WAL amplification problem. When inserting 100k vectors with 100 dimensions, it will use 42mb as index's data size, and generates 42mb write ahead logs, less than 2x write amplification on the index. +- The drawback is that implementing logical replication in PostgreSQL requires additional effort. We have plans to implement it in the future. + +Based on our testing of the `gist-960-euclidean` dataset on an 8-core, 32GB instance, with 1 million 960-dimensional vectors, it took 11,640 seconds to build the index using pgvector. pgvecto.rs only took 1,500 seconds, which is about 8x speedup. With larger machine, pgvecto.rs can further accelerate on by utilizing all cores whereas pgvector cannot. + + +## Performance + +We opted for vector-db-benchmark over ann-benchmark because the latter is primarily intended for testing various algorithms rather than real-world database scenarios. Ann-benchmark only allows the algorithm to run on a single core and does not test throughput. + +Our tests were conducted on an 8-core, 32GB instance. We had pgbouncer installed before a postgres instance, with separate installations of pgvector and pgvecto.rs. For this test, we used the official docker image of pgvector version 0.5.0. + +Due to pgvector's vector size limit at 2000, we can only test on part of the dataset in vector-db-benchmark. And pgvecto.rs shows better results in both performance and speed comparing to pgvector, and up to 2x speed up when precision > 90%. Here's the results on gist-960-euclidean data, with m=16 and ef_construction=40 for both extensions. + +![benchmark](./images/299de17f-edaa-43af-8353-c6d0785b643f.jpeg) \ No newline at end of file diff --git a/docs/comparison-with-specialized-vectordb.md b/docs/comparison-with-specialized-vectordb.md new file mode 100644 index 000000000..51f35884f --- /dev/null +++ b/docs/comparison-with-specialized-vectordb.md @@ -0,0 +1,32 @@ + +# Why not a specialty vector database? + +Read our complete blog at [modelz.ai/blog/pgvector](https://modelz.ai/blog/pgvector) + +Imagine this, your existing data is stored in a Postgres database, and you want to use a vector database to do some vector similarity search. You have to move your data from Postgres to the vector database, and you have to maintain two databases at the same time. This is not a good idea. + +Why not just use Postgres to do the vector similarity search? This is the reason why we build pgvecto.rs. The user journey is like this: + +```sql +-- Update the embedding column for the documents table +UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0; + +-- Create an index on the embedding column +CREATE INDEX ON documents USING vectors (embedding l2_ops) +WITH (options = $$ +capacity = 2097152 +[vectors] +memmap = "ram" +[algorithm.hnsw] +memmap = "ram" +m = 32 +ef = 256 +$$); + +-- Query the similar embeddings +SELECT * FROM documents ORDER BY embedding <-> ai_embedding_vector('hello world') LIMIT 5; +``` + +From [SingleStore DB Blog](https://www.singlestore.com/blog/why-your-vector-database-should-not-be-a-vector-database/): + +> Vectors and vector search are a data type and query processing approach, not a foundation for a new way of processing data. Using a specialty vector database (SVDB) will lead to the usual problems we see (and solve) again and again with our customers who use multiple specialty systems: redundant data, excessive data movement, lack of agreement on data values among distributed components, extra labor expense for specialized skills, extra licensing costs, limited query language power, programmability and extensibility, limited tool integration, and poor data integrity and availability compared with a true DBMS. diff --git a/docs/images/299de17f-edaa-43af-8353-c6d0785b643f.jpeg b/docs/images/299de17f-edaa-43af-8353-c6d0785b643f.jpeg new file mode 100644 index 000000000..a0a57cd89 Binary files /dev/null and b/docs/images/299de17f-edaa-43af-8353-c6d0785b643f.jpeg differ diff --git a/docs/images/filter-benchmark.png b/docs/images/filter-benchmark.png new file mode 100644 index 000000000..c0d327e1a Binary files /dev/null and b/docs/images/filter-benchmark.png differ diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 000000000..031846fbc --- /dev/null +++ b/docs/install.md @@ -0,0 +1,88 @@ + +# Installation + +## Try with docker + +We have prebuild image at [tensorchord/pgvecto-rs](https://hub.docker.com/r/tensorchord/pgvecto-rs). You can try it with + +``` +docker run --name pgvecto-rs-demo -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d tensorchord/pgvecto-rs:latest +``` + +To acheive full performance, please mount the volume to pg data directory by adding the option like `-v $PWD/pgdata:/var/lib/postgresql/data` + +Reference: https://hub.docker.com/_/postgres/. + +
+ Build from source + +## Install Rust and base dependency + +```sh +sudo apt install -y build-essential libpq-dev libssl-dev pkg-config gcc libreadline-dev flex bison libxml2-dev libxslt-dev libxml2-utils xsltproc zlib1g-dev ccache clang git +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +## Clone the Repository + +```sh +git clone https://github.com/tensorchord/pgvecto.rs.git +cd pgvecto.rs +``` + +## Install Postgresql and pgrx + +```sh +sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' +wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - +sudo apt-get update +sudo apt-get -y install libpq-dev postgresql-15 postgresql-server-dev-15 +cargo install cargo-pgrx --git https://github.com/tensorchord/pgrx.git --rev $(cat Cargo.toml | grep "pgrx =" | awk -F'rev = "' '{print $2}' | cut -d'"' -f1) +cargo pgrx init --pg15=/usr/lib/postgresql/15/bin/pg_config +``` + +## Install pgvecto.rs + +```sh +cargo pgrx install --release +``` + +Configure your PostgreSQL by modifying the `shared_preload_libraries` to include `vectors.so`. + +```sh +psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors.so"' +``` + +You need restart the PostgreSQL cluster. + +```sh +sudo systemctl restart postgresql.service +``` + +
+ +
+ Install from release + +Download the deb package in the release page, and type `sudo apt install vectors-pg15-*.deb` to install the deb package. + +Configure your PostgreSQL by modifying the `shared_preload_libraries` to include `vectors.so`. + +```sh +psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors.so"' +``` + +You need restart the PostgreSQL cluster. + +```sh +sudo systemctl restart postgresql.service +``` + +
+ +Connect to the database and enable the extension. + +```sql +DROP EXTENSION IF EXISTS vectors; +CREATE EXTENSION vectors; +```