# pgvector Open-source vector similarity search for Postgres ```sql CREATE TABLE table (column vector(3)); CREATE INDEX ON table USING ivfflat (column vector_l2_ops); SELECT * FROM table ORDER BY column <-> '[1,2,3]' LIMIT 5; ``` Supports L2 distance, inner product, and cosine distance [![Build Status](https://github.com/pgvector/pgvector/workflows/build/badge.svg?branch=master)](https://github.com/pgvector/pgvector/actions) ## Installation Compile and install the extension (supports Postgres 9.6+) ```sh git clone --branch v0.2.3 https://github.com/pgvector/pgvector.git cd pgvector make make install # may need sudo ``` Then load it in databases where you want to use it ```sql CREATE EXTENSION vector; ``` You can also install it with [Docker](#docker), [Homebrew](#homebrew), or [PGXN](#pgxn) ## Getting Started Create a vector column with 3 dimensions (replace `table` and `column` with non-reserved names) ```sql CREATE TABLE table (column vector(3)); ``` Insert values ```sql INSERT INTO table VALUES ('[1,2,3]'), ('[4,5,6]'); ``` Get the nearest neighbor by L2 distance ```sql SELECT * FROM table ORDER BY column <-> '[3,1,2]' LIMIT 1; ``` Also supports inner product (`<#>`) and cosine distance (`<=>`) Note: `<#>` returns the negative inner product since Postgres only supports `ASC` order index scans on operators ## Indexing Speed up queries with an approximate index. Add an index for each distance function you want to use. L2 distance ```sql CREATE INDEX ON table USING ivfflat (column vector_l2_ops); ``` Inner product ```sql CREATE INDEX ON table USING ivfflat (column vector_ip_ops); ``` Cosine distance ```sql CREATE INDEX ON table USING ivfflat (column vector_cosine_ops); ``` Indexes should be created after the table has data for optimal clustering. If the distribution of data changes significantly, you can reindex without downtime: ```sql -- Postgres 12+ REINDEX INDEX CONCURRENTLY index_name; -- Postgres < 12 CREATE INDEX CONCURRENTLY temp_name ON table USING ivfflat (column opclass); DROP INDEX CONCURRENTLY index_name; ALTER INDEX temp_name RENAME TO index_name; ``` Also, unlike typical indexes which only affect performance, you may see different results for queries after adding an approximate index. ### Index Options Specify the number of inverted lists (100 by default) ```sql CREATE INDEX ON table USING ivfflat (column opclass) WITH (lists = 100); ``` A [good place to start](https://github.com/facebookresearch/faiss/issues/112) is `4 * sqrt(rows)` ### Query Options Specify the number of probes (1 by default) ```sql SET ivfflat.probes = 1; ``` A higher value improves recall at the cost of speed. Use `SET LOCAL` inside a transaction to set it for a single query ```sql BEGIN; SET LOCAL ivfflat.probes = 1; SELECT ... COMMIT; ``` ### Indexing Progress [unreleased] Check [indexing progress](https://www.postgresql.org/docs/current/progress-reporting.html#CREATE-INDEX-PROGRESS-REPORTING) with Postgres 12+ ```sql SELECT phase, tuples_done, tuples_total FROM pg_stat_progress_create_index; ``` The phases are: 1. `initializing` 2. `sampling table` 3. `performing k-means` 4. `sorting tuples` 5. `loading tuples` Note: `tuples_done` and `tuples_total` are only populated during the `loading tuples` phase ### Partial Indexes Consider [partial indexes](https://www.postgresql.org/docs/current/indexes-partial.html) for queries with a `WHERE` clause ```sql CREATE INDEX ON table USING ivfflat (column opclass) WHERE (other_column = 123); ``` To index many different values of `other_column`, consider [partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) on `other_column`. ## Performance To speed up queries without an index, increase `max_parallel_workers_per_gather`. ```sql SET max_parallel_workers_per_gather = 4; ``` To speed up queries with an index, increase the number of inverted lists (at the expense of recall). ```sql CREATE INDEX ON table USING ivfflat (column opclass) WITH (lists = 1000); ``` ## Reference ### Vector Type Each vector takes `4 * dimensions + 8` bytes of storage. Each element is a float, and all elements must be finite (no `NaN`, `Infinity` or `-Infinity`). Vectors can have up to 1024 dimensions. ### Vector Operators Operator | Description --- | --- \+ | element-wise addition \- | element-wise subtraction <-> | Euclidean distance <#> | negative inner product <=> | cosine distance ### Vector Functions Function | Description --- | --- cosine_distance(vector, vector) | cosine distance inner_product(vector, vector) | inner product l2_distance(vector, vector) | Euclidean distance vector_dims(vector) | number of dimensions vector_norm(vector) | Euclidean norm ## Libraries Libraries that use pgvector: - [pgvector-python](https://github.com/pgvector/pgvector-python) (Python) - [Neighbor](https://github.com/ankane/neighbor) (Ruby) - [pgvector-node](https://github.com/pgvector/pgvector-node) (Node.js) - [pgvector-go](https://github.com/pgvector/pgvector-go) (Go) - [pgvector-rust](https://github.com/pgvector/pgvector-rust) (Rust) - [pgvector-cpp](https://github.com/pgvector/pgvector-cpp) (C++) ## Frequently Asked Questions #### How many vectors can be stored in a single table? A non-partitioned table has a limit of 32 TB by default in Postgres. A partitioned table can have thousands of partitions of that size. #### Is replication supported? Yes, pgvector uses the write-ahead log (WAL), which allows for replication and point-in-time recovery. #### What if my data has more than 1024 dimensions? Two things you can try are: 1. use dimensionality reduction 2. compile Postgres with a larger block size (`./configure --with-blocksize=32`) and edit the limit in `src/vector.h` ## Additional Installation Methods ### Docker Get the [Docker image](https://hub.docker.com/r/ankane/pgvector) with: ```sh docker pull ankane/pgvector ``` This adds pgvector to the [Postgres image](https://hub.docker.com/_/postgres). You can also build the image manually ```sh git clone --branch v0.2.3 https://github.com/pgvector/pgvector.git cd pgvector docker build -t pgvector . ``` ### Homebrew On Mac with Homebrew Postgres, you can use: ```sh brew install pgvector/brew/pgvector ``` ### PGXN Install from the [PostgreSQL Extension Network](https://pgxn.org/dist/vector) with: ```sh pgxn install vector ``` ## Hosted Postgres Some Postgres providers only support specific extensions. To request a new extension: - Amazon RDS - follow the instructions on [this page](https://aws.amazon.com/rds/postgresql/faqs/) - Google Cloud SQL - follow the instructions on [this page](https://cloud.google.com/sql/docs/postgres/extensions#requesting-support-for-a-new-extension) - DigitalOcean Managed Databases - vote or comment on [this page](https://ideas.digitalocean.com/app-framework-services/p/pgvector-extension-for-postgresql) - Azure Database for PostgreSQL - follow the instructions on [this page](https://docs.microsoft.com/en-us/azure/postgresql/concepts-extensions#next-steps) ## Upgrading Install the latest version and run: ```sql ALTER EXTENSION vector UPDATE; ``` ## Thanks Thanks to: - [PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension](https://dl.acm.org/doi/pdf/10.1145/3318464.3386131) - [Faiss: A Library for Efficient Similarity Search and Clustering of Dense Vectors](https://github.com/facebookresearch/faiss) - [Using the Triangle Inequality to Accelerate k-means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - [k-means++: The Advantage of Careful Seeding](https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf) - [Concept Decompositions for Large Sparse Text Data using Clustering](https://www.cs.utexas.edu/users/inderjit/public_papers/concept_mlj.pdf) ## History View the [changelog](https://github.com/pgvector/pgvector/blob/master/CHANGELOG.md) ## Contributing Everyone is encouraged to help improve this project. Here are a few ways you can help: - [Report bugs](https://github.com/pgvector/pgvector/issues) - Fix bugs and [submit pull requests](https://github.com/pgvector/pgvector/pulls) - Write, clarify, or fix documentation - Suggest or add new features To get started with development: ```sh git clone https://github.com/pgvector/pgvector.git cd pgvector make make install ``` To run all tests: ```sh make installcheck # regression tests make prove_installcheck # TAP tests ``` To run single tests: ```sh make installcheck REGRESS=functions # regression test make prove_installcheck PROVE_TESTS=test/t/001_wal.pl # TAP test ``` Resources for contributors - [Extension Building Infrastructure](https://www.postgresql.org/docs/current/extend-pgxs.html) - [Index Access Method Interface Definition](https://www.postgresql.org/docs/current/indexam.html) - [Generic WAL Records](https://www.postgresql.org/docs/13/generic-wal.html)