This post has results for vector index support in MariaDB and Postgres. I am new to vector indexes so I will start small and over time add more results. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.
tl;dr
- Performance for MariaDB is excellent
- peak QPS for a given recall target is much better than pgvector
- time to create the index to achieve a given recall target is much better than pgvector
- MariaDB is easier to tune than pgvector
- MariaDB needs more documentation
- this blog post on benchmarks
- this overview
- this syntax description
- the algorithm is described as a variation of HNSW (MHNSW) and the community will benefit from more details on what has been changed. For example, I can't set ef_construction and it always uses float16 in the index. But with pgvector I can use either float32 (default) or float16 (via halfvec).
- What transaction isolation level is supported? I get that workloads will be read-heavy but I assume that some won't be read-only so I won't to know whether repeatable read and read committed are provided.
- What concurrent operations are supported? Obviously, reads can be concurrent with other reads. But can writes be concurrent with reads or with other writes on the same table?
While I already installed some dependencies on this server long ago, to run this benchmark I did:
sudo apt install libmariadb3 libmariadb-dev
pip3 install mariadb
pip3 install pgvector psycopg
For MariaDB I compiled version 11.7.1 from source. Files that I used include:
I am using the ann-benchmarks via my fork of a fork of a fork at this commit. These forks have changes to run the benchmark for MariaDB and pgvector without using Docker containers. A request I have for anyone writing a benchmark client is to limit dependencies, or at least make them optional. I just want to point the benchmark client at an existing installation.
In this post I use the fashion-mnist-784-euclidean dataset in non-batch mode where non-batch mode doesn't run concurrent queries and batch mode does. Note that I don't set --count when running the benchmark which means that all of the queries use LIMIT 10.
The command lines for non-batch mode are:
python3 -u run.py --algorithm pgvector --dataset fashion-mnist-784-euclidean --local
POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
python3 -u run.py --algorithm pgvector_halfvec --dataset fashion-mnist-784-euclidean --local
MARIADB_CONN_ARGS=root:pw:127.0.0.1:3306 MARIADB_DB_NAME=test \
python3 -u run.py --algorithm mariadb --dataset fashion-mnist-784-euclidean --local
Results: QPS vs recall graphs
- Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
- Time to create the index for Postgres is similar with and without halfvec
- Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).
For all cases below the best QPS at a given recall target is much better, sometimes 2X better, with MariaDB than with pgvector.