This post has results for vector index support in MariaDB and Postgres. I am new to vector indexes so I will start small and over time add more results. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.
tl;dr
- Performance for MariaDB is excellent
- peak QPS for a given recall target is much better than pgvector
- time to create the index to achieve a given recall target is almost always much better than pgvector
- MariaDB is easier to tune than pgvector
- MariaDB needs more documentation
- this blog post on benchmarks
- this overview
- this syntax description
- the algorithm is described as a variation of HNSW (MHNSW) and the community will benefit from more details on what has been changed. For example, I can't set ef_construction and it always uses float16 in the index. But with pgvector I can use either float32 (default) or float16 (via halfvec).
- What transaction isolation level is supported? I get that workloads will be read-heavy but I assume that some won't be read-only so I won't to know whether repeatable read and read committed are provided.
- What concurrent operations are supported? Obviously, reads can be concurrent with other reads. But can writes be concurrent with reads or with other writes on the same table?
While I already installed some dependencies on this server long ago, to run this benchmark I did:
sudo apt install libmariadb3 libmariadb-dev
pip3 install mariadb
pip3 install pgvector psycopg
For MariaDB I compiled version 11.7.1 from source. Files that I used include:
I am using the ann-benchmarks via my fork of a fork of a fork at this commit. These forks have changes to run the benchmark for MariaDB and pgvector without using Docker containers. A request I have for anyone writing a benchmark client is to limit dependencies, or at least make them optional. I just want to point the benchmark client at an existing installation.
In this post I use the fashion-mnist-784-euclidean dataset in non-batch mode where non-batch mode doesn't run concurrent queries and batch mode does. Note that I don't set --count when running the benchmark which means that all of the queries use LIMIT 10.
Files related to these tests are archived here.
The command lines for non-batch mode are:
python3 -u run.py --algorithm pgvector --dataset fashion-mnist-784-euclidean --local
POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
python3 -u run.py --algorithm pgvector_halfvec --dataset fashion-mnist-784-euclidean --local
MARIADB_CONN_ARGS=root:pw:127.0.0.1:3306 MARIADB_DB_NAME=test \
python3 -u run.py --algorithm mariadb --dataset fashion-mnist-784-euclidean --local
Results: QPS vs recall graphs
- Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
- Time to create the index for Postgres is similar with and without halfvec
- Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).
For all cases below the best QPS at a given recall target is much better, sometimes 3X better, with MariaDB than with pgvector.
A couple of comments:
ReplyDelete* MariaDB is not using float16 for indexes, it uses int16, which gives 15 bits to store the value vs. 10 in float16.
* concurrent reads/writes and all transaction isolation levels are supported. Keep in mind though that the index is a graph so rows are heavily interconnected, and thus InnoDB row locks might play a greater role than usual.
Cool stuff! I think to make this be even more compelling, I'd like to see some specific real-world comparisons:
ReplyDelete1) openai embedding dimensions for text-embedding-3-small are 1536 dimensions, for 3-large they are 3072 dimensions. These will probably be very popular
2) adding vectors to an index is important, but so is retrieving them. I'd like to see both benchmarked
3) next to pg_vector people are using pinecone and chromadb a lot, any comparison with those would be super useful.
4) in a real world scenario, people will want to store all kinds of metadata per vector. Some testing to see how this performs when it's in-table vs in a separate table would be very useful.
For 1) I will speak to the sponsor of this work to figure out which data set to test next
DeleteFor 2) I don't understand. The results above have QPS/recall charts, more info on peak QPS for a given recall, and then time & size info on index creation.
For 3) I need someone to sponsor the work if I am to add more databases to the evaluation.
For 4) I agree but here I am limited by what the well known benchmarks use, and I am new to this space. Regardless, I expect a SQL DBMS that also supports vector search to be more successful than a DBMS limited to vector search.
I am looking forward to more such articles! Thanks Mark & MariaDB folks.
ReplyDeleteIs it possible to publish index build timings for MariaDB/pgvector with the different M's? I am going to try out with a different dataset and wanted to get an idea before I start.