Sunday, January 12, 2025

Evaluating vector indexes in MariaDB and pgvector: part 1

This post has results for vector index support in MariaDB and Postgres. I am new to vector indexes so I will start small and over time add more results.  This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.

I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.

tl;dr

  • Performance for MariaDB is excellent
    • peak QPS for a given recall target is much better than pgvector
    • time to create the index to achieve a given recall target is almost always much better than pgvector
  • MariaDB is easier to tune than pgvector
  • MariaDB needs more documentation
The good and the less good

The good for MariaDB starts with performance. The results I get are great and match the results in this blog post from upstream. Another good thing about MariaDB is that it is easier to tune. With pgvector I need to set M and ef_construction while creating an index and then ef_search while running a query, With MariaDB there is no option to set ef_construction. And evaluations are easier when there are fewer options to tune. Note that the M with pgvector isn't the same thing as the M with MariaDB, but they are similar (at least in spirit).

The less good for MariaDB is the documentation. We need more, but that is easy to fix. I have been using:

Things that need more documentation
  • the algorithm is described as a variation of HNSW (MHNSW) and the community will benefit from more details on what has been changed. For example, I can't set ef_construction and it always uses float16 in the index. But with pgvector I can use either float32 (default) or float16 (via halfvec).
  • What transaction isolation level is supported? I get that workloads will be read-heavy but I assume that some won't be read-only so I won't to know whether repeatable read and read committed are provided.
  • What concurrent operations are supported? Obviously, reads can be concurrent with other reads. But can writes be concurrent with reads or with other writes on the same table?
Hardware

The hardware is a Beelink SER7 7840HS with a Ryzen 7 7840HS CPU, 32G of RAM and Ubuntu 22.04.

While I already installed some dependencies on this server long ago, to run this benchmark I did:

sudo apt install libmariadb3 libmariadb-dev

pip3 install mariadb

pip3 install pgvector psycopg


Database software

For Postgres I compiled version 17.2 and pgvector 0.8.0 from source. Files that I used include:
For MariaDB I compiled version 11.7.1 from source. Files that I used include:
Benchmark

I am using the ann-benchmarks via my fork of a fork of a fork at this commit. These forks have changes to run the benchmark for MariaDB and pgvector without using Docker containers. A request I have for anyone writing a benchmark client is to limit dependencies, or at least make them optional. I just want to point the benchmark client at an existing installation.

The ann-benchmarks configuration files are here for MariaDB and for pgvector. I am open to feedback that I should try different parameters. I added support to use float16 (halfvec) for the pgvector index (but only the index, not for the vector stored in the table).

In this post I use the fashion-mnist-784-euclidean dataset in non-batch mode where non-batch mode doesn't run concurrent queries and batch mode does. Note that I don't set --count when running the benchmark which means that all of the queries use LIMIT 10.

Files related to these tests are archived here.

The command lines for non-batch mode are:
POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
    python3 -u run.py  --algorithm pgvector --dataset fashion-mnist-784-euclidean --local

POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
    python3 -u run.py  --algorithm pgvector_halfvec --dataset fashion-mnist-784-euclidean --local

MARIADB_CONN_ARGS=root:pw:127.0.0.1:3306 MARIADB_DB_NAME=test \
    python3 -u run.py  --algorithm mariadb --dataset fashion-mnist-784-euclidean --local

Results: QPS vs recall graphs

The recall vs QPS graph is created by: python3 plot.py --dataset fashion-mnist-784-euclidean

The results below for MariaDB are excellent. It gets more QPS than pgvector at a given recall target.


Results: create index

I am still trying to figure out how to present this data. All of the numbers are here for the time to create an index and the size of the index. The database configs for Postgres and MariaDB are shared above, and parallel index create is disabled by the config for Postgres (and not supported yet by MariaDB). The summary is:
  • Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
  • Time to create the index for Postgres is similar with and without halfvec
  • Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
Results: best QPS for a given recall

Many benchmark results are marketed via peak performance (max throughput or min response time) but these are usually constrained optimization problems -- determine peak performance that satisfies some SLA. And the SLA might be response time or efficiency (cost).

With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).

For all cases below except the first (recall = 1.000) the time to create the index is about 2X or more longer with pgvector than with MariaDB.

For all cases below the best QPS at a given recall target is much better, sometimes 3X better, with MariaDB than with pgvector.

Legend
* recall & QPS - results from the benchmark
* isecs - number of seconds to create the index for M and ef_construction (ef_cons)

Best QPS with recall >= 1.000
recall  QPS  isecs  algorithm
1.000   920  97.4   PGVector(m=16, ef_cons=192, ef_search=120)
1.000   991  89.8   PGVector_halfvec(m=16, ef_cons=256, ef_search=120)
1.000  3226 111.8   MariaDB(m=32, ef_search=10)

Best QPS with recall >= 0.99
recall  QPS  isecs  algorithm
0.990  2337  70.7   PGVector(m=16, ef_cons=96, ef_search=20)
0.991  2558  78.3   PGVector_halfvec(m=16, ef_cons=192, ef_search=20)
0.995  4745  27.4   MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.98
recall  QPS  isecs  algorithm
0.985  2383  50.4   PGVector(m=16, ef_cons=32, ef_search=20)
0.988  2608  52.8   PGVector_halfvec(m=16, ef_cons=64, ef_search=20)
0.984  5351  18.1   MariaDB(m=8, ef_search=10)

Best QPS with recall >= 0.97
recall  QPS  isecs  algorithm
0.972  3033  42.3   PGVector(m=8, ef_cons=96, ef_search=20)
0.973  3185  89.8   PGVector_halfvec(m=16, ef_cons=256, ef_search=10)
0.984  5351  18.1   MariaDB(m=8, ef_search=10)

Best QPS with recall >= 0.96
recall  QPS  isecs  algorithm
0.961  3044   50.4  PGVector(m=16, ef_cons=32, ef_search=10)
0.967  3250   52.8  PGVector_halfvec(m=16, ef_cons=64, ef_search=10)
0.961  5691   14.8  MariaDB(m=6, ef_search=10)

Best QPS with recall >= 0.95
recall  QPS  isecs  algorithm
0.961  3044  50.4   PGVector(m=16, ef_cons=32, ef_search=10)
0.967  3250  52.8   PGVector_halfvec(m=16, ef_cons=64, ef_search=10)
0.961  5691  14.8   MariaDB(m=6, ef_search=10)

4 comments:

  1. A couple of comments:

    * MariaDB is not using float16 for indexes, it uses int16, which gives 15 bits to store the value vs. 10 in float16.
    * concurrent reads/writes and all transaction isolation levels are supported. Keep in mind though that the index is a graph so rows are heavily interconnected, and thus InnoDB row locks might play a greater role than usual.

    ReplyDelete
  2. Cool stuff! I think to make this be even more compelling, I'd like to see some specific real-world comparisons:
    1) openai embedding dimensions for text-embedding-3-small are 1536 dimensions, for 3-large they are 3072 dimensions. These will probably be very popular
    2) adding vectors to an index is important, but so is retrieving them. I'd like to see both benchmarked
    3) next to pg_vector people are using pinecone and chromadb a lot, any comparison with those would be super useful.
    4) in a real world scenario, people will want to store all kinds of metadata per vector. Some testing to see how this performs when it's in-table vs in a separate table would be very useful.

    ReplyDelete
    Replies
    1. For 1) I will speak to the sponsor of this work to figure out which data set to test next

      For 2) I don't understand. The results above have QPS/recall charts, more info on peak QPS for a given recall, and then time & size info on index creation.

      For 3) I need someone to sponsor the work if I am to add more databases to the evaluation.

      For 4) I agree but here I am limited by what the well known benchmarks use, and I am new to this space. Regardless, I expect a SQL DBMS that also supports vector search to be more successful than a DBMS limited to vector search.

      Delete
  3. I am looking forward to more such articles! Thanks Mark & MariaDB folks.

    Is it possible to publish index build timings for MariaDB/pgvector with the different M's? I am going to try out with a different dataset and wanted to get an idea before I start.

    ReplyDelete

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...