Tuesday, January 28, 2025

Vector indexes, MariaDB & pgvector, large server, large dataset: part 1

This post has results from ann-benchmarks to compare MariaDB and Postgres with a larger dataset, gist-960-euclidean.  Previous posts (here and here) used fashion-mnist-784-euclidean which is a small dataset. By larger I mean by the standards of what is in ann-benchmarks. This dataset has 1M rows and 960 dimensions. The fashion-mnist-784-euclidean dataset has 60,000 rows and 784 dimensions. Both use Euclidean distance.

tl;dr

  • MariaDB gets between 2.5X and 3.9X more QPS than Postgres for recall >= 0.95

Benchmark

This post has much more detail about my approach in general. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit.  The ann-benchmarks config files are here for MariaDB and for Postgres.

I used a larger server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices. The database configuration files are here for MariaDB and for Postgres.

I had ps and vmstat running during the benchmark and confirmed there weren't storage reads as the table and index were cached by MariaDB and Postgres.

The command line to run the benchmark using my helper scripts is:
    bash rall.batch.sh v1 gist-960-euclidean c32r128

Results: QPS vs recall

This chart shows the best QPS for a given recall. MariaDB gets ~1.5X more QPS than pgvector at low recall and between 2X and 4X more QPS at high recall.


Results: create index

The database configs for Postgres and MariaDB are shared above, and parallel index create is disabled by the config for Postgres and not supported yet by MariaDB. The summary is:
  • index sizes are similar between MariaDB and pgvector with halfvec
  • time to create the index varies a lot and it is better to consider this in the context of recall which is done in next section
Table size is 3906 MB for Postgres and 5292 MB for MariaDB.

Legend
  • M - value for M when creating the index
  • cons - value for ef_construction when creating the index
  • secs - time in seconds to create the index
  • size(MB) - index size in MB
                -- pgvector --          -- pgvector --
                -- float32  --          -- halfvec  --
M       cons    secs    size(MB)        secs    size(MB)
 8       32      323    3870             292    2568
16       32      610    7811             551    2603
 8       64      512    3865             466    2565
16       64      964    7684             869    2561
32       64     1958    7812            1773    2604
 8       96      717    3863             646    2564
16       96     1330    7681            1187    2560
32       96     2640    7679            2368    2559
48       96     3990    7812            3606    2606
 8      192     1265    3861            1142    2562
16      192     2295    7679            2036    2559
32      192     4361    7678            3880    2559
48      192     6281    7678            5581    2562
64      192     8589    7678            7612    3839
 8      256     1607    3861            1448    2562
16      256     2882    7678            2560    2559
32      256     5260    7678            4611    2559
48      256     7678    7678            6713    2561
64      256     9962    7678            8851    3839

mariadb
M       secs    size(MB)
 4        243   2316
 5        313   2320
 6        439   2316
 8        775   2316
12       1878   2316
16       3547   2348
24       8690   2696
32      16172   2696
48      38732   2756

Results: best QPS for a given recall

Many benchmark results are marketed via peak performance (max throughput or min response time) but these are usually constrained optimization problems -- determine peak performance that satisfies some SLA. And the SLA might be response time or efficiency (cost).

With ann-benchmarks the constraint is recall. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16/halfvec).

Summary
  • Postgres does not get recall=1.0 for the values of M, ef_construction and ef_search I used
  • Index create time was less for MariaDB in all cases except the result for recall >= 0.96. However, if you care more about index size than peak QPS then it might be better to look at more results per recall level, as in the best 3 results per DBMS rather than the best as I do here.
  • For a given recall target, MariaDB gets between 2.5X and 3.9X more QPS than Postgres
Legend:
  • recall, QPS - best QPS at that recall
  • isecs - time to create the index in seconds
  • m= - value for M when creating the index
  • ef_cons= - value for ef_construction when creating the index
  • ef_search= - value for ef_search when running queries
Best QPS with recall >= 1.000, Postgres did not reach the recall target
recall  QPS     isecs
1.000    87.1  38732   MariaDB(m=48, ef_search=10)

Best QPS with recall >= 0.99, MariaDB gets 3.7X more QPS than Postgres
recall  QPS     isecs
0.990    81.1   9962    PGVector(m=64, ef_cons=256, ef_search=120)
0.990    85.4   8851    PGVector_halfvec(m=64, ef_cons=256, ef_search=120)
0.991   311.8   8690    MariaDB(m=24, ef_search=10)

Best QPS with recall >= 0.98,MariaDB gets 3.5X more QPS than Postgres
recall  QPS     isecs
0.984   101.1   6281    PGVector(m=48, ef_cons=192, ef_search=120)
0.984   109.7   5581    PGVector_halfvec(m=48, ef_cons=192, ef_search=120)
0.985   384.9   3547    MariaDB(m=16, ef_search=20)

Best QPS with recall >= 0.97, MariaDB gets 2.5X more QPS than Postgres
recall  QPS     isecs
0.973   138.0   5260    PGVector(m=32, ef_cons=256, ef_search=120)
0.971   152.7   3880    PGVector_halfvec(m=32, ef_cons=192, ef_search=120)
0.985   384.9   3547    MariaDB(m=16, ef_search=20)

Best QPS with recall >= 0.96, MariaDB gets 3.9X more QPS than Postgres
recall  QPS     isecs
0.966   139.8   6281    PGVector(m=48, ef_cons=192, ef_search=80)
0.964   155.4   2368    PGVector_halfvec(m=32, ef_cons=96, ef_search=120)
0.960   610.1   3547    MariaDB(m=16, ef_search=10)

Best QPS with recall >= 0.95, MariaDB gets 2.9X more QPS than Postgres
recall  QPS     isecs
0.951   190.8   5260    PGVector(m=32, ef_cons=256, ef_search=80)
0.951   208.7   4611    PGVector_halfvec(m=32, ef_cons=256, ef_search=80)
0.960   610.1   3547    MariaDB(m=16, ef_search=10)

No comments:

Post a Comment

Vector indexes, MariaDB & pgvector, large server, large dataset: part 1

This post has results from ann-benchmarks to compare MariaDB and Postgres with a larger dataset, gist-960-euclidean.  Previous posts ( here...