Small Datum: February 2025

Wednesday, February 26, 2025

Feedback from the Postgres community about the vector index benchmarks

This is a response to some of the feedback I received from the Postgres community about my recent benchmark results for vector indexes using MariaDB and Postgres (pgvector). The work here isn't sponsored and required ~2 weeks days of server time and a few more hours of my time (yes, I contribute to the PG community).

tl;dr

index create is ~4X faster when using ~4 parallel workers. I hope that parallel DDL comes to more open source DBMS.
parallel query does not help pgvector
increasing work_mem does not help pgvector
pgvector gets less QPS than MariaDB because it uses more CPU to compute the distance metric

The feedback

The feedback I received includes

benchmarks are useless, my production workload is the only relevant workload

I disagree, but don't expect consensus. But this approach means you are unlikely to do comparative benchmarks because it costs too much to port your workload to some other DBMS just for the sake of a benchmark and only your team can run this so you will miss out on expertise from elsewhere.

this work was sponsored so I don't trust it

There isn't much I can to about that.

this is benchmarketing!

There is a marketing element to this work. Perhaps I was not the first to define benchmarketing, but by my definition a benchmark report is not benchmarketing when the results are explained. I have been transparent about how I ran the benchmark and shared some performance debugging results here. MariaDB gets more QPS than pgvector because pgvector uses more CPU to compute the distance metric.

you should try another Postgres extension for vector indexes

I hope to do that eventually. But pgvector is a great choice here because it implements HNSW and MariaDB implements modified HNSW. Regardless time is finite, the numbers of servers I have is finite and my work here (both sponsored and volunteer) competes with my need to finish my taxes, play video games, sleep, etc.

this result is bogus because I didn't like some config setting you used

Claims like this are cheap to make and expensive to debunk. Sometimes the suggested changes make a big difference but that has been rare in my experience. Most of the time the changes at best make a small difference.

this result is bogus because you used Docker

I am not a Docker fan, but I only used Docker for Qdrant. I am not a fan because I suspect it is overused and I prefer benchmark setups that don't require it. While the ann-benchmarks page states that Docker is used for all algorithms, it is not used for MariaDB or Postgres in my fork. And it is trivial, but time consuming, to update most of ann-benchmarks to not use Docker.

Claims about the config that I used include:

huge pages were disabled

Yes they were. Enabling huge pages would have helped both MariaDB and Postgres. But I prefer to not use them because the feature is painful (unless you like OOM). Posts from me on this are here and here.

track_io_timing was enabled and is expensive

There wasn't IO when queries ran as the database was cached so this is irrelevant. There was some write IO when the index was created. While I won't repeat tests with track_io_timing disabled, I am skeptical that the latency of two calls to gettimeofday() per IO are significant on modern Linux.

autovacuum_vacuum_cost_limit is too high

I set it to 4000 because I often write write-intensive benchmarks on servers with large IOPs capacities. This is the first time anyone suggested they were too high and experts have reviewed my config files. I wish that Postgres vacuum didn't require so much tuning -- and all of that tuning means that someone can always claim your tuning is wrong. Regardless, the benchmark workload is load, create index, query and I only time the create index and query steps. There is little impact from vacuum on this benchmark. I have also used hundreds of server hours to search for good Postgres configs.

parallel operations were disabled

Parallel query isn't needed for the workloads I normally run but parallel index create is nice to have. I disable parallel index create for Postgres because my primary focus is efficiency -- how much CPU and IO is consumed per operation. But I haven't been clear on that in my blog posts. Regardless, below I show there is a big impact from parallel index create and no impact from parallel query.

work_mem is too low

I have been using the default which is fine for the workloads I normally run (sysbench, insert benchmark). Below I show there is no impact from increasing it to 8M.

Benchmark

This post has much more detail about my approach in general. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit. I used the dbpedia-openai dataset with 1M rows. It uses angular (cosine) for the distance metric. The ann-benchmarks config files is here for Postgres and in this case I only have results for pgvector with halfvec (float16).

I used a large server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices. I tested three configurations for Postgres and all of the settings are here:

def stands for default and is the config I used in all of my previous blog posts. Thus, it is the config for which I received feedback.

wm8 stands for work_mem increased to 8MB. The default (used by def) is 4MB.

pq4 stands for Parallel Query with ~4 workers. Here I changed a few settings from def to support that.

Output from the benchmark is here.

The command lines to run the benchmark using my helper scripts are:

bash rall.batch.sh v1 dbpedia-openai-1000k-angular c32r128

Results: QPS vs recall

These charts show the best QPS for a given recall. The graphs appears to be the same but the differences are harder to see as recall approaches 1.0 so the next section has a table with numbers.

But from these graphs, QPS doesn't improve with the wm8 or pq4 configs.

The chart for the def config which is what I used in previous blog posts.

The chart for the wm8 config with work_mem=8M

The chart for the pq4 config that uses ~4 parallel workers

Results: best QPS for a given recall

Many benchmark results are marketed via peak performance (max throughput or min response time) but these are usually constrained optimization problems -- determine peak performance that satisfies some SLA. And the SLA might be response time or efficiency (cost).

With ann-benchmarks the constraint is recall. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs.

Summary

pgvector does not get more QPS with parallel query
pgvector does not get more QPS with a larger value for work_mem

Legend:

recall, QPS - best QPS at that recall
isecs - time to create the index in seconds
m= - value for M when creating the index
ef_cons= - value for ef_construction when creating the index
ef_search= - value for ef_search when running queries

Best QPS with recall >= 1.000

no algorithms achived this for the (M, ef_construction) settings I used

Best QPS with recall >= 0.99

recall QPS isecs config

0.990 317.9 7302 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

0.990 320.4 7285 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

0.990 316.9 1565 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

Best QPS with recall >= 0.98

0.983 412.0 4120 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

0.984 415.6 4168 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

0.984 411.4 903 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

Best QPS with recall >= 0.97

0.978 487.3 5070 PGVector_halfvec(m=32, ef_cons=256, ef_search=30)

0.970 508.3 2495 PGVector_halfvec(m=32, ef_cons=96, ef_search=30)

0.970 508.4 2495 PGVector_halfvec(m=32, ef_cons=96, ef_search=30)

Best QPS with recall >= 0.96

0.961 621.1 4120 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

0.962 632.3 4168 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

0.962 622.0 903 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

Best QPS with recall >= 0.95

0.951 768.7 2436 PGVector_halfvec(m=16, ef_cons=192, ef_search=30)

0.952 770.2 2442 PGVector_halfvec(m=16, ef_cons=192, ef_search=30)

0.953 753.0 547 PGVector_halfvec(m=16, ef_cons=192, ef_search=30)

Results: create index

The database configs for Postgres are shared above and parallel index create is disabled by default because my focus has not been on DDL performance. Regardless, it works great for Postgres with pgvector. The summary is:

index create is ~4X faster when using ~4 parallel workers
index sizes are similar with and without parallel create index

Sizes: table is ~8G and index is ~4G

Legend

M - value for M when creating the index
cons - value for ef_construction when creating the index
secs - time in seconds to create the index
size(MB) - index size in MB

def wm8 pq4

M cons secs secs secs

8 32 412 405 108

16 32 649 654 155

8 64 624 627 154

16 64 1029 1029 237

32 64 1901 1895 412

8 96 834 835 194

16 96 1387 1393 312

32 96 2497 2495 541

48 96 3731 3726 798

8 192 1409 1410 316

16 192 2436 2442 547

32 192 4120 4168 903

48 192 6117 6119 1309

64 192 7838 7815 1662

8 256 1767 1752 400

16 256 3146 3148 690

32 256 5070 5083 1102

48 256 7302 7285 1565

64 256 9959 9946 2117

Thursday, February 20, 2025

How to find Lua scripts for sysbench using LUA_PATH

sysbench is a great tool for benchmarks and I appreciate all of the work the maintainer (Alexey Kopytov) put into it as that is often a thankless task. Today I struggled to figure out how to load Lua scripts from something other than the default location that was determined when sysbench was compiled. It turns out that LUA_PATH is the thing to set, but the syntax isn't what I expected.

My first attempt was this, because the PATH in LUA_PATH implies directory names. But that failed.
LUA_PATH="/mnt/data/sysbench.lua/lua" sysbench ... oltp_insert run

It turns out that LUA_PATH uses special semantics and this worked:
LUA_PATH="/mnt/data/sysbench.lua/lua/?.lua" sysbench ... oltp_insert run

The usage above replaces the existing search path. The usage below prepends the new path to the existing (compiled in) path:

LUA_PATH="/mnt/data/sysbench.lua/lua/?.lua;;" sysbench ... oltp_insert run

Wednesday, February 19, 2025

My database communities

I have been working on databases since 1996. In some cases I just worked on the product (Oracle & Informix), in others I consider myself a member of the community (MySQL, Postgres & RocksDB). And for MongoDB I used to be in the community.

I worked on Informix XPS in 1996. I chose Informix because I could live in Portland OR and walk to work. I was fresh out of school, didn't know much about DBMS, but got a great starter project (star query optimization). The company wasn't in great shape so I left by 1997 for Oracle. I never used Informix in production and didn't consider myself as part of the Informix community.

I was at Oracle from 1997 to 2005. The first 3 years were in Portland implementing JMS for the app server team and the last 5 years at Oracle HQ working on query execution. I fixed many bugs, added support for ieee754 types, rewrote sort and maintained the sort and bitmap index row sources. The people there were great and I learned a lot but I did not enjoy the code base and left for a startup. I never used Oracle in production and don't consider myself as part of the Oracle community.

I lead the MySQL engineering teams at Google for 4 years and at Facebook/Meta for 10 years. I was very much immersed in production and have been active in the community since 2006. The MySQL teams got much done at both Google (GTID, semi-sync, crash-safe replication, rewrote the InnoDB rw lock) and Facebook/Meta (MyRocks and too many other things to mention). Over the years at FB/Meta my job duties got in the way of programming so I used performance testing as a way to remain current. I also filed many bugs might still be in the top-10 for bug reports. While Oracle has been a great steward for the MySQL project I have been critical about the performance regressions from older MySQL to newer MySQL. I hope that eventually stops because it will become a big problem.

I contributed some code to RocksDB, mostly for monitoring. I spent much more time doing performance QA for it, and filing a few bugs. I am definitely in the community.

I don't use Postgres in production but have spent much time doing performance QA for it over the past ~10 years. A small part of that was done while at Meta, I had a business case, and was able to use some of their HW and my time. But most of this has been a volunteer effort -- more than 100 hours of my time and 10,000+ hours of server time. Some of those server hours are in public clouds (Google, Hetzner) so I am also spending a bit on this. I found a few performance bugs. I have not found large performance regressions over time which is impressive. I have met many of the contributors working on the bits I care about, and that has been a nice benefit.

I used to be a member of the MongoDB community. Like Postgres, I never supported it in production but I spent much time doing performance QA with it. I wrote mostly positive blog posts, filed more than a few bugs and even won the William Zola Community Award. But I am busy enough with MySQL, Postgres and RocksDB so I haven't tried to use it for years. Regardless, I continue to be impressed by how fast they pay down tech debt, with one exception (no cost-based optimizer).

Saturday, February 15, 2025

Vector indexes, large server, dbpedia-openai dataset: MariaDB, Qdrant and pgvector

My previous post has results for MariaDB and pgvector on the dbpedia-openai dataset. This post adds results from Qdrant. This uses ann-benchmarks to compare MariaDB, Qdrant and Postgres (pgvector) with a larger dataset, dbpedia-openai at 500k rows. The dataset has 1536 dimensions and uses angular (cosine) as the distance metric. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.

tl;dr

I am new to Qdrant so the chance that I made a mistake are larger than for MariaDB or Postgres
If you already run MariaDB or Postgres then I suggest you also use them for vector indexes
MariaDB usually gets ~2X more QPS than pgvector and ~1.5 more than Qdrant

Editorial

I have a bias. I am skeptical that you should deploy a new DBMS to support but one datatype (vectors) unless either you have no other DBMS in production or your production DBMS does not support vector indexing.

Production is expensive -- you have to worry about security, backups, operational support
A new DBMS is expensive -- you have to spend time to learn how to use it

My initial response to Qdrant is that the new developer experience isn't very good. This can be fixed, but right now the product is complicated (has many features), configuration is complicated (also true for the DBMS I know, but I already paid that price), and the cognitive load is large. Just one example of the cognitive load is the need to learn the names that Qdrant uses for things that already have well-known names in the SQL DBMS world.

Deploying Qdrant

The more DBMS you include in one benchmark, the more likely you are to make a mistake because you lack expertise in all of those DBMS. I will soon learn whether I made a mistake here but I made a good faith effort to get good results from Qdrant.

I first tried to compile from source. But that failed. The docs state that The current list of required libraries can be found in the Dockerfile and while I was able to figure that out, I prefer that they just list the dependencies. Alas, my attempts to compile from source failed with error messages about problems with (gRPC) protocol definitions.

So I decided to try the Docker container they provide. I ended up not changing the Qdrant configuration provided in the Docker container. I spent some time doing performance debugging and didn't see anything to indicate that a config change was needed. For example, I didn't see disk IO during queries. But the performance debugging was harder because that Docker container image doesn't come with my favorite debug tools installed. Some of the tools were easy to install, others (perf) were not.

Benchmark

This post has much more detail about my approach in general. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit.

The ann-benchmarks config files are here for MariaDB, Postgres and Qdrant. For Postgres I use the values for M and ef_construction. But MariaDB doesn't support ef_construction so I only specify the M values. While pgvector requires ef_construction to be >= 2*M, I do not know whether Qdrant has a similar requirement. Regardless I only test cases where that constraint is true.

Some quantization was used

MariaDB uses 16-bit integers rather than float32
pgvector uses float32, pgvector halfvec uses float16
For Qdrant I used none (float32) and scalar (int8)

I used a larger server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices. The database configuration files are here for MariaDB and for Postgres. Output from the benchmark is here.

I had ps and vmstat running during the benchmark and confirmed there weren't storage reads as the table and index were cached by MariaDB and Postgres.

The command lines to run the benchmark using my helper scripts are:

bash rall.batch.sh v1 dbpedia-openai-500k-angular c32r128

Results: QPS vs recall

These charts show the best QPS for a given recall. MariaDB gets more QPS than Qdrant and pgvector but that is harder to see as the recall approaches 1, so the next section has a table for best QPS per DBMS at a given recall.

Results: create index

The database configs for Postgres and MariaDB are shared above, and parallel index create is disabled by the config for Postgres and not supported yet by MariaDB. The summary is:

index sizes are similar between MariaDB and pgvector with halfvec
time to create the index varies a lot and it is better to consider this in the context of recall which is done in next section. But Qdrant creates indexes a lot faster than MariaDB or pgvector.
I did not find an accurate way to determine index size for Qdrant. There is a default method in ann-benchmarks that a DBMS can override. The default just compares process RSS before and after creating an index which isn't accurate for small indexes. The MariaDB and Postgres code override the default and query the data dictionary to get a more accurate estimate.

The max time to create an index for MariaDB and Postgres exceeds 10,000 seconds on this dataset when M and/or ef_construction are large. The max time for Qdrant was <= 400 seconds for no quantization and <= 300 seconds for scalar quantization. This is excellent. But I wonder if things Qdrant does (or doesn't do) to save time during create index contributes to making queries slower because MariaDB has much better QPS.

More details on index size and index create time for MariaDB and Postgres are in my previous post.

Results: best QPS for a given recall

Summary

Qdrant with scalar quantization does not get a result for recall=1.0 for the values of M, ef_construction and ef_search I used
MariaDB usually gets ~2X more QPS than pgvector and ~1.5 more than Qdrant
Index create time was much less for Qdrant (described above)

Legend:

recall, QPS - best QPS at that recall
rel2ma - (QPS for me / QPS for MariaDB)
m= is the value for M when creating the index
ef_cons= is the value for ef_construction when creating the index
ef_search= is the value for ef_search when running queries
quant= is the quantization used by Qdrant
dbms

MariaDB - MariaDB, there is no option for quantization
PGVector - Postgres with pgvector and float32
PGVector_halfvec - Postgres with pgvector and halfvec (float16)
Qdrant(..., quant=none) - Qdrant with no quantization
Qdrant(..., quant=scalar) - Qdrant with scalar quantization

MariaDB gets more QPS than a DBMS when rel2ma is less than 1.0 and when rel2ma is 0.5 then MariaDB gets 2X more QPS. Below, the rel2ma values are always much less than 1.0 except in the first group of results for recall = 1.0.

Best QPS with recall = 1.000

recall QPS rel2ma

1.000 18.3 1.00 MariaDB(m=32, ef_search=200)

1.000 49.4 2.70 PGVector(m=64, ef_construct=256, ef_search=400)

1.000 56.4 3.08 PGVector_halfvec(m=64, ef_construct=256, ef_search=400)

1.000 153.9 8.41 Qdrant(m=32, ef_construct=256, quant=none, hnsw_ef=400)

Best QPS with recall >= 0.99

recall QPS rel2ma

0.993 861 1.00 MariaDB(m=24, ef_search=10)

0.991 370 0.43 PGVector(m=16, ef_construct=256, ef_search=80)

0.990 422 0.49 PGVector_halfvec(m=16, ef_construct=192, ef_search=80)

0.990 572 0.66 Qdrant(m=32, ef_construct=256, quant=none, hnsw_ef=40)

0.990 764 0.89 Qdrant(m=48, ef_construct=192, quant=scalar, hnsw_ef=40)

Best QPS with recall >= 0.98

recall QPS rel2ma

0.983 1273 1.00 MariaDB(m=16, ef_search=10)

0.981 492 0.39 PGVector(m=32, ef_construct=192, ef_search=30)

0.982 545 0.43 PGVector_halfvec(m=32, ef_construct=192, ef_search=30)

0.981 713 0.56 Qdrant(m=16, ef_construct=192, quant=none, hnsw_ef=40)

0.980 895 0.70 Qdrant(m=16, ef_construct=256, quant=scalar, hnsw_ef=40)

Best QPS with recall >= 0.97

recall QPS rel2ma

0.983 1273 1.00 MariaDB(m=16, ef_search=10)

0.971 635 0.50 PGVector(m=32, ef_construct=192, ef_search=20)

0.971 724 0.57 PGVector_halfvec(m=32, ef_construct=192, ef_search=20)

0.972 782 0.61 Qdrant(m=16, ef_construct=192, quant=none, hnsw_ef=30)

0.970 982 0.77 Qdrant(m=16, ef_construct=192, quant=scalar, hnsw_ef=30)

Best QPS with recall >= 0.96

recall QPS rel2ma

0.969 1602 1.00 MariaDB(m=12, ef_search=10)

0.965 762 0.48 PGVector(m=16, ef_construct=192, ef_search=30)

0.964 835 0.52 PGVector_halfvec(m=16, ef_construct=192, ef_search=30)

0.963 811 0.51 Qdrant(m=16, ef_construct=96, quant=none, hnsw_ef=30)

0.961 996 0.62 Qdrant(m=16, ef_construct=96, quant=scalar, hnsw_ef=30)

Best QPS with recall >= 0.95

recall QPS rel2ma

0.969 1602 1.00 MariaDB(m=12, ef_search=10)

0.954 802 0.50 PGVector(m=16, ef_construct=96, ef_search=30)

0.955 880 0.55 PGVector_halfvec(m=16, ef_construct=96, ef_search=30)

0.954 869 0.54 Qdrant(m=8, ef_construct=256, quant=none, hnsw_ef=40)

0.950 1060 0.66 Qdrant(m=16, ef_construct=192, quant=scalar, hnsw_ef=20)

Monday, February 10, 2025

Vector indexes, MariaDB & pgvector, large server, dbpedia-openai dataset

This post has results from ann-benchmarks to compare MariaDB and Postgres with a larger dataset, dbpedia-openai at 100k, 500k and 1M rows. It has 1536 dimensions and uses angular (cosine) as the distance metric. By larger I mean by the standards of what is in ann-benchmarks. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.

tl;dr

Index create time was much less for MariaDB in all cases except the result for recall >= 0.95
For a given recall, MariaDB gets between 2.1X and 2.7X more QPS than Postgres

Benchmark

bash rall.batch.sh v1 dbpedia-openai-500k-angular c32r128

bash rall.batch.sh v1 dbpedia-openai-1000k-angular c32r128

Results: QPS vs recall

These charts show the best QPS for a given recall. MariaDB gets about 2X more QPS than Postgres for a specific recall level

With 100k rows

With 500k rows

With 1M rows

Results: create index

The database configs for Postgres and MariaDB are shared above, and parallel index create is disabled by the config for Postgres and not supported yet by MariaDB. The summary is:

index sizes are similar between MariaDB and pgvector with halfvec
time to create the index varies a lot and it is better to consider this in the context of recall which is done in next section

Legend

M - value for M when creating the index
cons - value for ef_construction when creating the index
secs - time in seconds to create the index
size(MB) - index size in MB

Table sizes:

* Postgres is 7734M

* MariaDB is 7856M

-- pgvector -- -- pgevector --

-- float32 -- -- halfvec --

M cons secs size(MB) secs size(MB)

8 32 458 7734 402 3867

16 32 720 7734 655 3867

8 64 699 7734 627 3867

16 64 1144 7734 1029 3867

32 64 2033 7734 1880 3867

8 96 934 7734 843 3867

16 96 1537 7734 1382 3867

32 96 2730 7734 2482 3867

48 96 4039 7734 3725 3867

8 192 1606 7734 1409 3867

16 192 2778 7734 2435 3867

32 192 4683 7734 4154 3867

48 192 6830 7734 6106 3867

64 192 8601 7734 7831 3958

8 256 2028 7734 1764 3867

16 256 3609 7734 3151 3867

32 256 5838 7734 5056 3867

48 256 8224 7734 7283 3867

64 256 11031 7734 9931 3957

mariadb

M secs size(MB)

4 318 3976

5 372 3976

6 465 3976

8 717 3976

12 1550 3976

16 2887 3976

24 7248 3976

32 14120 3976

48 36697 3980

Results: best QPS for a given recall

Summary

Postgres does not get recall=1.0 for the values of M, ef_construction and ef_search I used
Index create time was much less for MariaDB in all cases except the result for recall >= 0.95
For a given recall target, MariaDB gets between 2.1X and 2.7X more QPS than Postgres

Legend:

recall, QPS - best QPS at that recall
isecs - time to create the index in seconds
m= - value for M when creating the index
ef_cons= - value for ef_construction when creating the index
ef_search= - value for ef_search when running queries

Best QPS with recall >= 1.000, pgvector did not reach the recall target

recall QPS isecs

1.000 20 36697 MariaDB(m=48, ef_search=40)

Best QPS with recall >= 0.99, MariaDB gets >= 2.2X more QPS than Postgres

recall QPS isecs

0.990 287 8224 PGVector(m=48, ef_cons=256, ef_search=40)

0.990 321 7283 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

0.992 731 7248 MariaDB(m=24, ef_search=10)

Best QPS with recall >= 0.98, MariaDB gets >= 2.7X more QPS than Postgres

recall QPS isecs

0.984 375 4683 PGVector(m=32, ef_cons=192, ef_search=40)

0.984 418 4154 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

0.981 1130 2887 MariaDB(m=16, ef_search=10)

Best QPS with recall >= 0.97, MariaDB gets >= 2.3X more QPS than Postgres

recall QPS isecs

0.974 440 6830 PGVector(m=48, ef_cons=192, ef_search=20)

0.973 483 6106 PGVector_halfvec(m=48, ef_cons=192, ef_search=20)

0.981 1130 2887 MariaDB(m=16, ef_search=10)

Best QPS with recall >= 0.96, MariaDB gets >= 2.2X more QPS than Postgres

recall QPS isecs

0.962 568 4683 PGVector(m=32, ef_cons=192, ef_search=20)

0.961 635 4154 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

0.965 1433 1550 MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.95, MariaDB gets >= 2.1X more QPS

recall QPS isecs

0.953 588 2730 PGVector(m=32, ef_cons=96, ef_search=20)

0.957 662 1382 PGVector_halfvec(m=16, ef_cons=96, ef_search=40)

0.965 1433 1550 MariaDB(m=12, ef_search=10)

Wednesday, February 26, 2025

Feedback from the Postgres community about the vector index benchmarks

Thursday, February 20, 2025

How to find Lua scripts for sysbench using LUA_PATH

Wednesday, February 19, 2025

My database communities

Saturday, February 15, 2025

Vector indexes, large server, dbpedia-openai dataset: MariaDB, Qdrant and pgvector

Monday, February 10, 2025

Vector indexes, MariaDB & pgvector, large server, dbpedia-openai dataset

Using db_bench to measure RocksDB performance with gcc and clang