Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Do we really need a specialized vector database? (modelz.ai)
147 points by gaocegege on Aug 12, 2023 | hide | past | favorite | 91 comments


It feels like Postgres is eating the world. In a thread a couple of days ago many argued that with the capability of efficiently storing and processing JSON we can do without MongoDB et al. Now we also replace vector databases. There are extensions that provide columnar storage and to be honest I never completely understood the need for specialized time series databases.

It is a bit as if Postgres has become what Oracle wanted to be - a one-stop solution for all your data storage needs.


Postgres allowing Json support destroyed so many marketing campaigns from non-technicaleadership "sick of all the money spent on SQL, we gotta try this nosql thing"


It's not only that postgres supports json (plenty of RDBMS do), it's that postgres has better nosql performance than the databases behind those campaigns.


> (...) and to be honest I never completely understood the need for specialized time series databases.

The need for specialized time series databases is explained by the basis of any software development process: reuse ready-made high-level abstractions instead of having to reinvent the wheel, poorly.

There are a few key features of time series databases that you either go without (i.e., data compression) or have to reinvent the wheel (querying, resolution-based retention periods, statistical summary of time periods,etc).

You might live without those features,but in some applications there's a hefty price tag tied to not having them. If you adopt a time series database, you benefit from all those features without having to implement a single one of them.

Perhaps that has no value to you, but it does have for the world.


I think what OP was getting at is why it has to be specialized. Time-series features can be implemented in a general purpose DB engine like postgres or mongo.


They kinda can't though, not without losing a lot of performance and capability.

Time series databases should be viewed in the context of being a fairly niche product however. The archetypal use case is in finance where you have a constant firehose of tick data and trades across a wide array of instruments you need to be able to index instantaneously and aggregate easily. The sheer volume of data can't be overstated here. To make matters worse, the records are often both wide and sparsely populated.

A non-specialized really DBMS doesn't cut it in this case. You need to build the DBMS from the ground up to cater to this sort of a usecase. Slapping a columnar backing store onto something existing does not cut it.


Time-series features can be implemented in a general purpose DB engine, but the architecture of the database will be limiting for lots of use cases requiring performance, especially for high cardinality datasets.

A columnar database engineered from the ground up for time-series should have better foundations to allows fast ingest (to the tune of million of rows/sec per server) and also be very efficient for time-based queries that can be done via languages such as SQL.

Using QuestDB as an example: Data is stored in chronological order and is optimized for sequential ingestion with a timestamp component (re-ordering data on the fly if it comes out-of-order). The data is also partitioned by time. The InfluxDB Line Protocol is better suited for streaming type of ingest versus transactional inserts via Postgres.


How? I just tried to look for time series in Postgres and it looks like TimescaleDB is the plugin you'd want? At least web search seems to love it. Their landing page is just marketing fluff and says next to nothing about their feature set. In a database that isn't column-oriented, how can they do automatic downsampling? Or do they just not?


You might want to check out their columnar compression & their continuous aggregate blog posts:

https://www.timescale.com/blog/timescaledb-2-3-improving-col...

https://www.timescale.com/blog/massive-scale-for-time-series...


Have used it for finance, can recommend. No need for kdb, at least for now.

Their Slack is very responsive too, top service.


I'm curious to research if a dedicated TSDB is worth it for my project. Not to waste time could you recommend one that in your opinion would be worth it over just Postgres? I used InfluxDB a couple of years a ago but didn't really see the value, even though I loved the product. Should I give it another chance?

Application is storage of a diverse set of (irregular) events. Lifecycle management and archival is more important than compression. Volume is moderate.


I'll spare you my rant on why I don't like InfluxDB lol.

In the end it depends how much data you're dealing with. If you're only ingesting dozens to hundreds of records per second per writer you might actually like AWS Time stream. If you're dealing with a million + rows per second from a low number of writers (e.g. scientific data) then you probably want Influx or Timescale. If you want to host it yourself and you have less than 1TB of data then Influx is okay or if you want it cloud hosted then Influx may be a good option.

If you need full control, scalable, TB or PB of data, etc. Then Timescale is great but it's definitely more work.


Thanks s lot, that is a really helpful reply. I'm curious on your concerns about InfluxDB. I mean not to bash them, but it really comes down to trade-offs and it's always good to know the things they don't emphasize in their marketing.


We used standard Postgres for 200k rows/day (100M rows/year) time series. Worked just fine, even without partitioning. Maybe use partitioning (on time) from the start though, it was a bit tedious to migrate to it later. Now handling 10x these rates, probably good for another 10-100x. We have TimeScaleDB extension enabled now, but not using that many of the features yet, mostly stock Postgres.


There is no need for specialized time-series databases, because normal OLAP databases work better for time-series. I see it from my experience with ClickHouse.

It might sound paradoxical, but specialized time-series databases, such as TimescaleDB or InfluxDB, are not good at time-series, compared to normal (boring) OLAP databases.


I am glad open software is eating the world. GNU/Linux, Python, JS, Postgres


Me too. You know what's another good sign times are changing? A fortune 500 company gave me a Linux laptop to WFH. Even a decade ago this would be extremely improbable.


Surprisingly, even though a relatively at a relatively big public company my teams and eng managers get some pressure to get macs instead of linux machines. And at some point I thought we're past that...


There's a lot of developers who like macs because they do not value control, access or freedom to use their hardware the way they want.

If it weren't coming at the cost of everyone else in the industry, i wouldn't mind this preference, not everyone has to be a hacker. As is, apple's existence is a net negative.


Peoples preferences are not holding back the industry. If there was a linux laptop with the same hardware and software quality as the recent MBP's I'd get one in an instant. I run linux servers, I'm not unaware of how to set them up but I don't want to deal with the quirks of most linux daily drivers, so I don't.

Many linux users seem very bitter that people value Apple's offerings over linux based solutions. That this preference is somehow taking something away from them. I don't think anyone should be looked down upon because they don't value the same things that someone with a hacker/tinkerer mentality values. I think that attitude is partly what puts a lot of people off who are half in/half out of the ecosystem. I personally find it a bit tiring, and in general I'm supportive of open source and related things.


There are very few people that think this way, but this opinion is often project on people. Bicyclists are bitter, environmentalists are bitter, islamist are bitter, Luther was bitter. That kind of projection is just your and their resistant for change.

What is common though is having to use tools you are not used to or tools that are subpar e.g. a Mac. You as a Mac user might call that being bitter. It is like forcing an Excel jockey to only use OpenOffice. Those people turn bitter fast.


Alternative take as a dev that only uses Apple products for work, the local hardware doesn't matter that much. We've almost circled back to dumb terminals with everything running on a network connected server.


That's kind of an expensive dumb terminal ;).


Dumb terminals weren't exactly cheap. The vt220 released in 1983 was about $4000 USD (adjusted for inflation to today).

Source: https://en.wikipedia.org/wiki/VT220


To some extent I agree, but it really depends on what you do and for whom. When I asked why this company decided to swap windows for linux (for certain teams) I was told their tools (terraform, python, ssh via cygwin, helm, kubectl etc) would break very frequently with windows updates. So they had to do everything via ssh which isn't that convenient.

I've used windows machines for years doing the same thing elsewhere and I had none of these problems. I run cygwin,wsl or a Linux VM and I had everything I wanted locally run on a Windows laptop.

Now having a Linux laptop (based on rhel9) many things are really nice. For example: I cun run gui apps via x11 over LAN on my desktop, I can share many dotfiles from my desktop to have the laptop configured exactly how I like it. But there is a price to pay. Unsurprisingly having to use outlook and ms teams via a browser is a pain.


...and this is why it shouldn't really matter.

I real life, though, having a proper Linux kernel at hand is useful.


For small companies the value of standardizing on MacBooks is high. But a big company should be able to handle two operating systems.


Obviously, IT people would like everybody to just use a single os. Windows, preferably, as they have the best enterprise integration suite.

Either way, it's usually much easier to negotiate hardware in smaller companies: they don't have stock, hardware policies, security software to support, b2b contracts to handle.

It's the bigger guys that want unification.


It could be the battery life aspect of things. Apple Silicon based mac's seem to vastly outperform their x86_64 alternatives.

If those same mac-using developers were allowed to use Asahi Linux (or one of the other up-and-coming ones like Fedora on Apple Silicon), it'd be interesting to see how many make that choice.


Wow. Now that's a very special event indeed. Cheers on not having to run a docker or work exclusively from a VM because of office politics.


That's nice. Netflix does the same and actually beyond what I would expected. You can bring your own laptop (or desktop) or expense one from the company and install an off the shelf distro of your choice and after going through a browser based registration of your Netflix corp account, you can use it as your daily driver. All internal sites and tools work seamlessly.


is JS eating the world due to being "open" or because we are forced by browsers?

at least until wasm gets more and more traction


You mean, open collaboration from people who care about what they are doing often leads to better projects faster? Careful now, someone might call ya a commie or worse an anarchist.


I am surprised that the "Postgres supports JSON so we don't need NoSQL systems" talking point is so common on Hacker News. MongoDB's marketing department is partially to blame for this because they marketed the negative trade-offs of MongoDB like non-SQL querying and lack of schema as "intentional" features.

It was Google's BigTable paper (2006) and Amazon's Dynamo paper (2007) that led to the "NoSQL revolution" in the late 2000s. The goal was to make it easier to scale a lot of de-normalized data horizontally with the tradeoffs being lack of RDBMS-like querying and schema enforcement.

In hindsight it was bizarre to market a whole category of horizontally scalable data systems based on one of their main negative trade-offs: Lack of SQL querying.

Since then many of these data systems have added SQL-like querying.


JSON in Postgres is less performant that MongoDB, that is what I found. Indices over JSON in Postgres leave a lot to be desired - so much so that it required us to design around it. My rule now is do not treat JSON in Postgress like MongoDB otherwise you will have a bad time (e.g. slowness.)


If you need to index over data inside JSON, why not pull out the pieces that need to be indexed into normal relational columns?

When I've used Postgres JSON fields in the past, I've used it for unstructured data that doesn't need to be indexed, as if there's anything that does need to be indexed, then it's important enough to extract out to a field for efficiency.


Yes. We are saying the same thing. Treat json in Postgres as a little blob of unstructured data and do not design things in Postgres as you would in mongodb. Minimize json in Postgres or you will have a bad time.


Mongo costs more than the “slowness” of Postgres’ JSON.


I am surprised there’s no postgres extension to solve this.


No it isn't. It's a transactions-focused DBMS that's no good for analytics. Orders of magnitude slower than available solutions and another order of magnitude or so slower than the academic state of the art, last time I checked (which admittedly was a few years back).


I guess I have never quite understood what is meant by analytics, and feel there is a different definition that is over my head.

I do some “analytics” in a postgres database, with data and metrics stored in the same db, but I guess it’s not a huge amount - amount of data swamps the metrics by orders of magnitude. Seems to work ok for me, queries are 5ms or less, and it is only one thing to learn/deploy/maintain.

It’s on a pretty powerful server, though.


> what is meant by analytics

Disk I/O. Most analytical workloads only need a subset of columns for the tables under query, but Postgres has to read in every column due to how data is organized. Columnar and hybrid columnar can significantly reduce disk I/O by orders of magnitude, which makes a big difference beyond a few dozen gigabytes.

Even if all your data can fit into memory, Postgres will still be slower because it needs to loop over values for aggregations. Columnar databases can use SIMD instructions and better utilize the CPU cache across multiple aggregations in the same query.

Postgres is also built for lookup queries and point updates, so every row needs a separate entry in b-tree indexes. Analytical databases can generally use sparse indexes that resolve to groups of rows, which means indexes take up both less space in memory and on disk that make range queries more efficient.


one of the biggest difference is whether you are doing more reads vs writes. reads want indices, writes don't. reads want denormalization, writes don't. reads often want column major, writes don't.


I think it’s the natural evolution of software development in an age where decoupled architectures make more and more sense. I’ve been around long enough to see a few pendulum swings, but I’m not sure we’re ever going back to a world where the monolith makes sense again. I know some people will disagree with me, but I think there is a fundamental issue with the OOP line of thought, to no fault of the philosophy itself. It’s simply that people don’t handle complexity well in the real world. The way I see it, you have two modes of operating. The right way and whatever the heck you’re doing when it’s Thursday afternoon and you’ve slept poorly all week. Sure there are ways around this, but in most cases your code-reviewer and quality control isn’t going to be allocated enough time (and neither are you for that matter) to actually deny things that work. Which is how you end up with 9000 different tables, views, stored procedures and what not in your giant mess of a database.

Which was sort of fine in a world where nobody cared about data security. But it’s almost impossible to build a “monolith” database that lives up to the modern legislation requirements, at least within the EU. So we’re basically living in a world where the way we’ve done databases for the past 50 years, doesn’t work for the business. Unless you separate everything into business related domains in small kingdoms with total sovereignty over who comes and goes. Which is basically what micro services are.

At the same time we’re also living in a world where it’s often cheaper (and performant enough) to put 95% of what you do in a container using volumes for persistence. If SQLite was ready for it, or maybe, if all the ORMs were ready for SQLite, you could probably do 90% or your database needs in it. Since that’s not the case, Postgres reigns supreme because it’s the next best thing (yes, you’re allowed to think it’s better). I view Vector DBs sort of similarity to this. We operate almost all of our databases in containers, the one exception is for the one services everyone calls where the performance loss of the container is too much. Well, that’s not entirely true, our dataware house isn’t operated in containers, at least not by us, but I rarely interact with the BI side of things except for data architecture so I’m actually not sure how their external partners do Ops. Anyway, I think Postgres will do 90% or (y)our vector needs and then we will need dedicated vector databases for those last bits that won’t fit into the “everything” box. I don’t think this is bad either, maybe Oracle wanted it to be theirs, but then Oracle should not have Oracles governance. I’m sure they didn’t foresee this future when they bought MySQL though, I know I didn’t. Because who could’ve foreseen that that way we used to do Databases would become sort of obsolete because of containers and EU legislation… and other things?


>But it’s almost impossible to build a “monolith” database that lives up to the modern legislation requirements, at least within the EU. So we’re basically living in a world where the way we’ve done databases for the past 50 years, doesn’t work for the business.

Not sure I agree. It seems far simpler to comply with most regulations if you store your data in a centralised data store with good management features.

If each and every microservice handles its own data, how would you implement rules affecting all of them, such as retention periods, deletion requests, permissions, access logs, etc? It's not just that the data is stored all over the place. You may also have to use microservice specific APIs to access it.

If you're referring to laws that require data storage in a particular jurisdiction, it would still be easier to centralise data management in each jursdiction rather than distribute it across scores of services within each of those jurisdictions.


> Not sure I agree. It seems far simpler to comply with most regulations if you store your data in a centralised data store with good management features.

I don't think we disagree, it's just that we soon won't be allowed to physically store our data on the same servers because the EU deems it too risky. A couple of years ago you could operate 600 energy plants from the same few servers, but by January 2024 that will no longer be legal.

Aside from that, we're now entering a world where you need to do comply with regular audits on data field levels as well as an explosion of access roles. I'm not sure how you would manage that in a traditional centralized data store, you probably could, but we do it with OPA and once you do that, there is no difference between using a monolith or micro services because it all goes through the central policies.

You could argue that the energy sector is a bit of a special case, but it's even sticter in anything related to the financial sectors.


I think the idea is e.g. you can just have opaque keys (e.g. UUIDs) to a person in all services except one, which has their name in. Destroying that record in one place pseud/anonymisises the record across the system.


This is a fairly standard approach to solving that issue. There are even some applications built for the purpose of housing that data, encrypting it, maintaining audit and use records, processing DSARs, and interacting with the data in useful ways via API.


> I’m not sure we’re ever going back to a world where the monolith makes sense again.

I agree, because we never left it. People who say otherwise have an extremely distorted view. Most companies are not Google or Meta.


> But it’s almost impossible to build a “monolith” database that lives up to the modern legislation requirements, at least within the EU.

This is an extremely narrow view. There are a zillion uses of databases that are not storing user data from web apps or whatever.


> This is an extremely narrow view. There are a zillion uses of databases that are not storing user data from web apps or whatever.

Who said anything about "user" data?

I work in the energy sector. The GDPR rules are tiny issues compared to the requirements we face to be able to submit ourselves to audits of exactly whom accessed exactly what. We're at the point where the role you get when you log on a system, needs to define whether you have access to each column in a row or not, and we need to log it too.

For authorization, policies and logging we have frameworks like OPA, but for the actual data protection... Well...


Does `container` mean `Docker/OCI container`?


Great article, but I'm saddened by their view that C is too hard to work with, so the 2-year-old extension must be rewritten in Rust.

C certainly has its faults, and while I have no real experience with Rust, I'm willing to believe that it's significantly better as a language.

But pgvector, at least from a quick scan, looks like a well-written, easily comprehensible C codebase with decent tests. There are undoubtedly lots of hard problems that the developers have solved in implementing it. Putting time and effort into reimplementing all of that in another language because of an aversion to C feels like a waste of effort that could be put into enhancing the existing extension.

Maybe there's something I'm missing? Is the C implementation not as solid as it looks at first glance?


Rust makes concurrency really easy, at least in comparison to C or C++. It has great cross-platform frameworks available, like Tokio which pgvecto.rs uses, and makes using them safe and straightforward.


C also makes concurrency very easy. So easy the uncareful regularly shoot off both feet. Too easy.


It's hard to get concurrency right in C


I wonder if having that sort of "safe concurrency" causes developers to overuse concurrency and introduce coordination costs.

Do we know that tokio's concurrency strategy is optimal for database access?


Let’s assume that this is true. Then is the solution to go back to C where it is hard to get concurrency running correctly? Seems backwards thinking.


No, the solution is to keep the thing that's running quite well in c unless you can prove that it's better?


Starting with a claimed 20x speedup seems like proving it's better? https://modelz.ai/blog/pgvecto-rs


We've gone with Postgres and pg_vector which works really well thus far, especially since we know and like the existing Postgres tooling and vector functionality there is just a (widely supported) pg_vector extension away. Since it now also supports HNSW indices it should perform well on large amounts of vectorized data as well (though we haven't tested this yet).

From a consultant perspective, Postgres as a requirement is a much easier sell then some new, hot but unknown dedicated vector DB.


> HNSW indices it should perform well on large amounts of vectorized data as well (though we haven't tested this yet)

here are a few benchmarks from this week's commit: https://jkatz05.com/post/postgres/pgvector-hnsw-performance/


I have been following the vector database trend back in 2020 and I ended up with the conclusion: vector search features are a nice to have features which adds more value on existing database (postgres) or text search services (elasticsearch) than using an entirely new framework full of hidden bugs. You could get way higher speedup when you are using the right embedding models and encoding way than just using the vector database with the best underlying optimization. And the bonus side is that you are using a stack which was battle tested (postgres, elasticsearch) vs new kids (pinecone, milvus ... )


The Cassandra project recently[1] added vector search. Relative to a lot of other features, it was fairly simple. A new 'vector' type and an extension of the existing indexing system using the Lucene HNSW library. Now we'll be finding ways to optimize and improve performance with better algorithms and query schemes.

What we won't be doing is figuring out how to scale to petabytes of data distributed across multiple data centers in a massive active-active cluster. We've spent the last 14 years perfecting that, and still have work to do. With the benefit of hindsight, if you have a database that is less than 10 years old, all I have to say is good luck. You have some challenging days ahead.

1. https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30...


If you can do it in GPU memory, do it in GPU memory.

If it takes quantization + buying a 1 TB RAM server ($4k of RAM + parts), do that in memory with the raw tensors and shed a small tear -- both for cost and the joy of the pain that you are saving yourself, your team, and everyone around you.

If you need more, then tread lightly and extremely carefully. Very few mega LLM pretraining datasets are even on this order of magnitude, though some are a few terabytes IIRC. If you are exceeding this, then your business usecase is likely specialized indeed.

This message brought to you by the "cost reduction by not adding dumb complexity" group. I try to maintain a reputation for aggressively fighting unnecessary complexity, which is the true cost measure IMO of any system.


Is there a better article explaining how vector DBs are used with LLMs exactly? This 4 point description raises more questions than it answers for me.

For example what is the difference between a "chunk vector" and a "prompt vector"? Aren't they essentially the same thing (a vector representation of text)?

How do we "search the prompt vector for the most similar chunk vector"? A short code snippet is shown that queries the DB, but it's not shown what is done with what comes back. What format is the output in?

I suspect this works by essentially replacing chunks of input text by shorter chunks of "roughly equivalent" text found in the vector DB and sending that as the prompt instead, but based on this description I can't be sure.


The article is referring to the problem of having a limited context length in LLMs. That is you can only pass X tokens in the prompt.

For example, let’s say you have a prompt that lets you answer questions about a book. If the book is long enough, you won’t be able to include it as is in the prompt, so you have to figure out what are the most relevant passages you must include to answer a given question. What you usually do is find the passages that are the most semantically similar to your question.

Chunk vectors are the vectorized passages of the book (i.e., a numerical vector that represents a passage), and the prompt vector is usually the vectorized question.

To find the most similar vectors you need a distance measure, cosine similarity being the most popular.

The output of finding the most similar vectors is the vectors + it’s metadata (chunk, page, chapter, etc)


Thank you for explaining it. I was aware of the problem of too short context length. I'd love to be able to pass an entire programming project with my prompt for example (or a book in your example).

I think I understand how it works now for many kinds of prompts where the information to be extracted is contained in one(or more) of the chunks of much bigger whole.

I'm not sure about prompts where it is required to "understand" entire input to answer properly. For example summarising a book. Although even with this vector search could perhaps help by looking for things not near "please provide a summary", but certain hand crafted values such as "important to the plot" etc.

I guess I need to do some experimenting with It. I found some open source alternatives to (not at all)OpenAI in form of "Sentence Transformers" to create embeddings.

However, what would be really neat is to have a large open source dataset of embeddings already created on some general purpose collection of texts, to try searches etc.


I built something like this at findsight.ai and gave a talk about it (including a discussion on promoting and chunking) here: https://youtu.be/elNrRU12xRc

There are also open datasets of this, eg https://huggingface.co/datasets/kannada_news for news, or https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/reviews?...


Its pretty trivial to calculate the embeddings if you have a reasonable GPU with CUDA support, even for decent volumes of data, the problem is that each different model would produce different embeddings and as soon as a new one is released all your previous datasets aren't that useful.

Also worth mentioning that vector databases are really useful in Resource Augmented Generation - aka find an answer from an existing corpus and utilize it to supplement the LLM.


Check Cohere's embeddings of Wikipedia: https://txt.cohere.com/embedding-archives-wikipedia/


> Search the prompt vector to find the most similar chunk vector.

Does this mean that a question is changed to a similar question? Doesnt it decrease the quality of answers significantly?

If someone asks "who is the best student in California?" will the question be changed to "what is the best school in California?".

This would explain the terrible drops in quality of amswers that we saw. The underlying technology is changed for easier scaling, but is much worse.

It's like the current google (what has multiple problems), where it also sometimes doesnt search your keywords - even when they are in quotation marks - because it knows "better"..


Without exception, LLM frontends out there that use vector search simply do a query and hope for the best. The reality is that the answer may not be in the result. It is a hard problem to solve. We are at least 1-2 stepping stones away from having a solution that actually works!


You can do a lot with hybrid approaches, where you mix and match semantic search results with classic text based search. In addition, depending on what kind of knowledge you pull in, the LLM pretraining can often fill the gaps pretty well even if the retrieval augmentation isn't ideal. But yes, it's still pretty fuzzy, though it works more often than not.



>Does this mean that a question is changed to a similar question?

No. It is retrieving the most semantically similar document(s) to work with. For a question about students information about students will be more semantically similar than information about schools.

>It's like the current google

Google has used vector search for years. It's why you can just type questions in and Google will understand what you are talking about.

>doesnt search your keywords - even when they are in quotation marks

Quotation marks work. If there are no results it will show results without the quotes. The mobile site doesn't make this clear when it happens though.


Neon also made a new pg_embedding extension to add to the Postgres mix

https://neon.tech/blog/pg-embedding-extension-for-vector-sea...


Any first hand experience? We are using pg_vector and it works really well


Nope, just saw their blog post about it recently.


There isn't a best universal choice for all situations. If you're already using Postgres and all you want is to add vector search, pgvector might be good enough.

txtai (https://github.com/neuml/txtai) sets out to be an all-in-one embeddings database. This is more than just being a vector database with semantic search. It can embed text into vectors, run LLM workflows, has components for sparse/keyword indexing and graph based search. It also has a relational layer built-in for metadata filtering.

txtai currently supports SQLite/DuckDB for relational data but can be extended. For example, relational data could be stored in Postgres, sparse/dense vectors in Elasticsearch/Opensearch and graph data in Neo4j.

I believe modular solutions like this where internal components can be swapped in and out are the best option but given I'm the author of txtai, I'm a bit biased. This setup enables the scaling and reliability of existing solutions balanced with someone being able get started quickly with a POC to evaluate the use case.


Am I reading this right? You’ve written a vector database in python? Are there any performance benchmark comparisons for me to look at?


txtai is in Python but much of the underlying code isn't. For example, PyTorch/Transformers uses C/C++/CUDA for most operations. The approximate nearest neighbor (ANN) libraries for vector indexes are Faiss, Hnswlib and Annoy. They are all C/C++. Much of the sparse vectors code is NumPy and uses Python's array module for storing term vectors.

There are no 3rd party benchmarks for txtai as of now. You would have to compare how it does on your own data to judge.


> For instance, the 7B model requires approximately 10 seconds for inference on 300 Chinese characters on an Nvidia A100 (40GB).

Is that right? That sounds like, 2 OOMs higher than I would've expected. Is he doing something wrong like loading the model from a cold start?


i was _very_ surprised by that number as well. 10s for inference on 300 chars? using an entire A100 GB GPU? that does not make any sense. OpenAI and other companies would NEVER be able to scale to 100s of millions of users if that was the real number, right?


The article makes the argument that it is easier to query vector spaces when your database already supports them: why use an external vector db when you can use pgvector in postgreSQL ?

That is a fine argument if you don't mind that pgvector is second-to-worst amongst all open-source vector search implementations, and two orders of magnitude slower than the state of the art [1].

The author also makes the argument that traditional DBs are better because they are battle-tested, and then goes and rewrites the pgvector plugin from C to rust.

[1] http://ann-benchmarks.com


Great article. I was recently wondering why are vector databases useful at all. There are cases (LLMs) where they are useful. But apart from that, most of the cases I have encountered, requires indexing on different data types than vector. I might be biased due to my experience but in a very typical use cases e.g. analyzing event streams or building recommendation engine you want to filter rows by timestamp and client id or product category and availability. Vector index is not that useful in those cases.

That's why I think that outside some very specific use cases vector databases are not very useful.


It's also possible to use HNSW indices with Postgres using pg_embedding instead of pgvector: https://neon.tech/blog/pg-embedding-extension-for-vector-sea...

Edit:

Ah I see pgvector will soon support HNSW as well (from 0.5.0): https://github.com/pgvector/pgvector/issues/181#issuecomment...


Author makes a good argument. I wanted to experiment with local do-it-myself in-memory vector embeddings to use with OpenAI APIs in Common Lisp and Swift. Simple to do, and I added short examples to both my Common Lisp and Swift books. I was also experimenting with simply persisting the data stores to SQLite. Anyway, really simple stuff, and fun to play with.


The work with deciding what words closely related and and which ones are not seems like a hugly difficult problem where there is no single "right" or "wrong". I would think it fits within linguistics.


What if your main data is in mysql?

Lack of consistency is not a big deal and you often don't have it at scale anyways.


"there is a postgres plugin for that"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: