New embedding models and API updates

minimaxir · on Jan 25, 2024

To compare with the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard), the new embedding models are on par with open-source embedding models like BAAI/bge-large-en-v1.5, not a drastic improvement if already using them. Obviously, a cost/performance improvement is still good.

I've found evidence that the OpenAI 1536D embeddings are unnecessairly big for 99% of use cases (and now there's a 3072D model?!) so the ability to reduce dimensionality directly from the API is appreciated for the reasons given in this post. Just chopping off dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel.

EDIT: Tested the API: it does support reducing to an arbitrary number of dimensions other than the ones noted into the post. (even 2D for data viz, but may not be as useful since the embeddings are normalized)

The embeddings aren't "chopped off", the first components of the embedding will change as dimensionality reduces, but not much.

summarity · on Jan 25, 2024

> dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel

Very basic techniques (e.g. SuperBit random projection) have been extremely effective with OpenAI embeddings in the past. E.g. all embeddings on findsight.ai are OpenAI Ada embeddings stored as SuperBit signatures with a code length of 10,000 (i.e. 157 integers each), and there's almost no recall loss compared to the full vectors.

osmarks · on Jan 25, 2024

https://github.com/facebookresearch/faiss/wiki/Vector-codecs has some good things available too.

brigadier132 · on Jan 25, 2024

How low in dimensions have you been able to go without significant recall loss?

huac · on Jan 25, 2024

do you construct an index over the superbit signatures to perform approximate search or do you perform exact search?

summarity · on Jan 25, 2024

Parallelised exact search, which is good enough because calculating the hamming similarity is just an XOR+POPCNT per vector component and an addition. But of course you could put this into an HNSW graph for approximate search for >10M vectors. Or do LSH first for even larger data sets.

sjkoelle · on Jan 25, 2024

how does this compare with PCA?

moralestapia · on Jan 25, 2024

>I've found evidence that the OpenAI 1536D embeddings are unnecessairly big for 99%

Agree, I tried a few 384-dim models and they perform 95%+ as good, def. not worth the extra space.

lpasselin · on Jan 25, 2024

Most of the leaderboard has much lower sequence length.

huac · on Jan 25, 2024

curious what the latency impact is to apply the dimensionality reduction... may hint at the specific technique they use

ldjkfkdsjnv · on Jan 25, 2024

These benchmarks never tell the full story. Anyone thats using models for real production use cases with complex ai requirements knows OpenAI is still king. GPT4 in practice is leagues ahead, regardless of what leaderboards show.

teaearlgraycold · on Jan 25, 2024

Subjectively, OpenAI hasn’t been a leader in the embedding model space the way they have been with text completions.

ldjkfkdsjnv · on Jan 25, 2024

Your use case isnt complex/nuanced enough then

teaearlgraycold · on Jan 25, 2024

Could easily be true. It's harder to judge embeddings quality.

moralestapia · on Jan 25, 2024

Hmm ... nah.

GPT4 is still king but text-embedding-ada-002 is quite bad.

phantastic · on Jan 25, 2024

what's been your experience with different open source embedding models vs ada?

anotherpaulg · on Jan 25, 2024

Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

The new GPT-4 Turbo is intended to reduce laziness. I'm updating aider's existing laziness benchmark now.

EDIT: Preliminary results are up.

Overall, the new `gpt-4-0125-preview` model does worse on the lazy coding benchmark as compared to the November `gpt-4-1106-preview` model.

https://aider.chat/docs/benchmarks-0125.html

fakedang · on Jan 26, 2024

I suspect that this has been underway a long time. GPT-4 today seems as good as GPT-3 was in 2022 for most text generation tasks. Lots of answers seem cached-in-memory in some form, then regurgitated and adjusted for different user queries - recent answers seem to follow a template, as compared to organic generation (which is also computationally expensive).

I have a feeling they're purposely reducing the quality of the models, and possibly even relaunching older models as SOTA to show "progress".

ametrau · on Jan 26, 2024

“To answer your question about chatgpts progress it’s necessary to…” (proceeds to list what it’s going to do instead of doing it).

Gosh that gets annoying

wahnfrieden · on Jan 25, 2024

Reminds me of Whisper v3 performing far worse than v2 with hallucinations

ComplexSystems · on Jan 25, 2024

Thanks! Would be great to get a comparison of the two models.

up6w6 · on Jan 25, 2024

Chatgpt-3.5 price reduction seems to be a direct response to Mixtral, which was cheaper (~0.0019 vs 0.0020 for 1K tokens) and better (https://arena.lmsys.org/) until now.

extr · on Jan 25, 2024

Anecdotally (N~1) it does seem like the new GPT-4 Turbo is less lazy. I cut out of a bunch of my system prompt which was designed to encourage full code gen and re-tried some previous examples: it now works completely fine without all the fluff about how I'll die if you don't complete this code, I'll tip you $200, I'll do anything you want, etc etc.

anotherpaulg · on Jan 25, 2024

My testing on the older `gpt-4-1106-preview` model seemed to show that these sort of "emotional appeals" actually hurt GPT's coding performance. GPT-4 Turbo did 4-12% worse on the benchmarks when similar concepts were added to the prompt.

https://aider.chat/docs/unified-diffs.html

BeetleB · on Jan 25, 2024

Am I the only one who hasn't had issues with laziness? I use only the API, not ChatGPT. I can't recall a time when the result produced was seriously incomplete.

LeoPanthera · on Jan 26, 2024

It's rare, but I have. I switched to using Copilot for code editing. It's never lazy, but it's a little less clever. It's perfectly fine for small edits though.

isoprophlex · on Jan 25, 2024

The embeddings with arbitrary dimensionality and lower cost sound very juicy! Never a word on latency though in any of these press releases, and if I'm building a chatbot or semantic search, it's kinda bad for the UX to be waiting > 2 seconds for something to happen...

te_chris · on Jan 25, 2024

Allow me to introduce you to…… 40 seconds later….gpt 4 turbo.

danielbln · on Jan 25, 2024

gpt-4-turbo was supposed to be the speedy model, yet speed isn't mentioned at all. Go figure.

moralestapia · on Jan 25, 2024

Nice! Just yesterday I was wondering when they were going to release an upgrade for text-embedding-ada-002 which is not that good anymore.

Btw, of all those I tried so far WhereIsAI/UAE-Large-V1 truly excels and is free/open to use, only downside is a small-ish context size.

SCUSKU · on Jan 25, 2024

I wrote HNResumeToJobs.com using PGVector + the OpenAI Ada embedding model which outputs a vector of size 1536. The new large model now outputs a vector of double the size 3072. I believe PGVector only supports up to 2,000 dimensions for indexing, so that will be a problem.

However, I see that they also support shortening embeddings. OpenAI says that the text-embedding-3-large w/ shortening to 1536 still outperforms the text-embedding-ada-002 model. So maybe I'll go that route first, and then hope that PGVector begins supporting vectors of size >3072.

serjester · on Jan 25, 2024

PGVector supports up to 16k dimensions although I imagine performance will be atrocious.

te_chris · on Jan 25, 2024

There’s a small model option too.

pizza · on Jan 25, 2024

Curious what embedding compression technique they're using since it allows for dynamic dimensionality reduction. Or maybe they trained a different projection for each conceivable dim argument?

byefruit · on Jan 25, 2024

Probably something similar to https://arxiv.org/abs/2310.07707

isoprophlex · on Jan 26, 2024

Just PCA the thing on a couple billion embeddings, store those coefficients (only ~3000 floats) and in the API, report the n components with the highest variance/eigenvalue?

danielcampos93 · on Jan 25, 2024

Probably as simple as training the smaller model to approximate the larger model. Well studied and done via tinylm and minilm.

teaearlgraycold · on Jan 25, 2024

Honestly the answers to these questions are usually the most obvious thing. It could just be some basic dimensionality reduction technique precomputed for each input/output size combination.

That’s still just a few thousand matrices. I’m sure they can handle the training and distribution of that set.

serjester · on Jan 25, 2024

While the performance gains are exciting - I’m curious when they’ll release multi modal embeddings.

So much information is tied up in tables and images which are so difficult to work with right now. You can hack around it with GPT4V but it’s always going to underperform something that was trained end to end.

I’d also love to see the same support for fine tuning embeddings that they have for their LLM’s. I’m curious how that’d perform over their latest massive model.

elashri · on Jan 25, 2024

> This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task

Well that's good because the term "lazy" is a literal definition of what I can see GPT4 doing for the last couple of months. It would be a battle to get it actually to complete a task. Let's see how it improves.

mikeortman · on Jan 25, 2024

This was the biggest win for me. One time asked ChatGPT if it could "write python to convert a docx to pdf using any python package of your choosing" and its response was a paragraph telling me doing so is difficult, followed by a python function, with just an inline comment //implement conversion here

gregw134 · on Jan 25, 2024

Probably better than hallucinating a wrong answer if it's stumped

declaredapple · on Jan 25, 2024

No it's not.

It does (did?) this all the time to me generating tests.

I know it _couldn't_ know the right answer because I didn't tell it all of the model and serializer schema's and all that.

I don't care, because I know it is capable of generating very good guesses, and it is able to generate comprehensive tests, and I just need to tweak it to actually run. The problem is you have to shove it to even try.

Refusing to make any attempt makes it useless. If it's truly stumped then it'll be pretty obvious (assuming you aren't using it blindly).

darkerside · on Jan 25, 2024

I wonder if information workers are becoming more lazy in the mental context of hearing AI will take their jobs, and I wonder if this ironically ends up polluting the data set.

3pt14159 · on Jan 25, 2024

What do you use GPT4 for?

The reason I ask is that ChatGPT is helping me debug docker files and a lot of the times I find really hard to Google answers there. Sometimes it waves me away but usually with promptings for more information to go on first.

LeoPanthera · on Jan 25, 2024

I wish the moderation API was available for use outside of sending text to one of the GPTs. It's surprisingly accurate.

bob_theslob646 · on Jan 26, 2024

May you elaborate further please with an example.

LeoPanthera · on Jan 26, 2024

OpenAI offers an API for checking strings to see if they are likely to contain violent, hate, or sexual content. https://platform.openai.com/docs/guides/moderation

It's free, but you're only allowed to use it for checking ChatGPT inputs and outputs. It would be very useful on internet forums, social media, and the like.

Lucasoato · on Jan 25, 2024

> By default, data sent to the OpenAI API will not be used to train or improve OpenAI models.

That’s some appreciated change from the previous policy, but still it just mentions the API, not the interactive web app.

danielbln · on Jan 25, 2024

This has always been like this for the API.

binarymax · on Jan 25, 2024

Finally, a reasonable output dim size. 1536 was just nuts, and added 4x the cost of HNSW in RAM compared to something like the e5 small models at 384 dims.

teaearlgraycold · on Jan 26, 2024

You can always do PCA on your end

binarymax · on Jan 26, 2024

PCA doesn’t work at all for embeddings. Maybe in some rare cases, but it’s throwing away knowledge and the loss in accuracy is usually quite drastic.

teaearlgraycold · on Jan 26, 2024

Really? I'm no specialist in the area I just use embeddings for search at work. The ML guy at work tells me our dinov2 embeddings should PCA quite nicely once we need that.

franky47 · on Jan 25, 2024

ML noob here: what's the migration path from a set of embeddings made with model A to model B, if the source isn't available?

vessenes · on Jan 25, 2024

Sadly, re-embed.

Or if you're really, really stuck, you could take the source embeddings, some of the sources, get destination new embeddings, then train a small model to update. It's unlikely this will do anything but lower your quality compared to re-embedding everything.

This lock in effect is one reason I've avoided using OpenAI's embedding models; at least if it's open source, you'll be able to embed everything on an open model you have control of. The idea of committing to a large datastore using embedding APIs makes me feel very uncomfortable.

avereveard · on Jan 25, 2024

Same technique as for mixed password hashing, you store the embedding model name and encode and search per each model stored, until the multiple embedding cost at search time become larger than embedding the old data again in the new for. At

minimaxir · on Jan 25, 2024

Fully rebuild your vector store on the new embeddings.

sroussey · on Jan 25, 2024

Rebuild. You should setup your system are arbitrary number of embeddings for a given piece of text. In production that may mean just one, or maybe two while rebuilding.

WiSaGaN · on Jan 26, 2024

It seems the "gpt-3.5-turbo-0125" mentioned in the blog is not available yet through API as of 01-26 01:18 UTC? Using it resulted "The model `gpt-3.5-turbo-0125` does not exist or you do not have access to it.". It is not mentioned in API /models either. Although there is "gpt-4-0125-preview".

LeoPanthera · on Jan 26, 2024

This is answered in the post you didn't read.

"Next week we are introducing..."

WiSaGaN · on Jan 26, 2024

My bad!

qwertox · on Jan 25, 2024

Every time they push a new feature I start getting tons of network error abortions of the answers. This is in Germany.

These not only go against the quota, but it hangs for a minute or longer where I have to wait until it realizes that it won't finish the answer.

Right now chat.openai.com won't even load.

swalsh · on Jan 25, 2024

I'm really excited to get 3.5 with JSON mode. Trying to get it to consistently generate JSON has been one of my biggest issues. I've been playing with GPT-4-Turbo's JSON mode, and it works so well.

mediaman · on Jan 25, 2024

3.5 has JSON mode as of the November release, but only in the November-dated model: gpt-3.5-turbo-1106.

I found it to reliably produce JSON correctly, but I've found 3.5 to be a poor performer at things like entity extraction and following directions compared to other fast models such as claude-instant (though that does not have function calling).

nkristoffersen · on Jan 26, 2024

What kind of problems are you facing for entity extraction with 3.5? I am also currently working with 3.5 for entity extraction and entity linking. It is a fun pipeline but curious what issues you ran into?

janejeon · on Jan 25, 2024

Is it just me or is the new embeddings model (v3 small) insanely cheap? It's coming out to be ~$0.02/mil tokens (if I'm mathing right), whereas other "embeddings API" services are typically charging at around $0.1/mil tokens

huac · on Jan 25, 2024

cheaper prices -> more usage -> larger batch sizes / better gpu utilization -> lower cost of service

minimaxir · on Jan 25, 2024

Competition driving lower prices is one of the good things about business microeconomics.

boringg · on Jan 25, 2024

Once competition is minimized and they have monopolistic power, raise prices.

minimaxir · on Jan 25, 2024

Given that comparable open-source models exist for free, it's near-impossible to exert a monopoly.

janejeon · on Jan 25, 2024

I've honestly been surprised by how fast these "AIaaS" companies are competing to bring performance up and prices down. It really feels good to wake up the next morning to find out your stuff is better and cheaper automatically.

reinca · on Jan 25, 2024

Salut

jimmyed · on Jan 25, 2024

> This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

How does one solve for this? Wrangling the prompt with "please don't be lazy", or are there inference tricks like running thru the weights differently/multiple times?

minimaxir · on Jan 25, 2024

RLHF harder.

Karuma · on Jan 25, 2024

Maybe removing the lazy posts from the training data.

ShamelessC · on Jan 25, 2024

[flagged]

dang · on Jan 25, 2024

Please don't make the thread worse by crossing into off-topic attacks like this.

If you see a post or an account that ought to have been moderated but hasn't been, the likeliest explanation is that we didn't see it. If you want to help, emailing us at hn@ycombinator.com is best.

nickthegreek · on Jan 25, 2024

fyi, they got banned for another comment on this post.

jimmyed · on Jan 25, 2024

[flagged]

dang · on Jan 25, 2024

You've repeatedly been using HN for nationalistic/ethnic/racial/religious/whatever battle. That's not allowed here, regardless of who or what you're battling, so I've banned the account.

https://news.ycombinator.com/newsguidelines.html