Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
New embedding models and API updates (openai.com)
220 points by Josely on Jan 25, 2024 | hide | past | favorite | 79 comments


To compare with the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard), the new embedding models are on par with open-source embedding models like BAAI/bge-large-en-v1.5, not a drastic improvement if already using them. Obviously, a cost/performance improvement is still good.

I've found evidence that the OpenAI 1536D embeddings are unnecessairly big for 99% of use cases (and now there's a 3072D model?!) so the ability to reduce dimensionality directly from the API is appreciated for the reasons given in this post. Just chopping off dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel.

EDIT: Tested the API: it does support reducing to an arbitrary number of dimensions other than the ones noted into the post. (even 2D for data viz, but may not be as useful since the embeddings are normalized)

The embeddings aren't "chopped off", the first components of the embedding will change as dimensionality reduces, but not much.


> dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel

Very basic techniques (e.g. SuperBit random projection) have been extremely effective with OpenAI embeddings in the past. E.g. all embeddings on findsight.ai are OpenAI Ada embeddings stored as SuperBit signatures with a code length of 10,000 (i.e. 157 integers each), and there's almost no recall loss compared to the full vectors.



How low in dimensions have you been able to go without significant recall loss?


do you construct an index over the superbit signatures to perform approximate search or do you perform exact search?


Parallelised exact search, which is good enough because calculating the hamming similarity is just an XOR+POPCNT per vector component and an addition. But of course you could put this into an HNSW graph for approximate search for >10M vectors. Or do LSH first for even larger data sets.


how does this compare with PCA?


>I've found evidence that the OpenAI 1536D embeddings are unnecessairly big for 99%

Agree, I tried a few 384-dim models and they perform 95%+ as good, def. not worth the extra space.


Most of the leaderboard has much lower sequence length.


curious what the latency impact is to apply the dimensionality reduction... may hint at the specific technique they use


These benchmarks never tell the full story. Anyone thats using models for real production use cases with complex ai requirements knows OpenAI is still king. GPT4 in practice is leagues ahead, regardless of what leaderboards show.


Subjectively, OpenAI hasn’t been a leader in the embedding model space the way they have been with text completions.


Your use case isnt complex/nuanced enough then


Could easily be true. It's harder to judge embeddings quality.


Hmm ... nah.

GPT4 is still king but text-embedding-ada-002 is quite bad.


what's been your experience with different open source embedding models vs ada?


Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

The new GPT-4 Turbo is intended to reduce laziness. I'm updating aider's existing laziness benchmark now.

EDIT: Preliminary results are up.

Overall, the new `gpt-4-0125-preview` model does worse on the lazy coding benchmark as compared to the November `gpt-4-1106-preview` model.

https://aider.chat/docs/benchmarks-0125.html


I suspect that this has been underway a long time. GPT-4 today seems as good as GPT-3 was in 2022 for most text generation tasks. Lots of answers seem cached-in-memory in some form, then regurgitated and adjusted for different user queries - recent answers seem to follow a template, as compared to organic generation (which is also computationally expensive).

I have a feeling they're purposely reducing the quality of the models, and possibly even relaunching older models as SOTA to show "progress".


“To answer your question about chatgpts progress it’s necessary to…” (proceeds to list what it’s going to do instead of doing it).

Gosh that gets annoying


Reminds me of Whisper v3 performing far worse than v2 with hallucinations


Thanks! Would be great to get a comparison of the two models.


Chatgpt-3.5 price reduction seems to be a direct response to Mixtral, which was cheaper (~0.0019 vs 0.0020 for 1K tokens) and better (https://arena.lmsys.org/) until now.


Anecdotally (N~1) it does seem like the new GPT-4 Turbo is less lazy. I cut out of a bunch of my system prompt which was designed to encourage full code gen and re-tried some previous examples: it now works completely fine without all the fluff about how I'll die if you don't complete this code, I'll tip you $200, I'll do anything you want, etc etc.


My testing on the older `gpt-4-1106-preview` model seemed to show that these sort of "emotional appeals" actually hurt GPT's coding performance. GPT-4 Turbo did 4-12% worse on the benchmarks when similar concepts were added to the prompt.

https://aider.chat/docs/unified-diffs.html


Am I the only one who hasn't had issues with laziness? I use only the API, not ChatGPT. I can't recall a time when the result produced was seriously incomplete.


It's rare, but I have. I switched to using Copilot for code editing. It's never lazy, but it's a little less clever. It's perfectly fine for small edits though.


The embeddings with arbitrary dimensionality and lower cost sound very juicy! Never a word on latency though in any of these press releases, and if I'm building a chatbot or semantic search, it's kinda bad for the UX to be waiting > 2 seconds for something to happen...


Allow me to introduce you to…… 40 seconds later….gpt 4 turbo.


gpt-4-turbo was supposed to be the speedy model, yet speed isn't mentioned at all. Go figure.


Nice! Just yesterday I was wondering when they were going to release an upgrade for text-embedding-ada-002 which is not that good anymore.

Btw, of all those I tried so far WhereIsAI/UAE-Large-V1 truly excels and is free/open to use, only downside is a small-ish context size.


I wrote HNResumeToJobs.com using PGVector + the OpenAI Ada embedding model which outputs a vector of size 1536. The new large model now outputs a vector of double the size 3072. I believe PGVector only supports up to 2,000 dimensions for indexing, so that will be a problem.

However, I see that they also support shortening embeddings. OpenAI says that the text-embedding-3-large w/ shortening to 1536 still outperforms the text-embedding-ada-002 model. So maybe I'll go that route first, and then hope that PGVector begins supporting vectors of size >3072.


PGVector supports up to 16k dimensions although I imagine performance will be atrocious.


There’s a small model option too.


Curious what embedding compression technique they're using since it allows for dynamic dimensionality reduction. Or maybe they trained a different projection for each conceivable dim argument?


Probably something similar to https://arxiv.org/abs/2310.07707


Just PCA the thing on a couple billion embeddings, store those coefficients (only ~3000 floats) and in the API, report the n components with the highest variance/eigenvalue?


Probably as simple as training the smaller model to approximate the larger model. Well studied and done via tinylm and minilm.


Honestly the answers to these questions are usually the most obvious thing. It could just be some basic dimensionality reduction technique precomputed for each input/output size combination.

That’s still just a few thousand matrices. I’m sure they can handle the training and distribution of that set.


While the performance gains are exciting - I’m curious when they’ll release multi modal embeddings.

So much information is tied up in tables and images which are so difficult to work with right now. You can hack around it with GPT4V but it’s always going to underperform something that was trained end to end.

I’d also love to see the same support for fine tuning embeddings that they have for their LLM’s. I’m curious how that’d perform over their latest massive model.


> This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task

Well that's good because the term "lazy" is a literal definition of what I can see GPT4 doing for the last couple of months. It would be a battle to get it actually to complete a task. Let's see how it improves.


This was the biggest win for me. One time asked ChatGPT if it could "write python to convert a docx to pdf using any python package of your choosing" and its response was a paragraph telling me doing so is difficult, followed by a python function, with just an inline comment //implement conversion here


Probably better than hallucinating a wrong answer if it's stumped


No it's not.

It does (did?) this all the time to me generating tests.

I know it _couldn't_ know the right answer because I didn't tell it all of the model and serializer schema's and all that.

I don't care, because I know it is capable of generating very good guesses, and it is able to generate comprehensive tests, and I just need to tweak it to actually run. The problem is you have to shove it to even try.

Refusing to make any attempt makes it useless. If it's truly stumped then it'll be pretty obvious (assuming you aren't using it blindly).


I wonder if information workers are becoming more lazy in the mental context of hearing AI will take their jobs, and I wonder if this ironically ends up polluting the data set.


What do you use GPT4 for?

The reason I ask is that ChatGPT is helping me debug docker files and a lot of the times I find really hard to Google answers there. Sometimes it waves me away but usually with promptings for more information to go on first.


I wish the moderation API was available for use outside of sending text to one of the GPTs. It's surprisingly accurate.


May you elaborate further please with an example.


OpenAI offers an API for checking strings to see if they are likely to contain violent, hate, or sexual content. https://platform.openai.com/docs/guides/moderation

It's free, but you're only allowed to use it for checking ChatGPT inputs and outputs. It would be very useful on internet forums, social media, and the like.


> By default, data sent to the OpenAI API will not be used to train or improve OpenAI models.

That’s some appreciated change from the previous policy, but still it just mentions the API, not the interactive web app.


This has always been like this for the API.


Finally, a reasonable output dim size. 1536 was just nuts, and added 4x the cost of HNSW in RAM compared to something like the e5 small models at 384 dims.


You can always do PCA on your end


PCA doesn’t work at all for embeddings. Maybe in some rare cases, but it’s throwing away knowledge and the loss in accuracy is usually quite drastic.


Really? I'm no specialist in the area I just use embeddings for search at work. The ML guy at work tells me our dinov2 embeddings should PCA quite nicely once we need that.


ML noob here: what's the migration path from a set of embeddings made with model A to model B, if the source isn't available?


Sadly, re-embed.

Or if you're really, really stuck, you could take the source embeddings, some of the sources, get destination new embeddings, then train a small model to update. It's unlikely this will do anything but lower your quality compared to re-embedding everything.

This lock in effect is one reason I've avoided using OpenAI's embedding models; at least if it's open source, you'll be able to embed everything on an open model you have control of. The idea of committing to a large datastore using embedding APIs makes me feel very uncomfortable.


Same technique as for mixed password hashing, you store the embedding model name and encode and search per each model stored, until the multiple embedding cost at search time become larger than embedding the old data again in the new for. At


Fully rebuild your vector store on the new embeddings.


Rebuild. You should setup your system are arbitrary number of embeddings for a given piece of text. In production that may mean just one, or maybe two while rebuilding.


It seems the "gpt-3.5-turbo-0125" mentioned in the blog is not available yet through API as of 01-26 01:18 UTC? Using it resulted "The model `gpt-3.5-turbo-0125` does not exist or you do not have access to it.". It is not mentioned in API /models either. Although there is "gpt-4-0125-preview".


This is answered in the post you didn't read.

"Next week we are introducing..."


My bad!


Every time they push a new feature I start getting tons of network error abortions of the answers. This is in Germany.

These not only go against the quota, but it hangs for a minute or longer where I have to wait until it realizes that it won't finish the answer.

Right now chat.openai.com won't even load.


I'm really excited to get 3.5 with JSON mode. Trying to get it to consistently generate JSON has been one of my biggest issues. I've been playing with GPT-4-Turbo's JSON mode, and it works so well.


3.5 has JSON mode as of the November release, but only in the November-dated model: gpt-3.5-turbo-1106.

I found it to reliably produce JSON correctly, but I've found 3.5 to be a poor performer at things like entity extraction and following directions compared to other fast models such as claude-instant (though that does not have function calling).


What kind of problems are you facing for entity extraction with 3.5? I am also currently working with 3.5 for entity extraction and entity linking. It is a fun pipeline but curious what issues you ran into?


Is it just me or is the new embeddings model (v3 small) insanely cheap? It's coming out to be ~$0.02/mil tokens (if I'm mathing right), whereas other "embeddings API" services are typically charging at around $0.1/mil tokens


cheaper prices -> more usage -> larger batch sizes / better gpu utilization -> lower cost of service


Competition driving lower prices is one of the good things about business microeconomics.


Once competition is minimized and they have monopolistic power, raise prices.


Given that comparable open-source models exist for free, it's near-impossible to exert a monopoly.


I've honestly been surprised by how fast these "AIaaS" companies are competing to bring performance up and prices down. It really feels good to wake up the next morning to find out your stuff is better and cheaper automatically.


Salut


> This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

How does one solve for this? Wrangling the prompt with "please don't be lazy", or are there inference tricks like running thru the weights differently/multiple times?


RLHF harder.


Maybe removing the lazy posts from the training data.


[flagged]


Please don't make the thread worse by crossing into off-topic attacks like this.

If you see a post or an account that ought to have been moderated but hasn't been, the likeliest explanation is that we didn't see it. If you want to help, emailing us at hn@ycombinator.com is best.


fyi, they got banned for another comment on this post.


[flagged]


You've repeatedly been using HN for nationalistic/ethnic/racial/religious/whatever battle. That's not allowed here, regardless of who or what you're battling, so I've banned the account.

https://news.ycombinator.com/newsguidelines.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: