To compare with the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard), the new embedding models are on par with open-source embedding models like BAAI/bge-large-en-v1.5, not a drastic improvement if already using them. Obviously, a cost/performance improvement is still good.
I've found evidence that the OpenAI 1536D embeddings are unnecessairly big for 99% of use cases (and now there's a 3072D model?!) so the ability to reduce dimensionality directly from the API is appreciated for the reasons given in this post. Just chopping off dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel.
EDIT: Tested the API: it does support reducing to an arbitrary number of dimensions other than the ones noted into the post. (even 2D for data viz, but may not be as useful since the embeddings are normalized)
The embeddings aren't "chopped off", the first components of the embedding will change as dimensionality reduces, but not much.
> dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel
Very basic techniques (e.g. SuperBit random projection) have been extremely effective with OpenAI embeddings in the past. E.g. all embeddings on findsight.ai are OpenAI Ada embeddings stored as SuperBit signatures with a code length of 10,000 (i.e. 157 integers each), and there's almost no recall loss compared to the full vectors.
Parallelised exact search, which is good enough because calculating the hamming similarity is just an XOR+POPCNT per vector component and an addition. But of course you could put this into an HNSW graph for approximate search for >10M vectors. Or do LSH first for even larger data sets.
These benchmarks never tell the full story. Anyone thats using models for real production use cases with complex ai requirements knows OpenAI is still king. GPT4 in practice is leagues ahead, regardless of what leaderboards show.
Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.
The new GPT-4 Turbo is intended to reduce laziness. I'm updating aider's existing laziness benchmark now.
EDIT: Preliminary results are up.
Overall,
the new `gpt-4-0125-preview` model does worse on the lazy coding benchmark
as compared to the November `gpt-4-1106-preview` model.
I suspect that this has been underway a long time. GPT-4 today seems as good as GPT-3 was in 2022 for most text generation tasks. Lots of answers seem cached-in-memory in some form, then regurgitated and adjusted for different user queries - recent answers seem to follow a template, as compared to organic generation (which is also computationally expensive).
I have a feeling they're purposely reducing the quality of the models, and possibly even relaunching older models as SOTA to show "progress".
Chatgpt-3.5 price reduction seems to be a direct response to Mixtral, which was cheaper (~0.0019 vs 0.0020 for 1K tokens) and better (https://arena.lmsys.org/) until now.
Anecdotally (N~1) it does seem like the new GPT-4 Turbo is less lazy. I cut out of a bunch of my system prompt which was designed to encourage full code gen and re-tried some previous examples: it now works completely fine without all the fluff about how I'll die if you don't complete this code, I'll tip you $200, I'll do anything you want, etc etc.
My testing on the older `gpt-4-1106-preview` model seemed to show that these sort of "emotional appeals" actually hurt GPT's coding performance. GPT-4 Turbo did 4-12% worse on the benchmarks when similar concepts were added to the prompt.
Am I the only one who hasn't had issues with laziness? I use only the API, not ChatGPT. I can't recall a time when the result produced was seriously incomplete.
It's rare, but I have. I switched to using Copilot for code editing. It's never lazy, but it's a little less clever. It's perfectly fine for small edits though.
The embeddings with arbitrary dimensionality and lower cost sound very juicy! Never a word on latency though in any of these press releases, and if I'm building a chatbot or semantic search, it's kinda bad for the UX to be waiting > 2 seconds for something to happen...
I wrote HNResumeToJobs.com using PGVector + the OpenAI Ada embedding model which outputs a vector of size 1536. The new large model now outputs a vector of double the size 3072. I believe PGVector only supports up to 2,000 dimensions for indexing, so that will be a problem.
However, I see that they also support shortening embeddings. OpenAI says that the text-embedding-3-large w/ shortening to 1536 still outperforms the text-embedding-ada-002 model. So maybe I'll go that route first, and then hope that PGVector begins supporting vectors of size >3072.
Curious what embedding compression technique they're using since it allows for dynamic dimensionality reduction. Or maybe they trained a different projection for each conceivable dim argument?
Just PCA the thing on a couple billion embeddings, store those coefficients (only ~3000 floats) and in the API, report the n components with the highest variance/eigenvalue?
Honestly the answers to these questions are usually the most obvious thing. It could just be some basic dimensionality reduction technique precomputed for each input/output size combination.
That’s still just a few thousand matrices. I’m sure they can handle the training and distribution of that set.
While the performance gains are exciting - I’m curious when they’ll release multi modal embeddings.
So much information is tied up in tables and images which are so difficult to work with right now. You can hack around it with GPT4V but it’s always going to underperform something that was trained end to end.
I’d also love to see the same support for fine tuning embeddings that they have for their LLM’s. I’m curious how that’d perform over their latest massive model.
> This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task
Well that's good because the term "lazy" is a literal definition of what I can see GPT4 doing for the last couple of months. It would be a battle to get it actually to complete a task. Let's see how it improves.
This was the biggest win for me. One time asked ChatGPT if it could "write python to convert a docx to pdf using any python package of your choosing" and its response was a paragraph telling me doing so is difficult, followed by a python function, with just an inline comment //implement conversion here
It does (did?) this all the time to me generating tests.
I know it _couldn't_ know the right answer because I didn't tell it all of the model and serializer schema's and all that.
I don't care, because I know it is capable of generating very good guesses, and it is able to generate comprehensive tests, and I just need to tweak it to actually run. The problem is you have to shove it to even try.
Refusing to make any attempt makes it useless. If it's truly stumped then it'll be pretty obvious (assuming you aren't using it blindly).
I wonder if information workers are becoming more lazy in the mental context of hearing AI will take their jobs, and I wonder if this ironically ends up polluting the data set.
The reason I ask is that ChatGPT is helping me debug docker files and a lot of the times I find really hard to Google answers there. Sometimes it waves me away but usually with promptings for more information to go on first.
It's free, but you're only allowed to use it for checking ChatGPT inputs and outputs. It would be very useful on internet forums, social media, and the like.
Finally, a reasonable output dim size. 1536 was just nuts, and added 4x the cost of HNSW in RAM compared to something like the e5 small models at 384 dims.
Really? I'm no specialist in the area I just use embeddings for search at work. The ML guy at work tells me our dinov2 embeddings should PCA quite nicely once we need that.
Or if you're really, really stuck, you could take the source embeddings, some of the sources, get destination new embeddings, then train a small model to update. It's unlikely this will do anything but lower your quality compared to re-embedding everything.
This lock in effect is one reason I've avoided using OpenAI's embedding models; at least if it's open source, you'll be able to embed everything on an open model you have control of. The idea of committing to a large datastore using embedding APIs makes me feel very uncomfortable.
Same technique as for mixed password hashing, you store the embedding model name and encode and search per each model stored, until the multiple embedding cost at search time become larger than embedding the old data again in the new for. At
Rebuild. You should setup your system are arbitrary number of embeddings for a given piece of text. In production that may mean just one, or maybe two while rebuilding.
It seems the "gpt-3.5-turbo-0125" mentioned in the blog is not available yet through API as of 01-26 01:18 UTC? Using it resulted "The model `gpt-3.5-turbo-0125` does not exist or you do not have access to it.". It is not mentioned in API /models either. Although there is "gpt-4-0125-preview".
I'm really excited to get 3.5 with JSON mode. Trying to get it to consistently generate JSON has been one of my biggest issues. I've been playing with GPT-4-Turbo's JSON mode, and it works so well.
3.5 has JSON mode as of the November release, but only in the November-dated model: gpt-3.5-turbo-1106.
I found it to reliably produce JSON correctly, but I've found 3.5 to be a poor performer at things like entity extraction and following directions compared to other fast models such as claude-instant (though that does not have function calling).
What kind of problems are you facing for entity extraction with 3.5? I am also currently working with 3.5 for entity extraction and entity linking. It is a fun pipeline but curious what issues you ran into?
Is it just me or is the new embeddings model (v3 small) insanely cheap? It's coming out to be ~$0.02/mil tokens (if I'm mathing right), whereas other "embeddings API" services are typically charging at around $0.1/mil tokens
I've honestly been surprised by how fast these "AIaaS" companies are competing to bring performance up and prices down. It really feels good to wake up the next morning to find out your stuff is better and cheaper automatically.
> This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.
How does one solve for this? Wrangling the prompt with "please don't be lazy", or are there inference tricks like running thru the weights differently/multiple times?
Please don't make the thread worse by crossing into off-topic attacks like this.
If you see a post or an account that ought to have been moderated but hasn't been, the likeliest explanation is that we didn't see it. If you want to help, emailing us at hn@ycombinator.com is best.
You've repeatedly been using HN for nationalistic/ethnic/racial/religious/whatever battle. That's not allowed here, regardless of who or what you're battling, so I've banned the account.
I've found evidence that the OpenAI 1536D embeddings are unnecessairly big for 99% of use cases (and now there's a 3072D model?!) so the ability to reduce dimensionality directly from the API is appreciated for the reasons given in this post. Just chopping off dimensions to an arbitrary dimensionality is not a typical dimensionality reduction technique so that likely requires a special training/alignment technique that's novel.
EDIT: Tested the API: it does support reducing to an arbitrary number of dimensions other than the ones noted into the post. (even 2D for data viz, but may not be as useful since the embeddings are normalized)
The embeddings aren't "chopped off", the first components of the embedding will change as dimensionality reduces, but not much.