Hacker Newsnew | past | comments | ask | show | jobs | submit | DeveloperErrata's commentslogin

Not quite, most of the recent work on modern RNNs has been addressing this exact limitation. For instance linear attention yields formulations that can be equivalently interpreted either as a parallel operation or a recursive one. The consequence is that these parallelizable versions of RNNs are often "less expressive per param" than their old-school non-parallelizable RNN counterparts, though you could argue that they make up for that in practice by being more powerful per unit of training compute via much better training efficiency.


This was really educational to me, felt at the perfect level of abstraction to learn a lot about the specifics of LLM architecture without the difficulty of parsing the original papers


Don't know how Grok is setup, but in earlier models the vision backbone was effectively a separate model that was trained to convert vision inputs into a tokenized output, where the tokenized outputs would be in the form of "soft tokens" that the main model would treat as input and attend to just like it would for text token inputs. Because they're two separate things, you can modify each somewhat independently. Not sure how things are currently setup tho.


Consider the difference between the requirements to simulate the universe and simulate a person's experience of the universe. As people in the universe, we wouldn't be able to tell the difference, but the latter would be have much lower requirements


If you were simulating two beings' experiences of the universe and they met and compared notes, you might expect little incongruities where random chance was used in the decisions necessary to generate their perspectives.

Let's say a sparrow falls and a simulated being observes it, but so does a camera, which stores the footage in an unexamined hard drive. Years later another being observes the footage stored by that camera, and up until that point there had been no point in manifesting it in the simulation. Would the wings of the falling sparrow flutter exactly the same way for the two observers? What if they met and discussed it, as part of a deliberate experiment to test the consistency of the universe? Would they agree? Imagine the computational overhead necessary in storing details that you're not sure will ever be useful. The same logic could hold true for a photon generated by an exploding star 12 billion years ago that zipped across the expanding universe and eventually interacted with some silicon in a CCD in an astronomer's camera.


Human's already have differing memories of the same events. No low resolution simulations needed for that to occur.


A properly setup experiment could distinguish signal from the noise of normal memory variation.


Indeed, you only have to simulate what one person perceives and where they explore, etc. One person's conscious experience in theory should be able to be fully simulated with very little information relative to the complexity of matter etc. You just have to simulate the senses and model the universe around a person.

And why not do that as an experiment? If science/experimentation is a useful thing - why not have the main reality and lots of individual experiments?


Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.


Increasingly so. Many other popular inference tools in this space also expose an OpenAI compatible API: VLLM, Llama.cpp, and LiteLLM all do.


Seems like this would (eventually) be big for VR applications. Especially if the avatar could be animated using sensors installed on the headset so that the expressions match the headset user. Reminds me of the metaverse demo with Zuckerberg and Lex Friedman


Macbook Pros with M3 & integrated RAM & VRAM can do 70B models :)


I want to plug the Little Big Planet series of games, it's what got me into programming when I was young and I think it still has a lot of charm


Seems neat - I'm not sure if you do anything like this but one thing that would be useful with RAG apps (esp at big scales) is vector based search over cache contents. What I mean is that, users can phrase the same question (which has the same answer) in tons of different ways. If I could pass a raw user query into your cache and get back the end result for a previously computed query (even if the current phrasing is a bit different than the current phrasing) then not only would I avoid having to submit a new OpenAI call, but I could also avoid having to run my entire RAG pipeline. So kind of like a "meta-RAG" system that avoids having to run the actual RAG system for queries that are sufficiently similar to a cached query, or like a "approximate" cache.


I was impressed by Upstash's approach to something similar with their "Semantic Cache".

https://github.com/upstash/semantic-cache

  "Semantic Cache is a tool for caching natural text based on semantic similarity. It's ideal for any task that involves querying or retrieving information based on meaning, such as natural language classification or caching AI responses. Two pieces of text can be similar but not identical (e.g., "great places to check out in Spain" vs. "best places to visit in Spain"). Traditional caching doesn't recognize this semantic similarity and misses opportunities for reuse."


I strongly advise not relying on embedding distance alone for it because it'll match these two:

1. great places to check out in Spain

2. great places to check out in northern Spain

Logically the two are not the same, and they could in fact be very different despite their semantic similarity. Your users will be frustrated and will hate you for it. If an LLM validates the two as being the same, then it's fine, but not otherwise.


I agree, a naive approach to approximate caching would probably not work for most use cases.

I'm speculating here, but I wonder if you could use a two stage pipeline for cache retrieval (kinda like the distance search + reranker model technique used by lots of RAG pipelines). Maybe it would be possible to fine-tune a custom reranker model to only output True if 2 queries are semantically equivalent rather than just similar. So the hypothetical model would output True for "how to change the oil" vs. "how to replace the oil" but would output False in your Spain example. In this case you'd do distance based retrieval first using the normal vector DB techniques, and then use your custom reranker to validate that the potential cache hits are actual hits


Any LLM can output it, but yes, a tuned LLM can benefit with a shorter prompt.


A hybrid search approach might help, like combining vector similarity scores with e.g. BM25 scores.

Shameless plug (FOSS): https://github.com/jankovicsandras/plpgsql_bm25 Okapi BM25 search implemented in PL/pgSQL for Postgres.


That would totally destroy the user experience. Users change their query so they can get a refined result, not so they get the same tired result.


Even across users it’s a terrible idea.

Even in the simplest of applications where all you’re doing is passing “last user query” + “retrieved articles” into openAI (and nothing else that is different between users, like previous queries or user data that may be necessary to answer), this will be a bad experience in many cases.

Queries A and B may have similar embeddings (similar topic) and it may be correct to retrieve the same articles for context (which you could cache), but they can still be different questions with different correct answers.


Depends on the scenario. In a threaded query, or multiple queries from the same user - you’d want different outputs. If 20 different users are looking for the same result - a cache would return the right answer immediately for no marginal cost.


That's not the use case of the parent comment:

> for queries that are sufficiently similar


Thanks for the detail! This is a use case we plan to support, and it will be configurable (for when you don’t want it). Some of our customers run into this when different users ask a similar query - “NY-based consumer founders” vs “consumer founders in NY”.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: