How do you have the “modify LLM state from within” working? I can have it modify my config but I don’t know how to get it to eval and improve arbitrary elisp.
Absolutely baffled too. I was expecting that they preferred the vim philosophy of small tools that do one thing well, but no. So you like modal editing, well you’ve got it right there in emacs. Why that of all the potential gripes you might have with emacs?
I think there’s a weaker claim that holds true: we were able to ignore lots of content based on the superficial (and pay proper attention to work that passed this test) and now we are overwhelmed because everything meets the superficial criteria and we can’t pay proper attention to all of it.
That's what I had in mind! The whole post is a claim that evaluating knowledge work got more expensive because cheaper measures stopped correlating well with quality.
If someone was already evaluating the work output using a metric closer to the underlying quality then it might not have been a big shift for them (other than having much more work to evaluate).
You could however only do that if you were fine with unfairly judging the quality of work, as you now readily discarded quality work based on superficial proxies. Which admittedly is done in a lot of cases.
Chess players learned to exploit chess computers’ weaknesses in the beginning too, but they can’t any longer. This version of the robot might not learn continuously, but the next will be better.
I believe there are still some echoes of the concept. Even top engines will play certain grandmaster draw lines unless told more or less explicitly not to. So if you were playing a match against Stockfish you'd want to play the Berlin draw as White every time, for example.
But chess is a turn-based game where there's no deception (in the sense that both players can see all legal moves for both themselves and their opposition at all times), whereas in table tennis, it's in real time, it's fast as hell, the table is small, and the ball can have 2 or 3 different spin types from the same arm/hand/wrist movement , and can land in a number of different spots.
We have confidence in the extra code a compiler generates because it’s deterministic. We don’t have that in LLMs, neither those that wrote nor read the code.
Interesting that what you're talking about as ASI is "as capable of handling explicit requirements as a human, but faster". Which _is_ better than a human, so fair play, but it's striking that this requirement is less about creativity than we would have thought.
The work where I've done well in my life (smashing deadlines, rescuing projects) has so often come because I've been willing to push back on - even explicitly stated - requirements. When clients have tried to replace me with a cheaper alternative (and failed) the main difference I notice is that the cheaper person is used to being told exactly what to do.
Maybe this is more anthropomorphising but I think this pushing back is exactly the result that the LLMs are giving; but we're expecting a bit too much of them in terms of follow-up like: "ok I double checked and I really am being paid to do things the hard way".
To be fair, there is likely not much training data on the difficult conversations you need to handle in a senior position, pushback being one of them. The trouble for the agents is that it is post hoc, to explain themselves, rationalising rather than ”help me understand” beforehand.
> I asked an AI agent to solve a programming problem
You're not asking it to solve anything. You provide a prompt and it does autocomplete. The only reason it doesn't run forever is that one of the generated tokens is interpreted as 'done'.
With the same reasoning, human being are only a bunch of atoms, and the only reason they don't collide with other humans is because of the atomic force.
When your abstraction level is too low, it doesn't explain anything, because the system that is built on it is way too complex.
How the neural networks produce such surprisingly human characteristics is an open question with a ton of research going into it. Explaining this is a bit more than what one smart person can achieve.
I just don't think that's correct. When I ask Claude to solve something for me, it takes a number of actions on my computer which are neither writing text nor interpreting the done token. It executes the build, debugs tests, et cetera. Sometimes it spawns mini-mes when it thinks that would be helpful! I think saying this is all "autocomplete" is a category error, like saying that you shouldn't talk about clicking buttons or running programs because it's all just electrically charged silicon under the hood.
technically, it does all that by outputting text, like `run_shell_command("cargo build")` as part of its response. But you could easily say similar things about humans.
To me, "autocomplete" seems like it describes the purpose of a system more than how it functions, and these agents clearly aren't designed to autocomplete text to make typing on a phone keyboard a bit faster.
I feel like people compare it to "autocomplete" because autocomplete seems like a trivial, small, mundane thing, and they're trying to make the LLMs feel less impressive. It's a rhetorical trick that is very overused at this point.
When someone asks you a question in what ways are you not an "autocomplete"?
You aren't aware of how you come up with the words you are saying, you just start talking and the next word somehow falls out of your mouth. Maybe you think before you start talking, but where do the thoughts come from? They just appear to you in your head. We are just as much a predictive machine as LLMs, the human brain is just fuzzier.
Human minds have the ability to reason and to evaluate sources by different authorities. It is why some children are able to obey their parents while ignoring scammers on TV commercials, shouting at them to buy stuff.
We are also able to apply lived experience to our reasoning. That is why we can accurately answer a question about whether to drive or walk to the car wash. Or how we could immediately see how many "r"s are in "strawberry".
LLMs, being "glorified autocomplete" don't have a real way to separate truth from lies, or critically evaluate sources of information. Humans can absorb information in various ways, such as our "classic five senses" which inform our daily lives and motions, or by absorbing information via reading, hearing, seeing, etc., or by inferring and reasoning and being "guided by the Spirit" in a more metaphysical way where LLMs would fail.
Thoughts are derivative of sensory processing. We have subjective experience and subjective feeling, our symbols are grounded in physical reality. LLM "thoughts" are simulacrum, manipulating symbols according to rules does not imply understanding. One must be quite derealised to think we are predictive machinery or the human brain is just a fuzzier – it is much more than that.
You had literally -zero- input in what your brain gave you as an answer. It just gave you something, you can make up whatever story you want to tell yourself, "it's my favourite movie", "I saw it last week", whatever you want. It doesn't change the fact that the words on your screen triggered some neural pathway in your brain that is totally out of your control and landed on "Titanic".
It's how literally everyone thinks. Your thoughts come unbidden via a process you do not understand and cannot observe and your consciousness follows them along. Your brain is not as special as you imagine.
It's like we have little thinking sub-agents auto-completing cognition tokens in the background that then surface findings to the main agent which then auto-completes some more cognition tokens in the foreground.
> Maybe we should just commit the signature change with a TODO
I'm fascinated that so many folks report this, I've literally never seen it in daily CC use. I can only guess that my habitually starting a new session and getting it to plan-document before action ("make a file listing all call sites"; "look at refactoring.md and implement") makes it clear when it's time for exploration vs when it's time for action (i.e. when exploring and not acting would be failing).
I have only seen "go do X" result in CC adding "TODO: X" to the working file on one occasion. When it happened, I noticed that the file contained a very similar todo for a similar action already. My guess is that because the agent had the whole file in context, that influenced it to produce output similar to what was already there.
reply