I can walk and chew bubble gum at the same time: on one hand, yes, there's certainly a lot of Kool-Aid being drank by the AI folks. Even on HN, I constantly argue with people that genuinely think LLMs are some kind of magical black box that contain "knowledge" or "intelligence" or "meaning" when in reality, it's just a very fancy Markov chain. And on the other hand, I think that language interfaces are probably the next big leap in how we interact with our computers, but more to the point of the article:
> To conclude: one must have different standards for developing systems than for testing, deploying, or using systems.
In my opinion, you unfortunately will never (and, in fact could never) have reliable development and testing standards when designing purely stochastic systems like, e.g., large language models. Intuitively, the fact that these are stochastic systems is why we need things like hyper-parameters, fiddling with seeds, and prompt engineering.
The Microsoft Research "Sparks of AGI" paper spends 154 pages describing behaviors of GPT-4 that are inconsistent with the understanding of it being a "fancy Markov chain": https://arxiv.org/abs/2303.12712
I expect that the reason people are constantly arguing with you is that your analysis does not explain some easily testable experiences, such as why GPT-4 has the ability to explain what some non-trivial and unique Python programs would output if they were run, despite GPT-4 not having access to a Python interpreter itself.
> trivial and unique Python programs would output if they were run, despite GPT-4 not having access to a Python interpreter itself
Trivially explained as "even a broken clock is right twice a day." I skimmed the paper, as it was linked here on HN iirc. First, it was published by Microsoft, a company that absolutely has a horse in this race (what were they supposed to say? "The AI bot our search engine uses is dumb?"). Second of all, I was very interested in their methodology, so I fully read the first section, which is woefully hand-wavy, a fact with which even the authors would agree:
> We acknowledge that this approach is somewhat subjective and informal, and that it may not satisfy the rigorous standards of scientific evaluation.
The paper, for instance, is amazed that GPT knows how to draw a unicorn in TikZ, but we already know it was trained on the Pile, which includes all Stack Exchange websites, which happens to include answers like this one[1]. So to make the argument that it's being creative, when the answer (or, more charitably, something extremely close to it) is literally in the training set, is just disingenuous.
> Trivially explained as "even a broken clock is right twice a day."
Trivial, vacuous, and wrong. It is not plausible to correctly predict the output of a serious Python program by coincidence. See the detailed examples in the paper -- one of which was pseudocode, not Python -- to see how silly this claim sounds.
> First, it was published by Microsoft, a company that absolutely has a horse in this race
Firstly, this is evidence that you are neither very familiar with Microsoft, nor with academic or industrial research labs. Microsoft Research (the affiliation of the authors) is practically a different company, and run more like a university lab. It is somewhat deeply insulting to their (often still primarily academic) researchers to suggest that they would publish a misleading puff piece to benefit the commercial arm.
Secondly, while the paper describes itself as qualitative, you can reproduce the major claims yourself (and I have).
> familiar with Microsoft, nor with academic or industrial research labs
The idea that Microsoft Research would publish anything remotely damaging to Microsoft is beyond naïve. I mean, one of their core tenets is "Ensure that Microsoft products have a future," but okay.
You're one of the people that will be yelling to everyone else in a potential future "We're only seemingly oppressed! It's just a parlor trick that they've turned most of humanity into paperclips, sci-fi authors wrote about this already!"
I don't think GPT-4 is magic, but unless the unicorn is literally an exact replica of something from it's training set, it clearly has "knowledge", and it's weird that you'd try to deny that.
Do you think the card catalog down at the local library is sentient?
Maybe that's not enough data, though!
Is the card catalog for NY Public Library sentient?
Maybe that's still too local.
Is Google sentient?
Everyone with a clue would admit these are all examples of "knowledge"
It's the same parlor trick, with fancier algos. It's not intelligence. It won't produce a human-level AI.
Period.
It's not the amount of data, it's what it does with that data. Try typing "Tell me a story about a unicorn arguing with people on Hacker News" into a card catalog and tell me how good it is at storytelling. Typing that into GPT-4 might not win any literary awards, but it obviously understands what you meant and does a passable job.
> but unless the unicorn is literally an exact replica of something from it's training set, it clearly has "knowledge", and it's weird that you'd try to deny that
Speaking of weird, that's a very weird definition of knowledge. Because in that case, then almost any data transformation operation implies knowledge. MS Word thesaurus? Knowledge. Search and replace? Knowledge. Markov chain[1]? Knowledge.
Knowledge is one of those vague words like conciousness that has people arguing past one another. However, yes, an interactive thesaurus has "knowledge" of a very limited sort. And LLMs have much more "knowledge", and are able to synthesize novel things from that knowledge.
You can argue "it's only seemingly got knowledge!" all you want, as everyone else enjoys increasingly capable AI.
Good point. While it seems obvious to me that LLMs can never be anything more than fancy Markov chains, in my experience it seems the majority of human "logic" does not operate much differently. Very rare to encounter someone who is able to think or speak critically. Most regurgitate canned responses based on keywords.
I'm gonna respond to you, because i think you like GPT4 and i do too (even if the only use i trust for now is "Resume me this **lot of text/research article** in less than 200 words", which is already great for a knowledge hoarder like me)
You can think against yourself, a LLM have troubles doing so. Also, they fail spectacularly when asked to do real-life operation: "I have to buy two bagettes at one euros, then five chocolatine, croissants and raisin bread at 1.40, 1.20 and 1.60 respectively, how much should i take with me?" when in my head, i just know it'll be between 20 and 25 in seconds (and in fact it's 23, i took random numbers but they are quite easy to add).
> You should take 23 euros with you to purchase all the items.
Are you sure you're using GPT-4 and not 3.5? GPT-4 is incomparably more competent compared to GPT-3.5 on logical tasks like this (trust me, I've had it solve much more complicated questions than this), and you aren't using GPT-4 on chat.openai.com unless you're paying for it and deliberately picking it when creating a new chat.
Edit: Here's an example of a more complicated question that GPT-4 answered correctly on the first try: https://i.imgur.com/JMC7jsw.png
Funnily enough, this was also a problem that a friend posed to me while trying to challenge the reasoning ability of GPT-4. As you can see (cross-reference it if you like), it nailed the answer.
The rare humans who don't speak any language (or animals, for that matter) can still think, which shows that thought is more than manipulating language constructs.
Well, for one, humans are obviously at least more than a fancy Markov chain because we have genetically hard-wired instincts, so we are, in some sense, "hard-coded" if you forgive my programming metaphor. Hard-coded to breed, multiply, care for our young, seek shelter, among many other things.
Markov chains, like any algorithm, are hard-coded. And just as evolution hard-codes our genes, supervised learning (and in the future reinforcement learning) hard-codes LLMs and other AI models.
>contain "knowledge" or "intelligence" or "meaning" when in reality, it's just a very fancy Markov chain
These are not mutually exclusive. If you have a Markov chain that 100% of the time outputs "A cat is an animal", then it has knowledge that a cat is an animal.
Knowledge is awareness of information. "Awareness" is a quagmire because a lot of people believe that 'true' awareness requires possessing a sort of soul which machines can't possess.
I think the important part is information, the matter of 'awareness' can simply be ignored as a philosophical/religious disagreement which will never be resolved. What's important is: Does the system contain information? Can it reliably convey that information? In which ways can it manipulate that information?
"Awareness of information" describes belief. Knowledge is justified, true belief (you can believe things you don't actually know/don't have justification for, and you can be made aware of information you don't believe). If you're dismissive of philosophy and then ask epistemological questions, you'll miss out on a lot of good pondering people have done on the subject, and end up reinventing some of it without encountering the criticism of those ideas.
2. "Belief" isn't any less of a quagmire than "awareness." Materialists and dualists will never agree on whether machines can have 'belief' or 'awareness', so discussions between the two will always be fruitless.
If you're asking explicitly epistemogical questions, knowledge as in "do you have knowledge of the events of last night" is probably not the definition you want. You probably want want the definition as in, "what is knowledge, what do I know, and how do I know it?" (Note the definition I used is also there.)
You're asking questions and then declaring the answers impossible to determine, I don't really see the point. You don't really avoid the question of belief in the line of questioning you propose. It just gets implicitly shifted into the observer.
Personally I don't care about whether this paradigm will ever reconcile with that one, I care about which I think is most appropriate to a given problem space.
> You're asking questions and then declaring the answers impossible to determine
If you mean to say that I've asked whether machines can know or believe, you're wrong. I have not asked whether it's possible for machines to 'know' or 'believe'. What I have asserted, not asked, is these questions are a waste of your time, because the divide between materialists and duelists will never be bridged. The the root of the disagreement is an irreparable philosophical divide, essentially religious disagreement.
To reiterate for clarity, these are the questions which I said are relevant: "Does the system contain information? Can it reliably convey that information? In which ways can it manipulate that information?" I haven't declared these questions impossible to answer. On the contrary, these are questions for engineers, not philosophers or theologians. They are mundane, practical questions:
The system is a spreadsheet: Does it contain information? Yes, assuming it isn't blank. Even a fraudulent spreadsheet contains information, false as it my be. Can it convey that information? Yes, given appropriate spreadsheet software and a user who knows how to use it. Can it manipulate that information? Certainly, a spreadsheet can sort, sum, etc.
The system is an AI: Can it contain information? Yes, plenty of information is fed into them during training. Can it reliably convey that information? That depends on the degree of reliability you desire. Can it manipulate that information? Yes, numerous kinds of manipulations have been demonstrated. The reliability of information conveyance and the manner of manipulations which are possible are important questions for any engineer who is thinking about creating or employing such a system. The answers to these questions are not impossible to determine.
But can an AI "know" things? Pointless question, like asking if a submarine can "swim". Important questions about submarines include: How deep can it go? How fast can it go? How quiet is it? These are questions for which empirical answers can be determined. Whether a submarine can "swim" is a pointless question, all it does is interrogate how much anthropocentric baggage the word "swim" has. Maybe that's an interesting question to linguists, poets or philosophers, but it isn't an important question to engineers trying to solve a real problems.
I know swimming submarines are cliche so here's another: Can a seat-belt hug you? That's a stupid question for poets or linguists who want to interrogate the anthropocentric implications of the word 'hug'. Can a seat-belt restrain you? That's a useful question for automotive engineers who want to build a car.
The heart of the issue is always the same: is a perfect simulation actually the same thing as what it simulates. I would argue that yes especially when the definition of knowledge or intelligence is already so fuzzy but some people will probably always disagree.
If it requires interpretation, than it is the "yes+human" system that has the knowledge.
What really happened here is that a human wrote down, "a cat is an animal," and then another human read it, understood it, and believed it. And so the knowledge moved from one human to another. `yes` was only a conduit for that information to travel through.
If something was a conduit for knowledge wouldn't it make sense that it would have at some point contained the knowledge? The knowledge is stored into it by one human and extracted by another human.
Absolutely you can store knowledge in text. You can store quite a bit of knowledge in a book for instance, but the book doesn't have any beliefs and doesn't know anything. Whether a sufficiently complex Markov chain or ANN can have beliefs, I don't know but I'm skeptical that these ANNs do in particular.
It's ability to produce text containing true statements isn't sufficient evidence to conclude that it has beliefs, and it's easy to find cases where it contradicts itself (eg, if you play around with the wording you can find a prompt where it tells you that solving the trolley problem is a matter of harming the fewest people but proposes a solution that harms the most people). I take that as an indication it's primarily regurgitating text and rearranging the prompt rather than applying knowledge (which, to be clear, is useful for a number of tasks).
> To conclude: one must have different standards for developing systems than for testing, deploying, or using systems.
In my opinion, you unfortunately will never (and, in fact could never) have reliable development and testing standards when designing purely stochastic systems like, e.g., large language models. Intuitively, the fact that these are stochastic systems is why we need things like hyper-parameters, fiddling with seeds, and prompt engineering.