No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.
I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.
Sibling commenter also mentions:
> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."
Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.
Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.
Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.
So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).
Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.
When it comes to the statement
>its not unreasonable to say llms are trained to predict the next book instead of single token.
You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.
It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).
The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.
My hot take is that as that percentage increases, salaries will go up asymptotically, until you get to 100%, then they crash to 0. If 80% of your job can be done by AI, I'm going to give you the work of 5 people. When is 99%, I will give you the work of 100 people
If 80% is “done by the AI”, who is responsible for the certain failure on behalf of the AI? Given inference often is, >0%, wrong — in a word… hmm.
How many 9s until you’re comfortable? Even then, knowing 1000 tasks could likely have at least 1 foundational issue… how do you audit? “Pretty please do the needful” and have another “please ensure they do the needful”. Do you review the 1000 inputs/outputs processed? Don’t get me wrong, am familiar with the “send it” ethos all too well, but at-scale it seems like quite the pickle.
Genuinely curious how most people consider these angles… was tasked with building a model once to perform what literally could’ve otherwise been a SQL query… when I brought this up, it was met with “well we need to do it with AI” I don’t think a humans gonna want to find that needle in a haystack when 100,000 significant documents are originated… but I don’t have to worry about that one anymore thank goodness.
If you're okay with the work being done poorly and without review, then sure. Otherwise, it'll take the same amount of time and be done worse. I would not trust solely 1 person to review 5 people's work let alone 100.
You’re arguing semantics. OP is hypothesising a future where the quality of work is comparable to that of a human. If you don’t believe that that’s on the cards, just say it, but you’re intentionally misrepresenting the hypothetical.
Or said another way: "Gold returned to the price it was a Tuesday". The world is too multi-variant to conclude this conversation caused the price of gold to collapse.
Serious question: do you think people in Iran would prefer the status quo, or Return of the Shah (son)? My gut says Shah, but I don't know anyone from Iran, so that's just a guess.
About 30% would take the shaw has a first or second choice. This is higher than support for the current regime but the country is deeply divided on what an alternative future would look like.
It’s a serious question but it’s not a relevant one. There isn’t a ballot with just “Pahlavi” and “Ayatollah” on it; and there probably never will be considering how much Iranians hate both.
Speaking of evil dark-patterned business practices, Epic just recently U-turned on lootboxes again. Fortnite did have them originally, but pivoted to the less egregious FOMO sales funnel around when it got really popular, except now they've backtracked by allowing pay-to-win and lootbox mechanics in user-created game modes.
Fortnite's user-created modes are essentially an attempt to compete with Roblox, and like Roblox their age demographics skew very young, so this reads as a deliberate attempt to exploit children specifically.
Maybe it has changed since I played it, but I honestly found Fortnite to be pretty non-predatory compared to most live service games. I know that's a low bar, but at least you can just buy stuff when you see them on there.
If we looked at the top 100 played steam games, I don't think Fortnite would crack the top 15 for most manipulative.
Weirdly I agree. After seeing the truly god-awful pay to win gambling-filled landscape of Roblox, Fortnite feels pretty tame and respectful. V bucks aren't shoved down your throat, the battle passes are pretty transparent about what you get, and the whole cosmetics store feels less lootbox-heavy than a lot of games.
> Now, third-party games can offer premium in-game items and effects, with developers pocketing 37% of the proceeds — temporarily doubled to 74% for 12 months.
37%? Developers get a 37% cut? Holy fucking hypocrisy from Tim Sweeney and the camp at Epic Games with their predation here.
Epic wasn't fined for putting things on sale. Instead, it was fined for putting pressure on children to buy things that were put on sale; e.g. through wording like "Get it now" and "Grab it", and through design.
This makes no sense. You can't buy anything on Fortnite with real money. The purchase you make is for blocks of v-bucks which can then be spent on items.
The actual financial transaction is completely divorced from any items that are on sale and, hopefully, that financial transaction is completely out of the direct control of children.
I play Fortnite -- I get the battle pass, I have a lot of skins and items, and I have never paid a cent. I earn enough v-bucks from playing the game to never have to pay.
That's actually pretty amazing and so I question this fine situation. If you didn't give your kid a cent for Fortnite, they could still play and have great time with their friends and basically get the full experience including skins, items, and emotes.
I appreciate the validation that he doesn't customize it much. I see a lot of people creating really complex agents/workflows that I tried to replicate and always came across to me as more trouble than they are worth. Kinda like 10 years ago when people would create complex workflows for storing their notes.
In America, the problem comes when the gain and the loss come in different years. If you make a big gain in 2024, but didn't pay taxes on that gain, then lose the money in 2025, they will come after you for failing to pay taxes in 2024 even though you no longer have the money in 2025. The lesson is to pay your taxes.
A bank will be happy to lend you the money to cover the spread since you have the collateral of a large tax refund in the future. It'll cost you a little bit of interest but it's generally not the catastrophe that people make it out to be.
Maybe if you are an ultra high net worth individual. I don’t see your avg Joe walking into their neighborhood Chase bank asking for a $500k loan using their potential tax refund as collateral is going to get it. That seems like an esoteric financial product.
Tax refund loans are offered in conjunction with the tax filing service like TurboTax or H&R Block because they already know what your refund amount is going to be and it’s relatively risk free (small refund amounts) and easy to automate. They are similar to pay day loans.
Crypto bro showing up with $1m gains and losses from crypto transactions and asking for a refund loan at their neighborhood bank is probably not going to go anywhere (it’s too large a risk because it’s not just a few thousand dollars but at the same time it’s too small an amount for them to do custom due diligence to underwrite a loan).
Anyway you can’t erase gains in year 1 with losses in year 2 at least in the USA (you can only offset $3k/yr max in year 2 if you don’t have any other gains).
I followed Rob's work on this in real time, it was a master class in calling out a company with no value. He just continually laid out how numbers didn't add up, and laid out the inevitable conclusion. I had no idea about the threats, but I do know his wife had a baby while all this was going on.
reply