im kind of wondering like what the ceiling would be on reasoning for something like the 1.5T models with the repeating technique, but they would take a long time to download. i think if you have them already it would take maybe an hour or so to check against a swath of prompts. whats the reasoningest open model at the moment?
my guess is that large models trained on large corpuses there is just some ceiling of "reasoning you can do" given the internal geometry implied by the training data, cause text is lossy and low-bandwidth anyway, and theres only really so much of it. past some point you just have to have models learning from real-world interactions and my guess is we're already kind of there.
I have Deepseek etc, but inferencing on DDR5 would take about 2-3 weeks for a simple scan. I think this works best with dense models, but it also seems ok with MoE.
@everyone: Can someone hook me up with Nvidia sponsorship?
oh neat ill check that one out. i dont get that much speedup from ssd/128gb unified vs vram if im doing like a predefined set of prompts, since i have it load it from disk anyway and im just doing one forward pass per prompt, and just like load part of it at a time. its a bit slower if im doing cpu inferencing but i only had to do that with one model so far.
but yeah on demand would be a lot of ssd churn so id just do it for testing or getting some hidden state vectors.
I do wish one of the big labs would sponsor with a rack of HGX Rubin NVL8's. I have lots of ideas to test, and I have probably hit the spending limit with the boss on hardware (she hasn't seen the new power bill yet...)
On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.
Cheers. I will go back though my other old projects (optogenetics, hacking Crispr/CAS9 etc), and put them on my blog.
On your questions:
1) A few other papers have been mentioned in the thread, like Solar10.7B. They duplicated the whole transformer stack, and it kinda helped. But as I found experimentally, that probably not a great idea. You are duplicating 'organs' (i.e. input processing stuff), that should only have one copy. Also, that paper didn't see immediate improvements; they had to do continued pre-training to see benefits. At that point, I'm guessing the big labs stopped bothering. Limited by hardware, I had to find unusual angles to approach this topic.
2) Nah, no more wetware for me. I did a half decade of research at a big neurobiology institute, and while it was very enjoyable, I can truly say that grant writing and paper review are 'not my thing'. This reason this info was delayed so long is that I wanted a paper in the AI field to go along with my papers in other fields. But as a Hobbyist with no official affiliation, and the attention span of a gnat, I gave up and started a blog instead. Maybe someone will cite it?
How about, as you found repeating x-y was useful for locating the block of 7 layers in the first place; I'd be incredibly curious if, knowing that block of 7, if you then iterated from repeating x-y in that block z times.
Like for those 7 layers 1,2,3,4,5,6,7 does efficiency increase if you run 1,2,3,3,4,4,4,5,6,7 or perhaps 1,2,3,3,4,5,6,6,7 etc. If only GPUs grew on trees
If you found two disjoint sections that seemed positive on their own, did you try looping both separately in the same model? Wondering how localized the structures are.
Yes, I was using Base64 to 'jailbreak' LLMs back in the day (so similar), and thats what led me to the hypothesis, and months of GPU use to find optimal later dultication!
For all we know, AI tech companies could theoretically have converted all of the "acquired" (ahem!) training set material into base64 and used it for training as well, just like you would encode say japanese romaji or hebrew written in the english alphabet.
> They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.
People use Base64 to store payloads of many arbitrary things, including web pages or screenshots, both deliberately and erroneously, and so they have almost certainly seen regular conversations in Base64 in their 10tb+ text training sets scraped from billions of web pages and files and mangled emails etc.
But that points again to the main idea: The model has learnt to transform Base64 into a form it can already use in the 'regular' thinking structures.
The alternative is that there is an entire parallel structure just for Base64, which based on my 'chats' with LLMs in that format seems implausible; it acts like the regular model.
If there is a 'translation' organ in the model, why not a math or emotion processing organs? Thats what I set out to find, and are illustrated in the heatmaps.
Also, any writing tips from the Master blogger himself? Huge fan (squeal!)
reply