More

dnhkng · 2026-03-13T14:59:00 1773413940

Glad to see someone replicate the results already :)

hashmap · 2026-03-13T15:43:37 1773416617

im kind of wondering like what the ceiling would be on reasoning for something like the 1.5T models with the repeating technique, but they would take a long time to download. i think if you have them already it would take maybe an hour or so to check against a swath of prompts. whats the reasoningest open model at the moment?

my guess is that large models trained on large corpuses there is just some ceiling of "reasoning you can do" given the internal geometry implied by the training data, cause text is lossy and low-bandwidth anyway, and theres only really so much of it. past some point you just have to have models learning from real-world interactions and my guess is we're already kind of there.

dnhkng · 2026-03-13T16:41:12 1773420072

I stick with models I can run on VRAM, but DeepSeek Speciale have the best reasoning capabilities of the models I can actually run (https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale). What hardware can you access?

I have Deepseek etc, but inferencing on DDR5 would take about 2-3 weeks for a simple scan. I think this works best with dense models, but it also seems ok with MoE.

@everyone: Can someone hook me up with Nvidia sponsorship?

hashmap · 2026-03-14T00:40:46 1773448846

oh neat ill check that one out. i dont get that much speedup from ssd/128gb unified vs vram if im doing like a predefined set of prompts, since i have it load it from disk anyway and im just doing one forward pass per prompt, and just like load part of it at a time. its a bit slower if im doing cpu inferencing but i only had to do that with one model so far.

but yeah on demand would be a lot of ssd churn so id just do it for testing or getting some hidden state vectors.

dnhkng · 2026-03-11T06:44:10 1773211450

But blogging is fun!

I do wish one of the big labs would sponsor with a rack of HGX Rubin NVL8's. I have lots of ideas to test, and I have probably hit the spending limit with the boss on hardware (she hasn't seen the new power bill yet...)

dnhkng · 2026-03-11T06:34:51 1773210891

Hi, thanks for the praise!

On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.

dnhkng · 2026-03-11T06:26:23 1773210383

Cheers. I will go back though my other old projects (optogenetics, hacking Crispr/CAS9 etc), and put them on my blog.

On your questions: 1) A few other papers have been mentioned in the thread, like Solar10.7B. They duplicated the whole transformer stack, and it kinda helped. But as I found experimentally, that probably not a great idea. You are duplicating 'organs' (i.e. input processing stuff), that should only have one copy. Also, that paper didn't see immediate improvements; they had to do continued pre-training to see benefits. At that point, I'm guessing the big labs stopped bothering. Limited by hardware, I had to find unusual angles to approach this topic.

2) Nah, no more wetware for me. I did a half decade of research at a big neurobiology institute, and while it was very enjoyable, I can truly say that grant writing and paper review are 'not my thing'. This reason this info was delayed so long is that I wanted a paper in the AI field to go along with my papers in other fields. But as a Hobbyist with no official affiliation, and the attention span of a gnat, I gave up and started a blog instead. Maybe someone will cite it?

dnhkng · 2026-03-10T19:58:30 1773172710

Yes, it's an amazing time to be a hacker!

dnhkng · 2026-03-10T19:57:59 1773172679

It's still non-trivial, as multi-digit numbers can be constructed a huge combination of valid tokens.

The code in the blog helps derive useful metrics from partial answers.

dnhkng · 2026-03-10T19:25:33 1773170733

At some point I will clean up and share the dynamic layer modification code for oobabooga Text-Generation-WebuUI.

You can enter the setting, and apply new re-layering architectures. Its very weird chatting with these brain-damaged models.

dnhkng · 2026-03-10T19:22:34 1773170554

Yes!

I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...

fennecfoxy · 2026-03-12T15:37:19 1773329839

How about, as you found repeating x-y was useful for locating the block of 7 layers in the first place; I'd be incredibly curious if, knowing that block of 7, if you then iterated from repeating x-y in that block z times.

Like for those 7 layers 1,2,3,4,5,6,7 does efficiency increase if you run 1,2,3,3,4,4,4,5,6,7 or perhaps 1,2,3,3,4,5,6,6,7 etc. If only GPUs grew on trees

dnhkng · 2026-03-13T15:00:53 1773414053

Yes, I have done these thype of experiments; thats for the next post

efromvt · 2026-03-10T19:32:50 1773171170

If you found two disjoint sections that seemed positive on their own, did you try looping both separately in the same model? Wondering how localized the structures are.

dnhkng · 2026-03-10T17:46:54 1773164814

Yes, I was using Base64 to 'jailbreak' LLMs back in the day (so similar), and thats what led me to the hypothesis, and months of GPU use to find optimal later dultication!

dnhkng · 2026-03-10T17:32:26 1773163946

Because its generally expected that models only work 'in distribution', i.e. they work on stuff they have previously seen.

They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.

Does that make sense?

fweimer · 2026-03-10T21:01:44 1773176504

If you do not properly MIME-decode email, you end up with at least some base64-encoded conversations.

dormento · 2026-03-10T17:36:01 1773164161

For all we know, AI tech companies could theoretically have converted all of the "acquired" (ahem!) training set material into base64 and used it for training as well, just like you would encode say japanese romaji or hebrew written in the english alphabet.

dtj1123 · 2026-03-10T18:18:03 1773166683

Unlikely that every company would have bothered to do this.

idiotsecant · 2026-03-10T19:00:55 1773169255

'Yes, I know we already trained on all that data, but now I want you to convert to base64 and train it again! at enormous cost!'

adcoleman6 · 2026-03-11T12:34:49 1773232489

On the contrary, it could be a deliberate attempt to augment or diversify the dataset.

gwern · 2026-03-11T01:49:46 1773193786

> They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.

People use Base64 to store payloads of many arbitrary things, including web pages or screenshots, both deliberately and erroneously, and so they have almost certainly seen regular conversations in Base64 in their 10tb+ text training sets scraped from billions of web pages and files and mangled emails etc.

dnhkng · 2026-03-11T06:41:21 1773211281

Yes, thats true.

But that points again to the main idea: The model has learnt to transform Base64 into a form it can already use in the 'regular' thinking structures.

The alternative is that there is an entire parallel structure just for Base64, which based on my 'chats' with LLMs in that format seems implausible; it acts like the regular model.

If there is a 'translation' organ in the model, why not a math or emotion processing organs? Thats what I set out to find, and are illustrated in the heatmaps.

Also, any writing tips from the Master blogger himself? Huge fan (squeal!)