How about, as you found repeating x-y was useful for locating the block of 7 layers in the first place; I'd be incredibly curious if, knowing that block of 7, if you then iterated from repeating x-y in that block z times.
Like for those 7 layers 1,2,3,4,5,6,7 does efficiency increase if you run 1,2,3,3,4,4,4,5,6,7 or perhaps 1,2,3,3,4,5,6,6,7 etc. If only GPUs grew on trees
If you found two disjoint sections that seemed positive on their own, did you try looping both separately in the same model? Wondering how localized the structures are.
I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...