I used AI tools during development, same as most people writing code right now. The research direction, experiments, and conclusions are mine I read the papers, designed the experiments, ran them, and documented where things broke. The repo includes 60+ experiment iterations, result logs showing failures, and documentation that corrects earlier optimistic claims. That's not a pattern you'd get from prompting a model to generate a project. I'm one person, so yes, AI helped with implementation. The research was mine.
14B even at Q4 isn't realistic for coding on a single 12GB RTX 3060. Token speed is too slow. After all they are dense models. You aren't getting a good MoE model under 30B. You can do OCR, STT, TTS really well and for LLMs, good use cases are classification, summarization and extraction with <10B models.
Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.
Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.
Sometimes what is on the page is ambiguous. Imagine a scan where the dot over the i is missing in a word like "this". What's on the page is "thls" but to transcribe it that way would be an error outside of forensic contexts.
I am reminded it's basically impossible to read cursive writing in a language you don't know even if it's the same alphabet.
Yes, but that's context specific. If your goal with OCR to make text indexable and searchable with regular text search, then transcribing "lesser" as "lesfer" is bad. And handwriting can often be so bad that you need context to make the call about what the scribbles actually are trying to say.
Evaluation methods, too, are bad because they don't think critically about what the downstream task is. Word Error Rate and Character Error Rate are terrible metrics for most historical HTR, yet they're what people use because of habit.
It's a bit like how for a long time BLEU was the metric for translation quality. BLEU is based on N-gram similarity to a reference translation, so naturally translation methods based on and targeting N-gram similarity (e.g. pre NN Google translate) did well, and looked much better than they actually were.
Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.
The model does not need to be that smart to understand that a name it does not know that starts with a capital letter is a the name of a place or a person. It does not need to be aware of whom this refers to, it just needs to transcribe it.
Also, there are generalist models that have enough of a grasp of a dozen or so languages that fit comfortably in 7B parameters. Like the older Mistral, which had the best multi-lingual support at the time, but newer models around that size are probably good candidates. I am not surprised that a multilingual specialised model can fit in 8B or so.
This is very interesting. Especially the last part where it shows gpt-5.2 and gpt-oss and their very similar and unique outcome of being 90%+ Serious.
I tested this locally and got the same result with gpt-oss 120b. But only on the default 'medium' reasoning effort. When I used 'low' I kept getting more playful responses with emojis and when I used 'high' I kept getting more guessing responses.
I had a lot of fun with this and it provided me with more insight than I would have thought.
reply