Hacker Newsnew | past | comments | ask | show | jobs | submit | rdos's commentslogin

The Tools detail page is wrong, it's the one from last release, November 2025.

almost fell for it

when you put this stuff in perspective the lies really start to fall through

I can't seem to change the colors of the pie chart, other than the predefined themes. But all of those are horrible for a pie chart.

Yeah, as far as I know, you need to define a customized theme to customize pie chart colors. You can prepend the chart with initialization logic like:

%%{init: {"theme": "base", "themeVariables": { "pie1": "#FF5733", "pie2": "#33FF57", "pie3": "#3357FF", "pieStrokeColor": "#000000", "pieStrokeWidth": 3, "pieOpacity": 0.8 }}}%%

This looks like it works on this site too.


To be fair, pie charts are horrible in general.

> This bug is categorically distinct from hallucinations or missing permission boundaries

I was expecting some kind of explanation for this


Unless it is a bug in CC, which is likely as not, the LLM is failing to keep the story straight. A human could do the same; who said what?


Was any text in the repo NOT written by AI?


I used AI tools during development, same as most people writing code right now. The research direction, experiments, and conclusions are mine I read the papers, designed the experiments, ran them, and documented where things broke. The repo includes 60+ experiment iterations, result logs showing failures, and documentation that corrects earlier optimistic claims. That's not a pattern you'd get from prompting a model to generate a project. I'm one person, so yes, AI helped with implementation. The research was mine.


14B even at Q4 isn't realistic for coding on a single 12GB RTX 3060. Token speed is too slow. After all they are dense models. You aren't getting a good MoE model under 30B. You can do OCR, STT, TTS really well and for LLMs, good use cases are classification, summarization and extraction with <10B models.


Dual 3060s run 24B Q6 and 32B Q4 at ~15 tok/sec. That's fast enough to be usable.

Add a third one and you can run Qwen 3.5 27B Q6 with 128k ctx. For less than the price of a 3090.


Sure, two 3060 can pull usable performance on an usable LLM, but a single one can't (yet).

> 3x RTX 3060 less tgab the price of a 3090

Interesting, here it is around the same. 200-250€ for a used 12GB 3060 and 600-800 for a used 3090€.


> llama.cpp (previously Ollama)

I almost fainted


Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.


Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.


Sometimes what is on the page is ambiguous. Imagine a scan where the dot over the i is missing in a word like "this". What's on the page is "thls" but to transcribe it that way would be an error outside of forensic contexts.

I am reminded it's basically impossible to read cursive writing in a language you don't know even if it's the same alphabet.


Yes, but that's context specific. If your goal with OCR to make text indexable and searchable with regular text search, then transcribing "lesser" as "lesfer" is bad. And handwriting can often be so bad that you need context to make the call about what the scribbles actually are trying to say.

Evaluation methods, too, are bad because they don't think critically about what the downstream task is. Word Error Rate and Character Error Rate are terrible metrics for most historical HTR, yet they're what people use because of habit.

It's a bit like how for a long time BLEU was the metric for translation quality. BLEU is based on N-gram similarity to a reference translation, so naturally translation methods based on and targeting N-gram similarity (e.g. pre NN Google translate) did well, and looked much better than they actually were.


Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.


The model does not need to be that smart to understand that a name it does not know that starts with a capital letter is a the name of a place or a person. It does not need to be aware of whom this refers to, it just needs to transcribe it.

Also, there are generalist models that have enough of a grasp of a dozen or so languages that fit comfortably in 7B parameters. Like the older Mistral, which had the best multi-lingual support at the time, but newer models around that size are probably good candidates. I am not surprised that a multilingual specialised model can fit in 8B or so.


No. Gemini is clearly the leader across the board: https://www.ocrarena.ai/leaderboard


This is very interesting. Especially the last part where it shows gpt-5.2 and gpt-oss and their very similar and unique outcome of being 90%+ Serious.

I tested this locally and got the same result with gpt-oss 120b. But only on the default 'medium' reasoning effort. When I used 'low' I kept getting more playful responses with emojis and when I used 'high' I kept getting more guessing responses.

I had a lot of fun with this and it provided me with more insight than I would have thought.


I didn't read the blog yet because I clicked on cat pics and there weren't any!!!


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: