More

gdiamos · 2026-03-12T23:11:09 1773357069

I personally don't mind letting Claude write about work.

You could spend 80% doing the work and 20% writing about it, or 99% doing the work and 1% copy-pasting Claude's writeup about it into a blog.

There is nothing wrong with writing if you are into it, and yes you can probably do better than Claude, but I can related to engineers who just want to build.

spzb · 2026-03-12T23:25:21 1773357921

If you can’t be bothered to write it, why should I bother to read it?

cannonpr · 2026-03-13T01:17:03 1773364623

Because it contains information of value to you ? I mean if it doesn’t, just don’t read it.

userbinator · 2026-03-13T06:15:42 1773382542

https://marketoonist.com/wp-content/uploads/2023/03/230327.n...

Orygin · 2026-03-13T12:46:26 1773405986

To quote another HN comment recently made:

> Using AI to write content is seen so harshly because it violates the previously held social contract that it takes more effort to write messages than to read messages. If a person goes through the trouble of thinking out and writing an argument or message, then reading is a sufficient donation of time.

However, with the recent chat based AI models, this agreement has been turned around. It is now easier to get a written message than to read it. Reading it now takes more effort. If a person is not going to take the time to express messages based on their own thoughts, then they do not have sufficient respect for the reader, and their comments can be dismissed for that reason.

cannonpr · 2026-03-13T16:33:21 1773419601

So to a large extent I appreciate that argument, however I feel this applied more to throwaway comments or sales outreach, writing with low information density. In this occasion the work that went into it is a lot, it would be lost or inaccessible to me otherwise, I am genuinely grateful someone stuck their work in an LLM, said tidy this up to post, and hit enter.

selfhoster11 · 2026-03-12T23:34:27 1773358467

I could spend 100% doing the work with my own Claude, and 0% reading yours. That's a negative-sum outcome. I do think that the 80%/20% split is better (though anything that is mostly human voice is fine for me).

Groxx · 2026-03-13T00:45:27 1773362727

Because the failures are so frequent and often load-bearing that it makes it a negative sum to even attempt to read stuff that appears generated.

gdiamos · 2026-03-12T21:22:38 1773350558

One of my lessons in using different accelerators, whether they be different NVIDIA versions, or GPU->TPU, etc is that someone needs to do this work of indexing, partitioning, mapping, scheduling, and benchmarking. That work is labor intensive.

In this case, google has already done it, and that will be true for high resourced accelerator companies like Google working with the most popular operations like attention.

As long as you use those operations, you are okay. But if you do something different, you need to be prepared to do all of this yourself.

gdiamos · 2026-03-05T09:24:43 1772702683

Results as good as Qwen has been posting would seem to trigger a power struggle.

I think companies that don’t navigate these correctly eventually lose.

gdiamos · 2026-03-01T00:15:17 1772324117

It was inevitable.

gdiamos · 2026-02-27T02:13:00 1772158380

This is why I like Dario as a CEO - he has a system of ethics that is not jus about who writes the largest check.

You may not agree with it, but I appreciate that it exists.

gdiamos · 2026-02-20T17:55:59 1771610159

I know the frontier “labs” are holding back publications.

I don’t think it will last among researchers who think beyond production LLMs

gdiamos · 2026-02-19T04:11:39 1771474299

Most people don't appreciate how many dead end applications NVIDIA explored before finding deep learning. It took a very long time, and it wasn't luck.

wtallis · 2026-02-19T05:03:33 1771477413

It was luck that a viable non-graphics application like deep learning existed which was well-suited to the architecture NVIDIA already had on hand. I certainly don't mean to diminish the work NVIDIA did to build their CUDA ecosystem, but without the benefit of hindsight I think it would have been very plausible that GPU architectures would not have been amenable to any use cases that would end up dwarfing graphics itself. There are plenty of architectures in the history of computing which never found a killer application, let alone three or four.

oivey · 2026-02-19T05:16:54 1771478214

Even that is arguably not lucky, it just followed a non-obvious trajectory. Graphics uses a fair amount of linear algebra, so people with large scale physical modeling needs (among many) became interested. To an extent the deep learning craze kicked off because of developments in computation on GPUs enabled economical training.

imtringued · 2026-02-19T07:12:04 1771485124

Nvidia started their GPGPU adventure by acquiring a physics engine and porting it over to run on their GPUs. Supporting linear algebra operations was pretty much the goal from the start.

adrian_b · 2026-02-19T10:36:49 1771497409

They were also full of lies when they have started their GPGPU adventure (like also today).

For a few years they have repeated continuously how GPGPU can provide about 100 times more speed than CPUs.

This has always been false. GPUs are really much faster, but their performance per watt has oscillated during most of the time around 3 times and sometimes up to 4 times greater in comparison with CPUs. This is impressive, but very far from the "100" factor originally claimed by NVIDIA.

Far more annoying than the exaggerated performance claims, is how the NVIDIA CEO was talking during the first GPGPU years about how their GPUs will cause a democratization of computing, giving access for everyone to high-throughput computing.

After a few years, these optimistic prophecies have stopped and NVIDIA has promptly removed FP64 support from their price-acceptable GPUs.

A few years later, AMD has followed the NVIDIA example.

Now, only Intel has made an attempt to revive GPUs as "GPGPUs", but there seems to be little conviction behind this attempt, as they do not even advertise the capabilities of their GPUs. If Intel will also abandon this market, than the "general-purpose" in GPGPUs will really become dead.

KeplerBoy · 2026-02-19T11:37:46 1771501066

GPGPU is doing better than ever.

Sure FP64 is a problem and not always available in the capacity people would like it to be, but there are a lot of things you can do just fine with FP32 and all of that research and engineering absolutely is done on GPU.

The AI-craze also made all of it much more accessible. You don't need advanced C++ knowledge anymore to write and run a CUDA project anymore. You can just take Pytorch, JAX, CuPy or whatnot and accelerate your numpy code by an order of magnitude or two. Basically everyone in STEM is using Python these days and the scientific stack works beautifully with nvidia GPUs. Guess which chip maker will benefit if any of that research turns out to be a breakout success in need of more compute?

tverbeure · 2026-02-19T17:08:06 1771520886

> GPGPU can provide about 100 times more speed than CPUs

Ok. You're talking about performance.

> their performance per watt has oscillated during most of the time around 3 times and sometimes up to 4 times greater in comparison with CPUs

Now you're talking about perf/W.

> This is impressive, but very far from the "100" factor originally claimed by NVIDIA.

That's because you're comparing apples to apples per apple cart.

adrian_b · 2026-02-19T20:19:20 1771532360

For determining the maximum performance achievable, the performance per watt is what matters, as the power consumption will always be limited by cooling and by the available power supply.

Even if we interpret the NVIDIA claim as referring to the performance available in a desktop, the GPU cards had power consumptions at most double in comparison with CPUs. Even with this extra factor there has been more than an order of magnitude between reality and the NVIDIA claims.

Moreover I am not sure whether around 2010 and before that, when these NVIDIA claims were frequent, the power permissible for PCIe cards had already reached 300 W, or it was still lower.

In any case the "100" factor claimed by NVIDIA was supported by flawed benchmarks, which compared an optimized parallel CUDA implementation of some algorithm with a naive sequential implementation on the CPU, instead of comparing it with an optimized multithreaded SIMD implementation on that CPU.

tverbeure · 2026-02-19T21:13:50 1771535630

At the time, desktop power consumption was never a true limiter. Even for the notorious GTX 480, TDP was only 250 W.

That aside, it still didn't make sense to compare apples to apples per apple cart...

h3lp · 2026-02-20T18:07:26 1771610846

Well, power envelope IS the limit in many applications; anyone can build a LOBOS (Lots Of Boxes On Shelves) supercomputer, but data bandwidth and power will limit its usefullness and size. Everyone has a power budget. For me, it's my desk outlet capacity (1.5kW); for a hyperscaler, it's the capacity of the power plant that feeds their datacenter (1.5GW); we both cannot exceed Pmax * MIPS/W of computation.

tverbeure · 2026-02-20T19:02:40 1771614160

All of that may be true but it’s irrelevant.

If you’re dividing perf by perf/W, it makes no sense to yell “it’s not equal to 100!” You simply failed at dimension analysis taught in high school.

0x457 · 2026-02-19T22:20:09 1771539609

> A few years later, AMD has followed the NVIDIA example.

When bitcoin was still profitable to mine on GPUs, AMD's performed better due to not being segmented like NVIDIA cards. It didn't help AMD, not that it matters. AMD started segmenting because they couldn't make a competitive card at a competitive price for the consumer market.

0x457 · 2026-02-19T22:16:18 1771539378

That physics engine is an example of a dead-end.

mandevil · 2026-02-19T17:54:52 1771523692

There's something of a feedback loop here, in that the reason that transformers and attention won over all the other forms of AI/ML is that they worked very well on the architecture that NVIDIA had already built, so you could scale your model size very dramatically just by throwing more commodity hardware at it.

MengerSponge · 2026-02-19T05:06:22 1771477582

It was luck, but that doesn't mean they didn't work very hard too.

Luck is when preparation meets opportunity.

fooblaster · 2026-02-19T06:30:45 1771482645

It was definitely luck, greg. And Nvidia didn't invent deep learning, deep learning found nvidias investment in CUDA.

gdiamos · 2026-02-19T07:07:41 1771484861

I remember it differently. CUDA was built with the intention of finding/enabling something like deep learning. I thought it was unrealistic too and took it on faith in people more experienced than me, until I saw deep learning work.

Some of the near misses I remember included bitcoin. Many of the other attempts didn't ever see the light of day.

Luck in english often means success by chance rather than one's own efforts or abilities. I don't think that characterizes CUDA. I think it was eventual success in the face of extreme difficulty, many failures, and sacrifices. In hindsight, I'm still surprised that Jensen kept funding it as long as he did. I've never met a leader since who I think would have done that.

cherryteastain · 2026-02-19T08:25:19 1771489519

Nobody cared about deep learning back in 2007, when CUDA released. It wasn't until the 2012 AlexNet milestone that deep neural nets start to become en vogue again.

gbin · 2026-02-19T08:38:42 1771490322

I clearly remember Cuda being made for HPC and scientific applications. They added actual operations for neural nets years after it was already a boom. Both instances were reactions, people already used graphics shaders for scientific purposes and cuda for neural nets, in both cases Nvidia was like oh cool money to be made.

dboreham · 2026-02-19T16:20:47 1771518047

Parallel computing goes back to the 1960s (at least). I've been involved in it since the 1980s. Generally you don't create an architecture and associated tooling for some specific application. The people creating the architecture only have a sketchy understanding of application areas and their needs. What you do is have a bright idea/pet peeve. Then you get someone to fund building that thing you imagined. Then marketing people scratch their heads as to who they might sell it to. It's at that point you observed "this thing was made for HPC, etc" because the marketing folks put out stories and material that said so. But really it wasn't. And as you note, it wasn't made for ML or AI either. That said in the 1980s we had "neural networks" as a potential target market for parallel processing chips so it's aways there as a possibility.

fooblaster · 2026-02-19T20:38:06 1771533486

CUDA was profitable very early because of oil and gas code, like reverse time migration and the like. There was no act of incredible foresight from jensen. In fact, I recall him threatening to kill the program if large projects that made it not profitable failed, like the Titan super computer at oak ridge.

gdiamos · 2026-02-20T04:12:00 1771560720

I remember it being less profitable than graphics for a long time.

It did make money that would be interesting to a startup, but not to a public company.

fooblaster · 2026-02-28T07:39:19 1772264359

Again, it wasn't exactly a huge sink of resources. There was no genius gamble from jensen like you are suggesting. I suspect your view here is intrinsically tied to your need to feel like you and others who are in your position are responsible for your own success, when in fact it's mostly about luck.

brookst · 2026-02-19T12:23:52 1771503832

So it could just as easily have been Intel or AMD, despite them not having CUDA or any interest in that market? Pure luck that the one large company that invested to support a market reaped most of the benefits?

gdiamos · 2026-02-19T04:09:54 1771474194

I'm not sure why the article dismisses cost.

Let's say X=10% of the GPU area (~75mm^2) is dedicated to FP32 SIMD units. Assume FP64 units are ~2-4x bigger. That would be 150-300mm^2, a huge amount of area that would increase the price per GPU. You may not agree with these assumptions. Feel free to change them. It is an overhead that is replicated per core. Why would gamers want to pay for any features they don't use?

Not to say there isn't market segmentation going on, but FP64 cost is higher for massively parallel processors than it was in the days of high frequency single core CPUs.

wtallis · 2026-02-19T04:55:41 1771476941

> Assume FP64 units are ~2-4x bigger.

I'm pretty sure that's not a remotely fair assumption to make. We've seen architectures that can eg. do two FP32 operations or one FP64 operation with the same unit, with relatively low overhead compared to a pure FP32 architecture. That's pretty much how all integer math units work, and it's not hard to pull off for floating point. FP64 units don't have to be—and seldom have been—implemented as massive single-purpose blocks of otherwise-dark silicon.

When the real hardware design choice is between having a reasonable 2:1 or 4:1 FP32:FP64 ratio vs having no FP64 whatsoever and designing a completely different core layout for consumer vs pro, the small overhead of having some FP64 capability has clearly been deemed worthwhile by the GPU makers for many generations. It's only now that NVIDIA is so massive that we're seeing them do five different physical implementations of "Blackwell" architecture variants.

jcranmer · 2026-02-19T05:15:37 1771478137

> Assume FP64 units are ~2-4x bigger.

I'm not a hardware guy, but an explanation I've seen from someone who is says that it's not much extra hardware to add to a 2×f32 FMA unit the capability to do 1×f64. You already have all of the per-bit logic, you mostly just need to add an extra control line to make a few carries propagate. So the size overhead of adding FP64 to the SIMD units is more like 10-50%, not 100-300%.

adrian_b · 2026-02-19T11:04:34 1771499074

Most of the logic can be reused, but the FP64 multiplier is up to 4 times larger. Also some shifters are up to 2 times larger (because they need more stages, even if they shift the same number of bits). Small size increases occur in other blocks.

Even so, the multipliers and shifters occupy only a small fraction of the total area, a fraction that is smaller then implied by their number of gates, because they have very regular layouts.

A reduction from the ideal 1:2 FP64/FP32 throughput to 1:4 or in the worst case to 1:8 should be enough to make negligible the additional cost of supporting FP64, while still keeping the throughput of a GPU competitive with a CPU.

The current NVIDIA and AMD GPUs cannot compete in FP64 performance per dollar or per watt with Zen 5 Ryzen 9 CPUs. Only Intel B580 is better in FP64 performance per dollar than any CPU, though its total performance is exceeded by CPUs like 9950X.

wmf · 2026-02-19T04:14:13 1771474453

Why would gamers want to pay for any features they don't use?

Obviously they don't want to. Now flip it around and ask why HPC people would want to force gamers to pay for something that benefits the HPC people... Suddenly the blog post makes perfect sense.

rustyhancock · 2026-02-19T04:33:16 1771475596

Similar to when Nvidia released LHR GPUs that nerfed performance for Ethereum mining.

NVIDIA GeForce RTX 3060 LHR which tried to hinder mining at the bios level.

The point wasn't to make the average person lose out by preventing them mining on their gaming GPU. But to make miners less inclined to buy gaming GPUs. They also released a series of crypto mining GPUs around the same time.

So fairly typical market segregation.

https://videocardz.com/newz/nvidia-geforce-rtx-3060-anti-min...

adrian_b · 2026-02-19T11:31:22 1771500682

NVIDIA could make 2 separate products, a GPU for gamers and a FP accelerator for HPC.

Thus everybody would pay for what they want.

The problem is that both NVIDIA and AMD do not want to make, like AMD did until a decade ago and NVIDIA stopped doing a few years earlier, a FP accelerator of reasonable size and which would be sold at a similar profit margin with their consumer GPUs.

Instead of this, they want to sell only very big FP accelerators and at huge profit margins, preferably at 5-digit prices.

This makes impossible for small businesses and individual users to use such FP accelerators.

Those are accessible only for big companies, who can buy them in bulk and negotiate lower prices than the retail prices, and who will also be able to keep them busy for close to 24/7, in order to be able to amortize the excessive profit margins of the "datacenter" GPU vendors.

One decade and a half ago, the market segmentation was not yet excessive, so I was happy to buy "professional" GPUs, with unlocked FP64 throughput, at a price about twice greater in comparison with consumer GPUs.

Nowadays, I can no longer afford such a thing, because the similar GPUs are no longer 2 times more expensive, but 20 to 50 times more expensive.

So during the last 2 decades, first I shifted much of my computations from CPUs to GPUs, but then I had to shift them back to CPUs, because there are no upgrades for my old GPUs, any newer GPU being slower, not faster.

david-gpu · 2026-02-19T12:11:26 1771503086

Throughout this article you have been voicing a desire for affordable and high-througput fp64 processors, blaming vendors for not building the product you desire at a price you are willing to pay.

We hear you: your needs are not being met. Your use case is not profitable enough to justify paying the sky-high prices they now demand. In particular, because you don't need to run the workload 24/7.

What alternatives have you looked into? For example, Blackwell nodes are available from the likes of AWS.

adrian_b · 2026-02-19T17:07:10 1771520830

I think that you might have confused me with the author of the article.

American companies have a pronounced preference for business-to-business products, where they can sell large quantities in bulk and at very large profit margins that would not be accepted by small businesses or individual users, who spend their own money, instead of spending the money of an anonymous employer.

If that is the only way for them to be profitable, good for them. However such policies do not deserve respect. They demonstrate the inefficiencies in the management of these companies, which prevent them from competing efficiently in markets for low-margin commodity products.

From my experience, I am pretty certain that a smaller die version of the AMD "datacenter" GPUs could be made and it could be profitable, like such GPUs were a decade ago, when AMD was still making them. However today they no longer have any incentive to do such things, as they are content with selling a smaller number of units, but with much higher margins, and they do not feel any pressure to tighten their costs.

Fortunately at least in CPUs there has been a steady progress and AMD Zen 5 has been a great leap in floating-point throughput, exceeding the performance of older GPUs.

I am not blaming vendors for not building the product that I desire, but I am disappointed that years ago they have fooled me to waste time in porting applications to their products, which I bought instead of spending money for something else, but then they have discontinued such products, with no upgrade path.

Because I am old enough to remember what happened 15 to 20 years ago, I am annoyed about the hypocrisy of some discourses of the NVIDIA CEO, which have been repeated for several years after introducing CUDA, which were more or less equivalent with promises that the goal of NVIDIA is to put a "supercomputer" on the desk of everyone, only for him to pivot completely from these claims and remove FP64 from "consumer" GPUs, in order to be able to sell "enterprise" GPUs at inflated prices. Then soon this prompted AMD to imitate the same strategy.

adrian_b · 2026-02-19T11:00:06 1771498806

A FP64 unit can share most of two FP32 units.

Only the multiplier is significantly bigger, up to 4 times. Some shifters may also be up to twice bigger. The adders are slightly bigger, due to bigger carry-look-ahead networks.

So you must count mainly the area occupied by multipliers and shifters, which is likely to be much less than 10%.

There is an area increase, but certainly not of 50% (300 m^2). Even an area increase of 10% (e.g. 60-70 mm^2 for the biggest GPUs seems incredibly large).

Reducing the FP64/FP32 throughput ratio from 1:2 to 1:4 or at most to 1:8 is guaranteed to make the excess area negligible. I am sure that the cheap Intel Battlemage with 1:8 does not suffer because of this.

Any further reductions, from 1:16 in old GPUs until 1:64 in recent GPUs cannot have any other explanation except the desire for market segmentation, which eliminates small businesses and individual users from the customers who can afford the huge prices of the GPUs with FP64 support.

thesz · 2026-02-19T07:34:08 1771486448

  > Assume FP64 units are ~2-4x bigger.

This is wrong assumption. FP64 usually uses the same circuitry as two FP32, adding not that much ((de)normalization, mostly).

From the top of my head, overhead is around 10% or so.

  > Why would gamers want to pay for any features they don't use?

https://www.youtube.com/watch?v=lEBQveBCtKY

Apparently FP80, which is even wider than FP64, is beneficial for pathfinding algorithms in games.

Pathfinding for hundredths of units is a task worth putting on GPU.

kbolino · 2026-02-19T17:15:38 1771521338

Has FP80 ever existed anywhere other than x87?

Paul_Clayton · 2026-02-27T23:24:18 1772234658

The Motorola 88k and 68k both supported (eventually) extended precision, and, of course, Itanium supported it for x87 compatibility.

https://en.wikipedia.org/wiki/Motorola_88000 (under "Registers": "32 80-bit (88110 only)")

https://en.wikipedia.org/wiki/Extended_precision (see section titled "IEEE 754 extended-precision formats")

tliltocatl · 2026-02-19T09:41:34 1771494094

10% sounds implausibly high. Even on GPUs, most of area are various memories and interconnect.

gdiamos · 2026-02-09T01:38:09 1770601089

I feel like a lot of software engineering problems come out of people who refuse to talk to each other than through comments in VCS.

It makes sense if you are collaborating over IRC, but I feel the need to face palm when people sitting next to each other do it.

What is your preferred way to talk to your team?

No English, only code

Slack

Zoom

In a meeting room

Over lunch

On a walk

One thing I’ve learned over time is that the highest bandwidth way of talking is face to face because you can read body language in addition to words. Video chat is okay, but an artificial and often overly formal setting. Phone is faster than text. Text drops the audio/visual/emotional signal completely. Code is precise but requires reverse engineering intent.

I personally like a walk, and then pair programming a shared screen.

gdiamos · 2026-01-26T04:37:32 1769402252

It sounds like lack of security is the biggest feature and risk of this clawd thing.

I also tried using Siri to tell me the weather forcast while I was driving to the park. It asked me to auth into my phone. Then it asked me to approve location access. I guess it was secure but I never figured out what the weather forecast was.

Thankfully it didn't rain on my picnic. Some of the parents there asked me if their investors should be interested in clawd.

eddyg · 2026-01-26T04:52:36 1769403156

There are definitely people who should not be running this

https://www.shodan.io/search?query=clawdbot-gw

bronco21016 · 2026-01-26T15:13:10 1769440390

I guess we don't need to worry about sneaky prompt injection when there's 299 people giving away the prompt interface for free!

ashtakeaway · 2026-01-26T06:17:06 1769408226

Especially as root...