Buying hardware is paying a "random corporation". Make the massive hardware purchase after finding out if you have enough demand to buy rather than rent,
And even less than someone who wrote an interpreter for the script, less than someone who also chanted times tables while doing it.
More thinking isn’t a simple good thing. Given a limit to how much thought I can give any specific task, adding extra work may mean less where it’s most useful.
It is a good faith argument, my point is exactly that the actual scripting was not part of the relevant thought any more than the interpreter would have been.
It depends if the interesting part of the solution is the website for you. Maybe it is and that’s fine but for others it isn’t. Maybe they’ve got a cool backend thing and the ui isn’t the key part.
If it helps compare, you might have a full desire to manage a tricky server and all the various parts of it. It’d be removing the fun to just put a site on GitHub pages rather than hosting it on a pdp11. But if you want to show off your demo scene work you wouldn’t feel like you’d missed out on the fun just putting things up on a regular site.
I had a look and knew they seemed to be about £15 here, I couldn’t easily find second hand ones in the uk (though they’re not uncommon at shops). For £40 I can get a 7.5 inch black and white screen setup (trmnl byod xaio https://www.aliexpress.com/item/1005009532501677.html)
Lots of the tags I see though do have Bluetooth or maybe WiFi for updating as well.
I do really like eink things, I want to setup a nice 13 inch one which is now more like £160 so becoming more realistic for my to buy for fun.
I’m going to have to look more into these tags because if there’s cheap second hand ones they’d be awesome.
The other explanation is that often these are just mistakes that occur with a team of experts in their field but not data management, without a budget for building a more robust system, manually doing a lot of things with data. It's so easy to copy and paste something into the wrong place, to sort by a field and get things out of order, all kinds of issues like that.
On the other hand, any time a hypothesis appears significant, the first reaction should be to verify that all the data going into the calculation is correct, rather than just assume it is. In my day-to-day industry experience, significant results come far more often from incorrect data than an actual discovery.
CSV occupies, even years after moving away from more raw data work, way too much of my brain is still dedicated to "ways of dealing with CSV from random places".
I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before
* It's not standardised
* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.
* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.
* Errors are not even detectable in many cases.
* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.
* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.
* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.
* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.
This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?
This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.
There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.
While I appreciate a good rant just as much as the next person, most of these points have nothing to do with CSV. They are a general problem with underspecifying data, which is exactly what happens when you move data between systems.
The amount of hours I have wasted on unifying character sets across single database tables is horrifying to even think about. And the months it took before an important national dataset that supposedly many people use across several types of businesses was staggering. That fact that that XML came with a DTD was apparently not a hindrance to doing unspeakable horrors with both attributes and cdata constructs.
Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it? And that's exactly what happens in the real world when people move data across systems. That's why mojibake is still a thing in 2026.
I disagree, they are absolutely related to CSV in that these are all problems CSV has. Other formats can have these problems, but CSV is almost uniquely bad because these issues compound and it has a lot of them.
> They are a general problem with underspecifying data,
Which CSV provides essentially no tools to solve, unlike many other formats.
Also, several of these problems are not even about underspecified data but the format itself - you can have totally fine data which gets utterly fucked to the point of not parsing as a csv file by minor changes.
It's not even a fully specified format! Someone adds a comma in a field and then one of the following happens:
* Something generating the csv doesn't add quotes
* Something reading the csv doesn't understand quotes
And the classic
* Something sorted the file
> Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it?
If you've got something with actual date types you can have interfaces show actual calendars, and for many formats you will at least get an error if it's defined as DD/MM/YY and someone puts in 01/13/26. CSV however gives you no ability to do this - all data is just strings. And string defined dates with no restrictions are why I have had to deal with mixtures of 01/13/26 and 13/01/26, meaning everything goes just fine until you try and parse it. Or, like some of my personal favourites, "Winter 2019".
CSV is not one format, lacks verification of any useful kind, is almost uniquely easy for users to completely fuck up, and the lack of types means that programs do their own type inference which adds to things getting messed up.
You're blaming a lot of normal ETL problems on DSVs.
Like, specifying date as a type for a field in JSON isn't going to ensure that people format it correctly and uniformly. You still have parsing issues, except now you're duplicating the ignored schema for every data point. The benefit you get for all of that overhead is more useful for network issues than ensuring a file is well formed before sending it. The people who send garbage will be more likely to send garbage when the format isn't tabular.
There are types and there is a spec WHEN YOU DEFINE IT.
You define a spec. You deal with garbage that doesn't match the spec. You adjust your tools if the garbage-sending account is big. You warn or fire them if they're small. You shit-talk the garbage senders after hours to blow off steam. That's what ETL is.
DSVs aren't the problem. Or maybe they are for you because you're unable to address problems in your process, so you need a heavy unreadable format that enforces things that could be handled elsewhere.
We are talking here in the context of scientific datasets. Of course ETL plays a part here. However here it is really more the interplay of Excel with CSV which is often outputted by scientific instruments or scientific assistants.
You get your raw sensor data as a csv, just want to take a look in excel, it understandably mangles the data in attempt to infer column types, because of course it does, its's CSV! Then you mistakenly hit save and boom, all your data on disk is now an unrecoverable mangled mess.
Of course this is also the fault of not having good clean data practices, but with CSV and Excel it is just so, so easy to hold it wrong, simply because there is no right.
> so you need a heavy unreadable format
I prefer human unreadable if it means I get machine readable without any guesswork.
No, it's Excel trying to be too clever. It does the same thing with manual imput if you don't proactively change the field type.
You can import a DSV into Excel without mangling datatypes in a few different ways. Probably the best way is using Power Query.
A DSV generally does have a schema. It's just not in the file format itself. Just because it isn't self-describing doesn't mean it isn't described. It just means the schema is communicated outside of the data interchange.
If you get an .xls which doesn't have very esoteric functions, I expect it to open about the same way in any Excel program and any other office suite.
With CSV I do not have that expectation. I know that for some random user-submitted CSVs, I will have to fiddle. Even if that means finding the one row in thousand rows which has some null value placeholder, messing up the whole automatic inference.
No. That's not at all what I'm saying. I am saying that a fixed CSV file will open differently depending on the program you open it with.
Don't even need to transfer it. Opening a csv in pandas can be different than opening with polars, can be different to DuckDB, can be different to Excel.
You've got not guarantees. There's no spec, and how edge cases (if you want to call how to serialize and deserialize a float an edge case) are handled is open to the implementation.
It's both of their faults. CSV is not blameless here - Excel is doing something broadly that users expect, have dates as dates and numbers as numbers. Not everything as strings. If CSV had types then Excel would not have to guess what they are.
It does have types if you define them in the schema. Not every format needs to be self-describing. It's often more efficient to share the schema once outside of the data feed than have the overhead of restating it for every data point.
It's completely Excel's fault for pushing their type-inference and making it difficult for users to define or supply their own.
Power Query does a better job handling it, but you should be able to just supply a schema on import, like you can with Polars or DuckDb.
It's another example of MS babying their userbase too much. Like how VBA is single threaded only because threads are hard. They're making their product less usable and making it harder for their users to learn how stuff works.
Csv doesn’t have a schema, it has a barely adhered to post-hoc “not a specification” and everything is strings.
That you can solve some of these problems by using something as well as the csv file is not anywhere near as helpful, and it’s a clear problem of csv files. There is no universally followed schema, for a start, so now we’re at unique solutions all over the place.
> It's often more efficient to share the schema once outside of the data feed than have the overhead of restating it for every data point.
You cannot be suggesting that csv files are efficient surely, they’re atrociously inefficient. Having the same format and a tied in schema would solve a lot and add barely anything as overhead. If you want efficiency, do not use csv.
Asking users to manually load in the right schema every time they open a file is asking for trouble. Why wouldn’t you combine them?
> It's completely Excel's fault for pushing their type-inference and making it difficult for users to define or supply their own.
It’s not entirely excels fault that csv doesn’t have types. They didn’t invent and promote a new standard, but then why would you? There’s better formats out there. I’m sure they would argue that the excel files are a better format for a start.
And people did make better formats. That’s why I think csv should be consigned to the bin of history.
> "You can edit it in a text editor" which feels like a monkeys-paw wish
Yes :) Although I will note that some editors are good enough to maintain the structure as the user edits. Consider Emacs with `csv-mode`, for example. Of course most users don’t have Emacs so they’ll just end up using notepad (or worse, Word).
Systems have been caught out that review pull requests, that’s a simple and clear one. The more obvious to me for most people is anything you do that interacts with your email without an explicit approve list of emails to read.
Yes, but none of this applies to the local codex agent that runs when I tell it to and has access to my computer. Like: „scan this folder of PDFs and create an excel file with all expenses. Then enter them into my tax software.“ This needs access to very sensitive data and involves a quite complex handling of data. But the only attack vector I see is someone injecting prompts into my invoice files.
Which applies if you were to do this to invoices submitted to you, rather than ones you created, or if you have any way of user info getting into your invoices.
The overall speed rather than TTFT might start to be more relevant as the caller moves from being a human to another model.
However quality is really important. I tried that site and clicked one of their examples, "create a javascript animation". Fast response, but while it starts like this
```
Below is a self‑contained HTML + CSS + JavaScript example that creates a simple, smooth animation: a colorful ball bounces around the browser window while leaving a fading trail behind it.
Weird; I clicked through out of curiosity and didn't get any corruption of the sort in the end result.
I also asked it some technical details about how diffusion LLMs could work and it provided grammatically-correct plausible answers in a very short time (I don't know the tech to say if it's correct or not).
I got the exact same thing. But trying out another few prompts I couldn't get it to happen again. I wonder if its a bug with the cahcing/website? I can't imagine they actually run interference each time you use one of the sample prompts?
reply