At scale, running on commodity servers, why would want those features even today...

kbolino · on May 22, 2024

Partial updates could be useful for certain kinds of (mostly binary) files, but block storage is going to handle that much better in general than object storage. The concurrency and consistency guarantees are quite different between object and block storage. Making partial updates atomic would be quite difficult in general in S3, though simple preconditions like compare-and-swap (which is sorely needed anyway) might be sufficient to make it possible for certain use cases.

skrtskrt · on May 22, 2024

The paradigm can be flipped now for these distributed storage systems.

The blocks of a filesystem can now be objects, replicated or erasure-coded - like Ceph running filesystems on top of its low-level object storage protocol, which is done on raw disks, not a filesystem.

This can't be done for something like Minio just running on your filesystem, but if you're building the storage system from the ground up it can.

We will see more and more of these products appearing.

Vast Data is another interesting one. Global deduplication and compression for your data, with an S3, block, or NFS interface. Storing differential backups for thousands or millions of VMs? You'll only store the data from base Ubuntu image once.

cduzz · on May 23, 2024

I was referring more to the list of hypervisor features not implemented:

>> Because EC2's hypervisor was (when it was launched) lacking features (no hot swap, not shared block storage, no host movements, no online backup/clone, no live recovery, no Highly available host failover ) S3 had to step in to pick up some of the slack that proper block or file storage would have taken.

I don't want any of that nonsense in my compute layer or an application (at scale) that relies on shared block storage or host movements or live recovery.

I'm sure S3 append would be super handy.

sillysaurusx · on May 22, 2024

I think gcloud has CAS by comparing etags. Surprised to hear S3 can’t.

nathants · on May 23, 2024

the common approach is cas dynamodb pointers to uuid named objects in a named s3 prefix.

would it be better to merge ddb and s3? maybe, maybe not.

gcp/azure/etc are there to provide fancy cloud solutions. no need for aws to serve that market as well.

kbolino · on May 23, 2024

One of the important premises of object storage is that if your PutObject or multipart upload succeed, the entire object is atomically replaced. It is eventually consistent, so you may not immediately retrieve the just-uploaded object with GetObject, but you should see the new version eventually, and never see part of one version mixed with part of another. This should natively support compare-and-swap: "hey if the existing etag is what I expect, apply my change, otherwise ignore my change and tell me so". This has nothing to do with DynamoDB and is not reimplementing its feature set. It is just a natural extension of how the service already works (from an API consumer perspective, not necessarily an implementation perspective).

KaiserPro · on May 22, 2024

At scale?

transparent HA means that I can fail over services to other regions without having to get the programmers to think about it. Most of the busy work at scale is managing state or, more correctly recovering state from broken machines.

If I can make something else do that reliably, rather than engineer it myself, thats a win.

So much of the work of standing up a cluster (be it k8s or something else) is getting to the point where you can arbitrarily kill a datastore and it self heal.

If you're talking about s3 partial updates, its about cost/and or performance. If you dealing with megabyte chunks, and you want to flip a few bytes over hundreds of thousands, thats going to eat into transfer costs.

Sure you could chunk up the files even smaller, but then you hit into access latency (s3 aint that fast. )

cduzz · on May 23, 2024

I was referring to the notion that "failings" in the hypervisor layer like "hot swap, shared block storage, host movements, online backup/clone, live recovery, Highly available host failover" are a problem. At scale, I don't want my application to rely on any of that magic.

Reliability is always your problem not something to be punted to another layer of the stack that lets you pretend stuff doesn't go wrong.

KaiserPro · on May 23, 2024

> Reliability is always your problem

yup, which is why relying on devs to engineer it is a pain in the arse. Having online migration is such a useful tool to avoid accidental overloads when doing maintenance, its also a grat tool to have when testing config changes.

Currently I work at a place that has its own container spec and scheduler. This makes sense because we have literal millions of machines to manage. but thats an edge case.

For something like a global newspaper (when I used to work) it would be a massive overkill, we spent far too long making K8s act like a mainframe, when we could have bought one 20 times over, and still have change for a good party every week. or, just used hosted databases and liberal caches.

cduzz · on May 23, 2024

Oh sure -- for piddly enterprise nonsense, having some VM yeeting magic to HA a thing that's not HA is .... yeah, I guess. Ideally in combination with tested backups for when the HA magic corrupts instead of protects, but such is life.

But that's not "at scale" that's just some great plains accounting app that's been dragged from one pickle jar to another.

KaiserPro · on May 23, 2024

a more canonical example:

In 2016 we had a 36k cluster. There was something like 2 PB of fast online storage, 48pb of nearline, and two massive tape libraries for backup/interchange.

The cluster was ephemeral, and could be reprovisioned automatically by netboot. However the DNS/DHCP + auth servers were on the critical pathway. So we dumped them on a VMware cluster to make sure that we could run them in as close to 100% as possible. Yes, they were replicated, but they were also running on separate HA clusters, with mirrored storage. This meant that if we lost both of them, we could within a few minutes run them directly from a snapshot, or if it was a catasrafuck reload the config from git.

Now we could have made our own DNS+dhcp server, and or kerberos/ldap/active directory. but that cost money and wasn't worth the time. Plus the risk of running your own with a small crew (less than 10 infra people) was way to high.

VMware was almost mainframe level of uptime, if you did it right.