Append would allow to build a lot of other systems. I mean, the only functional ...

klodolph · on May 23, 2024

GFS? Google hasn’t used GFS for, like, fifteen years.

You can totally build stuff like Megastore or Bigtable (or Spanner) on top of S3. You use a log-structured merge tree. That’s how these systems work in the first place. In the log-structured merge tree, you have a set of files containing your data but you don’t modify them. Instead, you write new files containing the changes (the log). Eventually you compact them by writing a complete copy and deleting the old versions.

This works just fine on S3, and there are even some key-value stores built on top of S3 that work this way. Colossus is cheaper for short-lived data.

yencabulator · on May 23, 2024

> you write new files containing the changes (the log)

People are asking for create-if-not-exists specifically to be able to add objects to an ordered log, without needing a separate service for coordination. S3 cannot be used for this. GCS, for example, can.

klodolph · on May 23, 2024

Yeah—I know I didn’t mention that. You can build this type of system on top of S3 + CAS.

The “append” feature isn’t necessary for functionality, it just improves the cost / performance.

The idea of running your own database on top of S3 is, well, it’s gonna be janky. It’s not ideal. You do end up seeing databases running on top of S3 (I’ve seen some), and sometimes it even makes sense.

yencabulator · on May 23, 2024

OLAP data in S3 makes perfect sense, and that use case is desperate for create-if-not-exist to enable simpler inserts.

https://www.databricks.com/wp-content/uploads/2020/08/p975-a...

estebarb · on May 23, 2024

I think you missed the point. GFS supported append operations and was created around 24 years ago. S3 still hasn't caught up with this particular feature. Although they clearly implemented it, as you can do a long multipart upload and S3 will join the file for you.

At a high level, yes, you can implement systems like Megastore or Bigtable over S3. However, there are many details you must take into account. You cannot simply wave away the complexity and potential failure scenarios.

For starters, how are you going to create the newest SST?

If you keep it in memory or on disk, it must be replicated to prevent data loss if a machine fails. This approach could lead to losing the most recent changes. Additionally, you end up with a hybrid system that needs to read data from multiple sources, which adds complexity. If you essentially reimplement the system, why use S3 at all?

What if the data volume gets too low and you end up writing many small, expensive files?

Using something like Kinesis for batching might work, but the data won't be visible for N minutes.

Merging partial tables also requires maintaining an external index to track availability. Transactions would be helpful, but how do you handle failures?

And we haven't even mentioned managing garbage collection. It would require an external lock or reference count system.

klodolph · on May 23, 2024

> I think you missed the point.

Maybe I am thinking more broadly when imagine what it means to implement something like Spanner on top of S3.

We know that Spanner on top of S3 is not going to give you the same price/performance as building Spanner on top of Colossus while giving you the same semantics. You either relax the semantics a little bit, you pay out the nose for a lot of little files, or you find a durable place outside S3 to store the newest data.

> At a high level, yes, you can implement systems like Megastore or Bigtable over S3. However, there are many details you must take into account. You cannot simply wave away the complexity and potential failure scenarios.

Most of the complexity is the same whether you implement Megastore on top of S3 or on top of GFS. You can’t handwave it in either scenario.

> If you essentially reimplement the system, why use S3 at all?

It’s highly durable, highly available, and cheap (under certain usage scenarios).

GFS is not available for anyone to use, inside or outside Google. Its successor, Colossus, is not available outside Google. They’re just not available.

dpkirchner · on May 23, 2024

I haven't used S3: does compacting work within their system, like with an API call, or do you have to download all the chunks, upload the concatenated result, and then delete the chunks?

klodolph · on May 23, 2024

Compacting is a database operation. It would happen in your database, not at the underlying storage layer.

Your database may use multiple underlying storage layers anyway.

dpkirchner · on May 23, 2024

Ah, I guess I misunderstood the part about writing the complete copy. The data isn't really written, it's just abstracted access to multiple chunks.