I want to understand something, why use Spark and DataFusion? As the way to inte...

simonw · on Aug 24, 2021

Because if you have many TBs of data it's cheaper to run something like Spark across a bunch of smaller machines than it is to try and set up a many-TB PostgreSQL instance.

A trick that many data warehousing tools use these days is to farm out computing to where the data is stored.

You might have a PB of data spread across 100 different instances. When a SQL query comes in you break that up into a query plan that can be run in parallel against the subset of data on each of those instances, then aggregate together the results.

It's cheaper to send the computation out to run next to the data than it is to copy the data back to the nodes that are executing the computation.

It's all variants of the classic map/reduce technique.

As a result, a data warehouse may be able to run a dumb SQL 'like' query against everything it is storing in a reasonable amount of time - since it gets to run in parallel.

The trade-off is that you don't have consistency - ACID etc - or real-time results against data changes - generally your data warehouse will be repopulated on a schedule, but it won't be great at answering questions about changes that just happened a few seconds ago.

tomnipotent · on Aug 25, 2021

> Why not just use traditional DBMS like PostgreSQL

Serialization. It's usually a non-trivial percentage of time spent in a lot of distributed systems, and for some workloads it can be the bulk of time spent.

If I want to grab 3GB of data from a remote host and process it locally, we have to agree on how that data is going to be transferred so I can use it. Could very well be SQL, so we have some sort of network-based tabular data stream. Maybe it's Parquet files so we're using NFS/S3 to copy the files to local disks before reading into a completely separate in-memory data structure. At the end of this workload, I have 1GB of data I now want to write it back. Maybe the data is stored in-memory as an array of mixed-type structs, but I can't just send those bytes as-is to SQL server or mmap to the filesystem and expect Parquet to know what it means.

Apache Arrow and DataFusion aims to eliminate all that work in rewriting bytes between hosts. Imagine being able to create a cost-based optimized query plan on Host A, send it to Hosts M-P for processing, and even have that query plan trickle down to Parquet predicates when reading files from disk, before returning data to Host A which can simply be copied from the network into local memory and you can start working with it right away.

FridgeSeal · on Aug 25, 2021

Data formats like Parquet/Arrow and DataFusion are optimised for high speed read/write (and the processing) of large amounts of data, which is generally what you’re going to be using them for.

Additionally, as others have mentioned, clustered processing for larger-than-machine/ram datasets is a bit easier to manage compared to setting up a database cluster.

Another benefit is ephemeral-compute: we have Kubernetes cluster, and a particular message in a Kafka topic can kickstart a a spark job across several machines in the cluster (possibly causing auto scaling) which processes the x-TB’s of data it needs to, writes the results out and then finishes. Faster, cheaper and more suited than keeping a multi-node db cluster going.

Also lets us run non-SQL stages with less bottlenecks: bulk ML scoring, bulk data enrichment, etc.