Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because if you have many TBs of data it's cheaper to run something like Spark across a bunch of smaller machines than it is to try and set up a many-TB PostgreSQL instance.

A trick that many data warehousing tools use these days is to farm out computing to where the data is stored.

You might have a PB of data spread across 100 different instances. When a SQL query comes in you break that up into a query plan that can be run in parallel against the subset of data on each of those instances, then aggregate together the results.

It's cheaper to send the computation out to run next to the data than it is to copy the data back to the nodes that are executing the computation.

It's all variants of the classic map/reduce technique.

As a result, a data warehouse may be able to run a dumb SQL 'like' query against everything it is storing in a reasonable amount of time - since it gets to run in parallel.

The trade-off is that you don't have consistency - ACID etc - or real-time results against data changes - generally your data warehouse will be repopulated on a schedule, but it won't be great at answering questions about changes that just happened a few seconds ago.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: