To the extent "Big Data" originally and is still often claimed to mean "data beyond what fits on a single [process/RAM/disk/etc]", it's always been strange to me how much it's identified with analytics pipelines doing largely trivial transformations producing ultra-expensive "BI" pablum.
Yes, thank goodness that part is dead. But meanwhile - we've still got more actual data than ever to store, and ever-tighter deadlines on finding and delivering it. If we can get back to that and let the PySpark bootcampers fade away, maybe things can get a little better for once.
In other words:
Even when querying giant tables, you rarely end up needing to process very much data. Modern analytical databases can do column projection to read only a subset of fields, and partition pruning to read only a narrow date range. They can often go even further with segment elimination to exploit locality in the data via clustering or automatic micro partitioning. Other tricks like computing over compressed data, projection, and predicate pushdown are ways that you can do less IO at query time. And less IO turns into less computation that needs to be done, which turns into lower costs and latency.
Big data is "dead" because data engineers (the programming ones, not the analysts-in-all-but-title) spent a ton of effort building DBs with new techniques that scale better than before, with other storage patterns than before. Someone still has to write and maintain those! And it would be even better if those tools and techniques could escape the half dozen major data cloud companies and be more directly accessible to the average small team.
Yes, thank goodness that part is dead. But meanwhile - we've still got more actual data than ever to store, and ever-tighter deadlines on finding and delivering it. If we can get back to that and let the PySpark bootcampers fade away, maybe things can get a little better for once.
In other words:
Even when querying giant tables, you rarely end up needing to process very much data. Modern analytical databases can do column projection to read only a subset of fields, and partition pruning to read only a narrow date range. They can often go even further with segment elimination to exploit locality in the data via clustering or automatic micro partitioning. Other tricks like computing over compressed data, projection, and predicate pushdown are ways that you can do less IO at query time. And less IO turns into less computation that needs to be done, which turns into lower costs and latency.
Big data is "dead" because data engineers (the programming ones, not the analysts-in-all-but-title) spent a ton of effort building DBs with new techniques that scale better than before, with other storage patterns than before. Someone still has to write and maintain those! And it would be even better if those tools and techniques could escape the half dozen major data cloud companies and be more directly accessible to the average small team.