After playing around with this tool a bit in the past I can't seem to understand what it's trying to solve. It comes across as a side project that VCs happened to throw money at.
hey bserial, I'm part of the team working on Dagster.
While there are many things we're working on, there are 3 goals that got me excited about working on this system:
1. Local development: most modern workflow orchestration systems don't have a good local development story. We want to provide a seamless end-to-end dev experience from your laptop to CI to dev to prod for authoring data workflows.
2. Complexity: the Airflow deployments I've worked on or otherwise encountered have hundreds of DAGs and thousands of tasks scheduled on an hourly or daily cadence. We aim to provide abstractions to better support managing and wrangling that complexity.
3. Testability: Most modern data platforms are poorly tested. Many orchestration systems, like Airflow, tend to hardcode deployment concerns the business logic, e.g. EmrAddStepsOperator. With Dagster, we aim to separate the business logic from environmental concerns to make it easy to swap out an external resource implementation for a mock, dev version, etc.
Kedro puts emphasis on seamless transition to prod without jeopardizing work in experimentation stage:
- pipeline syntax is absolutely minimal (even supporting lambdas for simple transitions), inspired by the Clojure library core.graph https://github.com/plumatic/plumbing
- sequential and parallel runners are built-in (don't have to rely on Airflow)
- io provides wrappers for existing familiar data sources, but directly borrows arguments from Pandas, Spark APIs so no new API to learn
- flexibility in the sense you could rip out anything, for example, the whole Data Catalog replacing with another mechanism for data access like Haxl
- there's a project template which serves as a framework with built-in conventions from 50+ analytics engagements