In what way data.table trumps dplyr? Genuinely interested in knowing.
While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.
dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]
the easiest way to think about it is data.table is for people who are doing a lot of exploratory data analysis every day. If you're doing the same thing over and over, it makes sense to create a DSL specific to that task and optimize the hell out of it. that's basically data.table.
dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.
I disagree. Doing data manipulation one action at a time in a piped sequence is easiest to reason about because the state right before you apply a new operation is always clear.
data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.
To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):
dplyr:
df %>%
group_by(a) %>%
filter(b < mean(b))
data.table:
DT[DT[, .I[b < mean(b)],
by = .(a)]$V1]
Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.
(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)
The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.
The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.
While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.
dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]