In what way data.table trumps dplyr? Genuinely interested in knowing. While data...

extr · on Dec 17, 2021

the easiest way to think about it is data.table is for people who are doing a lot of exploratory data analysis every day. If you're doing the same thing over and over, it makes sense to create a DSL specific to that task and optimize the hell out of it. that's basically data.table.

dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.

RcrdBrt · on Dec 17, 2021

Anecdotal data: I found that data.table ingestion speed with fread() trumps absolutely everything else

civilized · on Dec 17, 2021

This observation is pretty widely shared.

nojito · on Dec 17, 2021

The difficulty to read is a misnomer.

    Dt[rows, columns, groups]

Assuming your dplyr code is generally split apply combine, the dt version is shorter and easier to reason around.

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

civilized · on Dec 17, 2021

I disagree. Doing data manipulation one action at a time in a piped sequence is easiest to reason about because the state right before you apply a new operation is always clear.

data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.

To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):

dplyr:

  df %>% 
    group_by(a) %>%
    filter(b < mean(b))

data.table:

  DT[DT[, .I[b < mean(b)],
        by = .(a)]$V1]

Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.

(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)

The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.

The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.

nojito · on Dec 19, 2021

Your code golf example makes no sense.

    B_mean <- dt[, mean(b)]

    Dt[b<b_mean, by=.(a)]

Unlike the dplyr solution the dt solution is robust and we can independently test to make sure the mean of b makes sense.

The very easy to reason around concept of dt[rows, columns, groups] makes the code extremely clear.

Your translation example is absolutely bonkers because it’s trying to pigeonhole the simplicity of dt into the nonsense that is dplyr.