coverman's comments

coverman · on Jan 10, 2022

But sadly no “The Plow King”

coverman · on Nov 5, 2019

Python + Pandas

ekianjo · on Nov 5, 2019

Or R + Tidyverse will do the job nicely too.

suslik · on Nov 5, 2019

R without tidyverse (which is just sugar) will do just as nicely.

ekianjo · on Nov 6, 2019

Agree. But tidyverse just makes things a lot more elegant to write.

appleiigs · on Nov 5, 2019

Python + Pandas + Jupyter Notebook/Lab

anst · on Nov 5, 2019

Python + Jupyter OK, but pandas actually reads everything at once, doesn’t it. 100MB is no problem but bigger files could result in high swapping pression.

cgufus · on Nov 5, 2019

I definitely agree that with this amount of data, you should move to a more programmatic way to handle it... pandas or R.

Keep in mind that pandas (and probably also R?) internally uses optimized structures based on numpy. So a 10 GB csv, depending on the content, might end up with a much smnaller memory footprint inside pandas.

If you have 10 GB csv, I think you will be happy working with pandas locally even on a Laptop. If you go to csv files with tens of GB, a cloud vm with corresponding memory might serve you well. If you need to handle big-data-scale csvs (hundreds of GB or even >TB), a scalable parallel solution like Spark will be your thing. Before you scale up however, maybe your task allows to pre-filter the data and reduces the amount by orders of magnitude... often, thinking the problem through reduces the amount of metal one needs to throw at the problem...

coverman · on June 6, 2019

Starting to see a lot of these frameworks pop up to simplify deployment of machine learning models. I’m really hoping one or two start to stand out...but it doesn’t feel like this one.

coverman · on March 27, 2019

As a data scientist with a BS and MBA, I can attest to having experienced disqualification for jobs specifically because of my lack of a PhD. What's troubling is employers think they need PhDs. It often doesn't matter if I have 10 years experience applying data science in industry, without that PhD companies think I'm unqualified.

From my perspective, the best data scientists strike a balance between technical and business knowledge. And it's the business knowledge that PhDs coming straight from academia often lack.