The Hugging Face Datasets Server is now open-source

minimaxir · on Oct 5, 2022

It's worth noting that the reason this works well is that Datasets is backed by Arrow, which allows streaming remote data. More info in the base library docs: https://huggingface.co/docs/datasets/stream

AnhTho_FR · on Oct 5, 2022

So much progress since the tamagotchi!!!

joshxyz · on Oct 5, 2022

LMAO welcome to the future!!

j-sizz · on Oct 5, 2022

Why just the first few rows

surajs · on Oct 6, 2022

what at time to be alive!

f38zf5vdt · on Oct 5, 2022

Surprised to see it's all Python.

anaganisk · on Oct 5, 2022

Proves over optimisation(rust, kube etc) vs coded in a language easy to pick, understand and perfect.

stoplying1 · on Oct 5, 2022

The only thing proved is that the authors chose python. It likely had more to do with necessary library/functionality availability than anything.

anaganisk · on Oct 5, 2022

Exactly my point, you can achieve results with the language in hand, by choosing premature/over optimisation. Hugginface is not a small service and if its able the do it good with python, it means “python is too slow for serious work” argument fails.

AccountAccount1 · on Oct 5, 2022

That's not exactly your point... "they chose python" and that's it. Don't know how you draw that conclusion, this one is easier: when you have 10 developers that know python, you choose python.

anaganisk · on Oct 5, 2022

I don’t know what you understood from my comment, but, I was explicitly speaking about many people who tend to go with rust/any other fad of the day for things that really don’t need them, just to be optimised from day one, despite their shop being a Python/Ruby/Php primary.

stoplying1 · on Oct 5, 2022

I guess, sure? It doesn't mean that there aren't (or are, for that matter) huge perf or efficiency gains left on the table. SD is certainly intense enough that small optimizations do add up quite significantly.

Plus I'm pretty sure if I go peak at CUDA, it ain't python anyway.

anaganisk · on Oct 5, 2022

Im aware of the fact some or many libraries have C bindings to python. But Python being slow is a major argument people come up with when someone builds a web program, I merely stating, it doesn’t have to be a con.

prophesi · on Oct 5, 2022

Python's already the main contender for AI/ML, so I don't think anyone's arguing that python is too slow for serious work. But it still makes you wonder if their AI/ML would be more performant in Lisp. And of course, you never know if something is "too slow" until your traffic increases exponentially and grinds your service to a halt, and now you have to figure out a solution in your language to address that with your source of income in peril.

sangnoir · on Oct 5, 2022

> Python's already the main contender for AI/ML, so I don't think anyone's arguing that python is too slow for serious work.

I love me some Python, but for ML, the "serious work" is handled by C/C++ libraries on the GPU or CPU. Python is provably slower than C/C++, vut being "too slow" or not depends on the domain and/or budget.

miohtama · on Oct 5, 2022

Because heavy lifting and data processing is 99% done on GPU, it does not matter if you choose C/C++, Python or even JavaScript as the binding language to move the files around and set up models.

Python happens to be the best for this sort of job as it is fast and easy to write and read.

outworlder · on Oct 5, 2022

> Python is provably slower than C/C++

Eh. Is it though? For all possible operations? Are you sure? Which implementation are you using? CPython? PyPy? Are you comparing the exact same algorithms implemented in both languages?

Modern CPUs are pretty good at executing bytecode-based languages, as there's usually a main loop that fits in cache.

A C program can jump all over the place and easily blow caches. It's also easy to access memory in ways that are suboptimal. Because of the difficulty of managing memory, C code tends to free stuff way too quickly. GC languages have an advantage there, as they don't necessarily reclaim the memory immediately, saving malloc/free cycles (at the cost of RAM usage). That's even more of a problem for C++.

Once you have profiled your Python program and found out that there isn't anything else you can do to run faster – including changing your algorithm (something that's easier to do in Python), then you can start thinking about rewriting the section in C on ASM.

kazinator · on Oct 5, 2022

> Which implementation are you using

The only one in which all documented (and not) Python stuff works, and all Python packages can be used?

There is only one Python for practical purposes.

> Modern CPUs are pretty good at executing bytecode-based languages, as there's usually a main loop that fits in cache.

When everything fits in the cache, bytecode loses most. The more you saddle code with uncached memory accesses, the less it matters whether it's bytecode or machine code. This is just Amdahl's Law.

The best performing Python code will be that which spends very few cycles in any byte code loop, and just lets the C routines in Python do all the work.

The interesting possiblity is that a bytecode program (and it's interpreter) together fit into the instruction and data cache. Whereas the equivalent machine code program blows the instruction cache. I'm not sure that this scenario is possible with Python in such a way that it actually makes it faster.

drunkenmagician · on Oct 6, 2022

Yes, there are numerous micro benchmarks that demonstrate this. Python has improved in speed, but outside specific uses cases where a C based library performs the bulk of the computational lift, it certainly lags behind. This despite some much duplicate effort to create JIT compilers and performance tuned runtimes.

behnamoh · on Oct 5, 2022

I wish Swift on TF had continued. It's got the nice syntax of Python with LLVM's speed. Julia tries to do the same, but it's still mostly an academic language with not-so-mature codebases. I like Python, but I think it's reached its peak usability nowadays. We need ever more computation and more complex abstractions that are simply impossible (or too difficult) to do in Python.

fluoridation · on Oct 5, 2022

Or perhaps it proves that Python and its libraries are fast enough for a service like Huggingface.

wiseowise · on Oct 5, 2022

The only thing it proves is that it’s good enough for this specific case.

anaganisk · on Oct 5, 2022

Isn’t their specific case serving data over network? Which applies to many web services?

prophesi · on Oct 5, 2022

Yeah, I would be more surprised if it wasn't in Python, considering HuggingFace is an AI/ML company.

jbverschoor · on Oct 5, 2022

Note that you probably need DocArray for this as the core building block

arthurcolle · on Oct 5, 2022

I saw the DocArray post earlier, is it really used in this?

jbverschoor · on Oct 5, 2022

Nah, I'm just kidding. I don't know, probably not :)

I never heard about the hugging face before that post.

arthurcolle · on Oct 5, 2022

Huggingface hosts the stable diffusion stuff, seems hard to miss over the last 2 months ;)

jbverschoor · on Oct 5, 2022

Ahh.. well yeah SD is on every few hours. I didn't look into any further other than the fun examples.