More

jdf · on March 26, 2025

It would be great if someone could implement the schema discovery algorithm from the DB research GOAT, Thomas Neumann, and add it to Apache Arrow: https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf

jdf · on Aug 15, 2017

There's a Rust parser generator called LALRPOP that is apparently inspired by Menhir.

https://github.com/nikomatsakis/lalrpop http://smallcultfollowing.com/babysteps/blog/2016/03/02/nice...

I've never used Menhir so I can't compare how similar they are in practice, but I've enjoyed the times I played with LALRPOP much more than the many times I've battled various yacc derivatives.

jdf · on Feb 24, 2017

Not asking you to reveal any internal plans, but do you think there's any chance that some form of KNL could show up on GCP in the near future?

FWIW we're currently using GCP, generally love it, and I'm looking forward to trying out Skylake...

jdf · on Jan 20, 2017

There have been a lot of tuning systems that take a workload and try to optimize things like index structures. They are tools that get your database "tuned" rather than "self-tuning".

However, I think this (very recent) research paper is a much better peek at how self-tuning might work:

https://blog.acolyer.org/2017/01/17/self-driving-database-ma...

jdf · on Jan 11, 2017

> The way in which we are safer is memory safety, nothing more.

I know you want to not overstate rust's claims given recent articles, but I think you're actually underselling a little here. For example, a rust `enum` make it much easier for the compiler to enforce code correctness. It's hard to go back to similar code in C or Go once you've gotten used to `match`.

steveklabnik · on Jan 11, 2017

Yes, I guess in my mind, "correctness" and "memory safety" are two different things. I like that Rust can help you write more correct software, but it's not nearly as strong of a guarantee as our memory safety guarantees are.

jdf · on Oct 22, 2016

> The only thing I can think of that Rust has that Go doesn't is, like, SIMD, and I'm sure your comment wasn't just referring to SIMD.

Just to clarify, are you referring to Rust using SIMD by way of LLVM, or by way of being able to use SIMD primitives / intrinsics directly in Rust code?

The former works much better than I had anticipated. I've been surprised by the extent that my iterator code ends up vectorized without me doing much work.

The latter does not give me warm Rust feelings today. There's a SIMD crate, but it doesn't look maintained and only works with the nightly compiler releases. I didn't think there was any stable way to do inline assembly, so I think linking C is my best bet here?

Manishearth · on Oct 22, 2016

> The latter does not give me warm Rust feelings today. There's a SIMD crate, but it doesn't look maintained and only works with the nightly compiler releases. I didn't think there was any stable way to do inline assembly, so I think linking C is my best bet here?

Yeah, the person formerly working on simd is no longer able to contribute to random open source projects. It's still something the Rust team wants to make stable, just not right now. SIMD is a bit complicated because you need to address target support in a straightforward way.

jdf · on Oct 22, 2016

I agree with some of your points, but think this is phrased a bit harsh. FWIW, I write both Go and Rust on a regular basis.

Here's where I would agree with you:

- Go makes it harder for someone to write overly abstract code (a common affliction!).

- Being able to occasionally do type assertions in an ergonomic way is surprisingly nice.

- I wish Rust had something in the stdlib like net/http.

- I like that go fmt is so unconfigurable and canonical.

Here's where I would disagree:

- I find ADTs (Rust's enums) super helpful for productivity.

- Removing nil pointer derefs is wonderful, particularly for refactoring.

- I spend too much time in Go rewriting bits of code that I would just use generics for in Rust or C++. Rust's iterators are wonderful and I end up using them over and over again.

- Maybe it's my C background, but I like being able to occasionally use macros. Even for tests it makes things much more readable.

- The borrow checker ends up moving many concurrency issues from runtime debugging to compile time debugging.

- I think cargo is more pleasant to use to manage code than using go + godeps/glide/etc.

jdf · on Oct 22, 2016

I forgot to add - I partially agree with you about the productivity of GC. For non-performance sensitive code, GC simplifies a lot. Otherwise, I find Go much nicer to reason about than Java since it has real arrays (value types!).

But I find it a mixed bag of whether the naive Go version of something that shares memory by GC is simpler than the naive Rust version of something that shares memory. Sometimes ref-counting (Rc<T> in Rust) is fine, although that's more expensive than GC. Sometimes Rust's ownership model nudges you to make the code much simpler and makes it clear that something only has a single writer. Sometimes you wish you were in C and just did it yourself...

pcwalton · on Oct 22, 2016

> Sometimes ref-counting (Rc<T> in Rust) is fine, although that's more expensive than GC.

To nitpick: The jury's still out on that one, because Rc in Rust isn't thread-safe reference counting. I believe that non-thread-safe reference counting is quite competitive with global, cross-thread tracing GC.

When people (rightly) talk about how much slower reference counting is than tracing GC, they're almost always talking about either thread-safe tracing GC vs. thread-safe reference counting or single-threaded tracing GC vs. single-threaded reference counting. When you compare Rc to a typical multithreaded GC'd language, you're comparing multithreaded tracing GC to single-threaded reference counting, which is a much more interesting comparison.

naasking · on Oct 22, 2016

> When people (rightly) talk about how much slower reference counting is than tracing GC, they're almost always talking about either thread-safe tracing GC vs. thread-safe reference counting or single-threaded tracing GC vs. single-threaded reference counting.

Reference counting has always been slower, even in single-threaded cases. This should be obvious because pure reference counting requires modifying counts whenever locals are assigned, which happen orders of magnitude more often than main memory updates, and now each local assignment requires touching main memory too.

As soon as you defer these updates somehow to recover that cost, you've introduced partial tracing. You can find papers from way back acknowledging this overhead, and suggesting optimizations [1].

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.9...

pcwalton · on Oct 22, 2016

> This should be obvious because pure reference counting requires modifying counts whenever locals are assigned, which happen orders of magnitude more often than main memory updates, and now each local assignment requires touching main memory too.

I guess so, but it's worth noting that this isn't the case in Rust, because Rc benefits from move semantics. Most assignments don't touch the reference counts at all.

naasking · on Oct 23, 2016

Right, Rust's borrowing is like the optimization link I provided, just slightly less general IIRC.

jdf · on Oct 22, 2016

Yes, this is a great point! I was a bit sloppy in my wording above.

Manishearth · on Oct 22, 2016

> Sometimes ref-counting (Rc<T> in Rust) is fine, although that's more expensive than GC.

This is a very interesting and hard comparison to make.

The marginal cost of GCing one more thing is much less than the marginal cost of RCing one more thing. But you need to pay the cost of the GC runtime to get there, and Rust doesn't, so it's a rather hard comparison to make.

However, the cost of RCing something itself is pretty small (and you don't RC very often in Rust anyway), so this rarely matters :)

jdf · on May 23, 2016

Not to be a stickler about name collisions, but Facebook wrote a research paper about a graph database called Unicorn back in 2013:

https://people.csail.mit.edu/matei/courses/2015/6.S897/readi...

This appears to be unrelated, which is somewhat unfortunate.

droopyEyelids · on May 23, 2016

I wish there was some sort of social agreement to name projects new words, or combined words.

We're overloading the english language so much. In 100 years it's going to be impossible to search for anything, as every word and phrase will have a million products and projects attached.

Didn't the original MIT hackers take pride in coming up with clever and unique names? what happened to that? I'd even settle for names like "elinks"

Johnny555 · on May 23, 2016

Search engines will just need to take context into account, I remember a lot of confusion between searches for Cisco IOS versus Apple IOS back when IOS first took on the name, but with a few keywords to provide context, now it's pretty easy to get relevant results.

base · on May 23, 2016

There is also the ruby server https://unicorn.bogomips.org/

monocasa · on May 23, 2016

There is also the qemu-as-a-library, Unicorn. https://github.com/unicorn-engine/unicorn

Funny how there's so many of them considering that the name Unicorn is generally supposed to have an air of rarity about it.

haifeng · on May 23, 2016

Okay, everyone is hiding a unicorn in their closet :)

covi · on May 23, 2016

This was my very first reaction as well. The Unicorn paper was published in a high-profile conference, so as a N=1 sample I'd say it's famous in the systems community.

overcast · on May 23, 2016

My vote is for UnicornPoop. Anyone have that?

jdf · on March 4, 2015

MVCC doesn't necessarily mean no in-place updates, it just means that you can distinguish between multiple versions. For example, Oracle:

- keep most recent version of all keys in B-tree

- store updates in undo log ("rollback segments")

- queries for older versions dynamically undo recent changes

http://docs.oracle.com/cd/B19306_01/server.102/b14220/consis...

slashdev · on March 4, 2015

Good point, but do they seriously do that? It's stupid.

If you overwrite data in place that's being concurrently read, you get garbled data. So you must guarantee nobody is reading it. One way is to lock the data for both readers and writers using a mutex of some form. Another way is Linux-RCU style[1]. Both make readers always pay a price for what should be an uncommon case.

It makes more sense to me to put your updates in a new place, and if need be copy them over the old data once nobody can see the old data anymore.

[1] http://lwn.net/Articles/262464/

jdf · on Nov 15, 2014

If you are using a more minimal hypervisor (see my other comment on the parent), then there do seem to be some measurable gains. I've seen a few papers in this style:

https://www.usenix.org/conference/osdi14/technical-sessions/...

We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2-5 in latency and 9 in throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation.

That said, a simple application like memcached might be currently latency-bound by the kernel's network stack, but a more complex application that reads from disk (even SSD) won't be.