Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cassandra supports reading/writing MapReduce jobs natively (http://wiki.apache.org/cassandra/HadoopSupport), and Voldemort supports read-only k/v views of Hadoop data (http://project-voldemort.com/blog/2009/06/building-a-1-tb-da...). Both of which are much more mature options.


I'll be fleshing all this out next week in a blog post.

ElephantDB is comparable to Voldemort's read-only mode. ElephantDB has additional support for incremental updates, and ElephantDB has support for consuming an ElephantDB datastore on the DFS in a MapReduce job (without touching any EDB servers serving random reads).

ElephantDB has an extremely small code base (about 2K LOC), and its highly specialized nature makes it simpler to configure than Voldemort.

I don't consider ElephantDB and Cassandra to be comparable. ElephantDB disassociates the creation of a datastore from serving it, whereas with Cassandra you'd be writing to a live database. You can create, update, and read from ElephantDB datastores using only Hadoop. This disassociation prevents your batch jobs from affecting the performance of reads to the db.


Voldemort developer here. I actually spoke with Nathan a bit (hi Nathan!) and concluded ElephantDB is a different project, solving a somewhat different use case.

The core difference is that Voldemort read-only storage uses a custom storage engine, based on memory mapped files (see the blog link jbellis posted) which allows a very high data to memory ratio (without having to worry about HotSpot's garbage collection, while excellent, still has issues with heaps larger than 20-30 gb). The data and index files are both built as part of the Map/Reduce job in Hadoop. The files are fetched in parallel from HDFS to Voldemort serving nodes, swapped "atomically" (atomic relative to a single node, not the cluster), and the process of fetching these files warms up the page cache as these files are written to disk. However, that makes implementing incremental updates tricky.

Of course, being part of Voldemort adds additional features e.g., being able to collocate it with read-write stores, using Voldemort's pluggable serialization support, failure detection, existing clients (for Java, C++, Python, Ruby), etc..

We use the read-only storage engine heavily for such features as People You May Know (Voldemort push being after around 100 chained Map/Reduce jobs), Skills, parts of our recommendation engine (the more real time features use read/write store) and more. The average latency is consistently within a few milliseconds: upper bound being the time it takes a disk to perform a single seek (if the data isn't in the cache). We've current improving ability to expand clusters with read only stores without re-pushing the data: my coworker Roshan was able to even make use of sendfile() when shipping data from one node to another to make this expansion fast with large volumes of data. We're also working on changes to the index format which will make it easier to implement incremental updates.

In addition to LinkedIn, the read-only storage functionality / Hadoop integration, is also used by several interesting startups, namely DeepDyve and Mendeley as well as at least one big company (not sure if I'm allowed to give out their name).

Bottom line: look at your requirements and see which solves your problem best. Distributed storage systems aren't a zero sum game.


Hey Alex, I was hoping you'd jump in. :)

So why would I use ElephantDB instead of Voldemort read-only storage?


> So why would I use ElephantDB instead of Voldemort read-only storage?

Incremental updates is the current feature Voldemort lacks. If that's a must for you, use that. We are hoping to that feature at one point, but it required changes to the index format (which are a recent feature).

That said, a complete push of data into Voldemort is quick: the whole process is network bound (data is read sequentially from HDFS and written sequentially to Voldemort).


Yeah, I realize one of the Clustrix guys got a lukewarm reception when he compared to MongoDB a month or so ago, but I am genuinely curious about the advantages of this (ElephantDB) versus Voldemort or other alternatives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: