Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Lucene is quite fantastic and Elasticsearch makes it a joy to use.

Still, I wonder what the overhead of Java is adding in this case. Even minor things like integer decoding can be done very fast with SIMD... but such approaches don't seem amenable to Java. I see that Elasticsearch exposes quite a bit of GC metrics, which must be a problem at times. And one of the Lucene devs wrote a post on how he replaced some parts with C++ and saw massive gains (but with a disclaimer that this was in no way indicative that Java wasn't fast).

I've considered trying to implement something like Lucene in, say, Rust, but then I see just how utterly massive Lucene is. Just the fuzzy search part alone required implementing code to generate code from a Russian PhD thesis they didn't fully understand.[1] So, no matter how many cycles the JVM is needlessly burning, Lucene just seems to advanced to write it without the overhead. (And maybe my intuition is just wrong and the overhead is only a few percent.)

1: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is...



There is a port of Lucene to C++, CLucene[1], it's compatible with version 2.3 of Java Lucene, the project is stopped long time ago, but it's very much stable, and works perfectly. An other port which is compatible with version 3 of java Lucene is LucenePlusPlus, but it use a lot of boost's smart pointers, the port seems like t was automated. This port was why CLucene development stopped, the maintainers wanted to make this new port faster by not using smart pointers whenever possible, but that didn't happen.

1: http://sourceforge.net/projects/clucene 2: https://github.com/luceneplusplus/LucenePlusPlus


Oddly enough, I don't see anyone talking about benchmarks for those projects. I found one offhand comment saying it was 2-3 faster than Java for indexing, but only 10% better for search. No real benchmarks or such. I suppose that's not the only reason to want a non-JVM version but it seems like a pretty major reason and something that'd warrant headline treatment in the readme...


Java overhead is not a huge issue unless you are embedding Lucene into some low spec devices. A consumer search system is usually relying heavily on cache (just like any databases), so even a 30-50% latency hit on cold queries is not that big a deal if > 90% of your queries are served from cache.

GC is a big problem when you don't know the expected query distribution which is the case for Elasticsearch's analytics. There is a lot more to a search engine than packing, decoding and merging posting lists. I've never seen anything that compares with Lucene text analysis and scoring API supports.


That fuzzy search story is more worrying than anything else, really. What they did seems just crazy to me.

First, the paper really doesn't seem so difficult. Second, they don't even think about reaching out to the author/s?

I suppose I shouldn't talk until I've tried their task. But I've implemented a lot of algorithms from papers, and their story had me shaking my head.


I've also implemented a lot of algorithms from papers, and their story has me nodding my head in agreement. Some algorithms papers are just downright impenetrable.


I tried implementing the same paper's algorithm in the past and somewhat succeeded - but gave up in the end. Precumputing the automatons was slow as hell, I came to a similar conclusion as the authors (N > 2 isn't really feasible, but was something I was interested in) and my plumbing sucked.

I'm really not experienced reading papers and this was the only one I ever tried, so I cannot compare it to others. It certainly was quite hard to follow for me and took some month of nightly dabbling before I reached the point above.


Is there a self-contained alternative to ElasticSearch specifically? If there was one written in Go or otherwise statically linkable that would be great from a deployment standpoint. I could deal with somewhat worse performance in exchange for that.


Search for "golang full text search database".

Lucene & Hadoop meant a big push for the Java eco-system, it's like a lock-in. Native C++ libraries and other free text search implementations have a smaller community and are usually less known. With Go, C++11 and Rust the future looks bright but it will take some time to catch up.


Agree, it's early days for non-Java based alternatives.

One of my colleagues, Marty Schoch, has been working on a full text search engine in golang, called bleve [1]

1: http://www.blevesearch.com/


>Search "golang full text search database".

I have. The problem is picking one that is mature enough and will be supported for years as you can expect ElasticSearch to be, which is what I meant by "an alternative".

I agree with you about the future looking bright but I meant something you could use right now.


It's hard to say, for Go there is e.g. bleve FTS and there are ports of Java Lucene to Go (e.g. https://github.com/balzaczyy/golucene). Such ports are either semi-automatic or automatic, only automatic ports. It's hard for Lucene ports to keep up, as Lucene is moving fast and most ports stalled.

One could also use a service oriented architecture and use e.g. ElasticSearch Rest API or C++ based Sphinx Search, both need litttle configuration and no custom code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: