Lucene is quite fantastic and Elasticsearch makes it a joy to use.
Still, I wonder what the overhead of Java is adding in this case. Even minor things like integer decoding can be done very fast with SIMD... but such approaches don't seem amenable to Java. I see that Elasticsearch exposes quite a bit of GC metrics, which must be a problem at times. And one of the Lucene devs wrote a post on how he replaced some parts with C++ and saw massive gains (but with a disclaimer that this was in no way indicative that Java wasn't fast).
I've considered trying to implement something like Lucene in, say, Rust, but then I see just how utterly massive Lucene is. Just the fuzzy search part alone required implementing code to generate code from a Russian PhD thesis they didn't fully understand.[1] So, no matter how many cycles the JVM is needlessly burning, Lucene just seems to advanced to write it without the overhead. (And maybe my intuition is just wrong and the overhead is only a few percent.)
There is a port of Lucene to C++, CLucene[1], it's compatible with version 2.3 of Java Lucene, the project is stopped long time ago, but it's very much stable, and works perfectly.
An other port which is compatible with version 3 of java Lucene is LucenePlusPlus, but it use a lot of boost's smart pointers, the port seems like t was automated. This port was why CLucene development stopped, the maintainers wanted to make this new port faster by not using smart pointers whenever possible, but that didn't happen.
Oddly enough, I don't see anyone talking about benchmarks for those projects. I found one offhand comment saying it was 2-3 faster than Java for indexing, but only 10% better for search. No real benchmarks or such. I suppose that's not the only reason to want a non-JVM version but it seems like a pretty major reason and something that'd warrant headline treatment in the readme...
Java overhead is not a huge issue unless you are embedding Lucene into some low spec devices. A consumer search system is usually relying heavily on cache (just like any databases), so even a 30-50% latency hit on cold queries is not that big a deal if > 90% of your queries are served from cache.
GC is a big problem when you don't know the expected query distribution which is the case for Elasticsearch's analytics. There is a lot more to a search engine than packing, decoding and merging posting lists. I've never seen anything that compares with Lucene text analysis and scoring API supports.
I've also implemented a lot of algorithms from papers, and their story has me nodding my head in agreement. Some algorithms papers are just downright impenetrable.
I tried implementing the same paper's algorithm in the past and somewhat succeeded - but gave up in the end.
Precumputing the automatons was slow as hell, I came to a similar conclusion as the authors (N > 2 isn't really feasible, but was something I was interested in) and my plumbing sucked.
I'm really not experienced reading papers and this was the only one I ever tried, so I cannot compare it to others. It certainly was quite hard to follow for me and took some month of nightly dabbling before I reached the point above.
Is there a self-contained alternative to ElasticSearch specifically? If there was one written in Go or otherwise statically linkable that would be great from a deployment standpoint. I could deal with somewhat worse performance in exchange for that.
Lucene & Hadoop meant a big push for the Java eco-system, it's like a lock-in. Native C++ libraries and other free text search implementations have a smaller community and are usually less known. With Go, C++11 and Rust the future looks bright but it will take some time to catch up.
I have. The problem is picking one that is mature enough and will be supported for years as you can expect ElasticSearch to be, which is what I meant by "an alternative".
I agree with you about the future looking bright but I meant something you could use right now.
It's hard to say, for Go there is e.g. bleve FTS and there are ports of Java Lucene to Go (e.g. https://github.com/balzaczyy/golucene). Such ports are either semi-automatic or automatic, only automatic ports. It's hard for Lucene ports to keep up, as Lucene is moving fast and most ports stalled.
One could also use a service oriented architecture and use e.g. ElasticSearch Rest API or C++ based Sphinx Search, both need litttle configuration and no custom code.
Still, I wonder what the overhead of Java is adding in this case. Even minor things like integer decoding can be done very fast with SIMD... but such approaches don't seem amenable to Java. I see that Elasticsearch exposes quite a bit of GC metrics, which must be a problem at times. And one of the Lucene devs wrote a post on how he replaced some parts with C++ and saw massive gains (but with a disclaimer that this was in no way indicative that Java wasn't fast).
I've considered trying to implement something like Lucene in, say, Rust, but then I see just how utterly massive Lucene is. Just the fuzzy search part alone required implementing code to generate code from a Russian PhD thesis they didn't fully understand.[1] So, no matter how many cycles the JVM is needlessly burning, Lucene just seems to advanced to write it without the overhead. (And maybe my intuition is just wrong and the overhead is only a few percent.)
1: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is...