The project initially used DocumentCloud, but I guess that's built for scenarios...

The project initially used DocumentCloud, but I guess that's built for scenarios which are a little less horrible than single-page JPEGs coming in via FTP.

Tesseract has been working beautifully, we've got the first 1500 documents sent through it. Unfortunately, we don't know the language of each document (trying to get up some crowdsourcing for that), so each gets sent through three times (eng, rus, ukr). Finally, all versions are indexed in ElasticSearch. If anyone has a neater way of doing this (e.g. a good, Python-based language detector), please shout!