> Basically, I have no idea if the paper is reproducible Worry more about whethe...

> Basically, I have no idea if the paper is reproducible

Worry more about whether the result is generalizable. How sensitive is that incremental improvement to hyperparemeter tuning, or to how the data is pre-processed, or to the specific problem domain? These days, the academic literature seems to rarely spend much time dwelling on these subjects, which is, at least in my opinion, sufficient reason for industry practitioners to shy away from the cutting edge.

I am less familiar with how this works out on classifiers, but I can say that this is the elephant in the room with topic modeling. Hyperparameter tuning and data cleaning are much more important than choice of algorithm. Perhaps even more importantly (at least if you're trying to understand different algorithms' relative merits), the method you choose for evaluating quality is critical: One setup will be clearly better if you are focused on the data's syntagmatic qualities, but perform terribly if you instead focus on the paradigmatic. And vice versa. In short, the question, "What algorithm is best?" is malformed and unanswerable.

There's an interesting paper from a while ago where it turned out that the vector space model that performed best when evaluated against the TEFL synonymy test was good old latent semantic analysis. I find that result to be noteworthy because it's one of the few papers that took a real live test that was designed for evaluating the skills of real live humans, and used it to evaluate a machine learning model. At the same time, that in no way implies that LSA is the best fit for your sentiment analysis pipeline.