Test/evaluation/inference are treated as almost synonymous because in academic research you almost exclusively run inference on a trained model in order to evaluate its performance on a test set. Of course in the real world, you will want to run inference in production to do useful work. But the language comes from research.