Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.
> Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.
That doesn't help in practical usage - all you'd know is their consistency at the point in time of testing. After all, 5m after your test is done, your request to an API might lead to a different model being used in the background because the limits of the current one were reached.
I tried that on a few problems; even on the same model the results have too much variation.
When comparing different models, repeating the experiment gives you different results.