Accuracy is very rarely a useful metric. It's more an engineering metric than something a user would ever care about.
What users want is to have their own credences properly calibrated by engaging with some system. From a physics textbook, they want a systematic presentation of ideas which allows them to build intuitions etc.
It's important to formulate the actual goal of the system, rather than just the engineer's goal (consider eg., "width of pipes" vs., "clean running water").
In the case of statistical AI systems, the goal is often best formulated in terms of the confidences of the system not its output. Since its output accuracy is kinda nonlinear and discontinuous in those confidences.
So from a statical AI Q&A system we dont want The Answer, we want the system to have expert-like confidences over possible answers.
Of course, as soon as you start formulating these metrics, all the SoA 99%+ accuracy hype evaporates. Since most of these systems have terrible confidence distributions.
Consider, eg., ChatGPT whose answers are often plausibly accurate (they count as an answer) but just repeat some silicon valley hype in a way an expert wouldnt. ChatGPT rarely has the careful scepticism of an expert, rarely presents ideas in an even handed way, rarely mentions the opposite.
It makes generating reference materials on areas with expert disagreement quite dangerous. ChatGPT presents the non-expert credence distribution. (And indeed, always does, since it just models (Q,A) frequencies which are not truth-apt)
This is mixing two meanings of confidence which could lead to confusion. The OP is using confidence to describe how high the per-token probability scores are, while you are talking about the confidence expressed in the tone of voice of the language generated by the model. Really those are orthogonal issues. (Eg, a model could predict with high probability that a output should be “I don’t know”)
I'm saying as a matter of fact ChatGPT should have different confidences in propositions. My issue isnt the tone of voice, my issue is the content of what it's saying is wrong wrt what we care about, ie., expert credences (/confidences) in the claims it's generating.
It can "express confidently" scepticism; it does not. That's the issue.
In my lang above i was mostly using credence to talk about the strength of the mental state of belief; and confidence to talk about the model of that used in statistical AI.
What users want is to have their own credences properly calibrated by engaging with some system. From a physics textbook, they want a systematic presentation of ideas which allows them to build intuitions etc.
It's important to formulate the actual goal of the system, rather than just the engineer's goal (consider eg., "width of pipes" vs., "clean running water").
In the case of statistical AI systems, the goal is often best formulated in terms of the confidences of the system not its output. Since its output accuracy is kinda nonlinear and discontinuous in those confidences.
So from a statical AI Q&A system we dont want The Answer, we want the system to have expert-like confidences over possible answers.
Of course, as soon as you start formulating these metrics, all the SoA 99%+ accuracy hype evaporates. Since most of these systems have terrible confidence distributions.
Consider, eg., ChatGPT whose answers are often plausibly accurate (they count as an answer) but just repeat some silicon valley hype in a way an expert wouldnt. ChatGPT rarely has the careful scepticism of an expert, rarely presents ideas in an even handed way, rarely mentions the opposite.
It makes generating reference materials on areas with expert disagreement quite dangerous. ChatGPT presents the non-expert credence distribution. (And indeed, always does, since it just models (Q,A) frequencies which are not truth-apt)