Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't "confidence" measuring the internal state of the net

I don't quite follow your meaning but no, that's not quite how I think about it.

If x is the input image and C is the class selected by the classifier we expect the confidence to be representative of the conditional probability P( C | x ). There is theory that shows this to be true given certain assumptions, but as is very often the case in this field the extent to which those assumptions are valid is rarely clear. So if the net outputs a confidence of 0.9999 we expect there to be only 1 chance in 10000 that the true class is something other than C. If we test the net and find such a confidence associated with an incorrect classification more frequently than predicted then we can say that the confidence is objectively incorrect.



This isn't something I know about, but I would think that for a classification problem, if the right loss function is used then the NN should learn to maximise the chance that the true answer is within the intervals it provides. However...

> If we test the net and find such a confidence associated with an incorrect classification more frequently than predicted then we can say that the confidence is objectively incorrect.

On the test data performance is never going to match the NN's training set performance, so it would always overestimate its confidence.


It sounds like you are describing accuracy, whereas confidence is closer to a measure of precision. You can have class outputs with low confidence (relative to 1) but still have a highly accurate NN - so long as the correct output has a higher confidence relative to the other outputs (even if the difference is 0.0001). If you've got 4 outputs and you're getting results like {0,1,0,0}, then you should really make sure that you aren't overfit - because that is the most obvious sign of overfitting.


I think you're describing accuracy, as in prediction accuracy.

Confidence is simply how much confidence you have that the predicted class is correct.

This is often important because in a real system you often want to know whether to accept the model's prediction or pass the case on to a more accurate, more expensive model (sometimes a human). You then can select a confidence threshold that gives an acceptable trade-off between false positives and false negatives (ie. choose an operating point on the ROC curve). If your model just gives a yes/no answer with no measure of confidence, or if you don't know how to interpret the output as a confidence, then you don't have sufficient information to do this.

Over-fitting is often part of the problem but in my experience things aren't as black or white as you seem to imply. There are things you can do to limit over-fitting but I don't know of any simple test that will give you a 0 if you're over-fitting and 1 if you're not.


> Confidence is simply how much confidence you have that the predicted class is correct.

If you mean correct among the number of output nodes, then we are talking about the same thing. If you mean correct among all the possibilities to include those unknown to the NN, then I'd appreciate a link - because I've never seen that before and it sounds incredibly useful.

> ...I don't know of any simple test that will give you a 0 if you're over-fitting and 1 if you're not.

Neither do I, that is why I used all those weasel words - but I also don't know of a neural network that wasn't either being misused or being overfit that returned such results. A simple example would be the 2 inputs for the xor test data, a hidden layer with enough nodes to memorize every permutation, and 2 output nodes - each representing the confidence for the potential result of the binary operation.


On your first point in my experience if you use a net with logistic output nodes and you feed it an input that looks nothing like any of the (labeled) training examples I would expect to see low values for all outputs. I would add the caveat that there may be the odd input that is not an example of a class that still provides an erroneous high confidence output but this is not common. These are the types of nets that I have the most experience with. I don't have a link for this, only my recollections from when I was working with these types of nets. I did once see a paper that proved the relationship between P(C|x) and logistic output values trained with the squared error loss but I don't remember if it applied to the case you're talking about and I doubt I could find it again, it may even be behind a paywall I no longer have access to.

I don't think this holds with a softmax output layer because I think that provides an output vector that is implicitly normalized across all classes. It also obviously doesn't hold if you explicitly normalize the sum of the output vector to 1.

For your second point I'm not sure I disagree with what you're saying, certainly not on the basis of the example you provided before since I would never expect to see 0's and 1's in the output of a net.

EDIT: After thinking about this some more I realized that I may have been mistaken in what I said above. It may be that for an unknown input you would expect to see a high entropy output distribution whose sum is close to 1 but where the average output value is about 1/N (ie. the number of output classes). If this is the case then the logistic and softmax cases would be similar. This still indicates a low confidence though, I wouldn't expect to see outputs in the upper quartile, unlike for a cleanly recognized input where you might see 0.8 or 0.9. I'm going to try this with a net I've trained for mnist using random inputs to check but unfortunately I have to retrain the net since I didn't save it and my laptop's busy with another task that I don't want to stop now so it may take a while.


OK this isn't a very scientific test but just eyeballing some outputs for a lightly trained net (classification error about 10% whereas a fully trained net can get well below 2%) what I'm seeing is that with the test set inputs every instance I've looked at has had one output > 0.5, most > 0.7. For random images the highest output I'm seeing is 0.27.


Also, the output is a probability, but it's a single number. you don't get anything like a 95% CI, error bars, or a posterior distribution .


If the net outputs probability of 0.9999 but lower CI is 0.4, then you don't have to trust it. With linear models you can estimate the CI from the data and mean squared error of the classifier.


It sounds like you want to stack another probability on top of a probability. The net gives you an estimate of P(C|x) and then you want a probabilistic measure of how accurate an estimate that is. This starts to get a little difficult for me to think about. It seems like you might be able to estimate it by running the net on a test set (disjoint of course from the training set) and applying standard statistical methods to the results, just like you could for any experiment.

EDIT: Maybe it's more complicated than that though because the experiment is itself estimating a probability, not just the value of some random variable. Maybe we need a real statistics expert to look into this.


The net output is overconfident, but this can be (partially) corrected with probability calibration. It's also used for SVM, which is usually underconfident.


SVM doesn't give CI either. In fact, it doesn't even give probability estimates per sample unless using Platt's technique whose probabilities don't always agree with the classifier's outputs.

What I mean is a confidence interval on probabilities. That is, if you did the training on 100 different training sets that follow the same distribution as your training set, they would have a range of outputs for a particular sample. The 95th highest and 5th lowest would have specific values. These values can be estimated and are the bounds of the CI.

For example, for logistic regression, you can estimate standard error per coefficient like so: https://en.wikipedia.org/wiki/Ordinary_least_squares#Finite_... That would give you a vector of standard errors the size of your variable(feature) set. Then you can use that to estimate overall range of CI per sample.

This would be useful in the example provided in OP's link. For example, let's say that the classifier says there is 86% probability that the license plate is RBX735.

Can we really trust this 86% probability or is the model primarily relying on some information that is available in only 2 samples out of 1000 that might have been outliers?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: