Tuesday, September 2, 2014

Platt scaling - calibrating classifier probabilities

Platts In many real world cases, it is very important to predict well calibrated probabilites.
Medical Domains, doctors will be more interested in the probability value associated with predictions, than a categorical output, which says if a patient has a certain condition or not.
In document clustering, a ranked affinity of documents to a cluster label will be more informative.
In this two part series, we explore how to go about calibrating output probabilities from classifiers. In this part we detail the problem and the use Reliability charts to visualize the output probabilities. Reliability charts indicates if there is a need to calibrate the probabilities.
Let us simulate some classification data for this exercise.
def load_data():
    X,Y = make_classification(n_samples = 100,n_features=100, random_state = 100)
    return X,Y
Use a maximum margin (SVM in this case) and Naive Bayes classifier
def max_margin_model(X,Y):
    mdl = SVC()
    mdl.fit(X,Y)
    return mdl

def nb_model(X,Y):
    mdl = BernoulliNB()
    mdl.fit(X,Y)
    return mdl
SVC does not returns the actual probability value, it returns the distance from the separating planes. We can scale these values between 0 and 1 to get the actual probability values.
We Fit the simulated data on these two classifiers and examine the output probability.
X,Y = load_data()
X_train,X_test,X_train,Y_test = train_test_split(X,Y,test_size=0.5)

mdl_nb = nb_model(X_train,Y_train)
predict_prob_nb = mdl_nb.predict_proba(X_test)

Y_t[Y==0]=-1

mdl_max_margin = max_margin_model(X_train,Y_train)
predict_prob_max_margin = mdl_max_margin.decision_function(X_test)
We need to look at the probability values of the true positives, so given the actual probability or a distance and true positive list, the following function returns the predicted probability for the true positives.
def getProbability(distances,Y,scale=False):
    Y_ones = numpy.where(Y == 1)[0]
    distances_ones = distances[Y_ones]
    if scale:
        min_max_scaler = preprocessing.MinMaxScaler()
        distances_ones = min_max_scaler.fit_transform(distances_ones)

    return distances_ones
For distance output from SVC, we set the scale variable to true, hence we get a number between 0 and 1. For naive bayes, we return the probability values for the true positives.
We use the above function to get the probabilty values attributed to the true positives, for both the models
predict_prob_nb = getProbability(predict_prob_nb[:,1],Y_test,False)
predict_prob_max_margin = getProbability(mdl_max_margin.decision_function(X_test),Y_test)
We look at the probability histogram,
bins = list(numpy.arange(0,1.1,0.1))
hist,bins = numpy.histogram(predict_prob_max_margin,bins)
>>> hist
array([ 9,  7, 12, 16, 19, 16, 24,  7, 21, 18])
Its interesting to see the spread of the probability values for SVC.
As its seen in the output values, SVM is trying to push the predicted output value away from 0 and 1.
With naive bayes
>>> hist,bins = numpy.histogram(predict_prob_nb,bins)
>>> hist
array([ 25,   1,   2,  10,   9,  25,  17,  15,  12, 131])
The probability values are pushed towards 0 or 1, its evident from the below plot.
As expected, naive bayes is pushing the probability values towards 0 and 1.

Reference

http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf http://www.stat.cmu.edu/~fienberg/Statistics36-756/DegrootFienberg-Statistician-1983.pdf

No comments:

Post a Comment