(Step 4) Model Evaluation

Always select evaluation measures according to the goals of studies

Different studies often have different goals of developing defect models. For example, one study may want its defect models that yield the highest discriminatory power (i.e., optimise based on Area Under the receiver operator characteristic Curve (AUC) (AUC) [HM82]). Another study may want its defect models to discover an actual defective module as early as possible (i.e., optimise based on initial false alarm [HXL17]). Thus, several evaluation measures have been proposed to serve different goals of developing defect models.

Below, we describe 5 predictive accuracy and 3 effort-aware measures in detail with interactive tutorials. We also provide a code snippet for setting up the environment for interactive tutorials below, i.e., data preparation and model construction.

## Load Data and preparing datasets

# Import for Load Data
from os import listdir
from os.path import isfile, join
import pandas as pd
import numpy as np

# Import for Split Data into Training and Testing Samples
from sklearn.model_selection import train_test_split

# Import for Construct Defect Models (Classification)
from sklearn.ensemble import RandomForestClassifier # Random Forests


train_dataset = pd.read_csv(("../../datasets/lucene-2.9.0.csv"), index_col = 'File')
test_dataset = pd.read_csv(("../../datasets/lucene-3.0.0.csv"), index_col = 'File')

outcome = 'RealBug'
features = ['OWN_COMMIT', 'Added_lines', 'CountClassCoupled', 'AvgLine', 'RatioCommentToCode']

# process outcome to 0 and 1
train_dataset[outcome] = pd.Categorical(train_dataset[outcome])
train_dataset[outcome] = train_dataset[outcome].cat.codes

test_dataset[outcome] = pd.Categorical(test_dataset[outcome])
test_dataset[outcome] = test_dataset[outcome].cat.codes

X_train = train_dataset.loc[:, features]
X_test = test_dataset.loc[:, features]

y_train = train_dataset.loc[:, outcome]
y_test = test_dataset.loc[:, outcome]


# commits - # of commits that modify the file of interest
# Added lines - # of added lines of code
# Count class coupled - # of classes that interact or couple with the class of interest
# LOC - # of lines of code
# RatioCommentToCode - The ratio of lines of comments to lines of code
features = ['nCommit', 'AddedLOC', 'nCoupledClass', 'LOC', 'CommentToCodeRatio']

X_train.columns = features
X_test.columns = features
training_data = pd.concat([X_train, y_train], axis=1)
testing_data = pd.concat([X_test, y_test], axis=1)

## Construct defect models
# Random Forests
rf_model = RandomForestClassifier(random_state=1234, n_jobs = 10)
rf_model.fit(X_train, y_train)  

Predictive Accuracy Measures

Precision

Precision measures the proportion between the number of lines that are correctly identified as defective and the number of lines that are identified by the models. Particularly, precision can be computed using a calculation of \(\frac{TP}{(TP+FP)}\), where \(TP\) is the number of actual defective lines that are predicted as defective and \(FP\) is the number of clean lines that are predicted as defective. A high precision value indicates that the models can correctly identify high number of defective lines.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score

# construct a confusion matrix
print('Construct a confusion matrix:')
tn, fp, fn, tp = confusion_matrix(y_test,rf_model.predict(X_test)).ravel()
print('(True Positive, False Positive) = (' + str(tp) +','+str(fp)+')')
print('(False Negative, True Negative) = (' + str(fn) +','+str(tn)+')\n')

# calculate precision manually
rf_precision_manual = tp/(tp + fp)

# calculate precision with a function
rf_precision_function = precision_score(y_test, rf_model.predict(X_test))

print('Precision (manual calculation)\t\t:', rf_precision_manual)
print('Precision (precision_score function)\t:', rf_precision_function)
Construct a confusion matrix:
(True Positive, False Positive) = (99,147)
(False Negative, True Negative) = (56,1035)

Precision (manual calculation)		: 0.4024390243902439
Precision (precision_score function)	: 0.4024390243902439

Recall

Recall measures the proportion between the number of lines that are correctly identified as defective and the number of actual defective lines. More specifically, we compute recall using a calculation of \(\frac{TP}{(TP+FN)}\), where \(TP\) is the number of actual defective lines that are predicted as defective and \(FN\) is the number of actual defective lines that are predicted as clean. A high recall value indicates that the approach can identify more defective lines.

from sklearn.metrics import recall_score

# calculate recall manually
rf_recall_manual = tp/(tp + fn)

# calculate recall with a function
rf_recall_function = recall_score(y_test, rf_model.predict(X_test))

print('Recall (manual calculation)\t\t:', rf_recall_manual)
print('Recall (recall_score function)\t:', rf_recall_function)
Recall (manual calculation)		: 0.6387096774193548
Recall (recall_score function)	: 0.6387096774193548

False Alarm Rate (FAR) or False Positive Rate (FPR)

FAR measures a proportion between the number of clean lines that are identified as defective and the number of actual clean lines. More specifically, FAR can be calculated with a calculation of \(\frac{FP}{(FP+TN)}\), where \(FP\) is the number of actual clean lines that are predicted as defective and \(TN\) is the number of actual clean lines that are predicted as clean. The lower the FAR value is, the fewer the clean lines that are identified as defective. In other words, a low FAR value indicates that developers spend less effort when inspecting defect-prone lines identified by the an approach.

# calculate FAR manually
rf_FAR_manual = fp / (fp + tn)

print('FAR (manual calculation)\t\t:', rf_FAR_manual)
FAR (manual calculation)		: 0.12436548223350254

Area Under the receiver operator characteristic Curve (AUC)

AUC measures the discriminatory power of predictive models and is widely suggested by recent research [LBMP08][GMH15][RD13]. The axes of the curve of the AUC measure are the coverage of non-defective modules (true negative rate) for the x-axis and the coverage of defective modules (true positive rate) for the y-axis. The AUC measure is a threshold-independent performance measure that evaluates the ability of models in discriminating between defective and clean instances. The values of AUC range between 0 (worst), 0.5 (no better than random guessing), and 1 (best) [HM82].

from sklearn.metrics import roc_auc_score 

# calculate AUC with a function
rf_AUC_function = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:,1])

print('AUC (roc_auc_score function)\t\t:', rf_AUC_function)
AUC (roc_auc_score function)		: 0.8589023524916761

Matthews Correlation Coefficients (MCC)

MCC measures a correlation coefficients between actual and predicted outcomes using the following calculation: \begin{equation} \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{equation} An MCC value ranges from -1 to +1, where an MCC value of 1 indicates a perfect prediction, and -1 indicates total disagreement between the prediction.

from sklearn.metrics import matthews_corrcoef

# calculate MCC manually
rf_MCC_manual = ((tp * tn) - (fp * fn)) / (((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)) ** (1/2))

# calculate MCC with a function
rf_MCC_function = matthews_corrcoef(y_test, rf_model.predict(X_test))

print('MCC (manual calculation)\t\t:', rf_MCC_manual)
print('MCC (matthews_corrcoef function)\t:', rf_MCC_function)
MCC (manual calculation)		: 0.4249604383453459
MCC (matthews_corrcoef function)	: 0.4249604383453459

Effort-Aware Measures

Initial False Alarm (IFA)

IFA measures the number of clean lines on which developers spend SQA effort until the first defective line is found when lines are ranked by their defect-proneness [HXL17]. A low IFA value indicates that few clean lines are ranked at the top, while a high IFA value indicates that developers will spend unnecessary effort on clean lines. The intuition behinds this measure is that developers may stop inspecting if they could not get promising results (i.e., find defective lines) within the first few inspected lines [PO11].

# Generate a defect-proneness ranking of testing instances
X_test_df = X_test.copy()
X_test_df['predicted_prob'] = rf_model.predict_proba(X_test)[:, 1]
X_test_df = X_test_df.sort_values(by = ['predicted_prob'], ascending = False)

# Determine the Initial False Alarm (IFA)
IFA = 0
for test_index in X_test_df.index:
    IFA += 1
    if y_test.loc[test_index] == 1:
        break
        
print('Initial False Alarm (IFA)\t\t:', IFA)
Initial False Alarm (IFA)		: 1

Distance-to-Heaven (D2H)

D2H is a combination of recall and FAR proposed by Agrawal and Menzies [AM18] [AFC+19]. The calculation of D2H is the root mean square of the recall and false alarm values (i.e., \(\sqrt{\frac{(1-Recall)^2 + (0-FAR)^2}{2}}\)). A d2h value of 0 indicates that an approach achieves a perfect identification, i.e., an approach can identify all defective lines (Recall \(= 1\)) without any false positives (FAR \(= 0\)). A high d2h value indicates that the performance of an approach is far from perfect, e.g., achieving a high recall value but also have high a FAR value and vice versa.

from sklearn.metrics import recall_score

# calculate recall with a function
rf_recall_function = recall_score(y_test, rf_model.predict(X_test))

# calculate FAR manually
rf_FAR_manual = fp / (fp + tn)

# calculate D2H manually
rf_D2H_numerator = ((1 - rf_recall_function)**2) + ((0 - rf_FAR_manual)**2)
rf_D2H_denominator = 2
rf_D2H_manual = (rf_D2H_numerator / rf_D2H_denominator)**(1/2)

print('D2H (manual calculation)\t\t:', rf_D2H_manual)
D2H (manual calculation)		: 0.27018278105904375

Top k% LOC Precision

Top k% LOC Precision measures how many defective lines found when inspecting the top k% of lines ranked by the defect-proneness estimated by the models [HXL17]. A high value of Top k% LOC precision indicates that the model can rank many defective lines at the top and many defective lines can be found given the fixed amount of effort (i.e., k% of LOC). On the other hand, the low value of Top k% LOC precision indicates many clean lines are in the top k% LOC and developers need to inspect more lines to identify defects. Similar to prior studies [MK10] [KMM+10] [RKBD14] [RHG+16], we use 20% of LOC as a fixed cutoff for an effort in this interactive tutorial.

from sklearn.metrics import precision_score

# Generate a defect-proneness ranking of testing instances
X_test_df = X_test.copy()
X_test_df['predicted_prob'] = rf_model.predict_proba(X_test)[:, 1]
X_test_df = X_test_df.sort_values(by = ['predicted_prob'], ascending = False)

# calculate the value of k% LOC, where k = 20
k_percent = 20.0
total_LOC = np.sum(X_test_df['LOC'])
p20_LOC = total_LOC * k_percent / 100

# find Top k% LOC according to the defect-proneness ranking of testing instances
cumsum_LOC = 0
last_index = -1
for i in range(len(X_test)):
    cumsum_LOC += X_test_df['LOC'].iloc[i]
    last_index = i
    
    if cumsum_LOC > p20_LOC:
        print('Cumsum_LOC =', cumsum_LOC, 'Index', last_index)
        break
p20_LOC_X_test_df = X_test_df.iloc[:last_index, :]
p20_LOC_y_test_df = y_test[p20_LOC_X_test_df.index]

# calculate precision of Top k% LOC
rf_p20_precision_function = precision_score(p20_LOC_y_test_df, rf_model.predict(p20_LOC_X_test_df.loc[:, X_test.columns]))

print('Top 20% LOC Precision (precision_score function)\t:', rf_p20_precision_function)
Cumsum_LOC = 3736 Index 231
Top 20% LOC Precision (precision_score function)	: 0.4025974025974026

Top k% LOC Recall

Top k%LOC recall measures how many actual defective lines found given a fixed amount of effort, i.e., the top k% of lines ranked by their defect-proneness [HXL17]. A high value of top k% LOC recall indicates that an approach can rank many actual defective lines at the top and many actual defective lines can be found given a fixed amount of effort. On the other hand, a low value of top k% LOC recall indicates many clean lines are in the top k% LOC and developers need to spend more effort to identify defective lines. Similarly, we use 20% of LOC as a fixed cutoff for an effort.

from sklearn.metrics import recall_score

# Generate a defect-proneness ranking of testing instances
X_test_df = X_test.copy()
X_test_df['predicted_prob'] = rf_model.predict_proba(X_test)[:, 1]
X_test_df = X_test_df.sort_values(by = ['predicted_prob'], ascending = False)

# calculate the value of k% LOC, where k = 20
k_percent = 20.0
total_LOC = np.sum(X_test_df['LOC'])
p20_LOC = total_LOC * k_percent / 100

# find Top k% LOC according to the defect-proneness ranking of testing instances
cumsum_LOC = 0
last_index = -1
for i in range(len(X_test)):
    cumsum_LOC += X_test_df['LOC'].iloc[i]
    last_index = i
    
    if cumsum_LOC > p20_LOC:
        print('Cumsum_LOC =', cumsum_LOC, 'Index', last_index)
        break
p20_LOC_X_test_df = X_test_df.iloc[:last_index, :]
p20_LOC_y_test_df = y_test[p20_LOC_X_test_df.index]

# calculate recall of Top k% LOC
rf_p20_recall_function = recall_score(p20_LOC_y_test_df, rf_model.predict(p20_LOC_X_test_df.loc[:, X_test.columns]))

print('Top 20% LOC Recall (recall_score function)\t:', rf_p20_recall_function)
Cumsum_LOC = 3736 Index 231
Top 20% LOC Recall (recall_score function)	: 1.0

Note

Parts of this chapter have been published by Supatsara Wattanakriengkrai, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Hideaki Hata, Kenichi Matsumoto: Predicting Defective Lines Using a Model-Agnostic Technique. CoRR abs/2009.03612 (2020).

Suggested Readings

[1] Amritanshu Agrawal, Tim Menzies: Is “better data” better than “better data miners”?: on the benefits of tuning SMOTE for defect prediction. ICSE 2018: 1050-1061.

[2] Amritanshu Agrawal, Wei Fu, Di Chen, Xipeng Shen, Tim Menzies: How to “DODGE” Complex Software Analytics? CoRR abs/1902.01838 (2019).

[3] Chris Parnin, Alessandro Orso: Are automated debugging techniques actually helping programmers? ISSTA 2011: 199-209.

[4] Qiao Huang, Xin Xia, David Lo: Supervised vs Unsupervised Models: A Holistic Look at Effort-Aware Just-in-Time Defect Prediction. ICSME 2017: 159-170.