Part 1 : Regression and Classification Model Evaluation
An introduction and intuition, how evaluate regression and Classification model in general

1. Introduction
Data scientists often use machine learning models to generate insight but wow does a Data scientist make a decision, whether the model will be implemented or not? When the model is implemented, there will be negative and positive impacts on the business. In order to prevent or minimize negative impacts, it is necessary to evaluate the model, so that it can estimate the positive impact and negative impact generated. Is this, model evaluation is one of the most important parts of machine learning.
In this moment, we’ll learn about the most important model performance metrics that can be used to assess the performance of a regression and classification model. Following is the list of metrics:
Regression
- R-Squared
- RMSE : Root Mean Squared Error [0- infinity]
- MAE : Mean Absolute Error [0- infinity]
Classification
- Accuracy
- Precision
- Recall
- Specificity
- F1 Score
2. Regression
In the regression case, the most popular evaluation models are R-Squared, RMSE, and MAE.
R-Squared
mathematically we can write R-Squared as following

we can say R-Squared formula as
R-squared = Explained variation / Total variation
R-Squared : how well the predictions approximate the ground truth or how close the data are to the fitted regression line. In essence, R-Squared represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. R-squared is always between 0 and 1, in percentage always between 0 and 100%.
- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.
Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line [1]
RMSE and MAE
RMSE : mathematically we can write RMSE as following

According to Wikipedia [2], RMSE (Root Mean Squared Error) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. RMSE is always non-negative, is always between 0 and infinity
MAE : mathematically we can write MAE as following

According to Wikipedia [3], MAE is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement[3].
When do we use RMSE and MAE?

if the data have an outlier then the value of RMSE is higher than MAE, so RMSE is more “sensitive” than MAE. Using the RMSE is more appropriate if the data have a lot of outliers.
3. Classification
In the case of classification, there are some metrics to describe the performance of a classification model on a set of test data for which the true values are known.
before discussing model evaluation on classification, we should understand the confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

- Positive (P) : Observation is positive (for example: is an apple).
- Negative (N) : Observation is not positive (for example: is not an apple).
- True Positive (TP) : Observation is positive, and is predicted to be positive.
- False Negative (FN) : Observation is positive, but is predicted negative.
- True Negative (TN) : Observation is negative, and is predicted to be negative.
- False Positive (FP) : Observation is negative, but is predicted positive.
Accuracy
Accuracy is the basic metric to evaluate classification models and measures the percentage of correctly predicted data against the total data. This is a metric that is best used for a balanced data set. In order to find out the accuracy, we should have a confusion matrix.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy = (True positive+ True negative) /Total count of elements
In this case, an accuracy is (44 + 37) / (44 + 37 + 15 + 4) = 81/100 = 81%.
Precision
Precision is a metric that quantifies the number of correct positive predictions made. This is a metric that can be used for an unbalanced data set. In essence, It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted [4]
Merupakan rasio prediksi benar positif dibandingkan dengan keseluruhan hasil yang diprediksi positif. Precision menjawab pertanyaan “Berapa persen mahasiswa yang benar DO dari keseluruhan mahasiswa yang diprediksi DO?[5]

Precision = (TP) / (TP + FP)
Precision = (True positive) / (True positive + False positive)
In this case, a precision is (44 ) / (44 + 15) = 44/59 = 74.6% ~ 3/4
For example, out of 4 emails predicted to be spam. There is 1 (4–3) email NOT SPAM which is categorized as SPAM. This metric that’s be used to avoid false positive predict (observation is negative, but is predicted positive)
Recall / Sensitivity
Recall is calculated as the number of true positives divided by the total number of true positives and false negatives. This is a metric that can be used for an unbalanced data set.
Merupakan rasio prediksi benar positif dibandingkan dengan keseluruhan data yang benar positif. Recall menjawab pertanyaan “Berapa persen mahasiswa yang diprediksi DO dibandingkan keseluruhan mahasiswa yang sebenarnya DO” [5]

Recall = (TP) / (TP + FN)
Recall = (True positive) / (True positive + False negative)
In this case, a recall is (40 ) / (40+ 10) = 40/50 = 80%
For example, out of 10 people who are predicted to be positive Covid 19. There were 2 people who failed to predict as positive for Covid 19. This is very dangerous, because the two people can infect other people. This metric is used to avoid false negative (Observation is positive, but is predicted negative)
Specificity
Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such
Merupakan kebenaran memprediksi negatif dibandingkan dengan keseluruhan data negatif. Specificity menjawab pertanyaan “Berapa persen mahasiswa yang benar diprediksi tidak DO dibandingkan dengan keseluruhan mahasiswa yang sebenarnya tidak DO” [5]

Specificity = (TN) / (TN + FP)
Specificity = (True negative) / (True negative + False positive)
In this case, a recall is (98 ) / (98+ 2) = 98/100 = 98%
For example in fraud detection, out of 100 is not a fraudulent transactions. The algorithm can correctly predict 98 transaction. This metric is used to maximize True negative rate.
F1 Score
F1 Score measure provides a way to combine both precision and recall into a single measure that captures both properties.
As mentioned before, precision is a metric to minimize false positive rate (model too ‘confident’ labels data as ‘positive’). On the other hand, recall is a metric to minimize false positive rate (the model fails to detect data that is actually ‘positive’).
Is it possible to minimize both (FP & FN) simultaneously? Yes, it’s possible with F1 Score. F1 Score is comparison of average from precision and recall that’s weighted.

F1 Score = 2 * (Recall*Precision) / (Recall + Precision)
Precision = 5/(5+95) = 5%
Recall = 5/(5+0) = 100
F1 score = (2 x 5% x 100%) / (5% + 100%) = 9.5%
The score is very low but is quite fair. comparison of average from precision and recall that’s weighted. Thus, we are not fooled by 100% sensitivity.
Conclusion
- R-squared : how well the predictions approximate the ground truth [0–100%]
- RMSE and MAE are more interpretable than R-squared and Using the RMSE is more appropriate if the data have a lot of outliers.
- Accuracy : This is a metric that is best used for a balanced data set
- Precision : This metric that’s be used to avoid false positive predict (observation is negative, but is predicted positive)
- Recall / Sensitivity : This metric that’s be used to avoid false negative (Observation is positive, but is predicted negative) and maximize True positive rate.
- Specificity : This metric is used to maximize True negative rate and minimize True positive rate.
- F1 Score : This metric that’s be used to minimize False positive and False negative
About Me
I’m a Data Scientist, Focus on Machine Learning and Deep Learning. You can reach me from Medium and Linkedin.
Linkedin arif romadhan
My Website : https://komuternak.com/
Reference
- Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?
- Root-mean-square deviation or Root-mean-square error
- Mean absolute error
- How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification
- Mengenal Accuracy, Precision, Recall dan Specificity serta yang diprioritaskan dalam Machine Learning