Chapter 3 Conclusions

Both machine learning and statistical learning techniques are tools with great potential in many branches of society. The impact that they currently have is increasing, however, as has been learned in carrying out this project, it requires analysis and understanding if the models are to be optimized according to the application where they are used.

Thus, the metrics obtained by the models made (not counting the neural networks due to their poor performance due to the existence of little data and the need for a very specific architecture) have been the following:

RMSE MAE
KNN 97.84 95.75
SVMlinear 60.55 54.54
SVMradial 144.98 144.62
DT 12.24 7.8
RF 32.28 29.83
XGB 36.71 35.75
Ensemble (RF) 9.38 6.9

In this way, we can see how the best metrics to optimize have been obtained by the decision trees followed by random forest and extreme gradient boosting. As discussed in Chapter 2, the reason why the decision tree probably achieves such a good result is due to the poor relationship in the distribution of the training set with the test set. On the other hand, in this particular problem the bagging approach where the variance of the final estimator is reduced works slightly better than the boosting approach.

Likewise, the rest of the models do not obtain very good results precisely because of the difference between the test and the train. However, and this is where the power of the ensembles lies, such an ensemble uses the best part of each model to be able to make the predictions, thus achieving the best result of all the models tested.

Thus, the results obtained nevertheless have something good in the social context that involves the problem to be solved, and that is that in almost all the models the new cases are overestimated. Although it is true that in terms of resource management it is not at all ideal, it is better to fail above than to not have enough resources for new cases, which in the end are COVID-19 patients with the probability of losing their lives.

Finally, as future lines it would be interesting to increase the data sample, in order to better predict the following days. Likewise, as the sample increases, it would be more and more interesting to use neural networks that, with the correct architecture, are probably capable of obtaining very good prediction results.

Regarding what has been done so far, it would be interesting to continue working with the assemblies to improve the results. In this way, once you have a good and reliable model, it could be put into production by connecting it to the data in real time, thus having a decision support system based on statistical and computational tools.