# 6 Feature selection and Modeling

It is nearly impossible to correctly label all observations or agents as corrupt. We decided then to characterize possibly corrupt cases by the suspicious behaviour of agents involved in the process, so we looked at two main dimensions:

## 6.1 Missing Data

Owing to the ammount of missing and dubious information found on Compranet’s database, although not necessarily related to corruption, we created an index that grades agencies and RUs according to the level of clarity in which they report public infromation. The index is relative to both the ammount of missing or dubious observations for each variable and for each agency or RU.

Specifically, we counted the missing observations in each column by agency and by RU and weighted such count, first, by the percentage of missing observations for each agency or RU relative to the sum of missing observations in the column, and second, by the percentage of observations each agency or RU has within the column. Finally, we integrated the results for all columns into a single index by agency or RU by adding them.

$I_i = \sum_j \text{missing}_{i,j} * \frac{\text{missing}_j}{\text{missing}_{i,j}} * \frac{\text{observations}_j}{\text{observations}_{i,j}}$ where $$j$$ represnets columns, and $$i$$ represents either agencies or RUs.

knitr::include_graphics("images/indice_info.png")

## 6.2 Compliance with Transparency Laws

Although the way in which RUs report some of the information may not be ideal for its analysis, in does not mean that they are not complying with transparency laws regarding government procurements.

To have an accurate indicator on compliance with transparency laws and evaluate such compliance in view of corruption in government procurements, we studied carefully the functioning of Compranet’s platform to identify deliberate omissions of information. These deliberate omissions were coded in a different way than missing values corresponding to procurement aspects which specific procurements don’t have access to; this was done in a careful manner, cross-referncing with other existing databases, such as RUPC (ahort for ’Registro Unico de Proveedores y Contratistas), before asigning codes to missing observations.

The new codification of missing values allows to apply the missing data index described above in order to grade dependencies and buying units according to their level of compliance with transparency laws regarding government procurements.