Chapter 3 Data interpolation, extrapolation and imputation

As a general rule of thumb, the higher a country’s level of development, the more complete the data available. The PCA requirement for complete data is therefore only fulfilled for a limited number of relevant variables or countries. If all incomplete variables and country data led to exclusion, it would result in a very limited measure of inclusive growth and, most likely, a bias towards developed countries. It would also introduce complications in the application of PCA, as this requires a relatively large sample to produce stable results. Consequently, imputation is required to maximize the inclusion of the available source indicators and economies.

Former editions of the IGI aimed to construct the index for one year only. Data for indicators were collected and data gaps filled by various methods (linear interpolation, using external data as proxies, imputation). To minimize data loss, all countries with complete information for at least one pillar’s indicators were included in the calculations. Consequently, analyses were based on data from 167, 131 and 90 countries for the first, the second and the third pillars, respectively. However, the final index was computed only for the 86 states with complete data available for all three pillars.

Since 2023, data are collected for all years starting 2000 until the most recent available data point. Obtaining full panel data set for all years requires following steps:

  • Data interpolation
  • Data extrapolation
  • Data imputation

3.1 Data interpolation

Filling data gaps from external sources

There are indicators highly correlated with another indicator or data from alternative sources. Using such proxies could help to fill data gaps especially in cases where no data point is available for the entire series. In that case, missing data are populated by a regression model based on the auxiliary variable.There are two indicators where this approach is considered.

  • IGI 1.5: Electric power consumption, kWh/person

Gaps in electric power consumption per capita can be filled by using data from the U.S. Energy Information Administration website. The database includes values on electricity net consumption (billion kWh). Per capita values can be calculated by using population from WPP. These data how a very high correlation (0.9971) with the principal data source.

  • IGI 2.4: People using safely managed drinking water services (% of population)

The indicator “People using safely managed drinking water services” is seen as a more relevant indicator for medium-to-high income countries, but this variable is unavailable for some countries. However, a similar indicator “People using at least basic drinking water services” is available for almost all countries. Thus, missing data for the preferred indicator were populated using a regression model based on the highly correlated (0.8213) variable on basic water services as the auxiliary variable.

  • IGI 3.1: Income concentration ratio (Gini index), units

Gaps in Gini index can be filled by using data from the World Income Inequality Database.

  • IGI 4.1: CO\(_2\) emissions (kg per PPP USD of GDP)

Gaps in CO2 emissions per unit of GDP can be filled by using data from the EDGAR. These data how a very high correlation (0.9971) with the principal data source.

  • IGI 4.3: Efficiency of water use (water productivity)

The indicator “Efficiency of water use (water productivity)” is unavailable for some countries. However, a similar indicator “Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)” is available for almost all countries. Thus, missing data for the preferred indicator were populated using a regression model based on the highly correlated (0.815) variable on water productivity as the auxiliary variable.

Remaining issues

Although data availability improved significantly by filling gaps from external sources, there are still some indicators identified as having lower data coverage.

  • IGI 2.1: Logistics Performance Index
    • The index is available for 139 economies. There are gaps because the index is not applied for all years but the overall coverage remains good.
  • IGI 2.7: Access to financial services
    • The indicator Proportion of adults (15 years and older) with an account at a financial institution or mobile-money-service provider, by sex (% of adults aged 15 years and older) is drawn from Gallup survey data covering almost 128,000 people in 123 economies. According to metadata, four rounds of data collection were completed, for years: 2011, 2014, 2017, and 2022. However, there are only two years available in the SDG Global Database - 2017 and 2021.
  • IGI 3.8: Ratio of female age of first marriage to male age of first marriage
    • World Marriage data include data from various data sources, often including duplicates. Manual cleaning was conducted to prepare the final data set. However, data gaps still persist. For new IGI editions, a better approach on decision making has to be developed to reduce the data processing burden by manual selection.
    • It is not expected that the average age of first marriage would significantly change over the time. Data gaps for countries with no data can be filled by calculating regional averages, for instance for M49 sub-regions.

In addition, poverty indicators such as poverty headcount ratio, Gini index or Palma ratio are not available for some countries. For instance, data for Singapore, which is very high in the ranking, are fully imputed for the indicator IGI 3.2 Poverty headcount ratio.

In 2023, there are 12 countries with only one data series missing to be included in the index. These countries are potential candidates to be included in the index in future editions when data gaps are filled.

  • FB_BNK_ACCSS (IGI 2.7) is missing for Fiji, Gabon, Timor-Leste
  • MAR_AGE_MAL (IGI 3.8) is missing for Jamaica, North Macedonia
  • SI.POV.LMIC (IGI 3.2) is missing for Cambodia, New Zealand, Singapore
  • NE.EXP.GNFS.ZS (IGI 1.6) is missing for Liberia, Malawi
  • NY.ADJ.NNTY.PC.KD (IGI 1.2) is missing for Malta
  • LP.LPI.OVRL.XQ (IGI 2.1) is missing for Eswatini

Linear interpolation

In this interpolation step, for each country, series with sufficient observations (more than 8 observations) and with gaps within the observations (with a maximum gap size of 5) are linearly interpolated. Series which do not satisfy these conditions are imputed at a later stage together with series with no observations at all.

3.2 Data extrapolation

Data extrapolation (or forecasting) refers to the prediction of the future values of a series after the last observed data, and until the last year for which we would like prediction (hereafter end year). If the end year is the current year, this extrapolation would generally be referred to as “nowcasting”. Forecasting values further than a couple of years ahead of available data is to be considered meaningless.

Since most series exhibit some trend, using double-exponential smoothing is likely to improve the extrapolations over simply repeating the last observed value. Moreover, this method needs fewer parameters to be estimated than ARIMA. Double-exponential smoothing is known to perform well in a wide range of forecasting tasks including for time series with relatively few observations, as is often our case. In view of the above, we found that the simplicity of the double-exponential smoothing and its better empirical performance were strong arguments for the change. For the reasons already discussed, it was also decided to perform extrapolation right after the interpolation step.

Accounting for the COVID-19 pandemic: Some indicators are thought to have been impacted by the COVID-19 pandemic. These indicators can be identified and extrapolated, similarly as for the Productive Capacities Index (PCI).

3.3 Data imputation

The next step includes data imputation by using non-parametric methdos. The missing data are imputed by using the missForest imputation method (Stekhoven and Buehlmann 2012) which is frequently used in the case of mixed-type data. The imputation was conducted by using the R package missForest (Stekhoven 2022).

Series which have too many missing values need to be imputed for the PCA to run: the threshold for imputation (as opposed to interpolation and extrapolation) is set to gaps of 5 or more observations (and can be changed in the settings); this of course includes cases where series have no observation at all.

MissForest method

MissForest is a nonparametric imputation method that can accommodate almost any kind of data, and is provided in the package of the same name (Stekhoven 2022). It can cope with mixed-type of variables, nonlinear relations, complex interactions and high dimensionality. It only requires the observation (i.e. the rows of the data frame supplied to the function) to be pairwise independent. The algorithm is based on random forests (Breiman 2001), which are powerful predictive models which, for the sake of brevity, can be compared to very flexible nonlinear regression models.

For each variable missForest fits a random forest on the observed part and then predicts the missing part. The algorithm continues to repeat these two steps until a stopping criterion is met or the user specified maximum of iterations is reached. For further details see (Stekhoven and Buehlmann 2012).

In more detail: The decision tree starts with using the most discriminative variable to create new branches. Those branches are then split in the same logic until the decision tree has fine grained nodes. The average of the values in those nodes become the estimated value. The default for continuous variables is that the node only contains 1 observation. Missing values are thus estimated as an observed value with a similar value on all other chosen variables. This is just one decision tree, this estimate is averaged out over all 100 trees. And then this process is redone in the next iteration until estimates of missing values hardly change.

Lessons learned

MissForest is a powerful imputation method, easy to be applied. Nevertheless, it reveals certain limitations when applied on country data with significant differences in performance. A good example is GDP, which could be very low for LDCs but very high for high-income economies. The imputed values could be biased towards those extreme outlying observations.

More detailed notes resulting from using the method on the PCI data and input values for the SDG costing exercise:

  • If countries are imputed separately

    • It should not be possible to impute completely missing data for a variable - there is nothing to train the model on.
    • If there is one observed value - this will be the estimate for all missing values.
    • The range of estimated missing values is the same as for the observed values.
    • The positive is that the imputations stay in the range for the country. But, for example, if there is only data up until 2010 - there is not room for growth beyond the 2010 level.
  • If all countries of a group/region are imputed together.

    • There is more data to model on.
    • The range of possible imputed values is the range of that variable for the group.
    • Implicitly, if \(y\) is the imputed variable and \(x\) is the set of other variables - the relationship between \(y\) and each \(x\) is taken into account.
    • One \(x\) will be Year and supposedly there is at least one observation in the class for each year. Trends over time will be accounted for and implicitly used for extrapolation. Year can come in at any point in the decision tree. In some trees it could be the first variable to split the tree, in others it will represent the time trend among observations with similar values on one or more \(x\).
    • One \(x\) is the country itself. If that is a discriminating variable (i.e. if the values of the country is distinct from other countries), it will be used in the model and the estimate will look more like the other values of the country. In the current setup, all observations from a country (2000 - 2022) are, a priori, treated as independent observations. Countries with fewer observations will influence the estimates less than others.
  • If all countries, regardless of group, are imputed together

    • The logic is extended. There is even more data to model on.
    • The range of possible imputed values is now the same as the observed values of the variable in all countries.
    • One \(x\) will be the group, in fact another \(x\) will be the split such as a group of developed/developing. So if the groups actually differs on the imputed variable, this will be accounted for.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.
Stekhoven, Daniel J. 2022. missForest: Nonparametric Missing Value Imputation Using Random Forest. Available at cran.r-project.org/web/packages/missForest/vignettes/missForest_1.5.pdf.
Stekhoven, Daniel J., and Peter Buehlmann. 2012. “MissForest - Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18.