Anthony Warsah Liu

GitHub | LinkedIn | anthonywarsahliu@gmail.com

World Health Index Analysis

Kaggle source link | Download R Script | Download CSV file

The data

A school project analyzing a toy dataset from Kaggle consisting of 16 variables:

  1. Country
  2. Year
  3. Headline Consumer Price Inflation
  4. Energy Consumer Price Inflation
  5. Food Consumer Price Inflation
  6. Official Core Consumer Price Inflation
  7. Producer Price Inflation
  8. GDP deflator index growth rate
  9. Continent/Region
  10. Score
  11. GDP per capita
  12. Social support
  13. Healthy life expectancy at birth
  14. Freedom to make life choices
  15. Generosity
  16. Perceptions of corruption

After the csv file is downloaded and put into RStudio, my team wished to test all the columns for significance in regards to score (World Happiness Index score), we could do this using multiple linear regression.

Significant variables

With this intent, we first looked at the missing-ness of each column:

colSums(is.na(df)) / nrow(df) * 100

And removed those columns that have a significant amount of N/A values. We identified 2 such columns; Official Core Consumer Price Inflation and Producer Price Inflation.

df <- df %>% select(-`Official Core Consumer Price Inflation`) %>% select(-`Producer Price Inflation`)

Before doing multiple linear regression, we removed Country and Continent/Region to reduce the amount of dummy binary variables. This adds an assumption that countries and continents/regions does not affect the happiness score, a reasonable assumption we felt - but thinking back on it now, certain countries do have branding and inherent bias to its name. The United States' American dream, for example, or the stigma against countries in the African continent, or the European Union's relatively high standard of living, etcetera.

After running multiple linear regression we have discovered these variables are significant:

  1. Year
  2. Energy Consumer Price Inflation
  3. GDP per Capita
  4. Social support
  5. Healthy life expectancy at birth
  6. Freedom to make life choices
  7. Generosity
  8. Perceptions of corruption

Imputation

After finding out the significant columns, we need to decide how to handle all the missing values. The team agreed we should do imputation, however, with no way to know the relationship between the missing and recorded data (as we are not part of the collection process), predictive mean matching (PMM) imputation was brought up as a method. I argued that the best practice was stochastic regression by country.

In short, we faced two choices at this point in time:

  1. Global PMM imputation - less work, less accurate imputation variance
  2. by Country stochastic linear regression imputation - more work, more accurate imputation variance

It was a big disagreement - I tried stochastic linear regression but could not make it work, I worried there wasn't enough data for some countries which only filled in some categories(columns) and/or years, my solution was to take those countries out, which was disagreed on by the team again. At an impasse, there was a need for an arbitrator, so we emailed the professor for guidance

With this advice, it was clear that we should just go ahead with stochastic regression imputation on the countries regardless that some of them had missing years and/or columns. During the school project, out of time and skill, we had gone against this directive and used a global blanket predictive mean method (PMM) of imputation, but now, I will attempt to fill in missing years with blank rows and attempt to fill them out with imputation per country.