World Health Index Analysis

Kaggle source link | Download R Script | Download CSV file

The data

A school project analyzing a toy dataset from Kaggle consisting of 16 variables:

Country
Year
Headline Consumer Price Inflation
Energy Consumer Price Inflation
Food Consumer Price Inflation
Official Core Consumer Price Inflation
Producer Price Inflation
GDP deflator index growth rate
Continent/Region
Score
GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption

After the csv file is downloaded and put into RStudio, my team wished to test all the columns for significance in regards to score (World Happiness Index score), we could do this using multiple linear regression.

Significant variables

With this intent, we first looked at the missing-ness of each column:

colSums(is.na(df)) / nrow(df) * 100

And removed those columns that have a significant amount of N/A values. We identified 2 such columns; Official Core Consumer Price Inflation and Producer Price Inflation.

df <- df %>% select(-`Official Core Consumer Price Inflation`) %>% select(-`Producer Price Inflation`)

Before doing multiple linear regression, we removed Country and Continent/Region to reduce the amount of dummy binary variables. This adds an assumption that countries and continents/regions does not affect the happiness score, a reasonable assumption we felt - but thinking back on it now, certain countries do have branding and inherent bias to its name. The United States' American dream, for example, or the stigma against countries in the African continent, or the European Union's relatively high standard of living, etcetera.

After running multiple linear regression we have discovered these variables are significant:

Year
Energy Consumer Price Inflation
GDP per Capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption

Imputation

After finding out the significant columns, we need to decide how to handle all the missing values. The team agreed we should do imputation, however, with no way to know the relationship between the missing and recorded data (as we are not part of the collection process), predictive mean matching (PMM) imputation was brought up as a method. I argued that the best practice was stochastic regression by country.

In short, we faced two choices at this point in time:

Global PMM imputation - less work, less accurate imputation variance
by Country stochastic linear regression imputation - more work, more accurate imputation variance

It was a big disagreement - I tried stochastic linear regression but could not make it work, I worried there wasn't enough data for some countries which only filled in some categories(columns) and/or years, my solution was to take those countries out, which was disagreed on by the team again. At an impasse, there was a need for an arbitrator, so we emailed the professor for guidance

With this advice, it was clear that we should just go ahead with stochastic regression imputation on the countries regardless that some of them had missing years and/or columns. During the school project, out of time and skill, we had gone against this directive and used a global blanket predictive mean method (PMM) of imputation, but now, I will attempt to fill in missing years with blank rows and attempt to fill them out with imputation per country.