DRAFT

Project Overview

  • Goal: Help CharityML maximize the likelihood of receiving dontations
  • How: Construct a model that predicts whether an individual makes more than 50k/yr, a value associated with being a candidate for giving donations

  • Data Source: 1994 US Census Data UCI Machine Learning Repository

Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

In [1]:
import numpy as np                                # Library for numerical computing with Python
import pandas as pd                               # Library to work with data in tabular form and the like
from time import time                             # Package to work with time values
from multiprocessing import Pool                  # Library for taking advantage of CPU
In [2]:
from IPython.display import display               # Allows the use of display() for DataFrames
import matplotlib.pyplot as plt                   # Package for plotting
import seaborn as sns                             # Library for plotting, prettier than matplotlib
import visuals as vs                              # Adapted from Udacity
import visualization                              # Module for creating plots more simply
import plotly.graph_objects as go                 # Interactive plots
import plotly.express as px                       # Interactive plots
from plotly.subplots import make_subplots         # Interactive plots
from dython.nominal import associations           # Categorical plots
In [68]:
import modeling                                                            # Module for simplifying modeling items
import statsmodels.api as sm                                               # Statistical analysis toolbox
from scipy.stats import skew                                               # Tool to evaluate statistical measure
from sklearn.preprocessing import MinMaxScaler                             # Feature scaling tool
from sklearn.model_selection import train_test_split, GridSearchCV         # Data splitting and tuning 
from sklearn.naive_bayes import MultinomialNB                              # Naive Bayes Classifier model
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV  # Logistic Regression model
from sklearn.svm import SVC                                                # Support Vectorm Machine
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier    # Ensemble models
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer       # Model metrics
from sklearn.base import clone
import xgboost as xgb
In [4]:
# iPython Notebook formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Account for changes made to imported packages
%load_ext autoreload
%autoreload 2
In [5]:
data = pd.read_csv("census.csv")

1.1 EDA: Data Dictionary

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education_level: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours_per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

1.2 EDA: Simple Cleaning and Engineering

Standardizing factor names by PEP8 Naming Convention Standards can be helpful.

There are a number of categorical variables. Handling those with one-hot encoding can be helpful.

In [6]:
name_changes = {x: x.lower().replace("-", "_") for x in data.columns.tolist() if "-" in x}
data = data.rename(columns=name_changes)
In [7]:
data.info(null_counts=True)   # Show information for each factor: NaN counts and data-type of column
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              45222 non-null  int64  
 1   workclass        45222 non-null  object 
 2   education_level  45222 non-null  object 
 3   education_num    45222 non-null  float64
 4   marital_status   45222 non-null  object 
 5   occupation       45222 non-null  object 
 6   relationship     45222 non-null  object 
 7   race             45222 non-null  object 
 8   sex              45222 non-null  object 
 9   capital_gain     45222 non-null  float64
 10  capital_loss     45222 non-null  float64
 11  hours_per_week   45222 non-null  float64
 12  native_country   45222 non-null  object 
 13  income           45222 non-null  object 
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB
In [8]:
data.describe(include='all').T    # Summarize each factor, transpose the summary
Out[8]:
count unique top freq mean std min 25% 50% 75% max
age 45222 NaN NaN NaN 38.5479 13.2179 17 28 37 47 90
workclass 45222 7 Private 33307 NaN NaN NaN NaN NaN NaN NaN
education_level 45222 16 HS-grad 14783 NaN NaN NaN NaN NaN NaN NaN
education_num 45222 NaN NaN NaN 10.1185 2.55288 1 9 10 13 16
marital_status 45222 7 Married-civ-spouse 21055 NaN NaN NaN NaN NaN NaN NaN
occupation 45222 14 Craft-repair 6020 NaN NaN NaN NaN NaN NaN NaN
relationship 45222 6 Husband 18666 NaN NaN NaN NaN NaN NaN NaN
race 45222 5 White 38903 NaN NaN NaN NaN NaN NaN NaN
sex 45222 2 Male 30527 NaN NaN NaN NaN NaN NaN NaN
capital_gain 45222 NaN NaN NaN 1101.43 7506.43 0 0 0 0 99999
capital_loss 45222 NaN NaN NaN 88.5954 404.956 0 0 0 0 4356
hours_per_week 45222 NaN NaN NaN 40.938 12.0075 1 40 40 45 99
native_country 45222 41 United-States 41292 NaN NaN NaN NaN NaN NaN NaN
income 45222 2 <=50K 34014 NaN NaN NaN NaN NaN NaN NaN
In [9]:
n_records = data.shape[0]                                               # First element of .shape indicates n
n_greater_50k = data[data['income'] == '>50K'].shape[0]                 # n of those with income > 50k
n_at_most_50k = data.where(data['income'] == '<=50K').dropna().shape[0] # .where() requires dropping na for this
greater_percent = round((n_greater_50k / n_records)*100,2)              # Show proportion of > 50k to whole data

data_details = {"Number of observations": n_records,
                "Number of people with income > 50k": n_greater_50k,
                "Number of people with income <= 50k": n_at_most_50k,
                "Percent of people with income > 50k": greater_percent}     # Cache values of analysis

for item in data_details:                                                   # Iterate through the cache
    print("{0}: {1}".format(item, data_details[item]))                      # Print the values
Number of observations: 45222
Number of people with income > 50k: 11208
Number of people with income <= 50k: 34014
Percent of people with income > 50k: 24.78
In [10]:
fig = px.histogram(data, x="income", nbins=2)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Income",
                  showlegend=False)
fig.update_yaxes(title_text="Number of Records")
fig.show()