DRAFT
How: Construct a model that predicts whether an individual makes more than 50k/yr, a value associated with being a candidate for giving donations
Data Source: 1994 US Census Data UCI Machine Learning Repository
Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt'
feature and records with missing or ill-formatted entries.
1.1 Data Dictionary
1.2 Simple Cleaning
1.3 Summary Statistics
1.4 Distributions
1.5 Skew and Variance
1.6 Relationships
2.1 Separate Labels from Factors
2.2 Transformation2.2.1 Indicator Variables
2.2.2 Impact
2.2.3 Logarithmic Transform
2.2.4 Normalization and Standardization2.4 Pipeline
3.Metrics
3.1 Accuracy
3.2 Precision
3.3 Recall
3.4 F$\beta$-Score
4.Models
4.1 Selection
4.2.1 Application
4.3 Model Application Pipeline
4.4.1 Application
4.4.2 Tuning4.5 Random Forest
4.5.1 Application
4.5.2 Tuning4.6 Ada Boost
4.6.1 Application
4.6.2 Tuning4.7 Gradient Boost
4.7.1 Application
4.7.2 Tuning4.8.1 Application
4.8.2 Tuning4.9.1 Application
4.9.2 Tuning4.10 Comparison
4.10.1 Feature Importance
4.10.2 Selection
4.10.3 Comp: Reduced Feature Model Performance
5.Summary
import numpy as np # Library for numerical computing with Python
import pandas as pd # Library to work with data in tabular form and the like
from time import time # Package to work with time values
from multiprocessing import Pool # Library for taking advantage of CPU
from IPython.display import display # Allows the use of display() for DataFrames
import matplotlib.pyplot as plt # Package for plotting
import seaborn as sns # Library for plotting, prettier than matplotlib
import visuals as vs # Adapted from Udacity
import visualization # Module for creating plots more simply
import plotly.graph_objects as go # Interactive plots
import plotly.express as px # Interactive plots
from plotly.subplots import make_subplots # Interactive plots
from dython.nominal import associations # Categorical plots
import modeling # Module for simplifying modeling items
import statsmodels.api as sm # Statistical analysis toolbox
from scipy.stats import skew # Tool to evaluate statistical measure
from sklearn.preprocessing import MinMaxScaler # Feature scaling tool
from sklearn.model_selection import train_test_split, GridSearchCV # Data splitting and tuning
from sklearn.naive_bayes import MultinomialNB # Naive Bayes Classifier model
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV # Logistic Regression model
from sklearn.svm import SVC # Support Vectorm Machine
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier # Ensemble models
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer # Model metrics
from sklearn.base import clone
import xgboost as xgb
# iPython Notebook formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Account for changes made to imported packages
%load_ext autoreload
%autoreload 2
data = pd.read_csv("census.csv")
Standardizing factor names by PEP8 Naming Convention Standards can be helpful.
There are a number of categorical variables. Handling those with one-hot encoding can be helpful.
name_changes = {x: x.lower().replace("-", "_") for x in data.columns.tolist() if "-" in x}
data = data.rename(columns=name_changes)
data.info(null_counts=True) # Show information for each factor: NaN counts and data-type of column
data.describe(include='all').T # Summarize each factor, transpose the summary
n_records = data.shape[0] # First element of .shape indicates n
n_greater_50k = data[data['income'] == '>50K'].shape[0] # n of those with income > 50k
n_at_most_50k = data.where(data['income'] == '<=50K').dropna().shape[0] # .where() requires dropping na for this
greater_percent = round((n_greater_50k / n_records)*100,2) # Show proportion of > 50k to whole data
data_details = {"Number of observations": n_records,
"Number of people with income > 50k": n_greater_50k,
"Number of people with income <= 50k": n_at_most_50k,
"Percent of people with income > 50k": greater_percent} # Cache values of analysis
for item in data_details: # Iterate through the cache
print("{0}: {1}".format(item, data_details[item])) # Print the values
fig = px.histogram(data, x="income", nbins=2)
fig.update_layout(height=600, width=750,
title_text="Distribution of Income",
showlegend=False)
fig.update_yaxes(title_text="Number of Records")
fig.show()