DRAFT
How: Construct a model that predicts whether an individual makes more than 50k/yr, a value associated with being a candidate for giving donations
Data Source: 1994 US Census Data UCI Machine Learning Repository
Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt'
feature and records with missing or ill-formatted entries.
1.1 Data Dictionary
1.2 Simple Cleaning
1.3 Summary Statistics
1.4 Distributions
1.5 Skew and Variance
1.6 Relationships
2.1 Separate Labels from Factors
2.2 Transformation2.2.1 Indicator Variables
2.2.2 Impact
2.2.3 Logarithmic Transform
2.2.4 Normalization and Standardization2.4 Pipeline
3.Metrics
3.1 Accuracy
3.2 Precision
3.3 Recall
3.4 F$\beta$-Score
4.Models
4.1 Selection
4.2.1 Application
4.3 Model Application Pipeline
4.4.1 Application
4.4.2 Tuning4.5 Random Forest
4.5.1 Application
4.5.2 Tuning4.6 Ada Boost
4.6.1 Application
4.6.2 Tuning4.7 Gradient Boost
4.7.1 Application
4.7.2 Tuning4.8.1 Application
4.8.2 Tuning4.9.1 Application
4.9.2 Tuning4.10 Comparison
4.10.1 Feature Importance
4.10.2 Selection
4.10.3 Comp: Reduced Feature Model Performance
5.Summary
Standardizing factor names by PEP8 Naming Convention Standards can be helpful.
There are a number of categorical variables. Handling those with one-hot encoding can be helpful.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45222 entries, 0 to 45221 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45222 non-null int64 1 workclass 45222 non-null object 2 education_level 45222 non-null object 3 education_num 45222 non-null float64 4 marital_status 45222 non-null object 5 occupation 45222 non-null object 6 relationship 45222 non-null object 7 race 45222 non-null object 8 sex 45222 non-null object 9 capital_gain 45222 non-null float64 10 capital_loss 45222 non-null float64 11 hours_per_week 45222 non-null float64 12 native_country 45222 non-null object 13 income 45222 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 4.8+ MB
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 45222 | NaN | NaN | NaN | 38.5479 | 13.2179 | 17 | 28 | 37 | 47 | 90 |
workclass | 45222 | 7 | Private | 33307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
education_level | 45222 | 16 | HS-grad | 14783 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
education_num | 45222 | NaN | NaN | NaN | 10.1185 | 2.55288 | 1 | 9 | 10 | 13 | 16 |
marital_status | 45222 | 7 | Married-civ-spouse | 21055 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
occupation | 45222 | 14 | Craft-repair | 6020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
relationship | 45222 | 6 | Husband | 18666 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
race | 45222 | 5 | White | 38903 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
sex | 45222 | 2 | Male | 30527 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
capital_gain | 45222 | NaN | NaN | NaN | 1101.43 | 7506.43 | 0 | 0 | 0 | 0 | 99999 |
capital_loss | 45222 | NaN | NaN | NaN | 88.5954 | 404.956 | 0 | 0 | 0 | 0 | 4356 |
hours_per_week | 45222 | NaN | NaN | NaN | 40.938 | 12.0075 | 1 | 40 | 40 | 45 | 99 |
native_country | 45222 | 41 | United-States | 41292 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
income | 45222 | 2 | <=50K | 34014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Number of observations: 45222 Number of people with income > 50k: 11208 Number of people with income <= 50k: 34014 Percent of people with income > 50k: 24.78