DRAFT
How: Construct a model that predicts whether an individual makes more than 50k/yr, a value associated with being a candidate for giving donations
Data Source: 1994 US Census Data UCI Machine Learning Repository
Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.
1.1 Data Dictionary
1.2 Simple Cleaning
1.3 Summary Statistics
1.4 Distributions
1.5 Skew and Variance
1.6 Relationships
2.1 Separate Labels from Factors
2.2 Transformation2.2.1 Indicator Variables
2.2.2 Impact
2.2.3 Logarithmic Transform
2.2.4 Normalization and Standardization2.4 Pipeline
3.Metrics
3.1 Accuracy
3.2 Precision
3.3 Recall
3.4 F$\beta$-Score
4.Models
4.1 Selection
4.2.1 Application
4.3 Model Application Pipeline
4.4.1 Application
4.4.2 Tuning4.5 Random Forest
4.5.1 Application
4.5.2 Tuning4.6 Ada Boost
4.6.1 Application
4.6.2 Tuning4.7 Gradient Boost
4.7.1 Application
4.7.2 Tuning4.8.1 Application
4.8.2 Tuning4.9.1 Application
4.9.2 Tuning4.10 Comparison
4.10.1 Feature Importance
4.10.2 Selection
4.10.3 Comp: Reduced Feature Model Performance
5.Summary
Standardizing factor names by PEP8 Naming Convention Standards can be helpful.
There are a number of categorical variables. Handling those with one-hot encoding can be helpful.