Aaditya Bansal

Feb 17, 20236 min read

California Housing Price Prediction - 4. Modeling

Housing Price Prediction

Welcome to my very first Machine Learning / Data Science Project.

This post is in continuation of the Part 1 - Data Extraction of this Project , Part-2 EDA and Visualization and Part 3 - Preprocessing please check them out if haven't already.

I will be sharing the process and updates using blogs.

In this Blog Post I have detailed the Overview and focused on the most important part of a Machine Learning / Data Science Project: Modeling!!

You can also view this project on Google Collab.

Overview

This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn. We will perform the following steps for successfully creating a model for house price prediction:

1. Data Extraction (See Details in Previous Blog)

Import libraries
Import Dataset from scikit-learn
Understanding the given Description of Data and the problem Statement
Take a look at different Inputs and details available with dataset.
Storing the obtained dataset into a Pandas Data frame

2. EDA (Exploratory Data Analysis) and Visualization (See Details in Previous Blog)

Getting a closer Look at obtained Data
Exploring different Statistics of the Data (Summary and Distributions)
Looking at Correlations (between indiviual features and between Input features and Target)
Geospatial Data / Coordinates - Longitude and Lattitude features

3. Preprocessing (See Details in Previous Blog)

Dealing with Duplicate and Null (NaN) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
- Visualization (Box-Plots)
- Using IQR
- Using Z-Score
Seperating Target and Input Features
Target feature Normalization (Plots and Tests)
Splitting Dataset into train and test sets
Feature Scaling (Feature Transformation)

4. Modeling

Specifying Evaluation Metric R squared (using Cross-Validation)
Model Training - trying multiple models and hyperparameters:
Model Selection (by comparing evaluation metrics)
Learn Feature Importance and Relations
Prediction

5. Deployment

Exporting the trained model to be used for later predictions. (by storing model object as byte file - Pickling)

4. Modeling

Specifying Evaluation Metric R squared (using Cross-Validation) To check the quality and accuracy of our predictions we will use the R square to evaluate our model. we will use R square score in cross-validation square to evaluate how well our model will generalize for future unseen values #FUNCTION TO CALCULATE THE EVALUATION SCORE USING CROSS VALIDATION AND R SQAURE SCORING from sklearn.model_selection import cross_val_score def calculate_eval_metric(model, X, y, cv = 3): scores = cross_val_score(model, X, y, cv = cv, scoring = 'r2') print("Evaluation score on 3 cross-validation sets : ", scores) print("Average R squared score : ", scores.mean()) return scores.mean() Dictionary to store Cross - Evaluation Metrics for each Model #DICTIONARY TO STORE CV SCORES cv_scores = {}

Model Training - trying multiple models and hyperparameters

Linear Regression Linear Regression can act as a baseline Model to get an understanding of how well we can perform using a simple model without much fine tuning #IMPORTING LINEAR REGRESSOR IMPLEMENTATION FROM SKLEARN from sklearn.linear_model import LinearRegression #CREATING A MODEL OBJECT AND TRAINING linear_regression_model = LinearRegression() cv_scores['linear_regression_model'] = calculate_eval_metric(linear_regression_model, X_train, y_train) Evaluation score on 3 cross-validation sets : [0.61733889 0.60654907 0.5905632 ] Average R squared score : 0.6048170562554777

Polynomial Regression We can try to create polynomial features inorder to learn more non-linear (complex relation ship between features) #CREATING POLYNOMIAL FEATURES TO LEARN MORE COMPLEX RELATIONS BETWEEN FEATURES from sklearn.preprocessing import PolynomialFeatures #specify degree of 3 for polynomial regression model #include bias=False means don't force y-intercept to equal zero poly = PolynomialFeatures(degree = 3, include_bias = False, interaction_only = True) #CREATE POLYNOMIAL FEATURES FOR ALL FEATURES EXCEPT THE COORDINATES X_train_temp = X_train[:, :-2] poly_X_train = poly.fit_transform(X_train_temp) #CONCAT THE COORDINATE FEATURES poly_X_train = np.concatenate((poly_X_train, X_train[:, -2:]), axis = 1) X_train.shape Output: (16512, 8) poly_X_train.shape Output: (16512, 43) #CREATING A MODEL OBJECT AND TRAINING polynomial_regression_model = LinearRegression() cv_scores['polynomial_regression_model'] = calculate_eval_metric(polynomial_regression_model, poly_X_train, y_train) Evaluation score on 3 cross-validation sets : [ 0.51319899 -100.3940004 -4.14467619] Average R squared score : -34.675159200163826 Model seems to perform very poorly. This might be a result of overfitting.

Ridge Regression ridge regression is the l2 regularized Polynomial / Linear Regression - It can be helpful to reduce Overfitting #IMPORT RIDGE REGRESSION IMPLIMENTATION FROM SKLEARN from sklearn.linear_model import Ridge poly_ridge_regression_model = Ridge(alpha = 2500.0) cv_scores['poly_ridge_regression_model'] = calculate_eval_metric(poly_ridge_regression_model, poly_X_train, y_train)

Output: Evaluation score on 3 cross-validation sets : [0.54003682 0.40483983 0.53270562] Average R squared score : 0.4925274246450928 ridge_regression_model = Ridge(alpha = 10.0) cv_scores["ridge_regression_model"] = calculate_eval_metric(ridge_regression_model, X_train, y_train) Output: Evaluation score on 3 cross-validation sets : [0.61717619 0.60642186 0.59115334]

Average R squared score : 0.6049171307366464

Decision Tree Regressor from sklearn.tree import DecisionTreeRegressor decision_tree_model = DecisionTreeRegressor(random_state = 0) cv_scores["decision_tree_model"] = calculate_eval_metric(decision_tree_model, X_train, y_train) Output: Evaluation score on 3 cross-validation sets : [0.6292367 0.55985191 0.59136255]

Average R squared score : 0.593483722209434 #PARAMETER TUNING #TRYING DIFFERENT VALUES OF MAX_DEPTH for depth in [2, 4, 6, 8, 10, 12]: decision_tree_model_depth = DecisionTreeRegressor(random_state = 0, max_depth = depth) cv_scores["decision_tree_model_depth", depth] = calculate_eval_metric(decision_tree_model_depth, X_train, y_train)

Output:

cv_scores Output:

from above we can observe that the max_depth value of 8 works best and generalize better than others

Random Forest Regressor from sklearn.ensemble import RandomForestRegressor random_forest_regressor_model = RandomForestRegressor(n_estimators = 100, max_depth = 8, random_state = 0) cv_scores["random_forest_regressor_model"] = calculate_eval_metric(random_forest_regressor_model, X_train, y_train) Output: Evaluation score on 3 cross-validation sets : [0.76016949 0.74570302 0.747854 ]

Average R squared score : 0.7512421711769676 Random forest gives the best performance out of every other model #TRYING DIFFERENT VALUES FOR N_ESTIMATORS IN TRAINING OUR MODEL for estimators in [80, 100, 120]: random_forest_regressor_model_estimators = RandomForestRegressor(n_estimators = estimators, max_depth = 8, random_state = 0) cv_scores["random_forest_regressor_model_estimators", estimators] = calculate_eval_metric(random_forest_regressor_model_estimators, X_train, y_train) Output: Evaluation score on 3 cross-validation sets : [0.75947895 0.7444624 0.74759133]

Average R squared score : 0.7505108924375108

Evaluation score on 3 cross-validation sets : [0.76016949 0.74570302 0.747854 ]

Average R squared score : 0.7512421711769676

Evaluation score on 3 cross-validation sets : [0.76075865 0.74723152 0.74797942]

Average R squared score : 0.7519898651687122 cv_scores Output: {'linear_regression_model': 0.6048170562554777,

'polynomial_regression_model': -34.675159200163826,

'ridge_regression_model': 0.6049171307366464,

'poly_ridge_regression_model': 0.4925274246450928,

'decision_tree_model': 0.593483722209434,

('decision_tree_model_depth', 2): 0.44105621127826095,

('decision_tree_model_depth', 4): 0.5690941288023251,

('decision_tree_model_depth', 6): 0.6461755081549374,

('decision_tree_model_depth', 8): 0.6820067052741757,

('decision_tree_model_depth', 10): 0.6790157396768074,

('decision_tree_model_depth', 12): 0.6613241838056646,

'random_forest_regressor_model': 0.7512421711769676,

('random_forest_regressor_model_estimators', 80): 0.7505108924375108,

('random_forest_regressor_model_estimators', 100): 0.7512421711769676,

('random_forest_regressor_model_estimators', 120): 0.7519898651687122} more estimators in random forest almost always result in better performance as we can confirm here, but we will stick with 100 n_estimators due to time constraints in training and because increasing estimators does not result in a major improvement.

Gradient Boosting Regressor from sklearn.ensemble import GradientBoostingRegressor gradient_boosting_model = GradientBoostingRegressor(random_state = 0, n_estimators = 100, learning_rate = 0.1, max_depth = 8) cv_scores["gradient_boosting_model"] = calculate_eval_metric(gradient_boosting_model, X_train, y_train) Output: Evaluation score on 3 cross-validation sets : [0.83041307 0.83215497 0.818246 ]

Average R squared score : 0.8269380107026487 Gradient Boosted Regressor results in best performance out of any model yet.

eXtreme Gradient Boosting Regressor (XGBoost) import xgboost as xgb xgb_model = xgb.XGBRegressor(max_depth = 8, n_estimators = 100, random_state = 0) cv_scores["xgb_model"] = calculate_eval_metric(xgb_model, X_train, y_train) Output:

Evaluation score on 3 cross-validation sets : [0.83280003 0.83361391 0.82231405]

Average R squared score : 0.8295759945972049

Support Vector Regression from sklearn.svm import SVR svr_model = SVR() cv_scores["svr_model"] = calculate_eval_metric(svr_model, X_train, y_train) Output: Evaluation score on 3 cross-validation sets : [0.73671607 0.72975087 0.72730831]

Average R squared score : 0.7312584168808836

Model Selection (by comparing evaluation metrics) Let's Take a look at the evaluation (r squared) score performance of different Models on our training data cv_scores Output:

from above data we can observe that the gradient boosting ('gradient_boosting' and 'xgb') models are able to provide the best performance. We will use Gradient Boosting Regressor for making predictions on new data. from sklearn.ensemble import GradientBoostingRegressor #BOOSTING MODEL gradient_boosting_model = GradientBoostingRegressor(random_state = 0, n_estimators = 100, learning_rate = 0.1, max_depth = 8) #FIT THE MODEL USING THE COMPLETE TRAINING DATA FOR BETTER PERFORMANCE gradient_boosting_model.fit(X_train, y_train)

Output: GradientBoostingRegressor(max_depth=8, random_state=0) #EVALUATION SCORE (R SQUARED) ON TRAIN DATA - HOW WELL OUR MODEL PERFORMS ON TRAINING DATA gradient_boosting_model.score(X_train, y_train) Output: 0.9543701442570848 #EVALUATION SCORE ON TEST DATA TO SEE HOW WELL OUR MODEL GENERALIZE TO NEW DATA gradient_boosting_model.score(X_test, y_test) Output: 0.8349827250092899

Learn Feature Importance and Relations #IMPORTANCE OF DIFFERENT FEATURES IN OUR MODEL list_of_features = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude"] feature_imp = gradient_boosting_model.feature_importances_ feature_importance = pd.DataFrame(list_of_features, columns = ["Features"]) feature_importance["Importance"] = feature_imp feature_importance Output:

plt.figure(figsize = (12,8)) sns.barplot(feature_importance['Importance'],feature_importance['Features'], orient = 'h') plt.show() Output:

!pip install shap import shap explainer = shap.TreeExplainer(gradient_boosting_model) shap_values = explainer.shap_values(X_train) shap.summary_plot(shap_values, X_train)

Output:

From above analysis we can conculde that the features 'MedInc', 'Latitude', 'Longitude' and 'AveOccup' contribute significantly more in producing the final predictions that other features.

Prediction first we have to standerdize the new data x_pred = cal_housing_dataset.data.loc[0].values.reshape(1, -1) x_pred Output: array([[ 8.3252 , 41. , 6.98412698, 1.02380952, 322. , 2.55555556, 37.88 , -122.23 ]]) #STANDERDISE THE INPUT x_pred = scaler.transform(x_pred) #PREDICTION ON DATA gradient_boosting_model.predict(x_pred) Output: array([4.50939797]) #ACTUAL VALUE cal_housing_dataset.target[0] Output: 4.526

Thank you for your time!! The following is the file with progress of the project until now.

Did you like my Notebook and my approach??

Yes, Absolutely 🤩
Nice Try 😅
No, can improve a lot 👀

Aaditya Bansal

California Housing Price Prediction - 4. Modeling

Housing Price Prediction

Overview

4. Modeling

Recent Posts

Comments