Original notebook source: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
original default score was .11543, new best score is .11356
How to Kaggle: https://www.youtube.com/watch?v=GJBOMWpLpTQ
June 2020
Need to submit every 2 months as the leaderboard will rollover after this 2 month period
Typical flow of model building: Use GridSearch for determining Best parameters => **bestestimator
good basic resource for models: https://scikit-learn.org/stable/modules/cross_validation.html https://scikit-learn.org/stable/modules/grid_search.html
A search consists of:
an estimator (regressor or classifier such as sklearn.svm.SVC());
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function. **
estimator.get_params() # get all hyperparameters
inner_cv = KFold(n_splits=4, shuffle=True, random_state=None)
# To be used in outer CV
outer_cv = KFold(n_splits=4, shuffle=True, random_state=None)
#inner loop KFold example:
gsc = GridSearchCV(
estimator=rf,
param_grid=param_grid,
#scoring='roc_auc', # or 'r2', etc
scoring='f1',
#scoring='neg_mean_squared_error', # or look here for other choices
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
#cv=5,
cv=inner_cv, # this will use KFold splitting to change train/test/validation datasets randomly
verbose=verbose
)
print("Running GridSearchCV:")
with MyTimer():
grid_result = gsc.fit(X_train, y_train)
non_nested_scores[i] = grid_result.best_score_
rf = RandomForestClassifier(**grid_result.best_params_)
comments from the original author:
This competition is very important to me as it helped me to begin my journey on Kaggle few months ago. I've read some great notebooks here. To name a few:
Comprehensive data exploration with Python by Pedro Marcelino : Great and very motivational data analysis
A study on Regression applied to the Ames dataset by Julien Cohen-Solal : Thorough features engeneering and deep dive into linear regression analysis but really easy to follow for beginners.
Regularized Linear Models by Alexandru Papiu : Great Starter kernel on modelling and Cross-validation
I can't recommend enough every beginner to go carefully through these kernels (and of course through many others great kernels) and get their first insights in data science and kaggle competitions.
After that (and some basic pratices) you should be more confident to go through this great script by Human Analog who did an impressive work on features engeneering.
As the dataset is particularly handy, I decided few days ago to get back in this competition and apply things I learnt so far, especially stacking models. For that purpose, we build two stacking classes ( the simplest approach and a less simple one).
As these classes are written for general purpose, you can easily adapt them and/or extend them for your regression problems. The overall approach is hopefully concise and easy to follow..
The features engeneering is rather parsimonious (at least compared to some others great scripts) . It is pretty much :
Imputing missing values by proceeding sequentially through the data
Transforming some numerical variables that seem really categorical
Label Encoding some categorical variables that may contain information in their ordering set
Box Cox Transformation of skewed features (instead of log-transformation) : This gave me a slightly better result both on leaderboard and cross-validation.
Getting dummy variables for categorical features.
Then we choose many base models (mostly sklearn based models + sklearn API of DMLC's XGBoost and Microsoft's LightGBM), cross-validate them on the data before stacking/ensembling them. The key here is to make the (linear) models robust to outliers. This improved the result both on LB and cross-validation.
To my surprise, this does well on LB ( 0.11420 and top 4% the last time I tested it : July 2, 2017 )
Hope that at the end of this notebook, stacking will be clear for those, like myself, who found the concept not so easy to grasp
#import some necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew, kurtosis, boxcox #for some statistics
from scipy.special import boxcox1p, inv_boxcox
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #check the files available in the directory
#Now let's import and put the train and test datasets in pandas dataframe
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
##display the first five rows of the train dataset.
train.head(5)
##display the first five rows of the test dataset.
test.head(5)
#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the 'Id' colum since it's unnecessary for the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
Documentation for the Ames Housing Data indicates that there are outliers present in the training data
Let's explore these outliers
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
m, b = np.polyfit(train['GrLivArea'], train['SalePrice'], 1)
#m = slope, b=intercept
plt.plot(train['GrLivArea'], m*train['GrLivArea'] + b)
plt.show()
Edit: Look for fliers in other columns
train.shape[1]
#a = int(np.sqrt(train.shape[1]))
a = 4
b = int(train.shape[1]/4)
r = int(train.shape[1]/a)
c = int(train.shape[1]/b)
i = 0
fig, ax = plt.subplots(nrows=r, ncols=c, figsize=(15, 60))
for row in ax:
for col in row:
#print(train.columns[i])
#print(train[train.columns[i]].dtype)
#col.plot()
#ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
try:
col.scatter(x = train[train.columns[i]], y = train['SalePrice'])
col.title.set_text(train.columns[i])
#col.title(train.columns[i])
except:
temp=1
#except Exception as e:
# print(e.message, e.args)
finally:
temp=1
i = i + 1
plt.show()
More filtering of data to try
#train = train.drop(train[(train['LotFrontage']>300) & (train['SalePrice']<400000)].index)
#train = train.drop(train[(train['LotArea']>100000) & (train['SalePrice']<400000)].index)
#train = train.drop(train[(train['BsmtFinSF1']>4000) & (train['SalePrice']<250000)].index)
#train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<250000)].index)
We can see at the bottom right two with extremely large GrLivArea that are of a low price. These values are huge oultliers. Therefore, we can safely delete them.
Edit: Think this deletion is safe to do as there are only 2 points and they seem to be abnormal, possibly data error
#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
m, b = np.polyfit(train['GrLivArea'], train['SalePrice'], 1)
#m = slope, b=intercept
plt.plot(train['GrLivArea'], m*train['GrLivArea'] + b)
plt.show()
Outliers removal is note always safe. We decided to delete these two as they are very huge and really bad ( extremely large areas for very low prices).
There are probably others outliers in the training data. However, removing all them may affect badly our models if ever there were also outliers in the test data. That's why , instead of removing them all, we will just manage to make some of our models robust on them. You can refer to the modelling part of this notebook for that.
Edit: todo - look for other outliers in all columns. Maybe better to use StandardScaler() and np.clip
SalePrice is the variable we need to predict. So let's do some analysis on this variable first.
sns.distplot(train['SalePrice'] , fit=norm)
_skew = skew(train['SalePrice'])
_kurtosis = kurtosis(train['SalePrice'])
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}, skew = {:.2f} kurtosis = {:.2f}\n'.format(mu, sigma, _skew, _kurtosis))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
The target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed.
Log-transformation of the target variable
Edit: also trying box-cox transform. The Box-Cox transform is given by:
y = (x**lmbda - 1) / lmbda, for lmbda > 0
log(x), for lmbda = 0
#We use the numpy fuction log1p which applies log(1+x) to all elements of the column
# option 1
#train["SalePrice"] = np.log1p(train["SalePrice"])
# Option 2: use box-cox transform - this performs worse than the log(1+x), reverting back
lam_l = 0.35
train["SalePrice"] = boxcox1p(train["SalePrice"], lam_l)
# Option 3: boxcox letting the algorithm select lmbda based on least-likelihood calculation
#train["SalePrice"], lam_l = boxcox(x=train["SalePrice"], lmbda=None)
# to revert back use y = inv_boxcox(x, lam_l) => train["SalePrice"] = inv_boxcox(train["SalePrice"], lam_l)
# think the formula is this: # pred_y = np.power((y_box * lambda_) + 1, 1 / lambda_) - 1
#Check the new distribution
sns.distplot(train['SalePrice'] , fit=norm);
_skew = skew(train['SalePrice'])
_kurtosis = kurtosis(train['SalePrice'])
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}, skew = {:.2f} kurtosis = {:.2f}\n'.format(mu, sigma, _skew, _kurtosis))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
The skew seems now corrected and the data appears more normally distributed.
Edit: Both distributions have a positive kurtosis, which means there is a steep dropoff in the curve from the center, or the tails have few points. Skewness is close to 0 now, so that metric is closer to a normal distribution
let's first concatenate the train and test data in the same dataframe
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
all_data.head()
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
Data Correlation
Cross Correlation chart to view collinearity within the features. Kendall's seems appropriate as the Output Variable is numerical and much of the input is categorical. Here is a chart to guide you to which method to use based on your dataset.
#Correlation map to see how features are correlated with each other and with SalePrice
corrmat = train.corr(method='kendall')
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)
We impute them by proceeding sequentially through features with missing values
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data.dtypes
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'].value_counts()
# this one may be a bit dangerous, maybe try to get zone from neighborhood most common value, similar to LotFrontage previously
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data['Utilities'].value_counts()
all_data = all_data.drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
Is there any remaining missing value ?
#Check remaining missing values if any
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()
It remains no missing value.
Transforming some numerical variables that are really categorical
Edit: I will add a new column year and date as ordinal, since there is some order to the values it may be useful
all_data['MSSubClass'].value_counts()
all_data['OverallCond'].value_counts()
all_data.shape
Want to add ordinal or int column for Year and Month, this is the function to perform that task
import datetime
Yr = all_data['YrSold'].min()
Mo = all_data['MoSold'].min()
t = datetime.datetime(Yr, Mo, 1, 0, 0)
def calculateYrMo (row):
return int((datetime.datetime(row.YrSold,row.MoSold,1) - t).total_seconds())
# either way will work
#all_data['YrMoSold'] = all_data.apply(lambda row: int((datetime.datetime(row.YrSold,row.MoSold,1) - t).total_seconds()), axis=1)
all_data['YrMoSold'] = all_data.apply(lambda row: calculateYrMo(row), axis=1)
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)
#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
Label Encoding some categorical variables that may contain information in their ordering set
Adding one more important feature
Since area related features are very important to determine house prices, we add one more feature which is the total area of basement, first and second floor areas of each house
from sklearn.preprocessing import LabelEncoder
#cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
# 'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
# 'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
# 'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
# 'YrSold', 'MoSold', 'YrMoSold')
# Edit: Dropping PoolQC for missing values => makes the model worse, reverting
#all_data = all_data.drop(['PoolQC'], axis=1)
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
'YrSold', 'MoSold', 'YrMoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))
# shape
print('Shape all_data: {}'.format(all_data.shape))
# Adding total sqfootage feature
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
Skewed features
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Box Cox Transformation of (highly) skewed features
We use the scipy function boxcox1p which computes the Box-Cox transformation of $1 + x$.
Note that setting $ \lambda = 0 $ is equivalent to log1p used above for the target variable.
See this page for more details on Box Cox Transformation as well as the scipy function's page
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam_f = 0.15
for feat in skewed_features:
#all_data[feat] += 1
all_data[feat] = boxcox1p(all_data[feat], lam_f)
#all_data[skewed_features] = np.log1p(all_data[skewed_features])
I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).
Getting dummy categorical features
all_data = pd.get_dummies(all_data)
print(all_data.shape)
Do some PCA for the dataset, to remove some of the collinearity. Not sure if this will have any effect as collinearity is usually not detrimental to many or most algorithms
# to choose number of components, look at this chart. Reference: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
from sklearn.decomposition import PCA
pca = PCA().fit(all_data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
We choose number of eigenvalues to calculate based on previous chart, 20 looks like a good number, the chart starts to roll off around 15 and almost hits max a 20. We can also try 30,40 and 50 to squeeze a little more variance out...
#do PCA/StandardScaler+clip here or before the skew boxcox1p transform
n_components = 50
pca = PCA(n_components=n_components)
all_data_pca = pca.fit_transform(all_data)
our column count went from 216 to n_component value
print(all_data.shape)
print(all_data_pca.shape)
Edit: pca.components_ is the matrix you can use to calculate the inverse of the PCA analysis, i.e. go back to the original dataset reference: https://stats.stackexchange.com/a/143949
weights = np.round(pca.components_, 3)
ev = np.round(pca.explained_variance_ratio_, 3)
print('explained variance ratio',ev)
pca_wt = pd.DataFrame(weights)#, columns=all_data.columns)
pca_wt.head()
The shape values are the number of columns in the PCA x the number of original columns
pca_wt.shape
#Correlation map to see how features are correlated with each other and with SalePrice
corrmat = pd.DataFrame(all_data_pca).corr(method='kendall')
plt.subplots(figsize=(12,9))
plt.title("Kendall's Correlation Matrix PCA applied", fontsize=16)
sns.heatmap(corrmat, vmax=0.9, square=True)
#Correlation map to see how features are correlated with each other and with SalePrice
corrmat = train.corr(method='kendall')
plt.subplots(figsize=(12,9))
plt.title("Kendall's Correlation Matrix Initial Train Set", fontsize=16)
sns.heatmap(corrmat, vmax=0.9, square=True)
As you can see, the PCA analysis did its job, the features show little correlation now
Getting the new train and test sets.
print(type(all_data))
print(type(pd.DataFrame(all_data_pca)))
Testing the new PCA dataset for analysis - RMSE looks works after PCA is applied, need to look at the Kaggle score later and see if it correlates, could be a mistake to use PCA on categorical dummy data. However, XGB is better with PCA n=50 option. Maybe use a heavier weight for that portion, or use all_data_pca only on that model...
use_pca = 0 # using PCA currently hurts the score
if (use_pca == 1):
all_data_pca = pd.DataFrame(all_data_pca)
train = all_data_pca[:ntrain]
test = all_data_pca[ntrain:]
all_data_pca.head()
else:
train = all_data[:ntrain]
test = all_data[ntrain:]
Import librairies
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
Define a cross validation strategy
We use the cross_val_score function of Sklearn. However this function has not a shuffle attribut, we add then one line of code, in order to shuffle the dataset prior to cross-validation
replace cross_val_score() with cross_validate()
# reference: https://scikit-learn.org/stable/modules/cross_validation.html
>>> from sklearn.metrics import make_scorer
>>> scoring = {'prec_macro': 'precision_macro',
... 'rec_macro': make_scorer(recall_score, average='macro')}
>>> scores = cross_validate(clf, X, y, scoring=scoring,
... cv=5, return_train_score=True)
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro',
'train_prec_macro', 'train_rec_macro']
>>> scores['train_rec_macro']
array([0.97..., 0.97..., 0.99..., 0.98..., 0.98...])
Kfold is useful for thorough testing of a model, will give a more accurate score based on remove some data test on the remaining and change the data removed each time. See image below for details:
#Validation function
n_folds = 10 # was 5 => better score but twice as slow now
def rmsle_cv(model):
print("running rmsle_cv code")
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
# other scores: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf)) # also r2
print("raw rmse scores for each fold:", rmse)
return(rmse)
This model may be very sensitive to outliers. So we need to made it more robust on them. For that we use the sklearn's Robustscaler() method on pipeline
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
again made robust to outliers
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
With huber loss that makes it robust to outliers
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state =5)
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,
subsample=0.5213, silent=1,
random_state =7, nthread = -1)
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
learning_rate=0.05, n_estimators=720,
max_bin = 55, bagging_fraction = 0.8,
bagging_freq = 5, feature_fraction = 0.2319,
feature_fraction_seed=9, bagging_seed=9,
min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
Let's see how these base models perform on the data by evaluating the cross-validation rmsle error
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
We begin with this simple approach of averaging base models. We build a new class to extend scikit-learn with our model and also to leverage encapsulation and code reuse (inheritance)
Averaged base models class
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
Averaged base models score
We just average four models here ENet, GBoost, KRR and lasso. Of course we could easily add more models in the mix.
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))
score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Wow ! It seems even the simplest stacking approach really improve the score . This encourages us to go further and explore a less simple stacking approch.
In this approach, we add a meta-model on averaged base models and use the out-of-folds predictions of these base models to train our meta-model.
The procedure, for the training part, may be described as follows:
Split the total training set into two disjoint sets (here train and .holdout )
Train several base models on the first part (train)
Test these base models on the second part (holdout)
Use the predictions from 3) (called out-of-folds predictions) as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.
The first three steps are done iteratively . If we take for example a 5-fold stacking , we first split the training data into 5 folds. Then we will do 5 iterations. In each iteration, we train every base model on 4 folds and predict on the remaining fold (holdout fold).
So, we will be sure, after 5 iterations , that the entire data is used to get out-of-folds predictions that we will then use as new feature to train our meta-model in the step 4.
For the prediction part , We average the predictions of all base models on the test data and used them as meta-features on which, the final prediction is done with the meta-model.
Another way to combine multiple models:
The function cross_val_predict is appropriate for: Visualization of predictions obtained from different models.
Model blending: When predictions of one supervised estimator are used to train another estimator in ensemble methods.
(Image taken from Faron)
Gif taken from KazAnova's interview
On this gif, the base models are algorithms 0, 1, 2 and the meta-model is algorithm 3. The entire training dataset is A+B (target variable y known) that we can split into train part (A) and holdout part (B). And the test dataset is C.
B1 (which is the prediction from the holdout part) is the new feature used to train the meta-model 3 and C1 (which is the prediction from the test dataset) is the meta-feature on which the final prediction is done.
Stacking averaged Models Class
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=10): # increasing this value should give a more accurate prediction, averaged over n_fold iterations
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
# We again fit the data on clones of the original models
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
# Train cloned base models then create out-of-fold predictions
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models): # for each model passed in
for train_index, holdout_index in kfold.split(X, y): # create train,test for the number of folds
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index]) # fit the model for this fold
y_pred = instance.predict(X[holdout_index]) # predict values for this fold
out_of_fold_predictions[holdout_index, i] = y_pred # think we either use all of these values as features later, or the mean value?
# Now train the cloned meta-model using the out-of-fold predictions as new and only feature
print("out_of_fold_predictions", out_of_fold_predictions)
self.meta_model_.fit(out_of_fold_predictions, y) # need to see out_of_fold_predictions feature set
return self
#Do the predictions of all base models on the test data and use the averaged predictions as
#meta-features for the final prediction which is done by the meta-model
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_ ])
return self.meta_model_.predict(meta_features)
Stacking Averaged models Score
To make the two approaches comparable (by using the same number of models) , we just average Enet, KRR and Gboost, then we add lasso as meta-model.
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
meta_model = lasso)
score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score mean and std: {:.4f} ({:.4f})".format(score.mean(), score.std()))
We get again a better score by adding a meta learner
We add XGBoost and LightGBM to the StackedRegressor defined previously.
We first define a rmsle evaluation function
def rmsle(y, y_pred):
return np.sqrt(mean_squared_error(y, y_pred))
StackedRegressor:
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
#stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
stacked_pred = inv_boxcox(stacked_averaged_models.predict(test.values), lam_l)
print(rmsle(y_train, stacked_train_pred))
XGBoost:
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
#xgb_pred = np.expm1(model_xgb.predict(test))
xgb_pred = inv_boxcox(model_xgb.predict(test), lam_l)
print(rmsle(y_train, xgb_train_pred))
LightGBM:
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
#lgb_pred = np.expm1(model_lgb.predict(test.values))
lgb_pred = inv_boxcox(model_lgb.predict(test.values), lam_l)
print(rmsle(y_train, lgb_train_pred))
'''RMSE on the entire Train data when averaging'''
print('RMSLE score on train data:')
stk = 0.70
xgb = 0.15
lgb = 0.15
print(rmsle(y_train,stacked_train_pred * stk + xgb_train_pred * xgb + lgb_train_pred * lgb ))
Ensemble prediction:
ensemble = stacked_pred*stk + xgb_pred*xgb + lgb_pred*lgb
print(lgb_pred)
print(xgb_pred)
print(stacked_pred)
print(ensemble)
Create File for Submission to Kaggle
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)
sub.head()
!pwd
ls -arlt
!head -10 submission.csv
Issue with submit screen data not matching run/edit/save version screen Trust the submit screen over the edit/save version screen. This seems to be not what is actually submitted. wrong: https://www.kaggle.com/donaldst/stacked-regressions-part-2/edit/run/37370884 correct: https://www.kaggle.com/donaldst/stacked-regressions-part-2?scriptVersionId=373737301) #400 0.11543 Default notebook 2) #400 4.19420 BoxCox on Sale Price (forgot to revert back) 5) #400 0.11543 Back to default 6) #288 0.11513 Added YrMo Cat/Ord column 8) #270 0.11482 Use BoxCox1p([SalePrice], lam) and revert using inv_boxcox 10) #270 0.11631 use BoxCox([SalePrice]) which uses default lambda minimizing least-likelihood 13) 0.11482 revert back to BoxCox1p 14) 0.11485 change lam from .15 to .05 (better skew, similar kurtosis 15) #263 0.11465 change lam to .20 16) 0.11492 change lam to .25 17) #259 0.11461 change lam to .30 18) 0.11474 change lam to .40 (too far) 19) #217 0.11395 change lam to .35 (stick with this for now) 20) 0.11479 drop PoolQC column for missing values (column was accurate by replacing the missing with None) 22) #217 0.11546 put PoolQC back, drop MiscFeature 23) 0.11395* revert, no dropping other columns 25) #223 0.14229 add PCA(n=50) with no other adjustments 27) #223 0.17279 PCA(n=20) 28) 0.14292 revert to n=50 (noticed the score seems to change slightly even with the same settings 29) 0.15168 change proportions of the models to use the model with the lowest rmsle (XGBoost) more, from .70, 15, .15 to .15, .70, .15 => score is worse, decreasing rmsle is not an accurate method to get a better score 30) 0.11445 change max_depth of model_xgb from 3 to 2 (weaker learner) => increases the rmsle and has worse score 31) 0.11410 try max_depth=4 => still worse, looks like 3 is the best number 32) 0.11395 revert to ~ #23 33) #211 0.11356 change n_folds from 5 to 10 for more thorough testing (takes 15 mins to run now) 34) 0.11356 use default scoring in cross_val_score function, not neg_mean_squared_error => same score, is this line of code not used? 35) revert to #33 stacked=.70, xgb=.15, lgb=.15 38) 0.11409 stacked=.98, xgb=.01, lgb=.01 => worse 39) #216 0.11363 stacked=.70, xgb=.05, lgb=.25 => also worse 41) 0.11375 stacked=.70, xgb=.25, lgb=.05 => also worse, seems ensembling is better... 42) 0.11364 stacked=.60, xgb=.15, lgb=.25 => worse 43) 0.11366 stacked=.60, xgb=.20, lgb=.20 => worse, looks like the kernel already hsa the optimal mix 44) 0.11486 filter the other 4 columns => worse, seems this information is useful... 45) 0.11622 put 2 back, hmm. even worse now... 46) 0.11464 try other 2 instead - ready to submit, just wait for tomorrow 47) 0.11825 remove all filtering, reverting as any other combinations are worse, the original user must have already checked this 49) #176 0.11356 revert to #33 (think a few scores drpped off already, my score moved up even though it didn't improve from #33) 50) try StandardScaler() insteaf of RobustScaler()
to do: try pca on features => no improvement in score. will drop this for now change ratios of models to find which is best predictor (gives best score), then work on improving that model first change scale factor of 3 models => score does not correlate to rmsle, try to find a better metric try StandardScaler and np.clip on features, currenly see RobustScaler() being used but i already screened fliers so StandardScaler should be adequate more filtering of fliers use all_data_pca only on models that show improvement use GridsearchCV to find optimal parameters, however, need to find the best way to evaluate the scores. RMSLE doesn't seem to match well...