Original notebook source: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

original default score was .11543, new best score is .11356

Stacked and Ensembled Regressions to predict House Prices

How to Kaggle: https://www.youtube.com/watch?v=GJBOMWpLpTQ

Donald S

June 2020

Need to submit every 2 months as the leaderboard will rollover after this 2 month period

Typical flow of model building: Use GridSearch for determining Best parameters => **bestestimator

good basic resource for models: https://scikit-learn.org/stable/modules/cross_validation.html https://scikit-learn.org/stable/modules/grid_search.html

A search consists of: an estimator (regressor or classifier such as sklearn.svm.SVC());
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function. **

estimator.get_params() # get all hyperparameters


inner_cv = KFold(n_splits=4, shuffle=True, random_state=None)

# To be used in outer CV 
outer_cv = KFold(n_splits=4, shuffle=True, random_state=None)    

#inner loop KFold example:
gsc = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    #scoring='roc_auc', # or 'r2', etc
    scoring='f1',
    #scoring='neg_mean_squared_error', # or look here for other choices 
    # https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    #cv=5,
    cv=inner_cv, # this will use KFold splitting to change train/test/validation datasets randomly
    verbose=verbose
)

print("Running GridSearchCV:")
with MyTimer():    
    grid_result = gsc.fit(X_train, y_train)   
non_nested_scores[i] = grid_result.best_score_

rf = RandomForestClassifier(**grid_result.best_params_)

image.png

comments from the original author:

This competition is very important to me as it helped me to begin my journey on Kaggle few months ago. I've read some great notebooks here. To name a few:

  1. Comprehensive data exploration with Python by Pedro Marcelino : Great and very motivational data analysis

  2. A study on Regression applied to the Ames dataset by Julien Cohen-Solal : Thorough features engeneering and deep dive into linear regression analysis but really easy to follow for beginners.

  3. Regularized Linear Models by Alexandru Papiu : Great Starter kernel on modelling and Cross-validation

I can't recommend enough every beginner to go carefully through these kernels (and of course through many others great kernels) and get their first insights in data science and kaggle competitions.

After that (and some basic pratices) you should be more confident to go through this great script by Human Analog who did an impressive work on features engeneering.

As the dataset is particularly handy, I decided few days ago to get back in this competition and apply things I learnt so far, especially stacking models. For that purpose, we build two stacking classes ( the simplest approach and a less simple one).

As these classes are written for general purpose, you can easily adapt them and/or extend them for your regression problems. The overall approach is hopefully concise and easy to follow..

The features engeneering is rather parsimonious (at least compared to some others great scripts) . It is pretty much :

  • Imputing missing values by proceeding sequentially through the data

  • Transforming some numerical variables that seem really categorical

  • Label Encoding some categorical variables that may contain information in their ordering set

  • Box Cox Transformation of skewed features (instead of log-transformation) : This gave me a slightly better result both on leaderboard and cross-validation.

  • Getting dummy variables for categorical features.

Then we choose many base models (mostly sklearn based models + sklearn API of DMLC's XGBoost and Microsoft's LightGBM), cross-validate them on the data before stacking/ensembling them. The key here is to make the (linear) models robust to outliers. This improved the result both on LB and cross-validation.

To my surprise, this does well on LB ( 0.11420 and top 4% the last time I tested it : July 2, 2017 )

Hope that at the end of this notebook, stacking will be clear for those, like myself, who found the concept not so easy to grasp

In [90]:
#import some necessary libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew, kurtosis, boxcox #for some statistics
from scipy.special import boxcox1p, inv_boxcox


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points


from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #check the files available in the directory
data_description.txt
sample_submission.csv
test.csv
train.csv

In [91]:
#Now let's import and put the train and test datasets in  pandas dataframe

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
In [92]:
##display the first five rows of the train dataset.
train.head(5)
Out[92]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.000 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.000 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.000 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.000 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.000 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [93]:
##display the first five rows of the test dataset.
test.head(5)
Out[93]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.000 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.000 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.000 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.000 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.000 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

In [94]:
#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79) 

Data Processing

Outliers

Documentation for the Ames Housing Data indicates that there are outliers present in the training data

Let's explore these outliers

In [95]:
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)

m, b = np.polyfit(train['GrLivArea'], train['SalePrice'], 1)
#m = slope, b=intercept
plt.plot(train['GrLivArea'], m*train['GrLivArea'] + b)

plt.show()

Edit: Look for fliers in other columns

In [96]:
train.shape[1]
#a = int(np.sqrt(train.shape[1]))
a = 4
b = int(train.shape[1]/4)
r = int(train.shape[1]/a)
c = int(train.shape[1]/b)
i = 0
fig, ax = plt.subplots(nrows=r, ncols=c, figsize=(15, 60))
for row in ax:
    for col in row:
        #print(train.columns[i])
        #print(train[train.columns[i]].dtype)
        #col.plot()
        #ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
        try:
            col.scatter(x = train[train.columns[i]], y = train['SalePrice'])
            col.title.set_text(train.columns[i])
            #col.title(train.columns[i])
        except:
            temp=1
        #except Exception as e:
        #    print(e.message, e.args)
        finally:
            temp=1
        i = i + 1
        
plt.show()

More filtering of data to try

In [97]:
#train = train.drop(train[(train['LotFrontage']>300) & (train['SalePrice']<400000)].index)
#train = train.drop(train[(train['LotArea']>100000) & (train['SalePrice']<400000)].index)
#train = train.drop(train[(train['BsmtFinSF1']>4000) & (train['SalePrice']<250000)].index)
#train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<250000)].index)

We can see at the bottom right two with extremely large GrLivArea that are of a low price. These values are huge oultliers. Therefore, we can safely delete them.

Edit: Think this deletion is safe to do as there are only 2 points and they seem to be abnormal, possibly data error

In [9]:
#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
m, b = np.polyfit(train['GrLivArea'], train['SalePrice'], 1)
#m = slope, b=intercept
plt.plot(train['GrLivArea'], m*train['GrLivArea'] + b)
plt.show()

Note :

Outliers removal is note always safe. We decided to delete these two as they are very huge and really bad ( extremely large areas for very low prices).

There are probably others outliers in the training data. However, removing all them may affect badly our models if ever there were also outliers in the test data. That's why , instead of removing them all, we will just manage to make some of our models robust on them. You can refer to the modelling part of this notebook for that.

Edit: todo - look for other outliers in all columns. Maybe better to use StandardScaler() and np.clip

Target Variable

SalePrice is the variable we need to predict. So let's do some analysis on this variable first.

In [10]:
sns.distplot(train['SalePrice'] , fit=norm)

_skew = skew(train['SalePrice'])
_kurtosis = kurtosis(train['SalePrice'])

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}, skew = {:.2f} kurtosis = {:.2f}\n'.format(mu, sigma, _skew, _kurtosis))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
 mu = 180932.92 and sigma = 79467.79, skew = 1.88 kurtosis = 6.50

The target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed.

Log-transformation of the target variable

Edit: also trying box-cox transform. The Box-Cox transform is given by:

y = (x**lmbda - 1) / lmbda,  for lmbda > 0
log(x), for lmbda = 0

In [11]:
#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column

# option 1
#train["SalePrice"] = np.log1p(train["SalePrice"])

# Option 2: use box-cox transform - this performs worse than the log(1+x), reverting back
lam_l = 0.35
train["SalePrice"] = boxcox1p(train["SalePrice"], lam_l)

# Option 3: boxcox letting the algorithm select lmbda based on least-likelihood calculation
#train["SalePrice"], lam_l = boxcox(x=train["SalePrice"], lmbda=None)
In [12]:
# to revert back use y = inv_boxcox(x, lam_l) => train["SalePrice"] = inv_boxcox(train["SalePrice"], lam_l)
# think the formula is this: # pred_y = np.power((y_box * lambda_) + 1, 1 / lambda_) - 1

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

_skew = skew(train['SalePrice'])
_kurtosis = kurtosis(train['SalePrice'])

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}, skew = {:.2f} kurtosis = {:.2f}\n'.format(mu, sigma, _skew, _kurtosis))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
 mu = 191.18 and sigma = 27.62, skew = 0.69 kurtosis = 1.31

The skew seems now corrected and the data appears more normally distributed.

Edit: Both distributions have a positive kurtosis, which means there is a steep dropoff in the curve from the center, or the tails have few points. Skewness is close to 0 now, so that metric is closer to a normal distribution

Features engineering

let's first concatenate the train and test data in the same dataframe

In [13]:
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
all_data size is : (2917, 79)
In [14]:
all_data.head()
Out[14]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 856 854 0 NaN 3 1Fam TA No 706.000 0.000 ... WD 0 Pave 8 856.000 AllPub 0 2003 2003 2008
1 1262 0 0 NaN 3 1Fam TA Gd 978.000 0.000 ... WD 0 Pave 6 1262.000 AllPub 298 1976 1976 2007
2 920 866 0 NaN 3 1Fam TA Mn 486.000 0.000 ... WD 0 Pave 6 920.000 AllPub 0 2001 2002 2008
3 961 756 0 NaN 3 1Fam Gd No 216.000 0.000 ... WD 0 Pave 7 756.000 AllPub 0 1915 1970 2006
4 1145 1053 0 NaN 4 1Fam TA Av 655.000 0.000 ... WD 0 Pave 9 1145.000 AllPub 192 2000 2000 2008

5 rows × 79 columns

Missing Data

In [15]:
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Out[15]:
Missing Ratio
PoolQC 99.691
MiscFeature 96.400
Alley 93.212
Fence 80.425
FireplaceQu 48.680
LotFrontage 16.661
GarageQual 5.451
GarageCond 5.451
GarageFinish 5.451
GarageYrBlt 5.451
GarageType 5.382
BsmtExposure 2.811
BsmtCond 2.811
BsmtQual 2.777
BsmtFinType2 2.743
BsmtFinType1 2.708
MasVnrType 0.823
MasVnrArea 0.788
MSZoning 0.137
BsmtFullBath 0.069
In [16]:
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
Out[16]:
Text(0.5,1,'Percent missing data by feature')

Data Correlation

Cross Correlation chart to view collinearity within the features. Kendall's seems appropriate as the Output Variable is numerical and much of the input is categorical. Here is a chart to guide you to which method to use based on your dataset.

image.png

In [17]:
#Correlation map to see how features are correlated with each other and with SalePrice
corrmat = train.corr(method='kendall')
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdadf45a438>

Imputing missing values

We impute them by proceeding sequentially through features with missing values

  • PoolQC : data description says NA means "No Pool". That make sense, given the huge ratio of missing value (+99%) and majority of houses have no Pool at all in general.
In [18]:
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
In [19]:
all_data.dtypes
Out[19]:
1stFlrSF           int64
2ndFlrSF           int64
3SsnPorch          int64
Alley             object
BedroomAbvGr       int64
BldgType          object
BsmtCond          object
BsmtExposure      object
BsmtFinSF1       float64
BsmtFinSF2       float64
BsmtFinType1      object
BsmtFinType2      object
BsmtFullBath     float64
BsmtHalfBath     float64
BsmtQual          object
BsmtUnfSF        float64
CentralAir        object
Condition1        object
Condition2        object
Electrical        object
EnclosedPorch      int64
ExterCond         object
ExterQual         object
Exterior1st       object
Exterior2nd       object
Fence             object
FireplaceQu       object
Fireplaces         int64
Foundation        object
FullBath           int64
                  ...   
LotFrontage      float64
LotShape          object
LowQualFinSF       int64
MSSubClass         int64
MSZoning          object
MasVnrArea       float64
MasVnrType        object
MiscFeature       object
MiscVal            int64
MoSold             int64
Neighborhood      object
OpenPorchSF        int64
OverallCond        int64
OverallQual        int64
PavedDrive        object
PoolArea           int64
PoolQC            object
RoofMatl          object
RoofStyle         object
SaleCondition     object
SaleType          object
ScreenPorch        int64
Street            object
TotRmsAbvGrd       int64
TotalBsmtSF      float64
Utilities         object
WoodDeckSF         int64
YearBuilt          int64
YearRemodAdd       int64
YrSold             int64
Length: 79, dtype: object
  • MiscFeature : data description says NA means "no misc feature"
In [20]:
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
  • Alley : data description says NA means "no alley access"
In [21]:
all_data["Alley"] = all_data["Alley"].fillna("None")
  • Fence : data description says NA means "no fence"
In [22]:
all_data["Fence"] = all_data["Fence"].fillna("None")
  • FireplaceQu : data description says NA means "no fireplace"
In [23]:
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
  • LotFrontage : Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood , we can fill in missing values by the median LotFrontage of the neighborhood.
In [24]:
#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
  • GarageType, GarageFinish, GarageQual and GarageCond : Replacing missing data with None
In [25]:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
  • GarageYrBlt, GarageArea and GarageCars : Replacing missing data with 0 (Since No garage = no cars in such garage.)
In [26]:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
  • BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath : missing values are likely zero for having no basement
In [27]:
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
  • BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 : For all these categorical basement-related features, NaN means that there is no basement.
In [28]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')
  • MasVnrArea and MasVnrType : NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type.
In [29]:
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
  • MSZoning (The general zoning classification) : 'RL' is by far the most common value. So we can fill in missing values with 'RL'
In [30]:
all_data['MSZoning'].value_counts()
Out[30]:
RL         2263
RM          460
FV          139
RH           26
C (all)      25
Name: MSZoning, dtype: int64
In [31]:
# this one may be a bit dangerous, maybe try to get zone from neighborhood most common value, similar to LotFrontage previously
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
  • Utilities : For this categorical feature all records are "AllPub", except for one "NoSeWa" and 2 NA . Since the house with 'NoSewa' is in the training set, this feature won't help in predictive modelling. We can then safely remove it.
In [32]:
all_data['Utilities'].value_counts()
Out[32]:
AllPub    2914
NoSeWa       1
Name: Utilities, dtype: int64
In [33]:
all_data = all_data.drop(['Utilities'], axis=1)
  • Functional : data description says NA means typical
In [34]:
all_data["Functional"] = all_data["Functional"].fillna("Typ")
  • Electrical : It has one NA value. Since this feature has mostly 'SBrkr', we can set that for the missing value.
In [35]:
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
  • KitchenQual: Only one NA value, and same as Electrical, we set 'TA' (which is the most frequent) for the missing value in KitchenQual.
In [36]:
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
  • Exterior1st and Exterior2nd : Again Both Exterior 1 & 2 have only one missing value. We will just substitute in the most common string
In [37]:
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
  • SaleType : Fill in again with most frequent which is "WD"
In [38]:
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
  • MSSubClass : Na most likely means No building class. We can replace missing values with None
In [39]:
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

Is there any remaining missing value ?

In [40]:
#Check remaining missing values if any 
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()
Out[40]:
Missing Ratio

It remains no missing value.

More features engeneering

Transforming some numerical variables that are really categorical

Edit: I will add a new column year and date as ordinal, since there is some order to the values it may be useful

In [41]:
all_data['MSSubClass'].value_counts()
Out[41]:
20     1079
60      573
50      287
120     182
30      139
160     128
70      128
80      118
90      109
190      61
85       48
75       23
45       18
180      17
40        6
150       1
Name: MSSubClass, dtype: int64
In [42]:
all_data['OverallCond'].value_counts()
Out[42]:
5    1643
6     531
7     390
8     144
4     101
3      50
9      41
2      10
1       7
Name: OverallCond, dtype: int64
In [43]:
all_data.shape
Out[43]:
(2917, 78)

Want to add ordinal or int column for Year and Month, this is the function to perform that task

In [44]:
import datetime
Yr = all_data['YrSold'].min()
Mo = all_data['MoSold'].min()
t = datetime.datetime(Yr, Mo, 1, 0, 0)

def calculateYrMo (row):   
    return int((datetime.datetime(row.YrSold,row.MoSold,1) - t).total_seconds())
In [45]:
# either way will work
#all_data['YrMoSold'] = all_data.apply(lambda row: int((datetime.datetime(row.YrSold,row.MoSold,1) - t).total_seconds()), axis=1)

all_data['YrMoSold'] = all_data.apply(lambda row: calculateYrMo(row), axis=1)
In [46]:
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

Label Encoding some categorical variables that may contain information in their ordering set

Adding one more important feature

Since area related features are very important to determine house prices, we add one more feature which is the total area of basement, first and second floor areas of each house

In [47]:
from sklearn.preprocessing import LabelEncoder
#cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
#        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
#        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
#        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
#        'YrSold', 'MoSold', 'YrMoSold')

# Edit: Dropping PoolQC for missing values => makes the model worse, reverting
#all_data = all_data.drop(['PoolQC'], axis=1)

cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold', 'YrMoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))
Shape all_data: (2917, 79)
In [48]:
# Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

Skewed features

In [49]:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Skew in numerical features: 

Out[49]:
Skew
MiscVal 21.940
PoolArea 17.689
LotArea 13.109
LowQualFinSF 12.085
3SsnPorch 11.372
LandSlope 4.973
KitchenAbvGr 4.301
BsmtFinSF2 4.145
EnclosedPorch 4.002
ScreenPorch 3.945

Box Cox Transformation of (highly) skewed features

We use the scipy function boxcox1p which computes the Box-Cox transformation of $1 + x$.

Note that setting $ \lambda = 0 $ is equivalent to log1p used above for the target variable.

See this page for more details on Box Cox Transformation as well as the scipy function's page

In [50]:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index

lam_f = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam_f)
    
    #all_data[skewed_features] = np.log1p(all_data[skewed_features])
There are 60 skewed numerical features to Box Cox transform

I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).

Getting dummy categorical features

In [51]:
all_data = pd.get_dummies(all_data)
print(all_data.shape)
(2917, 221)

Do some PCA for the dataset, to remove some of the collinearity. Not sure if this will have any effect as collinearity is usually not detrimental to many or most algorithms

In [52]:
# to choose number of components, look at this chart. Reference: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
from sklearn.decomposition import PCA

pca = PCA().fit(all_data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
Out[52]:
Text(0,0.5,'cumulative explained variance')

We choose number of eigenvalues to calculate based on previous chart, 20 looks like a good number, the chart starts to roll off around 15 and almost hits max a 20. We can also try 30,40 and 50 to squeeze a little more variance out...

In [53]:
#do PCA/StandardScaler+clip here or before the skew boxcox1p transform

n_components = 50
pca = PCA(n_components=n_components)
all_data_pca = pca.fit_transform(all_data)

our column count went from 216 to n_component value

In [54]:
print(all_data.shape)
print(all_data_pca.shape)
(2917, 221)
(2917, 50)

Edit: pca.components_ is the matrix you can use to calculate the inverse of the PCA analysis, i.e. go back to the original dataset reference: https://stats.stackexchange.com/a/143949

In [55]:
weights = np.round(pca.components_, 3) 
ev = np.round(pca.explained_variance_ratio_, 3)
print('explained variance ratio',ev)
pca_wt = pd.DataFrame(weights)#, columns=all_data.columns)
pca_wt.head()
explained variance ratio [ 0.213  0.165  0.114  0.076  0.074  0.061  0.048  0.044  0.033  0.028
  0.026  0.021  0.01   0.009  0.006  0.006  0.004  0.004  0.003  0.003
  0.003  0.002  0.002  0.002  0.002  0.002  0.002  0.002  0.001  0.001
  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001
  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001]
Out[55]:
0 1 2 3 4 5 6 7 8 9 ... 211 212 213 214 215 216 217 218 219 220
0 -0.069 0.671 -0.008 -0.003 0.014 -0.007 0.028 -0.618 -0.122 0.069 ... 0.000 -0.000 -0.000 -0.000 0.000 0.000 0.000 0.000 -0.000 -0.000
1 -0.028 -0.597 -0.003 -0.003 -0.015 -0.009 0.018 -0.349 0.020 0.029 ... -0.005 0.002 -0.000 -0.000 0.001 0.000 0.000 -0.005 0.001 0.001
2 0.086 -0.375 -0.000 0.002 -0.001 0.010 -0.013 -0.499 -0.169 0.062 ... 0.017 0.001 -0.000 -0.000 -0.000 0.000 -0.001 0.017 -0.000 -0.016
3 -0.008 -0.067 -0.006 -0.001 -0.006 -0.002 -0.016 -0.205 0.220 0.008 ... -0.003 -0.001 0.000 0.000 -0.000 -0.000 0.000 -0.002 -0.000 0.004
4 0.004 0.001 -0.001 -0.001 0.002 -0.002 -0.022 0.016 -0.097 -0.002 ... 0.001 -0.003 -0.000 -0.000 0.000 0.000 -0.000 0.002 0.001 0.001

5 rows × 221 columns

The shape values are the number of columns in the PCA x the number of original columns

In [56]:
pca_wt.shape 
Out[56]:
(50, 221)
In [57]:
#Correlation map to see how features are correlated with each other and with SalePrice
corrmat = pd.DataFrame(all_data_pca).corr(method='kendall')
plt.subplots(figsize=(12,9))
plt.title("Kendall's Correlation Matrix PCA applied", fontsize=16)
sns.heatmap(corrmat, vmax=0.9, square=True)


#Correlation map to see how features are correlated with each other and with SalePrice
corrmat = train.corr(method='kendall')
plt.subplots(figsize=(12,9))
plt.title("Kendall's Correlation Matrix Initial Train Set", fontsize=16)
sns.heatmap(corrmat, vmax=0.9, square=True)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdadf260080>

As you can see, the PCA analysis did its job, the features show little correlation now

Getting the new train and test sets.

In [58]:
print(type(all_data))
print(type(pd.DataFrame(all_data_pca)))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

Testing the new PCA dataset for analysis - RMSE looks works after PCA is applied, need to look at the Kaggle score later and see if it correlates, could be a mistake to use PCA on categorical dummy data. However, XGB is better with PCA n=50 option. Maybe use a heavier weight for that portion, or use all_data_pca only on that model...

In [59]:
use_pca = 0 # using PCA currently hurts the score

if (use_pca == 1):
    all_data_pca = pd.DataFrame(all_data_pca)
    train = all_data_pca[:ntrain]
    test = all_data_pca[ntrain:]
    all_data_pca.head()
else:
    train = all_data[:ntrain]
    test = all_data[ntrain:]

Modelling

Import librairies

In [60]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

Define a cross validation strategy

We use the cross_val_score function of Sklearn. However this function has not a shuffle attribut, we add then one line of code, in order to shuffle the dataset prior to cross-validation

In [ ]:
replace cross_val_score() with cross_validate()
# reference: https://scikit-learn.org/stable/modules/cross_validation.html

>>> from sklearn.metrics import make_scorer
>>> scoring = {'prec_macro': 'precision_macro',
...            'rec_macro': make_scorer(recall_score, average='macro')}
>>> scores = cross_validate(clf, X, y, scoring=scoring,
...                         cv=5, return_train_score=True)
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro',
 'train_prec_macro', 'train_rec_macro']
>>> scores['train_rec_macro']
array([0.97..., 0.97..., 0.99..., 0.98..., 0.98...])

Kfold is useful for thorough testing of a model, will give a more accurate score based on remove some data test on the remaining and change the data removed each time. See image below for details:

image.png

In [61]:
#Validation function
n_folds = 10 # was 5 => better score but twice as slow now

def rmsle_cv(model):
    print("running rmsle_cv code")
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    # other scores: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf)) # also r2
    print("raw rmse scores for each fold:", rmse)
    return(rmse)

Base models

  • LASSO Regression :

This model may be very sensitive to outliers. So we need to made it more robust on them. For that we use the sklearn's Robustscaler() method on pipeline

In [62]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
  • Elastic Net Regression :

again made robust to outliers

In [63]:
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
  • Kernel Ridge Regression :
In [64]:
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
  • Gradient Boosting Regression :

With huber loss that makes it robust to outliers

In [65]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)
  • XGBoost :
In [66]:
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
  • LightGBM :
In [67]:
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

Base models scores

Let's see how these base models perform on the data by evaluating the cross-validation rmsle error

In [68]:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [  7.19684436   7.83096536   7.97772403   8.93177266  10.49682815
   8.04675617   8.28662615   7.599477     8.42205722   8.54552938]

Lasso score: 8.3335 (0.8594)

In [69]:
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [  7.15904091   7.78193025   7.91408931   8.91373775  10.45940089
   7.99418817   8.23189536   7.57711627   8.36993344   8.51012054]
ElasticNet score: 8.2911 (0.8618)

In [70]:
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [ 7.03427211  7.22688049  7.00275119  8.39527856  9.65371154  7.55943726
  7.81966482  7.01846884  7.04988685  7.83070123]
Kernel Ridge score: 7.6591 (0.7984)

In [71]:
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [ 7.55827713  6.73463856  6.86203471  8.66038166  8.99847597  6.94307838
  6.97634313  6.63589994  7.07199744  8.48170664]
Gradient Boosting score: 7.4923 (0.8406)

In [72]:
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [ 7.21398482  6.73254564  6.98554565  8.28139022  9.39601867  7.13936985
  6.62671221  6.61027593  6.77791082  8.57770974]
Xgboost score: 7.4341 (0.9200)

In [73]:
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [ 7.27522306  6.91808677  6.92575802  8.68239738  9.58161954  7.44332536
  7.15073174  6.60353745  7.28141389  8.85430731]
LGBM score: 7.6716 (0.9469)

Stacking models

Simplest Stacking approach : Averaging base models

We begin with this simple approach of averaging base models. We build a new class to extend scikit-learn with our model and also to leverage encapsulation and code reuse (inheritance)

Averaged base models class

In [74]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   

Averaged base models score

We just average four models here ENet, GBoost, KRR and lasso. Of course we could easily add more models in the mix.

In [75]:
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [ 6.69022411  6.93785197  6.93195041  8.24602774  9.5626853   7.06876413
  7.3195536   6.81481109  7.34112387  7.87970299]
 Averaged base models score: 7.4793 (0.8342)

Wow ! It seems even the simplest stacking approach really improve the score . This encourages us to go further and explore a less simple stacking approch.

Less simple Stacking : Adding a Meta-model

In this approach, we add a meta-model on averaged base models and use the out-of-folds predictions of these base models to train our meta-model.

The procedure, for the training part, may be described as follows:

  1. Split the total training set into two disjoint sets (here train and .holdout )

  2. Train several base models on the first part (train)

  3. Test these base models on the second part (holdout)

  4. Use the predictions from 3) (called out-of-folds predictions) as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.

The first three steps are done iteratively . If we take for example a 5-fold stacking , we first split the training data into 5 folds. Then we will do 5 iterations. In each iteration, we train every base model on 4 folds and predict on the remaining fold (holdout fold).

So, we will be sure, after 5 iterations , that the entire data is used to get out-of-folds predictions that we will then use as new feature to train our meta-model in the step 4.

For the prediction part , We average the predictions of all base models on the test data and used them as meta-features on which, the final prediction is done with the meta-model.

Another way to combine multiple models:

The function cross_val_predict is appropriate for: Visualization of predictions obtained from different models.

Model blending: When predictions of one supervised estimator are used to train another estimator in ensemble methods.

Faron

(Image taken from Faron)

kaz

Gif taken from KazAnova's interview

On this gif, the base models are algorithms 0, 1, 2 and the meta-model is algorithm 3. The entire training dataset is A+B (target variable y known) that we can split into train part (A) and holdout part (B). And the test dataset is C.

B1 (which is the prediction from the holdout part) is the new feature used to train the meta-model 3 and C1 (which is the prediction from the test dataset) is the meta-feature on which the final prediction is done.

Stacking averaged Models Class

In [76]:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=10): # increasing this value should give a more accurate prediction, averaged over n_fold iterations
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models): # for each model passed in
            for train_index, holdout_index in kfold.split(X, y): # create train,test for the number of folds
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index]) # fit the model for this fold
                y_pred = instance.predict(X[holdout_index]) # predict values for this fold
                out_of_fold_predictions[holdout_index, i] = y_pred # think we either use all of these values as features later, or the mean value?
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new and only feature
        print("out_of_fold_predictions", out_of_fold_predictions)
        self.meta_model_.fit(out_of_fold_predictions, y) # need to see out_of_fold_predictions feature set
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

Stacking Averaged models Score

To make the two approaches comparable (by using the same number of models) , we just average Enet, KRR and Gboost, then we add lasso as meta-model.

In [77]:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score mean and std: {:.4f} ({:.4f})".format(score.mean(), score.std()))
running rmsle_cv code
raw rmse scores for each fold: [ 6.85001007  6.64341104  6.52040034  8.00855758  9.17740571  6.75796759
  6.8917827   6.43514757  6.78515951  7.78967574]
Stacking Averaged models score: 7.1860 (0.8277)

We get again a better score by adding a meta learner

Ensembling StackedRegressor, XGBoost and LightGBM

We add XGBoost and LightGBM to the StackedRegressor defined previously.

We first define a rmsle evaluation function

In [78]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

Final Training and Prediction

StackedRegressor:

In [79]:
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
#stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))

stacked_pred = inv_boxcox(stacked_averaged_models.predict(test.values), lam_l)
print(rmsle(y_train, stacked_train_pred))
4.08436669454

XGBoost:

In [80]:
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
#xgb_pred = np.expm1(model_xgb.predict(test))
xgb_pred = inv_boxcox(model_xgb.predict(test), lam_l)
print(rmsle(y_train, xgb_train_pred))
1.32708218101

LightGBM:

In [81]:
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
#lgb_pred = np.expm1(model_lgb.predict(test.values))
lgb_pred = inv_boxcox(model_lgb.predict(test.values), lam_l)
print(rmsle(y_train, lgb_train_pred))
4.70958373675
In [82]:
'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')


stk = 0.70
xgb = 0.15
lgb = 0.15

print(rmsle(y_train,stacked_train_pred * stk + xgb_train_pred * xgb + lgb_train_pred * lgb ))
RMSLE score on train data:
4.05323713769

Ensemble prediction:

In [83]:
ensemble = stacked_pred*stk + xgb_pred*xgb + lgb_pred*lgb
In [84]:
print(lgb_pred)
print(xgb_pred)
print(stacked_pred)
print(ensemble)
[ 121761.01583231  158356.17942425  180909.50509846 ...,  157326.31615489
  120894.28993839  218781.60095824]
[ 129230.8515625  159499.234375   192326.65625   ...,  157675.484375
  113760.09375    213956.46875  ]
[ 120132.52392508  169295.06014787  188693.36182129 ...,  168010.88921857
  109330.69723372  219098.93293885]
[ 120239.79207658  169087.71304873  188651.85611534 ...,  167800.68935649
  109490.62714058  219044.33498692]

Create File for Submission to Kaggle

In [85]:
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)
In [86]:
sub.head()
Out[86]:
Id SalePrice
0 1461 120239.792
1 1462 169087.713
2 1463 188651.856
3 1464 193552.244
4 1465 194743.100
In [87]:
!pwd
/kaggle/working

In [88]:
ls -arlt
total 48

---------- 1 root root   263 Jun 26 07:42 __notebook_source__.ipynb

drwxr-xr-x 5 root root  4096 Jun 26 07:42 ../

drwxr-xr-x 2 root root  4096 Jun 26 08:03 ./

-rw-r--r-- 1 root root 34427 Jun 26 08:03 submission.csv

In [89]:
!head -10 submission.csv
Id,SalePrice

1461,120239.79207658161

1462,169087.71304872783

1463,188651.85611534293

1464,193552.2444678471

1465,194743.100180123

1466,173112.5025885276

1467,180265.70787655286

1468,162231.92584240084

1469,182156.82712075484

Issue with submit screen data not matching run/edit/save version screen
Trust the submit screen over the edit/save version screen. This seems to be not what is actually submitted.
wrong: https://www.kaggle.com/donaldst/stacked-regressions-part-2/edit/run/37370884
correct: https://www.kaggle.com/donaldst/stacked-regressions-part-2?scriptVersionId=37373730

1) #400 0.11543 Default notebook 2) #400 4.19420 BoxCox on Sale Price (forgot to revert back) 5) #400 0.11543 Back to default 6) #288 0.11513 Added YrMo Cat/Ord column 8) #270 0.11482 Use BoxCox1p([SalePrice], lam) and revert using inv_boxcox 10) #270 0.11631 use BoxCox([SalePrice]) which uses default lambda minimizing least-likelihood 13) 0.11482 revert back to BoxCox1p 14) 0.11485 change lam from .15 to .05 (better skew, similar kurtosis 15) #263 0.11465 change lam to .20 16) 0.11492 change lam to .25 17) #259 0.11461 change lam to .30 18) 0.11474 change lam to .40 (too far) 19) #217 0.11395 change lam to .35 (stick with this for now) 20) 0.11479 drop PoolQC column for missing values (column was accurate by replacing the missing with None) 22) #217 0.11546 put PoolQC back, drop MiscFeature 23) 0.11395* revert, no dropping other columns 25) #223 0.14229 add PCA(n=50) with no other adjustments 27) #223 0.17279 PCA(n=20) 28) 0.14292 revert to n=50 (noticed the score seems to change slightly even with the same settings 29) 0.15168 change proportions of the models to use the model with the lowest rmsle (XGBoost) more, from .70, 15, .15 to .15, .70, .15 => score is worse, decreasing rmsle is not an accurate method to get a better score 30) 0.11445 change max_depth of model_xgb from 3 to 2 (weaker learner) => increases the rmsle and has worse score 31) 0.11410 try max_depth=4 => still worse, looks like 3 is the best number 32) 0.11395 revert to ~ #23 33) #211 0.11356 change n_folds from 5 to 10 for more thorough testing (takes 15 mins to run now) 34) 0.11356 use default scoring in cross_val_score function, not neg_mean_squared_error => same score, is this line of code not used? 35) revert to #33 stacked=.70, xgb=.15, lgb=.15 38) 0.11409 stacked=.98, xgb=.01, lgb=.01 => worse 39) #216 0.11363 stacked=.70, xgb=.05, lgb=.25 => also worse 41) 0.11375 stacked=.70, xgb=.25, lgb=.05 => also worse, seems ensembling is better... 42) 0.11364 stacked=.60, xgb=.15, lgb=.25 => worse 43) 0.11366 stacked=.60, xgb=.20, lgb=.20 => worse, looks like the kernel already hsa the optimal mix 44) 0.11486 filter the other 4 columns => worse, seems this information is useful... 45) 0.11622 put 2 back, hmm. even worse now... 46) 0.11464 try other 2 instead - ready to submit, just wait for tomorrow 47) 0.11825 remove all filtering, reverting as any other combinations are worse, the original user must have already checked this 49) #176 0.11356 revert to #33 (think a few scores drpped off already, my score moved up even though it didn't improve from #33) 50) try StandardScaler() insteaf of RobustScaler()

to do: try pca on features => no improvement in score. will drop this for now change ratios of models to find which is best predictor (gives best score), then work on improving that model first change scale factor of 3 models => score does not correlate to rmsle, try to find a better metric try StandardScaler and np.clip on features, currenly see RobustScaler() being used but i already screened fliers so StandardScaler should be adequate more filtering of fliers use all_data_pca only on models that show improvement use GridsearchCV to find optimal parameters, however, need to find the best way to evaluate the scores. RMSLE doesn't seem to match well...

In [ ]: