Original notebook source: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
original default score was .11543, new best score is .11356
How to Kaggle: https://www.youtube.com/watch?v=GJBOMWpLpTQ
June 2020
Need to submit every 2 months as the leaderboard will rollover after this 2 month period
Typical flow of model building: Use GridSearch for determining Best parameters => **bestestimator
good basic resource for models: https://scikit-learn.org/stable/modules/cross_validation.html https://scikit-learn.org/stable/modules/grid_search.html
A search consists of:
an estimator (regressor or classifier such as sklearn.svm.SVC());
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function. **
estimator.get_params() # get all hyperparameters
inner_cv = KFold(n_splits=4, shuffle=True, random_state=None)
# To be used in outer CV
outer_cv = KFold(n_splits=4, shuffle=True, random_state=None)
#inner loop KFold example:
gsc = GridSearchCV(
estimator=rf,
param_grid=param_grid,
#scoring='roc_auc', # or 'r2', etc
scoring='f1',
#scoring='neg_mean_squared_error', # or look here for other choices
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
#cv=5,
cv=inner_cv, # this will use KFold splitting to change train/test/validation datasets randomly
verbose=verbose
)
print("Running GridSearchCV:")
with MyTimer():
grid_result = gsc.fit(X_train, y_train)
non_nested_scores[i] = grid_result.best_score_
rf = RandomForestClassifier(**grid_result.best_params_)
comments from the original author:
This competition is very important to me as it helped me to begin my journey on Kaggle few months ago. I've read some great notebooks here. To name a few:
Comprehensive data exploration with Python by Pedro Marcelino : Great and very motivational data analysis
A study on Regression applied to the Ames dataset by Julien Cohen-Solal : Thorough features engeneering and deep dive into linear regression analysis but really easy to follow for beginners.
Regularized Linear Models by Alexandru Papiu : Great Starter kernel on modelling and Cross-validation
I can't recommend enough every beginner to go carefully through these kernels (and of course through many others great kernels) and get their first insights in data science and kaggle competitions.
After that (and some basic pratices) you should be more confident to go through this great script by Human Analog who did an impressive work on features engeneering.
As the dataset is particularly handy, I decided few days ago to get back in this competition and apply things I learnt so far, especially stacking models. For that purpose, we build two stacking classes ( the simplest approach and a less simple one).
As these classes are written for general purpose, you can easily adapt them and/or extend them for your regression problems. The overall approach is hopefully concise and easy to follow..
The features engeneering is rather parsimonious (at least compared to some others great scripts) . It is pretty much :
Imputing missing values by proceeding sequentially through the data
Transforming some numerical variables that seem really categorical
Label Encoding some categorical variables that may contain information in their ordering set
Box Cox Transformation of skewed features (instead of log-transformation) : This gave me a slightly better result both on leaderboard and cross-validation.
Getting dummy variables for categorical features.
Then we choose many base models (mostly sklearn based models + sklearn API of DMLC's XGBoost and Microsoft's LightGBM), cross-validate them on the data before stacking/ensembling them. The key here is to make the (linear) models robust to outliers. This improved the result both on LB and cross-validation.
To my surprise, this does well on LB ( 0.11420 and top 4% the last time I tested it : July 2, 2017 )
Hope that at the end of this notebook, stacking will be clear for those, like myself, who found the concept not so easy to grasp
#import some necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew, kurtosis, boxcox #for some statistics
from scipy.special import boxcox1p, inv_boxcox
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #check the files available in the directory
#Now let's import and put the train and test datasets in pandas dataframe
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
##display the first five rows of the train dataset.
train.head(5)
##display the first five rows of the test dataset.
test.head(5)
#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the 'Id' colum since it's unnecessary for the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
Documentation for the Ames Housing Data indicates that there are outliers present in the training data
Let's explore these outliers
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
m, b = np.polyfit(train['GrLivArea'], train['SalePrice'], 1)
#m = slope, b=intercept
plt.plot(train['GrLivArea'], m*train['GrLivArea'] + b)
plt.show()
Edit: Look for fliers in other columns
train.shape[1]
#a = int(np.sqrt(train.shape[1]))
a = 4
b = int(train.shape[1]/4)
r = int(train.shape[1]/a)
c = int(train.shape[1]/b)
i = 0
fig, ax = plt.subplots(nrows=r, ncols=c, figsize=(15, 60))
for row in ax:
for col in row:
#print(train.columns[i])
#print(train[train.columns[i]].dtype)
#col.plot()
#ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
try:
col.scatter(x = train[train.columns[i]], y = train['SalePrice'])
col.title.set_text(train.columns[i])
#col.title(train.columns[i])
except:
temp=1
#except Exception as e:
# print(e.message, e.args)
finally:
temp=1
i = i + 1
plt.show()
More filtering of data to try
#train = train.drop(train[(train['LotFrontage']>300) & (train['SalePrice']<400000)].index)
#train = train.drop(train[(train['LotArea']>100000) & (train['SalePrice']<400000)].index)
#train = train.drop(train[(train['BsmtFinSF1']>4000) & (train['SalePrice']<250000)].index)
#train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<250000)].index)
We can see at the bottom right two with extremely large GrLivArea that are of a low price. These values are huge oultliers. Therefore, we can safely delete them.
Edit: Think this deletion is safe to do as there are only 2 points and they seem to be abnormal, possibly data error
#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
m, b = np.polyfit(train['GrLivArea'], train['SalePrice'], 1)
#m = slope, b=intercept
plt.plot(train['GrLivArea'], m*train['GrLivArea'] + b)
plt.show()
Outliers removal is note always safe. We decided to delete these two as they are very huge and really bad ( extremely large areas for very low prices).
There are probably others outliers in the training data. However, removing all them may affect badly our models if ever there were also outliers in the test data. That's why , instead of removing them all, we will just manage to make some of our models robust on them. You can refer to the modelling part of this notebook for that.
Edit: todo - look for other outliers in all columns. Maybe better to use StandardScaler() and np.clip
SalePrice is the variable we need to predict. So let's do some analysis on this variable first.
sns.distplot(train['SalePrice'] , fit=norm)
_skew = skew(train['SalePrice'])
_kurtosis = kurtosis(train['SalePrice'])
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}, skew = {:.2f} kurtosis = {:.2f}\n'.format(mu, sigma, _skew, _kurtosis))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
The target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed.
Log-transformation of the target variable
Edit: also trying box-cox transform. The Box-Cox transform is given by:
y = (x**lmbda - 1) / lmbda, for lmbda > 0
log(x), for lmbda = 0
#We use the numpy fuction log1p which applies log(1+x) to all elements of the column
# option 1
#train["SalePrice"] = np.log1p(train["SalePrice"])
# Option 2: use box-cox transform - this performs worse than the log(1+x), reverting back
lam_l = 0.35
train["SalePrice"] = boxcox1p(train["SalePrice"], lam_l)
# Option 3: boxcox letting the algorithm select lmbda based on least-likelihood calculation
#train["SalePrice"], lam_l = boxcox(x=train["SalePrice"], lmbda=None)
# to revert back use y = inv_boxcox(x, lam_l) => train["SalePrice"] = inv_boxcox(train["SalePrice"], lam_l)
# think the formula is this: # pred_y = np.power((y_box * lambda_) + 1, 1 / lambda_) - 1
#Check the new distribution
sns.distplot(train['SalePrice'] , fit=norm);
_skew = skew(train['SalePrice'])
_kurtosis = kurtosis(train['SalePrice'])
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}, skew = {:.2f} kurtosis = {:.2f}\n'.format(mu, sigma, _skew, _kurtosis))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
The skew seems now corrected and the data appears more normally distributed.
Edit: Both distributions have a positive kurtosis, which means there is a steep dropoff in the curve from the center, or the tails have few points. Skewness is close to 0 now, so that metric is closer to a normal distribution
let's first concatenate the train and test data in the same dataframe
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
all_data.head()
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
Data Correlation
Cross Correlation chart to view collinearity within the features. Kendall's seems appropriate as the Output Variable is numerical and much of the input is categorical. Here is a chart to guide you to which method to use based on your dataset.