Kaggle Housing challenge, my take

In this article, I’m doing the Kaggle Housing challenge, which is probably the second most popular after Titanic. This was very much a “keeping track of what I’m doing for learning/my own sake” thing, but by the end I’ve gotten a ranking of 178/5419 on the public leaderboard (LB). That said, this is super long because I tried a million things and it’s kind of a full log of my workflow on this problem.

I’ve really learned a bunch from going through this very carefully. What I did here was to try the few techniques I knew when I started, and then I looked at notebooks/kernels for this challenge on Kaggle. A word on these kernels: even the very most top rated ones vary in quality immensely. Some are excellently explained and you can tell they tried different things to try and get an optimal result. Others are clearly people just trying random stuff they’ve heard of, misapplying relatively basic techniques, and even copying code from other kernels. So I viewed these as loose suggestions and guideposts for techniques.

This challenge is a good one. In contrast to the Titanic’s classification, this is a regression for the label “SalePrice”, the price a given property sold for. Another key characteristic of this one is that it has 80 features, which is a large number (for me, anyway), and of those, a decent number of missing values (NA’s).

I’m evaluating the scores by sometimes looking at the accuracy from the score() function of sklearn models (which is out of 100, increasing score is better), and then towards the end only looking at the Root mean square logarithmic error (RMSLE), for which lower is better.

Starting off:

Using literally only the feature “GrLivArea”, the most obviously linear:

TTdat = train_df[["GrLivArea","SalePrice"]]
display(TTdat.isnull().sum())

X_train, X_test, y_train, y_test = train_test_split(TTdat.drop("SalePrice",axis=1), 
                                                    TTdat["SalePrice"], test_size=0.3, random_state=42)


lm = linear_model.LinearRegression()
lm.fit(X_train,y_train)
print("\nLM score: {}".format(lm.score(X_test,y_test)))

print("\n\n\tLM RMSLE: {}".format(rmsle(np.array(y_test),np.array(lm.predict(X_test)))))

rfr = RandomForestRegressor(n_estimators=300)
rfr.fit(X_train,y_train)


print("\n\nRFR score: {}".format(rfr.score(X_test,y_test)))

print("\n\n\tRFR RMSLE: {}".format(rmsle(np.array(y_test),np.array(rfr.predict(X_test)))))

We get:

LM score: 0.5507664896759454

  LM RMSLE: 0.2738983861120738

RFR score: 0.4520886446608937

  RFR RMSLE: 0.302943758322113

That’s not great, obviously. But adding only a few more numerical features that are obviously linear:

TTdat = train_df[["OverallCond","LotArea","GrLivArea","OverallQual","YearBuilt","SalePrice"]]

We get:

LM score: 0.7779114744671904

  LM RMSLE: 0.2335872532120071

RFR score: 0.8661858933432334

  RFR RMSLE: 0.15824766173991986

So we’re getting ~78% accuracy by using the simplest model there is with no feature engineering or cleaning, and ~87% accuracy by just using a more flexible model (RFR).

At this point I’d like to note that the RMSLE is ~0.16. If you look at the leaderboard, there are a few at the top with a score of 0.0000, which are almost certainly cheaters (a classic way is to find the original dataset somewhere and then just submit the literal answers). Then, there are some between 0.01 and 0.1 that I don’t know enough about Kaggle/this competition to know whether or not they’re cheating the system somehow. But the part I want to highlight is that, if you look at #8 downwards, there’s a long plateau of people getting about 0.1 RMSLE, so whether or not the higher ranked ones are BS we can probably safely assume 0.1 is a real score to aim for.

That said, I think it’s also kinda funny that literally the simplest stuff gets you relatively close to a top bracket score (maybe even better, depending on which of the above are BS) (also, I should point out that RMSLE is a little deceptive; 0.16 is a little further  from 0.11 than you might think). That could make you scoff at ML/DS as a field, but I try to keep in mind that a) this is a pretty simple dataset, real life is probably messier, and b) the gains at the top for lots of pursuits have very diminishing returns, but it’s still worth it; making a much more complex model to get 4% improvement may not seem silly, but it’s probably worth it if it’s 4% of a country’s GDP or something.

Next, as a tiny improvement, I made a function that takes all the numeric (type int and float) features, fits a line just to each one of them separately (against the SalePrice label), and then returns a list of them to you, sorted by descending R^2 value. This lets me then select the top N “most linear” features to select:

You can see that it drops off pretty steeply after the first few. Usings these (I chose the top 20 here) in the same LM and RFR models gives:

LM score: 0.8069334646105912

  LM RMSLE: 0.16820072580858902

RFR score: 0.8859240899628174

  RFR RMSLE: 0.147238869323511

Definitely an improvement, but veryyy incremental. So I probably have to try other stuff to get that sweet, juicy 0.1 RMSLE. The two things that come to mind are 1) actually use the categorical features, as this is just based on the numerical ones (though I should point out that some of the ones of type int kind of seem like categorical features, because they aren’t really continuous, and 2) feature engineering on the numerical features (e.g., some of them are clearly more curved, which would make a linear fit inaccurate).

_________________________________________________________________________________________________________________________________

 

So at this point, I’ve read a few kernels, and here’s what I’ve figured out. For the most part (one (possible?) exception), they’re not actually doing much clever feature engineering.

The first thing that seems notable that I didn’t do above is actually carefully read the “data_description.txt” file they give you, that explains each feature and its meaning. The reason why this is important is that it turns out some categorical features have these values that are the string “NA”. However, pandas reads these in as NaN, which is usually what you get in a DataFrame if the cell was missing.

So, when you check for which features have the most nulls:

nullCount = pd.DataFrame(train_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(25))

you can see that some of them have a lot of nulls. Because I wasn’t being careful before and didn’t read the data description, I just assumed that these NaN were actual missing cells, not just the value “NA”, which has a different meaning:

Though it seems like it might be equivalent, because “NA” here clearly means “not applicable” because “Basement quality” can’t really be rated if there isn’t a basement, it’s actually not the same: NA contains actual information (there isn’t a basement), whereas NaN would be a lack of information (“we just don’t know about this field”).

So, we replace all the Na’s for these features (the ones that have NA as a value) with a more explicit “None” string:

allFeats_df[["Alley","BsmtQual","BsmtCond","BsmtExposure",
                     "BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual",
                    "GarageCond","PoolQC","Fence","MiscFeature"]] = allFeats_df[["Alley","BsmtQual","BsmtCond","BsmtExposure",
                     "BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual",
                    "GarageCond","PoolQC","Fence","MiscFeature"]].fillna(value="None")

This actually isn’t perfect: if any of these features also had actual missing cells, we just filled those in with the “None” string when it probably should be NaN. However, a brief glance at the spreadsheet showed that there were very few empty cells for these features, and as we’ll see in a second, actual NaN’s would probably get filled in with these anyway.

Now, looking at the nullcount should only show us features that have actual real NaN values:

nullCount = pd.DataFrame(allFeats_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))

There are lots of ways to handle this. If a given feature was missing lots of data, and didn’t seem particularly useful, I might just drop it. However, there’s clearly some trend in the top one, LotFrontage:

and there may be in others too. Another thing you could do is, instead of dropping the whole feature, just drop the rows with those NaN values. However, this would drop 486 values for LotFrontage, which is actually a large portion of our data. Definitely the most common thing to do is to fill in the values. It seems like people usually fill in numerical features with the median. I actually haven’t seen what people do for categorical features, but I made it find the most common value (the “mode”, kind of) and use that to fill it in. I made this handy function that reads the data type and does the appropriate thing:

def replaceAllMissingNaByType(df):
    for feat in df.columns.values:
        if (df.dtypes[feat]=="float" or df.dtypes[feat]=="int") and df[feat].isnull().sum()!=0:
            print("{} type: {}".format(feat,df.dtypes[feat]))
            df[feat] = df[feat].fillna(df[feat].median())
            print("filling {} with median value {}".format(feat,df[feat].median()))
        if df.dtypes[feat]=="object" and df[feat].isnull().sum()!=0:
            print("{} type: {}".format(feat,df.dtypes[feat]))
            df[feat] = replaceNaByMostCommon(df,feat)
    return(df)

def replaceNaByMostCommon(df,feat):
    mostCommon = df[feat].value_counts().index.tolist()[0]
    print("filling {} with most common value {}".format(feat,mostCommon))
    df[feat] = df[feat].fillna(mostCommon)
    return(df[feat])

replaceAllMissingNaByType(allFeats_df)

nullCount = pd.DataFrame(allFeats_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))

Now we can see that there are no nulls left:

One last thing to notice is, a couple features have an annoying quirk that their field is numerical although the numbers actually mean something else if you look at the data description:

Clearly these are pretty much categorical, not numerical, so we wouldn’t want to pass it to (most) models as numerical. Taking a look at the data:

There’s clearly info in this feature that would be lost because there’s no trend in its numerical form, but there are definitely differences in the distributions for different values. The other feature with this property, eh, I dunno if we’re losing much:

Anyway, we make these categorical by just converting each feature value into a string and adding the feature name to it to make sure it doesn’t get turned back into a number at some point (I dunno, pandas might do that or something):

def replaceFeatWithStrings(df,feat):
    df[feat] = df[feat].apply(lambda x:(feat+"_"+str(x)))
    return(df)

allFeats_df = replaceFeatWithStrings(allFeats_df,"MSSubClass")
allFeats_df = replaceFeatWithStrings(allFeats_df,"MoSold")

display(allFeats_df[["MSSubClass","MoSold"]])

Lastly, we do One Hot Encoding which should convert all the categorical features (I didn’t realize you could do it to the whole DataFrame and it would pick the categorical ones!):

allFeatsOHE_df = pd.get_dummies(allFeats_df)

display(allFeatsOHE_df.shape)

Due to the OHE, the DataFrame now has the shape (2919, 329), so 329 features.

Anyway, like I mentioned above, at this point, we haven’t done much actual feature engineering (in my opinion), just cleaning. But that’s still really important and clearly there were a lot of things I missed on my first pass!

There’s actually one piece of what I think could be called engineering that seems to be crucial. At this point, if you plug in what we have and try using both plain Linear Regression (LR) and L1 regularization LR (L1), you get a strange (I should figure out what’s happening with that negative error…) error for the LR and just generally poor test error for both LR and L1:

X_train = allFeatsOHE_df[:trainLength]
X_submit = allFeatsOHE_df[trainLength:]

print("shape of X_train: {}".format(X_train.shape))
print("shape of X_submit: {}".format(X_submit.shape))

X_tr, X_test, y_tr, y_test = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

print("shape of X_tr: {}".format(X_tr.shape))
print("shape of X_test: {}".format(X_test.shape))
print("shape of y_tr: {}".format(y_tr.shape))
print("shape of y_test: {}".format(y_test.shape))


lr = LinearRegression()
lr.fit(X_tr, y_tr)


print("\nLR train score: {}".format(lr.score(X_tr,y_tr)))
print("\nLR test score: {}".format(lr.score(X_test,y_test)))

print("\n\tLR train RMSLE: {}".format(rmsle(np.array(y_tr),np.array(lr.predict(X_tr)))))
print("\n\tLR test RMSLE: {}".format(rmsle(np.array(y_test),np.array(lr.predict(X_test)))))


lasso = LassoCV(alphas = [.0001, .0003, .0006, .001, .003, .006, .01, .03, .06, .1, .3, .6, 1, 3, 6], max_iter = 5000, cv = 10)
lasso.fit(X_tr,y_tr)
bestAlpha1 = lasso.alpha_
print("best alpha for first run: {}".format(bestAlpha1))

distOut = .6
steps = 30
newAlphaList = [bestAlpha1*(1+i) for i in np.linspace(-distOut,distOut,steps)]

lasso = LassoCV(alphas=newAlphaList, max_iter=5000, cv=10)
lasso.fit(X_tr,y_tr)
bestAlpha2 = lasso.alpha_
print("best alpha for finer run: {}".format(bestAlpha2))


print("\nL1 train score: {}".format(lasso.score(X_tr,y_tr)))
print("\nL1 test score: {}".format(lasso.score(X_test,y_test)))

print("\n\tL1 train RMSLE: {}".format(rmsle(np.array(y_tr),np.array(lasso.predict(X_tr)))))
print("\n\tL1 test RMSLE: {}".format(rmsle(np.array(y_test),np.array(lasso.predict(X_test)))))

# Plot important coefficients
coefs = pd.Series(lasso.coef_, index = X_tr.columns)
print("\nLasso picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()
LR train score: 0.9420562612988085

LR test score: -0.054035905520274596

  LR train RMSLE: 0.10046012794730877

  LR test RMSLE: 0.19075749114106955

L1 train score: 0.9403517450852872

L1 test score: 0.8242730131005287

  L1 train RMSLE: 0.10158786351457423

  L1 test RMSLE: 0.15753933510781765

(I’ll get to the stuff about the models themselves in a minute.)

Anyway, a thing it seems like most tutorials I’ve seen is doing a log transform, by which I mean taking log(1+x) for the SalePrice label, which apparently has the effect of “unskewing” a distribution if it’s skewed (the blue line is a smoothed fit to the histogram, the black line is a fitted Gaussian):

y_train_log = np.log1p(y_train)

sns.distplot(y_train, fit=norm)
plt.show()
sns.distplot(y_train_log, fit=norm)
plt.xlabel('log(1 + SalePrice)')
plt.show()

The log1p one is definitely closer to a Gaussian, which is cool. Why is this supposed to help? I…don’t know, I should figure that out. All I’ve read so far is that “models work better with normally distributed data”. What’s cool is that, I realized that, to actually get real number answers (in SalePrice dollars), we’d need to undo the predictions the models give us because they were trained on log1p SalePrice labels, so we’d need to do the inverse function of that (exp(x) – 1). Fortunately, numpy has one: np.expm1(). Yay!

Now, using the log1p(SalePrice) label instead (from now on):

lr = LinearRegression()
lr.fit(X_tr, y_tr)

print("\nLR train score: {}".format(lr.score(X_tr,y_tr)))
print("LR test score: {}".format(lr.score(X_test,y_test)))

print("\n\tLR train RMSLE: {}".format(rmsle(np.expm1(np.array(y_tr)),np.expm1(np.array(lr.predict(X_tr))))))
print("\tLR test RMSLE: {}".format(rmsle(np.expm1(np.array(y_test)),np.expm1(np.array(lr.predict(X_test))))))
LR train score: 0.9504707390629588
LR test score: 0.8753800474440097

  LR train RMSLE: 0.0876197228265237
  LR test RMSLE: 0.14539951846487162

Damn, getting ~88% and 0.145 on the train set, with plain ol’ LR! That’s better than what we got for the RFR before. One thing to note is that comparing the test to the train errors, we can see that it’s definitely overfitting to the train data. This makes sense, because we have so many goddamn features and no regularization; it will try using all those features as much as possible to fit the train data even if it’s not generalizable.

The next few models use regularization.

L1 (LASSO):

lasso = LassoCV(alphas = [.0001, .0003, .0006, .001, .003, .006, .01, .03, .06, .1, .3, .6, 1, 3, 6], max_iter = 50000, cv = 10)
lasso.fit(X_tr,y_tr)
bestAlpha1 = lasso.alpha_
print("best alpha for first run: {}".format(bestAlpha1))

distOut = .6
steps = 30
newAlphaList = [bestAlpha1*(1+i) for i in np.linspace(-distOut,distOut,steps)]

lasso = LassoCV(alphas=newAlphaList, max_iter=50000, cv=10)
lasso.fit(X_tr,y_tr)
bestAlpha2 = lasso.alpha_
print("best alpha for finer run: {}".format(bestAlpha2))

print("\nL1 train score: {}".format(lasso.score(X_tr,y_tr)))
print("L1 test score: {}".format(lasso.score(X_test,y_test)))

print("\n\tL1 train RMSLE: {}".format(rmsle(np.expm1(np.array(y_tr)),np.expm1(np.array(lasso.predict(X_tr))))))
print("\tL1 test RMSLE: {}".format(rmsle(np.expm1(np.array(y_test)),np.expm1(np.array(lasso.predict(X_test))))))

# Plot important coefficients
coefs = pd.Series(lasso.coef_, index = X_tr.columns)
print("\nLasso picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()
L1 train score: 0.9352871379332616
L1 test score: 0.9122346091469562

  L1 train RMSLE: 0.1001535286506382
  L1 test RMSLE: 0.12201994448365171

Wow, 91% and 0.12! Clearly regularization is important. Note that you need to set the regularization parameter alpha (or, I think there’s an option to let it choose alpha itself, but in the tutorials I saw people seemed to set it manually). What I’m doing here (that I saw others do) is give it a list of alphas, find the one that has the lowest error, and then search a little around there to more “fine tune” it.

L1 regularizes by summing up each of the (absolute value) of the feature weights. Apparently it has a tendency to set unimportant features to 0. We can plot the coefficients it found and plot the first and last 5 when sorted. Keep in mind that very negative is just as important as very positive; the magnitude is what matters. We can see that it got rid of 183 features.

One last thing I’ll say is that if you look at the max_iter parameter that controls the maximum iterations (of what?), it’s 50,000 here and it runs in about a second or two. When I ran it on the not log1p SalePrice originally, it simply wouldn’t finish with max_iter=50000. I had to change it to 5000 to get a result, which took on the order of minutes.

 

L2 (Ridge regularization):

ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60])
ridge.fit(X_tr,y_tr)
bestAlpha1 = ridge.alpha_
print("best alpha for first run: {}".format(bestAlpha1))

distOut = .6
steps = 30
newAlphaList = [bestAlpha1*(1+i) for i in np.linspace(-distOut,distOut,steps)]

ridge = RidgeCV(alphas=newAlphaList)
ridge.fit(X_tr,y_tr)
bestAlpha2 = ridge.alpha_
print("best alpha for finer run: {}".format(bestAlpha2))

print("\nL2 train score: {}".format(ridge.score(X_tr,y_tr)))
print("L2 test score: {}".format(ridge.score(X_test,y_test)))

print("\n\tL2 train RMSLE: {}".format(rmsle(np.expm1(np.array(y_tr)),np.expm1(np.array(ridge.predict(X_tr))))))
print("\tL2 test RMSLE: {}".format(rmsle(np.expm1(np.array(y_test)),np.expm1(np.array(ridge.predict(X_test))))))

# Plot important coefficients
coefs = pd.Series(ridge.coef_, index = X_tr.columns)
print("\nRidge picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the Ridge Model")
plt.show()
L2 train score: 0.9227768076158541
L2 test score: 0.8936095268487547

  L2 train RMSLE: 0.10940692876749346
  L2 test RMSLE: 0.13434470301395546

Note quite as good apparently. L2 regularizes by summing up the square of each feature weight. We also don’t have to pick max_iterations and other parameters for this. Note that it doesn’t eliminate as many features as L1:

L1 + L2 (ElasticNet):

This is just a mix of L1 and L2 regularization. It combines them in some ratio, which the code below tries to determine the same way it does for alpha. It does it in a bit of a weird iterative order which seems like it might be voodoo to me, but what do I know. Note that it limits the l1_ratio to be at most 1. This is because l1_ratio is a ratio in the sense that it’s the percent that it’s the L1 penalty, so for l1_ratio=0, it’s entirely the L2 penalty and for l1_ratio=1 it’s entirely the L1 penalty.

alphaList = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                    0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6]

elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
                          alphas = alphaList, 
                          max_iter = 50000, cv = 10)
elasticNet.fit(X_tr,y_tr)
bestAlpha1 = elasticNet.alpha_
bestRatio1 = elasticNet.l1_ratio_
print("Best l1_ratio :", bestRatio1)
print("Best alpha :", bestAlpha1)

print("Try again for more precision with l1_ratio centered around " + str(bestRatio1))

distOut = .2
steps = 10
newRatioList = [bestRatio1*(1+i) for i in np.linspace(-distOut,distOut,steps)]

elasticNet = ElasticNetCV(l1_ratio = newRatioList,
                          alphas = alphaList, 
                          max_iter = 50000, cv = 10)
elasticNet.fit(X_tr,y_tr)
if (elasticNet.l1_ratio_ > 1):
    elasticNet.l1_ratio_ = 1    
bestAlpha2 = elasticNet.alpha_
bestRatio2 = elasticNet.l1_ratio_
print("Best l1_ratio :", bestRatio2)
print("Best alpha :", bestAlpha2)


print("Now try again for more precision on alpha, with l1_ratio fixed at " + str(bestRatio2) + 
      " and alpha centered around " + str(bestAlpha2))

distOut = .5
steps = 20
newAlphaList = [bestAlpha2*(1+i) for i in np.linspace(-distOut,distOut,steps)]

elasticNet = ElasticNetCV(l1_ratio = bestRatio2,
                          alphas = newAlphaList, 
                          max_iter = 50000, cv = 10)
elasticNet.fit(X_tr,y_tr)
bestAlpha3 = elasticNet.alpha_
bestRatio3 = elasticNet.l1_ratio_
print("Best l1_ratio :", bestRatio3)
print("Best alpha :", bestAlpha3)


print("\nL1 train score: {}".format(elasticNet.score(X_tr,y_tr)))
print("L1 test score: {}".format(elasticNet.score(X_test,y_test)))

print("\n\tL1 train RMSLE: {}".format(rmsle(np.expm1(np.array(y_tr)),np.expm1(np.array(elasticNet.predict(X_tr))))))
print("\tL1 test RMSLE: {}".format(rmsle(np.expm1(np.array(y_test)),np.expm1(np.array(elasticNet.predict(X_test))))))

# Plot important coefficients
coefs = pd.Series(elasticNet.coef_, index = X_tr.columns)
print("\nElasticNet picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the ElasticNet Model")
plt.show()
Best l1_ratio : 1.0
Best alpha : 0.0006
Try again for more precision with l1_ratio centered around 1.0
Best l1_ratio : 1
Best alpha : 0.0006
Now try again for more precision on alpha, with l1_ratio fixed at 1 and alpha centered around 0.0006
Best l1_ratio : 1
Best alpha : 0.00042631578947368415

L1L2 train score: 0.9348473494236876
L1L2 test score: 0.9120412054711957

  L1L2 train RMSLE: 0.10049327401296412
  L1L2 test RMSLE: 0.12215431475676389

You can see that it really wants l1_ratio to be 1, entirely the L1 penalty. You can see that it also gets nearly exactly the same error and feature drop rate as L1 above.

 

XGBoost:

I also tried xgboost, which apparently is a gradient boosted random forest (I think). It’s not included in scikit learn like these others, so you have to import its package. To be honest, it’s even a little more voodoo-y than the others. I looked at the documentation for it to learn what the parameters do, but I still don’t know exactly what to do.

dtrain = xgb.DMatrix(X_tr, label = y_tr)
dtest = xgb.DMatrix(X_test)

params = {"max_depth":2, "eta":0.1}
model = xgb.cv(params, dtrain,  num_boost_round=500, early_stopping_rounds=100)

model.loc[30:,["test-rmse-mean", "train-rmse-mean"]].plot()

model_xgb = xgb.XGBRegressor(n_estimators=3000, max_depth=2, learning_rate=0.1, reg_alpha=2) #the params were tuned using xgb.cv
model_xgb.fit(X_tr, y_tr)

xgb_tr_preds = np.expm1(np.array(model_xgb.predict(X_tr)))
xgb_preds = np.expm1(np.array(model_xgb.predict(X_test)))

print("\nXGB train score: {}".format(model_xgb.score(X_tr,y_tr)))
print("XGB test score: {}".format(model_xgb.score(X_test,y_test)))

print("\n\tXGB train RMSLE: {}".format(rmsle(np.expm1(np.array(y_tr)),xgb_tr_preds)))
print("\tXGB test RMSLE: {}".format(rmsle(np.expm1(np.array(y_test)),xgb_preds)))
XGB train score: 0.9367862166074793
XGB test score: 0.8933270957580999

  XGB train RMSLE: 0.09898669774192935
  XGB test RMSLE: 0.1345229063370828

It gets a slightly worse test score than L1. These train/test scores kind of imply that it’s overfitting. It’s not really clear to me which parameter to adjust to fix this. I messed around with reg_alpha a bit but it didn’t seem to change it significantly.

Cool! So I submitted the L1 results to the leaderboard, and got 0.123 which jumped me up several thousand positions to 1121.

 

BUT WE CAN DO BETTER

 

The next thing that comes to mind, that I saw being done, was un-skewing other distributions besides SalePrice.

The results:

LR train score: 0.9536220848519422
LR test score: 0.5309034955562024

  LR train RMSLE: 0.08478647152623621
  LR test RMSLE: 0.2820981745167067

L1 train score: 0.934633933944826
L1 test score: 0.917479229823466

  L1 train RMSLE: 0.10065772845795878
  L1 test RMSLE: 0.11831799832380148

L2 train score: 0.9280338244782977
L2 test score: 0.9038758615016176

  L2 train RMSLE: 0.10561732482352758
  L2 test RMSLE: 0.12769838585618798

L1L2 train score: 0.9343291673046935
L1L2 test score: 0.9173580069218953

  L1L2 train RMSLE: 0.1008921118694655
  L1L2 test RMSLE: 0.11840487092261676

XGB train score: 0.9650214444888001
XGB test score: 0.8986660236532873

  XGB train RMSLE: 0.07363289689027702
  XGB test RMSLE: 0.13111329347012513

As you can see, L1 and L2 do very incrementally better on the test sets. Probably worth doing then.

Submitting them to the leaderboard, XGBoost isn’t an improvement over my old L1. This new (unskewed) L1 improves very incrementally to 0.12248, which gets me to position 1060. That’s pretty good, right? We must be done?

ARE YOU KIDDING ME

There’s way more we can try!

Okay, here are the things off the top of my head that I can think of to try:

  1. Remove outliers from the training set. There are some obvious outliers if you do plots of some features. However, they’re so few that I doubt it would make a huge difference. Also, I kind of suspect that the models are probably smart enough to get this stuff: I think they use ensembles, which would probably get rid of outliers.
  2. Some actual feature engineering. In one tutorial I read the person created a ton of new features, by basically multiplying all the (numerical) features together to create cross-terms, and then also creating square terms, cubed terms, etc. It ended up having a massive number of features, but those are all basically knobs you can turn. Obviously, this is incredibly prone to overfitting, but the regularization should handle that.
  3. Trying another model (I’ve seen and been told that LightGBM is good). Will this work significantly better? Eh maybe. I personally kind of doubt it, based on the scores I see in the kernels (which are worse than L1 if anything).
  4. Ensemble stacking. This is combining models in clever ways such that you have a “meta model” that selects other models for different samples of the data, I think. They’re extremely powerful and popular on Kaggle, though I can’t say I fully understand them yet.
  5. A few other tricks that, going more carefully through kernels, might actually be important.

I think I’ll try #2 first.

The idea here is that we create new features that may model the data better. I’m mostly following this kernel for this part. For example, if you look at OverallQual:

you can see that though a line fits it pretty well, it might actually be better fit by a quadratic (probably more visible if the mean of each of those values were plotted instead). So, we can create the feature OverallQual_pow2 which is just each of those values squared:

numFeats = getNumFeats(allFeats_df)
print("this many num feats: ",len(numFeats))

interactFeats_df = deepcopy(allFeats_df)

for feat in numFeats:
    interactFeats_df[feat+"_pow2"] = interactFeats_df.loc[:,feat]**2
    interactFeats_df[feat+"_pow3"] = interactFeats_df.loc[:,feat]**3
    interactFeats_df[feat+"_sqrt"] = np.sqrt(np.absolute(interactFeats_df.loc[:,feat]))

print("shape of new df: ",interactFeats_df.shape)

Because the kernel I was following did the cube and sqrt of each (numerical) feature, I did that too. To be honest, especially with data this noisy, it’s hard for me to imagine the cube modeling something significantly than the square would, but why not. At this point it has 185 features (there are now 4x the number of numerical features there were before).

unskewed_df = deepcopy(interactFeats_df)

skewed_feats = unskewed_df[getNumFeats(interactFeats_df)].apply(lambda x: skew(x)) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.5]
skewed_feats = skewed_feats.index.tolist()

#Applying log1p to unskew certain feats
print("unskewing this many feats: ",len(skewed_feats))
for feat in skewed_feats:
    unskewed_df[feat] = np.log1p(unskewed_df[feat])

#OHE
unskewed_df = pd.get_dummies(unskewed_df)
display(unskewed_df.shape)

#Make it so "Id" doesn't get scaled
unskewed_df = convertFeatToString(unskewed_df,"Id")

stdSc = StandardScaler()
numFeats = getNumFeats(unskewed_df)
unskewed_df.loc[:,numFeats] = stdSc.fit_transform(unskewed_df.loc[:,numFeats])

display(unskewed_df.head())

Then I unskew a lot of the features, do OHE, and then use a StandardScaler() from sklearn to make all the features have roughly the same variance and mean. This is apparently actually pretty important for most models. I tried doing it before and after OHE and it seems to not matter when you do it.

LR train score: 0.9641406070235962
LR test score: -113595263551.11363

  LR train RMSLE: 0.07455425212125777
  LR test RMSLE: inf


L1 train score: 0.938638248047312
L1 test score: 0.9181345505183663

  L1 train RMSLE: 0.09752586851153972
  L1 test RMSLE: 0.11784726361511702


L2 train score: 0.9309125003503657
L2 test score: 0.9067301940560699

  L2 train RMSLE: 0.10348339991864564
  L2 test RMSLE: 0.1257881454061949

On the LB, gave an error of 0.12199 and moved me up to position 1026.

Another thing to try is another type of feature creation, interactions. This is, instead of just looking at some function of individual features, you combine them. I tried this through multiplying. You need to be careful, because while you can “cheaply” do squared, cubed, and sqrt of each numeric feature, that will only increase it by some constant factor. Since we had ~34 numeric features, this gives you 136 numeric features in the end, which is large but doable (it’s actually making the solvers work pretty slowly even at that point). On the other hand, if you did all the interactions, that increases as the number of features squared! So we would end up with 578 features. Therefore, I only looked at the interactions of the first 10 most correlated features, which gives a more manageable number.

It didn’t really improve the results at all. I’m not saying this technique can never work, but it feels like a real shot in the dark compared to the other feature creation I did above. It’s easy to see why the square of a feature might help a model, but it seems like kind of a stretch that the product between two unrelated (or even tangentially related) things would.

Let’s try #5, a few little tricks.

One thing I saw that I think might actually help a bunch is the following. I glossed over it a little above, but a standard step during the data preparation stage is to fill in (actual) NA values, so you can actually use the rest of the info for that entry/observation/sample/whatever, rather than throwing it away. A standard way to do this (that I did above) is to replace NA values for a numeric feature with the median or mean of that feature. That works somewhat well; it intuitively feels reasonable because if you’re adding the mean, you’re “not changing” the mean distribution of that feature and you’re adding a value that seems like it can’t be too wrong.

However, this might be kind of inaccurate. For example, if you have good reason to believe that a house is really large (for example, it has a huge 2nd floor square footage), then it would actually almost certainly be inaccurate to guess that the first floor square footage is just the average of that feature. What this has the effect of doing is adding a bunch of inaccurate info to the table, kind of turning a bunch of entries that probably aren’t actually outliers, into outliers. What makes more sense is to fill in a more reasonable value for that missing feature, given other features.

So, going back to our list of “real” NA’s (i.e., not actual entries):

We can see that the top two are ones that could actually have a real effect on the data. Now what the guy did in the kernel was to fill in missing LotFrontage entries with the mean based on the neighborhood. That’s not a bad idea; you would expect houses in the same neighborhood to have the same rough area properties. Here’s a plot of the LotFrontage based on neighborhood, where those black lines are the mean for each neighborhood:

However, you can see that except for those first 4 neighborhoods (that don’t account for too many anyway), it’s a preeeeetty huge plateau for the rest, with the means ranging from only about 50-80, but the actual values (within each neighborhood) ranging from about 25-125. So honestly, using this to set the missing LotFrontage values isn’t that different from just using the overall average for the feature.

Something I thought makes more sense is to use another numeric feature, LotArea, which by eye seemed to correlate well with LotFrontage, and further, makes sense: the lot area is probably (if it’s a rectangular property) the lot frontage times the depth of the property. Given that, it made sense to take the square root of LotArea and plot LotFrontage vs that, which gives:

This seems pretty reasonable and probably more useful. I also drew the dots that are LotArea values of the entries that are missing LotFrontage information, at the average value (they would’ve been given otherwise). You can see that doing this method probably gives them much better values on average. Also note that for sqrt(LotArea)<50 the LotFrontage is all one low value. This probably isn’t a coincidence, and is something like a standard value given to condos or something.

Anyway, I fit these new LotFrontage values and set them in the DataFrame:

Well, to be honest, this didn’t do much. It got my LB score up to .12141, which is barely better than before, and position 997.

One more thing I tried (on the data that has all features unskewed, but no feature creation) is something else I saw in a kernel: unskewing with a Box-Cox transformation rather than log1p. It actually gives slightly better results, don’t ask me why.

Okay. At this point my score has improved vastly from when I started, but also it’s still in the ~1000th place on the leaderboard. At this point I’m going to straight up copy a successful kernel and see where my thing diverges from theirs. Here’s the one I’m looking at right now.

Here’s the order I do stuff in (in the version where I don’t create all those extra features):

  1. Separate y_train, combine X_train and X_test, and create y_train_log
  2. Fix a specific point
  3. Replace NA’s that actually mean something else
  4. Fill in LotFrontage NA’s with accurate vals
  5. Fill in GarageYrBlt NA’s with YearBuilt
  6. Fill in the other NA’s with median or mode
  7. Convert MSSubClass and MoSold to strings
  8.  Unskew skewed numeric features
  9. Do OHE on categorical features
  10. Turn “Id” into a string
  11. Use a standard scaler
  12. Split up data, model

Here’s the order he does stuff in:

  1. Separate “Id” from train and test sets, drop it from them
  2. Delete the two outliers with large GrLivArea but small SalePrice
  3. Do a log1p transform of SalePrice
  4. Combine train and test datasets
  5. Fill in a bunch of cat. feats NA’s with “None”,  fill in LotFrontage NA’s with mean of Neighborhood, fill Garage numeric feats NA’s with 0, fill in several cat. feats with mode, drop Utilities
  6. Convert MSSubClass, MoSold, YrSold, and OverallCond to strings
  7. Label Encode a bunch of, but not all, cat. feats
  8. Create a new “TotalSF” feature
  9. Unskew skewed numeric feats
  10. Do OHE to the remaining cat. feats
  11. Split up train/test
  12. Use Robust Scaler, model

To be honest, we have pretty similar things. The main difference that sticks out to me is the step where he Label Encodes all those cat. feats. In the end, he has 220 feats to my 330 (I tried switching to RobustScaler, which supposedly handles outliers better with pretty much the same results). I’m a little skeptical of LabelEncoding: he did choose to LE features that have an ordering, but LE necessarily makes the feature values now fit some function, which may not be linear. Additionally, I’m nearly positive he messed up, because LabelEncoder labels them in a sort order for that feature, which in this case is alphabetical because they’re strings. However, this is NOT what you should do, because it will sort the “quality” grades in the order [“Ex”, “Fa”, “Gd”, “None”, “Po”, “TA”], which is not in order of increasing quality. So, I kind of wonder how the hell he even got a good score at all.

When I add the LE (the way he has it), my L1 test score is 0.127, not great.

I just really tried replicating what he’s doing. It really seems like OHE vs LE doesn’t matter at all cause I got 0.1118 with LassoCV.

The key insight for this section was that getting rid of those two GrLivArea outliers improves things: down to a test RMSLE of 0.1088.

Okay, now I’m trying to improve this a little. My “control” at this point is: Remove outliers, combine data, remove Id, fill NA’s the way I have been, create the TotalSF  feature (it helps a little I think), unskew > 0.5, use RobustScaler. This gives an L1 test score of 0.1072 (my LB submission is currently 0.1179 at position 632).

Now I’m trying it with his LE, which I believe is done wrong. It gives an L1 test of 0.1096. Trying it with my ordered LE gives the worse 0.1178, which is truly bizarre to me. I’m genuinely not sure how this could be the case, because it really seems like he’s throwing out information.

_________________________________________________________________________________________________________________________

Another day, another model.

One thing I’m trying today is using his ensemble stacking. For now I’m just going to try it on my “simplest” solution, which is the data cleaning above (using things like the LotFrontage fill function) and then just doing OHE, no LE or feature creation (aside from TotalSF).

The first “stacking” I’m using is his averaging method, where he really just averages the outputs of several models, which should be more robust to outliers. First, we create the models and the scoring function:

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)


model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)


model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

#Validation function
n_folds = 5

def rmsle_cv(model,X_tr,y_tr):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_tr.values)
    rmse= np.sqrt(-cross_val_score(model, X_tr.values, y_tr, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

What’s interesting is that, if you test these models independently, you can see that they (even Lasso, strangely?) do well, but actually a little worse that the test scores I got above with just L1. L1 is actually about the same here but it’s worth noting that all the other models being used do worse:

score = rmsle_cv(lasso,X_train,y_train_log)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(ENet,X_train,y_train_log)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(KRR,X_train,y_train_log)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(GBoost,X_train,y_train_log)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_xgb,X_train,y_train_log)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_lgb,X_train,y_train_log)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
Lasso score: 0.1096 (0.0061)

ElasticNet score: 0.1096 (0.0062)

Kernel Ridge score: 0.1124 (0.0057)

Gradient Boosting score: 0.1167 (0.0090)

Xgboost score: 0.1159 (0.0076)

LGBM score: 0.1178 (0.0038)

Now we do the averaging stack:

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)
    
    
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

And now, submitting it… I get 0.11501, which puts me at position 187 on the leaderboard!! This was significant. This has definitely made me realize the power of stacking (and, we haven’t truly even stacked yet!). Let’s try the more advanced version!

For some reason, just plugging in my data like before isn’t working… Ahhh, after what took longer than I’d like to admit to figure out, I found that he passes y_train as a np array, whereas mine was a pd Series by default, which apparently breaks his stuff. That’s annoying, but a valuable lesson to have learned.

Huh, well weirdly enough, the actual stacking one (not just averaging) did slightly worse on the LB (in the 4th digits).

Anyway, at this point, I’m not 100% sure what to do. I also tried doing stacking on data that had feature creation done, but that didn’t add much. I suspect that the final stretch will be either some tiny improvement to data cleaning I do, or specific looking at outliers in the test data.

The first thing I’ll try is fixing the LotFrontage NA’s even a little more carefully — I noticed before (but didn’t think it would be worth it to try) that for the very low sqrt(LotArea) numbers, the LotFrontage seems to cut off and be this constant. So, before I do the transformation that turns it into a line, I’m going to replace those ones with that constant.

 That’s what it looks like after this manual fix (the green dots for LotArea). As you can see, it really only affects about 8 points in the combined train/test sets. Submitting it gives a tiny improvement to 0.11495, which moves me up 8 positions on the LB. Eh.

I suspect that what could move me up a bit farther at this point is removing more outliers. Remember before when I got a ~0.08 improvement for removing those two GrLivArea outliers? At this point, I’m fighting for improvements 1-2 orders of magnitude less than that. So let’s return and look at a few outliers that might be messing up the train data. Then, we could even look at the test data to see if it has any outliers that we could manually fix.

Hm, strangely, removing these outliers actually lowered my score. This is weird to me, because the few I removed really do seem like outliers. Here’s the before and after, compared with SalePrice:

Before:

After:

I’m not sure what to take away from that. My logic was, if you have non representative points in your train set, then it’s training the model sub optimally for all those typical points, which will make results worse. But I can think of two things are happening that outweigh that. One is that, despite their non-representative-ness, those points still actually contain good information, aside from those features. The other is that, if you have outliers, in a purely linear model, that might not be so good, but when using our stacked model, the nature of stacking is that it plays to the strength of the different specific models within it. And then, it’s possible that something like the GradientBoostingRegressor model our stacking model is using actually can handle these outliers well (because it’s a tree-based model, so can capture weird nonlinear things), so if we get rid of those points, it doesn’t know how to deal with them in the test set. This makes me wonder if I should actually put back those GrLivArea outliers from before, although they appeared to help back then.

At this point, there are actually only two entries with a better score than mine that have kernels. I went through them to see if there was some specific technique they used that I hadn’t seen yet, but they appear to be the same stuff, which means they did more careful hand tuning and experimenting. This itself might be a good skill (so maybe I’ll come back to it), but for now I think I’ve gotten a ton from this challenge and the returns are quickly diminishing. Till next time!

Leave a Comment

Your email address will not be published. Required fields are marked *