Kaggle Housing challenge, my take

In this article, I’m doing the Kaggle Housing challenge, which is probably the second most popular after Titanic. This was very much a “keeping track of what I’m doing for learning/my own sake” thing, but by the end I’ve gotten a ranking of 178/5419 on the public leaderboard (LB). That said, this is super long because I tried a million things and it’s kind of a full log of my workflow on this problem.

I’ve really learned a bunch from going through this very carefully. What I did here was to try the few techniques I knew when I started, and then I looked at notebooks/kernels for this challenge on Kaggle. A word on these kernels: even the very most top rated ones vary in quality immensely. Some are excellently explained and you can tell they tried different things to try and get an optimal result. Others are clearly people just trying random stuff they’ve heard of, misapplying relatively basic techniques, and even copying code from other kernels. So I viewed these as loose suggestions and guideposts for techniques.

This challenge is a good one. In contrast to the Titanic’s classification, this is a regression for the label “SalePrice”, the price a given property sold for. Another key characteristic of this one is that it has 80 features, which is a large number (for me, anyway), and of those, a decent number of missing values (NA’s).

I’m evaluating the scores by sometimes looking at the accuracy from the score() function of sklearn models (which is out of 100, increasing score is better), and then towards the end only looking at the Root mean square logarithmic error (RMSLE), for which lower is better.

Starting off:

Using literally only the feature “GrLivArea”, the most obviously linear:

TTdat = train_df[["GrLivArea","SalePrice"]] display(TTdat.isnull().sum()) X_train, X_test, y_train, y_test = train_test_split(TTdat.drop("SalePrice",axis=1), TTdat["SalePrice"], test_size=0.3, random_state=42) lm = linear_model.LinearRegression() lm.fit(X_train,y_train) print("\nLM score: {}".format(lm.score(X_test,y_test))) print("\n\n\tLM RMSLE: {}".format(rmsle(np.array(y_test),np.array(lm.predict(X_test))))) rfr = RandomForestRegressor(n_estimators=300) rfr.fit(X_train,y_train) print("\n\nRFR score: {}".format(rfr.score(X_test,y_test))) print("\n\n\tRFR RMSLE: {}".format(rmsle(np.array(y_test),np.array(rfr.predict(X_test))))) read more

Grouping IMDb top movies by runtime


This is a fun lil one. For an upcoming article, I need to know a list of (hopefully good) movies I haven’t yet seen, with similar runtimes. Now, I could have just scrolled down the list of IMDb.com’s top 250 movies, ctrl + clicking on the ones I haven’t seen, and then compared them by eye, because, to be honest, I think I’ve seen many (/most?) of them (we’ll see shortly).

But of course, that would be an efficient use of my time, and I’m learning pandas these days anyway, so why not use it!

To do what I want, I basically need to take that top 250 list (let’s say I don’t really care about the ratings within that list, I just want to select movies from it), get a column with the runtimes, and then manually make a column with 0/1 entries for if I haven’t/have seen it, and then select for the ones I have. Then, I could group, or at least visualize them.

The first (tiny) hurdle comes from the data source. IMDb.com very helpfully provides several zipped files of ALL their movies (beware, the “basics” one is 420MB unzipped!), buuut… they have separate files for the table with the runtimes (title.basics.tsv.gz) and the ratings (title.ratings.tsv.gz). That wouldn’t be too bad, you might think: you could just sort the ratings file by the rating column, take those entries, and select from the basics file, to get the runtimes.

Buuuut… a quick glance (or if you’ve ever just perused the dark back alleys of IMDb itself) reveals that there are many entries with very high ratings, higher than the top 250 scores (which range from 8.0 to 9.2). This is probably because there are lots of smaller productions where you get a selection bias, such that you pretty much only get people who really like the movie voting, so they’re all voting 10. Of course, IMDb links actually links to an explanation of this effect:

As indicated at the Top Rated Movies page, only votes from regular IMDb voters are considered when creating the top 250 out of the full voting database. This explains any difference between the vote averages reported in the top 250 lists and those on the individual title pages. This also explains why movies or shows you might think from their averages ought to appear on the list yet do not actually appear there.

To maintain the effectiveness of the top 250 lists, we deliberately do not disclose the criteria used for a person to be counted as a regular voter.

and says on the top 250 page itself:

The list is ranked by a formula which includes the number of ratings each movie received from users, and value of ratings received from regular users
To be included on the list, a movie must receive ratings from at least 25000 users

I assumed this before starting this thing but wasn’t sure how they did it. I always assumed that the simplest way would be to just have some threshold of a minimum number of votes (which they do), to even qualify. The “small, religiously devoted fanbase theory” of those artificially high ratings would probably break if you set it correctly — I mean, once you set the threshold of minimum votes high enough, if the rating is still high, it’s not really “artificial” anymore, is it? There are potential problems with that, like only selecting for really large productions (depending on the threshold). It appears they actually do this, but also add a secret “special sauce” where they weigh certain votes more, but they don’t say how.

Anyway, that’s a bit of a long winded way to say that it’d be hard to do what I said above, to get the runtimes of the top 250. At this point, I saw a few options: I probably could try their method manually, using the ratings file (which has average ratings and number of votes for each one), just taking the subset of movies that have a rating at and above the minimum of the top 250 list, and then thresholding those for the minimum number of votes. Maybe I’ll try this anyway, but I assumed (because they say they do something else in addition) that I might end up with another list if I did that.

Another thing I briefly considered (that, at this point, may have been much easier) would be some sort of web scraping. It’d be reaaaaaal easy (in theory) to have a script go to the link for each entry on the top 250 page (which would lead directly to the actual movie, which as we’ll see shortly, is actually a bit of a pain), and then each page has a well defined “runtime” field right below the title. I briefly debated this (and maybe I’ll try it later), but I don’t actually know how to do web scraping in python yet, so it would probably be a really hacky job on my part.

So, speaking of hack-y, here’s what I ended up doing. Everything is presented in bits because I did it in a Jupyter notebook.

To get around the “which are actually the top 250 movies” problem, I literally copied and pasted the top 250 list from the page, and pasted it into a text file, which I imported and dropped two things that ended up being garbage columns. Then I had to do a tiny bit of parsing, because using “\t” as a delimiter worked to separate the title and year, but not the rank and title. So, I had to delimit that column with the “.” after the rank #, but setting n=1, because some titles have a period in their name as well (Monsters, Inc. for example), so you wouldn’t want to chop those up into separate fields. Ahhh details!

movieRatings_df = pd.read_csv("copypasteRatings.csv","\t") display(movieRatings_df.head()) movieRatings_df = movieRatings_df.drop(["Your Rating","Unnamed: 3"],axis=1) display(movieRatings_df.head()) dotsplit = movieRatings_df["Rank & Title"].str.split(".",n=1) titleYears = pd.Series([entry[1] for entry in dotsplit]) rank = pd.Series([entry[0] for entry in dotsplit]).rename("Rank") years = pd.Series([int(title[-5:-1]) for title in titleYears]).rename("Year") titles = pd.Series([title[1:-7] for title in titleYears]).rename("Title") ratings_df = pd.concat([rank,titles,years],axis=1) ratings_df.head() read more

Dimensionality reduction via Principle Component Analysis in python on face images

Hey there! It’s been a while since I wrote anything other than stuff about travel (oh, don’t you worry, there’s still more of that coming!), so it feels good to write about something like this.

Right now, I’m almost finished with the Andrew Ng Machine Learning course on Coursera. Maybe I’ll write about it sometime, but it’s really, really solid and I’m learning a lot. He’s pretty great at explaining concepts and the course is constructed pretty well. What I really like is that, for the assignments, he’ll take the concept from that week and demonstrate a really interesting application of it (even if it’s a little contrived and may not actually be a practical use for it). Either way, it just gets me to think about the breadth of what this stuff can be applied to. read more

Getting back on the horse…er, Python

As of this writing, I just defended and I’m considering various options for what I’ll do next. That’s a whole other story, but the important part for this post is that, probably for whatever I do, I’ll be coding.

I’ve coded a decent amount in my life. I started with dinky web stuff wayyy back, then picked up a now-tattered and probably outdated “C++ for Dummies” book in highschool. I did small programs with that, as well as some silly things for crappy AVR projects I did. In college, I used mostly Java because that’s what the computer science classes I took asked for. Towards the end of college, though, I was working on my own research, and used C++ instead (why? I honestly don’t remember. Wait, I just did! My advisor had heard of some multiprocessor module for C++ that he wanted me to try, so that’s why I didn’t stick with Java). read more