Snackin my way across Viet-nom nom nom

This will proooobably have to be a two-parter. I spent the most time of any country in Vietnam, and my other posts probably should’ve been two parts themselves… Also, I’m gonna try a new format, with sections! Lawdy, at this point, it’s been a while, so I’m struggling to sit down and write it all down before I forget it.

  1. Getting into Vietnam, friends, and Hanoi
  2. Ninh Binh
  3. Cat Ba Island and Ha Long bay
  4. The Ha Giang Loop
  5. Motorcycles and setting off

Hanoi

I flew into Hanoi, where my friends from back home were going to visit me. When I got into the airport, Phil was already there. It was great seeing a familiar face after so long on the road and meeting and saying goodbye to people in such quick succession. I had already booked a hostel for us in Hanoi. Hanoi is one of those big tourist cities that has such high throughput that there are tons of cheap and excellent hostels. We got there from the airport pretty easily at night and immediately went out to sample the local snacks. A night market was still going, but I quickly noticed that it seemed to have much less focus on food than night markets in other countries, where it would usually be a long procession of different food carts. read more

Motion detection with the Raspberry Pi, part 1

Okay Declan, let’s try making this post a short and sweet update, not a rambling Homerian epic about simple stuff.

I got a Raspberry Pi (RPi) and an RPi camera because I wanted to learn about them and mess around with them. If I could do image recognition with them, that’d be a good platform to do ML, NN, and if I got enough data, maybe even DS type stuff. Luckily, there’s a ton of resources and code out there already. I drew upon heavily from www.pyimagesearch.com, which is a REALLY useful site, explained very great for beginners. Two articles that I basically copied code from and then butchered were this and this.

He’s not quite doing “image recognition” in this code, it’s more like “difference recognition”. Very simply, he has a stream of frames coming in from the camera. He starts off by taking what will be considered a “background frame”. Then, for all subsequent frames, he subtracts the background from the current frame, and then looks at the absolute difference (all done in grayscale, to make it simpler) of pixels. If two frames were identical, you’d expect very little different. If an object appeared in the new frame, the difference would show that object. Then, he uses some opencv tools to figure out where the object is, and draw a box around it.

I was able to put his code together and run it pretty quickly (though I removed some stuff like uploading it to dropbox, instead doing a kind of naive thing of sending the files via scp to my other machine), producing this gif of local traffic outside my window:

Of course, the devil is in the details. If you watch it a few times, you’ll notice some weird behavior. Most obviously, boxes are detected around the objects, but then the boxes appear to remain where the object was for several frames. Here you can see it frame by frame:

Why does this happen? Well it’s actually a smart feature, but done in a somewhat clumsy way. In his code, he has the following (I combined the few relevant snippets) inside the main frame capturing loop:

if avg is None:
      print("[INFO] starting background model...")
      avg = gray.copy().astype("float")
      rawCapture.truncate(0)
      continue

cv2.accumulateWeighted(gray, avg, alpha)

frameDelta = cv2.absdiff(gray, cv2.convertScaleAbs(avg))

Here, the variable gray is the (grayscale) frame we’re capturing each time. The avg variable is the background I mentioned that we’ll be subtracting from all following frames. So in the if statement, it’s simply setting avg to be the first gray value if it hasn’t been set yet. The last line is simple, too: it’s the subtraction of avg from gray, each time. But the middle line is the key. The opencv function accumulateWeighted() lets you keep a running weighted average. The first argument (gray) is what you’re adding to this average, the second argument (avg) is the average you’ll be updating, and the last is a parameter that determines how much to weight the new addition to the average. This is actually a pretty smart feature, because if you wanted to run this all day, the lighting and other stuff would change, so eventually you’d be comparing how it looked at 6PM to how it looked at noon, and maybe even frames with no objects would get triggered. So this is an “adaptive” background, which he smartly did.

So can you see the problem? To illustrate it, here’s another example of three images, where I’ve also plotted the avg and frameDelta images for each:

(it looks kind of crappy because I just arranged a bunch of windows rather than making it produce a grid.)

Anyway, you can probably tell what’s happening. The middle column, for each example, shows avg after it’s been updated with the current gray frame. The right column shows frameDelta as a result. However, you can see that in the 2nd row, in avg, there’s still a “ghost” of the arm there. So when the arm is actually gone from gray, absdiff(gray,avg) will still have the arm.

So I fixed it with the following pseudocode:

for frame in cameraStream: frameOccupied = False gray = grayscaleAndOtherOps(frame) frameDelta = absdiff(gray,avg) frameOccupied = isObjectInFrame(frameDelta) #Do stuff with the object if there is one if not frameOccupied: cv2.accumulateWeighted(gray, avg, alpha) read more

Kaggle Housing challenge, my take

In this article, I’m doing the Kaggle Housing challenge, which is probably the second most popular after Titanic. This was very much a “keeping track of what I’m doing for learning/my own sake” thing, but by the end I’ve gotten a ranking of 178/5419 on the public leaderboard (LB). That said, this is super long because I tried a million things and it’s kind of a full log of my workflow on this problem.

I’ve really learned a bunch from going through this very carefully. What I did here was to try the few techniques I knew when I started, and then I looked at notebooks/kernels for this challenge on Kaggle. A word on these kernels: even the very most top rated ones vary in quality immensely. Some are excellently explained and you can tell they tried different things to try and get an optimal result. Others are clearly people just trying random stuff they’ve heard of, misapplying relatively basic techniques, and even copying code from other kernels. So I viewed these as loose suggestions and guideposts for techniques.

This challenge is a good one. In contrast to the Titanic’s classification, this is a regression for the label “SalePrice”, the price a given property sold for. Another key characteristic of this one is that it has 80 features, which is a large number (for me, anyway), and of those, a decent number of missing values (NA’s).

I’m evaluating the scores by sometimes looking at the accuracy from the score() function of sklearn models (which is out of 100, increasing score is better), and then towards the end only looking at the Root mean square logarithmic error (RMSLE), for which lower is better.

Starting off:

Using literally only the feature “GrLivArea”, the most obviously linear:

TTdat = train_df[["GrLivArea","SalePrice"]] display(TTdat.isnull().sum()) X_train, X_test, y_train, y_test = train_test_split(TTdat.drop("SalePrice",axis=1), TTdat["SalePrice"], test_size=0.3, random_state=42) lm = linear_model.LinearRegression() lm.fit(X_train,y_train) print("\nLM score: {}".format(lm.score(X_test,y_test))) print("\n\n\tLM RMSLE: {}".format(rmsle(np.array(y_test),np.array(lm.predict(X_test))))) rfr = RandomForestRegressor(n_estimators=300) rfr.fit(X_train,y_train) print("\n\nRFR score: {}".format(rfr.score(X_test,y_test))) print("\n\n\tRFR RMSLE: {}".format(rmsle(np.array(y_test),np.array(rfr.predict(X_test))))) read more

Squall Moan: Small Clone clone

squallmoan

Ahhh, where it all started.

I was jamming with a friend in his basement and he had a bunch of pedals, which I was noodling around with. None really stuck out to me until this little guy. If you want a sample of what it sounds like, there are plenty of test drives on YouTube. You may recognize its sound from Nirvana songs (only 90s kids will myeh myeh myeeehhh).

I immediately fell in love with it. It has such a distinctive sound, dark and warbly. I always describe it as “watery”, though that doesn’t seem to make sense to anyone else. Anyway, at this point I hadn’t made any pedals, but once I got into it, this was the one that was always on my mind. I had made several, but I wanted this one to be my best yet. read more

The Knapsack Problem: Discrete Optimization, week 2

I’ve been doing this Coursera Discrete Optimization course with my friends. It’s a lot of fun and I’m learning a bunch. The instructor is a total goofball too, which is a plus. I’ve taken a handful of online courses before, but to be honest, the assignments have usually been pretty easy. Not so with this Discrete Optimization (DO (my initials!)) course! Each week, you have to solve 6 problems, and each is graded out of 10 depending on how well you do. I believe the breakdown is: 0/10 if your answer doesn’t even match the output format required, 3/10 if you do basically anything (even if your answer is quite wrong), 7/10 if you have an answer above some threshold but still not perfect, and 10/10 if your answer is optimal (I guess they know the optimal solution for all of them?). Usually, the problems increase in hardness throughout the set; often, the last one is difficult enough that (I believe we saw this said in the course forums by the instructor) it would be a challenge for people who specialize in DO for a living. I think that’s pretty cool! They usually give you a ton of practice problems of various difficulties, and (though I’m not 100% sure) I think the 6 you’re graded on are usually among those.

So what is DO? I certainly didn’t know when I started this course, though I guess I should’ve been able to guess. Optimization is what it sounds like, finding the best solution you can for given problems. The “discrete” part is that the quantities involved are integers or discrete (that’s the name in the title!!!) components. It turns out I had actually heard of many of the problems that DO applies to before, but didn’t know they were DO. I had heard most of them in the context of P vs NP complexity.

Anyway, enough of that. Week 1 didn’t have an assignment, so the first one was in week 2, and it was the famous “knapsack problem”. I’m guessing if you’re reading this (sike, no one reads this!), you’re familiar with it, but in case you aren’t, it’s as follows. You have a knapsack with a certain (weight) capacity, K. There are a bunch of items in front of you, and each one has a certain weight (w) and value (v). You wanna fill your backpack with the most valuable (i.e., largest combined value) set of items you can, and you can only take the items whole (i.e., you can’t break one in half to fill the bag completely).

Because I’ll be referring to it a few times, the format of a given problem (just a text file) they give you looks like this:

n K
v_0 w_0
v_1 w_1
...
v_n-1 w_n-1

n is the number of items there are, and K is the capacity of your bag in this problem. v and w are the value and weight of each item. (I’ll refer to the items by their line number, starting with 0, in the following.)

Here’s where the “discrete-iness” comes into play. If you were allowed to take non-whole pieces of the items, the solution would be trivial: you would calculate the density (d = v/w) of each item, and start taking the most (densely) valuable items in decreasing order. When you get to the point where your bag can’t take the next whole item, you take as much of it as you can fit, and this is the theoretically most valuable set.

Now, for the actual problem (where you can’t take fractional pieces of items) it turns out that this is still often a decent solution, for intuitive reasons; denser items are giving you more bang for your buck. This is called the “greedy solution”, because it just greedily takes the best item it can at any moment. However, the problem gets interesting when that strategy will actually give you sub-optimal results. In fact, helpfully, the simplest problem they give you already has a non-greedy optimal solution (OS):

4 11
8 4
10 5
15 8
4 3

The items already happen to be ordered by decreasing density. If you did the greedy solution and took item 0 (8, 4) and then item 1 (10, 5), you couldn’t take any more items and your total value would be 18. However, this isn’t OS! You can see that you actually wanna take the least two densest items, 2 and 3, which perfectly fill the bag and also add up to 19, the OS.

So, that’s the problem! Now, how do you actually solve it? (And, what does it mean to solve it?)

The most naive way you can imagine is to try every possible set, and take the largest valid (i.e., all the items fit in the bag) solution. You have a list of n items, so for each one, you can either take it (1) or leave it (0). So there are really only (hah!) 2^n possible sets of items. Naturally, this lends itself to a tree data structure, but it should also be clear that 2^n quickly becomes infeasible for surprisingly low values of n.

Therefore, the real game is in traversing that tree in some clever way such that you don’t have to go through the whole tree. I should make a distinction here. There are two types of solution you can solve for. You can go for the OS, where you’re 100% sure the solution you end up with is optimal. This actually doesn’t necessitate going through the whole tree. Sometimes you can know that you don’t need to go farther down a tree, which allows you to not have to look at the solutions contained within that part of the tree, but still know that a solution you find will be the OS. On the other hand, sometimes this is still infeasible, so you have to go for an approximate solution (AS), where you’re just doing the best you can. However, not requiring the solution you get to be the OS can let you find great solutions way faster, and still sometimes even let you find the OS. The downside is that you won’t know 100% that you got the OS.

So let’s actually start. The way my friends and I originally did it (and I stuck with), we actually don’t explicitly have a tree data structure; however, by dint of doing recursion with two possible recursive calls, this implicitly forms a tree.

The first thing done is reading in the data and storing it in a list of namedtuple Item objects, which I believe was given to us. Then, because we’re going to go through the tree in order of decreasing density, we sort the list by that:

def sortListDensity(inList): return(sorted(inList, key=lambda item:-item.density)) lines = input_data.split('\n') firstLine = lines[0].split() item_count = int(firstLine[0]) capacity = int(firstLine[1]) items = [] for i in range(1, item_count+1): line = lines[i] parts = line.split() items.append(Item(i-1, int(parts[0]), int(parts[1]),float(parts[0])/float(parts[1]))) sortedList = sortListDensity(items) read more

Fruits of south east asia!

I forget how I found it, but for some reason I stumbled upon this page of Thai (or more generally, south east Asian) fruits. A bunch of them are really obvious ones (mango, banana, coconut…), but a handful of them are ones I’ve never even heard of. Naturally, I have to try them all.

I mean, isn’t that kind of weird? I like to think I’ve at least heard of most of the major fruits, vegetables, and animals. I don’t mean that I’ve heard/tasted/seen literally all of them, but I usually think that the ones I haven’t heard of are just some small variant of one I already know. Like, if you learn about a blood orange, it’s cool, and yeah, a little different… but not mind blowingly different than a regular orange. It looks about the same aside from the red flesh, and tastes pretty similar. If you hear about moon bears, they’re…basically kinda weird black bears. You get the point. read more

Reading a book in one hour!

When you’re unemployed, you get to do all sorts of things that, if you had a job, you’d correctly judge as stupid, and then not do. Here’s one of them!

I was curious as to how much information I can pick up in an hour. I mean, I’ve gone to lots of talks, but I think a lot of the time it’s because they’re pretty specific, advanced topics (I mean, they’re usually talks about someone’s research). So I don’t think they’re necessarily the best metric for that. Part of why I’m curious is that there’s gotta be some sort of “information retention vs time spent learning it” curve, and I don’t really have a great grasp of the shape of it. I mean, I’m pretty sure that it’s monotonically increasing with time, but I really don’t know where the best ratio of it is. read more

Grouping IMDb top movies by runtime

Howdy!

This is a fun lil one. For an upcoming article, I need to know a list of (hopefully good) movies I haven’t yet seen, with similar runtimes. Now, I could have just scrolled down the list of IMDb.com’s top 250 movies, ctrl + clicking on the ones I haven’t seen, and then compared them by eye, because, to be honest, I think I’ve seen many (/most?) of them (we’ll see shortly).

But of course, that would be an efficient use of my time, and I’m learning pandas these days anyway, so why not use it!

To do what I want, I basically need to take that top 250 list (let’s say I don’t really care about the ratings within that list, I just want to select movies from it), get a column with the runtimes, and then manually make a column with 0/1 entries for if I haven’t/have seen it, and then select for the ones I have. Then, I could group, or at least visualize them.

The first (tiny) hurdle comes from the data source. IMDb.com very helpfully provides several zipped files of ALL their movies (beware, the “basics” one is 420MB unzipped!), buuut… they have separate files for the table with the runtimes (title.basics.tsv.gz) and the ratings (title.ratings.tsv.gz). That wouldn’t be too bad, you might think: you could just sort the ratings file by the rating column, take those entries, and select from the basics file, to get the runtimes.

Buuuut… a quick glance (or if you’ve ever just perused the dark back alleys of IMDb itself) reveals that there are many entries with very high ratings, higher than the top 250 scores (which range from 8.0 to 9.2). This is probably because there are lots of smaller productions where you get a selection bias, such that you pretty much only get people who really like the movie voting, so they’re all voting 10. Of course, IMDb links actually links to an explanation of this effect:

As indicated at the Top Rated Movies page, only votes from regular IMDb voters are considered when creating the top 250 out of the full voting database. This explains any difference between the vote averages reported in the top 250 lists and those on the individual title pages. This also explains why movies or shows you might think from their averages ought to appear on the list yet do not actually appear there.

To maintain the effectiveness of the top 250 lists, we deliberately do not disclose the criteria used for a person to be counted as a regular voter.

and says on the top 250 page itself:

The list is ranked by a formula which includes the number of ratings each movie received from users, and value of ratings received from regular users
To be included on the list, a movie must receive ratings from at least 25000 users

I assumed this before starting this thing but wasn’t sure how they did it. I always assumed that the simplest way would be to just have some threshold of a minimum number of votes (which they do), to even qualify. The “small, religiously devoted fanbase theory” of those artificially high ratings would probably break if you set it correctly — I mean, once you set the threshold of minimum votes high enough, if the rating is still high, it’s not really “artificial” anymore, is it? There are potential problems with that, like only selecting for really large productions (depending on the threshold). It appears they actually do this, but also add a secret “special sauce” where they weigh certain votes more, but they don’t say how.

Anyway, that’s a bit of a long winded way to say that it’d be hard to do what I said above, to get the runtimes of the top 250. At this point, I saw a few options: I probably could try their method manually, using the ratings file (which has average ratings and number of votes for each one), just taking the subset of movies that have a rating at and above the minimum of the top 250 list, and then thresholding those for the minimum number of votes. Maybe I’ll try this anyway, but I assumed (because they say they do something else in addition) that I might end up with another list if I did that.

Another thing I briefly considered (that, at this point, may have been much easier) would be some sort of web scraping. It’d be reaaaaaal easy (in theory) to have a script go to the link for each entry on the top 250 page (which would lead directly to the actual movie, which as we’ll see shortly, is actually a bit of a pain), and then each page has a well defined “runtime” field right below the title. I briefly debated this (and maybe I’ll try it later), but I don’t actually know how to do web scraping in python yet, so it would probably be a really hacky job on my part.

So, speaking of hack-y, here’s what I ended up doing. Everything is presented in bits because I did it in a Jupyter notebook.

To get around the “which are actually the top 250 movies” problem, I literally copied and pasted the top 250 list from the page, and pasted it into a text file, which I imported and dropped two things that ended up being garbage columns. Then I had to do a tiny bit of parsing, because using “\t” as a delimiter worked to separate the title and year, but not the rank and title. So, I had to delimit that column with the “.” after the rank #, but setting n=1, because some titles have a period in their name as well (Monsters, Inc. for example), so you wouldn’t want to chop those up into separate fields. Ahhh details!

movieRatings_df = pd.read_csv("copypasteRatings.csv","\t") display(movieRatings_df.head()) movieRatings_df = movieRatings_df.drop(["Your Rating","Unnamed: 3"],axis=1) display(movieRatings_df.head()) dotsplit = movieRatings_df["Rank & Title"].str.split(".",n=1) titleYears = pd.Series([entry[1] for entry in dotsplit]) rank = pd.Series([entry[0] for entry in dotsplit]).rename("Rank") years = pd.Series([int(title[-5:-1]) for title in titleYears]).rename("Year") titles = pd.Series([title[1:-7] for title in titleYears]).rename("Title") ratings_df = pd.concat([rank,titles,years],axis=1) ratings_df.head() read more

Stickin it to the Myan-mar

Ahhhhhhhhh, Myanmar. “You most likely know it as Myanmar, but it’ll always be Burma to me.”

While I originally planned to go to Thailand, Laos, and Cambodia, and maybe Vietnam, I really didn’t expect to go to Myanmar at all. To be honest, pretty much all I knew about it was that line from Seinfeld and that there’s currently a genocide/ethnic cleansing/refugee crisis happening with the Rohingya in the west of the country being committed by the Myanmar government (more on that later). However, I kept meeting people who told me that it was the highlight of their whole trip in SEA. When I had a few weeks to kill before meeting my friends in Vietnam, since I had kind of tired of Cambodia, it seemed like the perfect opportunity. read more

Dimensionality reduction via Principle Component Analysis in python on face images

Hey there! It’s been a while since I wrote anything other than stuff about travel (oh, don’t you worry, there’s still more of that coming!), so it feels good to write about something like this.

Right now, I’m almost finished with the Andrew Ng Machine Learning course on Coursera. Maybe I’ll write about it sometime, but it’s really, really solid and I’m learning a lot. He’s pretty great at explaining concepts and the course is constructed pretty well. What I really like is that, for the assignments, he’ll take the concept from that week and demonstrate a really interesting application of it (even if it’s a little contrived and may not actually be a practical use for it). Either way, it just gets me to think about the breadth of what this stuff can be applied to. read more