Fruits of south east asia!

I forget how I found it, but for some reason I stumbled upon this page of Thai (or more generally, south east Asian) fruits. A bunch of them are really obvious ones (mango, banana, coconut…), but a handful of them are ones I’ve never even heard of. Naturally, I have to try them all.

I mean, isn’t that kind of weird? I like to think I’ve at least heard of most of the major fruits, vegetables, and animals. I don’t mean that I’ve heard/tasted/seen literally all of them, but I usually think that the ones I haven’t heard of are just some small variant of one I already know. Like, if you learn about a blood orange, it’s cool, and yeah, a little different… but not mind blowingly different than a regular orange. It looks about the same aside from the red flesh, and tastes pretty similar. If you hear about moon bears, they’re…basically kinda weird black bears. You get the point. read more

Reading a book in one hour!

When you’re unemployed, you get to do all sorts of things that, if you had a job, you’d correctly judge as stupid, and then not do. Here’s one of them!

I was curious as to how much information I can pick up in an hour. I mean, I’ve gone to lots of talks, but I think a lot of the time it’s because they’re pretty specific, advanced topics (I mean, they’re usually talks about someone’s research). So I don’t think they’re necessarily the best metric for that. Part of why I’m curious is that there’s gotta be some sort of “information retention vs time spent learning it” curve, and I don’t really have a great grasp of the shape of it. I mean, I’m pretty sure that it’s monotonically increasing with time, but I really don’t know where the best ratio of it is. read more

Grouping IMDb top movies by runtime


This is a fun lil one. For an upcoming article, I need to know a list of (hopefully good) movies I haven’t yet seen, with similar runtimes. Now, I could have just scrolled down the list of’s top 250 movies, ctrl + clicking on the ones I haven’t seen, and then compared them by eye, because, to be honest, I think I’ve seen many (/most?) of them (we’ll see shortly).

But of course, that would be an efficient use of my time, and I’m learning pandas these days anyway, so why not use it!

To do what I want, I basically need to take that top 250 list (let’s say I don’t really care about the ratings within that list, I just want to select movies from it), get a column with the runtimes, and then manually make a column with 0/1 entries for if I haven’t/have seen it, and then select for the ones I have. Then, I could group, or at least visualize them.

The first (tiny) hurdle comes from the data source. very helpfully provides several zipped files of ALL their movies (beware, the “basics” one is 420MB unzipped!), buuut… they have separate files for the table with the runtimes (title.basics.tsv.gz) and the ratings (title.ratings.tsv.gz). That wouldn’t be too bad, you might think: you could just sort the ratings file by the rating column, take those entries, and select from the basics file, to get the runtimes.

Buuuut… a quick glance (or if you’ve ever just perused the dark back alleys of IMDb itself) reveals that there are many entries with very high ratings, higher than the top 250 scores (which range from 8.0 to 9.2). This is probably because there are lots of smaller productions where you get a selection bias, such that you pretty much only get people who really like the movie voting, so they’re all voting 10. Of course, IMDb links actually links to an explanation of this effect:

As indicated at the Top Rated Movies page, only votes from regular IMDb voters are considered when creating the top 250 out of the full voting database. This explains any difference between the vote averages reported in the top 250 lists and those on the individual title pages. This also explains why movies or shows you might think from their averages ought to appear on the list yet do not actually appear there.

To maintain the effectiveness of the top 250 lists, we deliberately do not disclose the criteria used for a person to be counted as a regular voter.

and says on the top 250 page itself:

The list is ranked by a formula which includes the number of ratings each movie received from users, and value of ratings received from regular users
To be included on the list, a movie must receive ratings from at least 25000 users

I assumed this before starting this thing but wasn’t sure how they did it. I always assumed that the simplest way would be to just have some threshold of a minimum number of votes (which they do), to even qualify. The “small, religiously devoted fanbase theory” of those artificially high ratings would probably break if you set it correctly — I mean, once you set the threshold of minimum votes high enough, if the rating is still high, it’s not really “artificial” anymore, is it? There are potential problems with that, like only selecting for really large productions (depending on the threshold). It appears they actually do this, but also add a secret “special sauce” where they weigh certain votes more, but they don’t say how.

Anyway, that’s a bit of a long winded way to say that it’d be hard to do what I said above, to get the runtimes of the top 250. At this point, I saw a few options: I probably could try their method manually, using the ratings file (which has average ratings and number of votes for each one), just taking the subset of movies that have a rating at and above the minimum of the top 250 list, and then thresholding those for the minimum number of votes. Maybe I’ll try this anyway, but I assumed (because they say they do something else in addition) that I might end up with another list if I did that.

Another thing I briefly considered (that, at this point, may have been much easier) would be some sort of web scraping. It’d be reaaaaaal easy (in theory) to have a script go to the link for each entry on the top 250 page (which would lead directly to the actual movie, which as we’ll see shortly, is actually a bit of a pain), and then each page has a well defined “runtime” field right below the title. I briefly debated this (and maybe I’ll try it later), but I don’t actually know how to do web scraping in python yet, so it would probably be a really hacky job on my part.

So, speaking of hack-y, here’s what I ended up doing. Everything is presented in bits because I did it in a Jupyter notebook.

To get around the “which are actually the top 250 movies” problem, I literally copied and pasted the top 250 list from the page, and pasted it into a text file, which I imported and dropped two things that ended up being garbage columns. Then I had to do a tiny bit of parsing, because using “\t” as a delimiter worked to separate the title and year, but not the rank and title. So, I had to delimit that column with the “.” after the rank #, but setting n=1, because some titles have a period in their name as well (Monsters, Inc. for example), so you wouldn’t want to chop those up into separate fields. Ahhh details!

movieRatings_df = pd.read_csv("copypasteRatings.csv","\t") display(movieRatings_df.head()) movieRatings_df = movieRatings_df.drop(["Your Rating","Unnamed: 3"],axis=1) display(movieRatings_df.head()) dotsplit = movieRatings_df["Rank & Title"].str.split(".",n=1) titleYears = pd.Series([entry[1] for entry in dotsplit]) rank = pd.Series([entry[0] for entry in dotsplit]).rename("Rank") years = pd.Series([int(title[-5:-1]) for title in titleYears]).rename("Year") titles = pd.Series([title[1:-7] for title in titleYears]).rename("Title") ratings_df = pd.concat([rank,titles,years],axis=1) ratings_df.head() read more

Stickin it to the Myan-mar

Ahhhhhhhhh, Myanmar. “You most likely know it as Myanmar, but it’ll always be Burma to me.”

While I originally planned to go to Thailand, Laos, and Cambodia, and maybe Vietnam, I really didn’t expect to go to Myanmar at all. To be honest, pretty much all I knew about it was that line from Seinfeld and that there’s currently a genocide/ethnic cleansing/refugee crisis happening with the Rohingya in the west of the country being committed by the Myanmar government (more on that later). However, I kept meeting people who told me that it was the highlight of their whole trip in SEA. When I had a few weeks to kill before meeting my friends in Vietnam, since I had kind of tired of Cambodia, it seemed like the perfect opportunity. read more

Dimensionality reduction via Principle Component Analysis in python on face images

Hey there! It’s been a while since I wrote anything other than stuff about travel (oh, don’t you worry, there’s still more of that coming!), so it feels good to write about something like this.

Right now, I’m almost finished with the Andrew Ng Machine Learning course on Coursera. Maybe I’ll write about it sometime, but it’s really, really solid and I’m learning a lot. He’s pretty great at explaining concepts and the course is constructed pretty well. What I really like is that, for the assignments, he’ll take the concept from that week and demonstrate a really interesting application of it (even if it’s a little contrived and may not actually be a practical use for it). Either way, it just gets me to think about the breadth of what this stuff can be applied to. read more

Yes we Cam..bodia

The 4000 Islands are in the very south of Laos, where it borders Cambodia, so Cambodia was naturally my next stop. We boarded a bus to Siem Reap and the SEA nonsense quickly began. We were herded into two minivans for the border, which was only about 15 minutes away. These minivans actually weren’t too bad, but of course they weren’t the ones we’d actually be riding in for the long haul (noticing a pattern here: nice transport for the beginning, then switch to awful transport…maybe because if people saw what they paid for before they got on, they’d demand money back or something?). The border was a typical hilarious chaotic shitshow of nervously handing over our passports, short barked orders, and not knowing what’s going on. The entrance visa for Cambodia was $37 (for the US). I’d heard that you should only hand over exact change, because if you handed over $40 for example, they’d probably invent some “fee” so you didn’t get money back (welcome to Scambodia, as I’ve heard it called). Strangely, after getting our visas, they kinda just pointed us in a direction to walk, and we walked for a couple minutes through these empty parking lots towards a bunch of stores (the “pickup spot” I guess) that probably should have been a lot closer to the crossing point… That was a little strange, and probably an accurate introduction to Cambodia. read more

Vientiane to the 4000 Islands, the La(o)st of Laos

Hey there again! I guess last time I left off, I was about to leave Vang Vieng to head farther south in Laos, by way of Vientiane, first. The main goal was to do two motorbike loops in central and south Laos, but I’ll get to that later. If you want to go south, especially by common bus routes, you’ll almost certainly end up going through the capital, Vientiane. I had heard pretty dismal stuff about it, but figured I’d give it about a day’s worth of attention, which I think was a good choice. read more

Hoo boy, I’ve fallen behind: Luang Prabang to Vientiane

Welp, despite my promise to do this more frequently, here we are. Two months in and number three. Womp womp.

Let’s see, I guess I left off in Luang Prabang…

The next day was a big one. My two friends and I agreed to get up early to see the sun rise from Mount Phousi, which is a huge hill in the middle of Luang Prabang with a temple on top. I would’ve liked to see sunset there too, but one of my friends had gone the day before and said that it was a shitshow mob of tourists, and that you see more of a see of cellphones than the sun itself. However, get up at 5AM and…you’ll definitely have less company up there. So we did that, and were climbing the hill around 5.50AM or so. We actually passed some people and monks coming back from the giving of the alms ceremony thing they do on the streets in many Laos cities. Getting up that early to do pretty much anything always feels pretty cool. You can be doing something fairly mundane but it’ll feel like a mission because you’re awake when it’s dark, but on the “other side” of the day. But it felt especially cool climbing up these steep stairs through the trees before sunset. Anyway, we finally got there, panting from being out of shape travelers, and it was already getting light enough to make out the city. There were only maybe…2 or 3? other people already there, which was cool. We all sat in mostly silence, happily watching it get lighter, and occasionally whispering to each other. However, soon enough, a fairly large group of tourists came up talking pretty loudly and even shouting to each other. A girl, one of the ones who had been up there before us, noticeably winced and then moved away to a different area with a worse view, to get away from them. I…kind of don’t get it. I get if they want to talk to their friends, but they can still do that quietly, right? Anyway, enough kvetching. read more

A little catchup: the slow boat and beginning of Luang Prabang

Let’s see, where did my rambling last leave off…

Ahh, in Pai. Well. Let me actually finish up there. Pai is a cool place, but towards the end I had pretty much exhausted what you can do there and seen all the dreadlocks and elephant pants I needed to. There were a couple smaller things I hadn’t seen yet, but I was at the “diminishing returns” part of my stay there so it seemed like I should push on from there. To do that, there are a few options. Pai itself doesn’t have an airport (I think..? Maybe the jungle swallowed it or something..?), so if you want to make a big jump, you could go back to Chiang Mai and then fly pretty much anywhere. Otherwise, you can go by land. There are a few ways to go, but a common one is heading to Laos. read more

11.13.2017: First couple weeks in SEA

Hoooooo boy.

When I left, I pictured having enough time and energy that I would be writing little blog posts every two days or so. To be honest, I still have the time; I could definitely set aside 20 minutes every couple days and blast something out, but my laptop is usually locked away which adds another hurdle to getting myself to do anything.

So! Lemme try and run through things pretty quickly. The day I left, my mom (hi mom, my lone reader!) really helpfully drove me to the train station at some ungodly hour like 5AM, where I got the MTA train to GCT. From there I had to get a bus (the “Airporter”) to JFK. Despite being “from New York”, just going like, 2 blocks over from GCT to the bus stop somehow confused me and made me realize how little I know NYC at this point. read more