Debugging Tips for running mrjob on Elastic Map Reduce

by Yavar Naddaf | Jan 31, 2013

mrjob is a package that greatly simplifies writing Hadoop Streaming jobs in Python and and running them on Amazon’s Elastic MapReduce (EMR). Written and maintained by the great guys at Yelp, mrjob allows you to write a multi-level MapReduce job via a bunch of python functions, test and run it locally to make sure it works, and then send it off to 1000 EMR instances to run and return the results.

Because of the multiple steps involved to get the code residing on your local machine to run as distributed jobs on EMR instances, debugging mrjob programs can be very challenging. If you have already read the official troubleshooting guide and still find yourself struggling with debugging your mrjob code, here are a few tips that may save you some time:

  1. Maximize test coverage

    It’s not shocking news that you need good test coverage to make debugging easier and faster. However, it is still worth reiterating that when writing mrjob code, you really want as much test coverage as you can get. Without good tests, re-running the full MapReduce job repeatedly as you hunt down bugs (even if you run it on your local machine) can quickly eat up your day. mrjob allows you to write tests that run a full job. However, personally, I prefer to breakdown each MapReduce step into a sequence of smaller functions and write unittests for each of the smaller steps. MapReduce is a functional programming model, so this usually works pretty well.

  2. Watch out for different Python versions

    As of this blog post, the latest AMI version supported in Amazon EMR is 2.3.1 which runs Debian 6.0.5 and Python 2.6. If you are using any of the Python 2.7 features (for instance set literals {1,2,3} or automatic numbering in string.format '{}:{}:{}'.format(2013, 01, 'Sunday')), your tests will pass on your local machine, but will fail on EMR. One way to address this is to install Python 2.6 along side 2.7 (short guide for ubuntu) and make sure that your tests run on Python 2.6 as well.

  3. Run your tests on an EMR instance

    Even when your tests pass on Python 2.6, the jobs may still fail with cryptic errors on EMR Hadoop instances. The problem is that aside from the Python version, other packages may also differ on Debian 6.0.5 compared to your local development machine. For instance, I once had a html processing job using BeautifulSoup. On my dev box, ubuntu 12.04, BeautifulSoup 3.2 is installed by default. Turns out that the EMR instances also comes with a version of BeautifulSoup pre-installed. Unfortunately for me, this is a much older and buggier version of BeautifulSoup which breaks on cases that work perfectly well on my local machine.

To fix this, I ended up launching a sandbox EMR instance (by selecting a running instance in the AWS console and picking launch more like this from the actions menu). I then ssh’ed into this box and ran my bootstrap_actions manually. Once my tests passed on this sandbox instance, my code was good to run on EMR.