Lab 7: Analyze ratings with Crunch
In this lab, you will run a Crunch job to analyze movie ratings. You will learn to:
- Build a crunch job
- Build and run a movie rating web application
If you haven’t already, make sure you’ve completed Lab 5: Create a partitioned dataset and, optionally, Lab 6: Create a Flume pipeline.
Background
In this lab, you will build and run a Crunch job that analyzes the movie ratings dataset you imported in lab 5.
You want to identify movies that have two distinct rating peaks. This would show two groups of people with very different opinions of a particular movie.
A simple way to detect this is done in two steps:
- Count the ratings for each movie to produce a histogram of each rating and the number of people that chose it.
- Look for two peaks in the histogram, where the count goes up, then down, then back up.
The Crunch job you will run in this lab uses a MapReduce round to produce the rating-to-count histogram for each movie, then filters out the movies that don’t meet the criteria in the second step. Filtering is done at the end of the reduce phase.
Steps
1. Build and run the Crunch job
Build the Crunch job with Maven.
If you are using the lab VM, this repository is already downloaded for you; skip the clone step.
git clone https://github.com/rdblue/ratings-crunch.git ~/ratings-crunch
cd ~/ratings-crunch
mvn clean package
Next, submit the job:
mvn kite:run-tool
This uses the Kite maven plugin to submit the job jar with all of its dependencies. The configuration, which sets the main class, is in the project’s POM file.
2. Look at the code
While the Crunch job runs, look at the source located at src/main/java/org/kitesdk/examples/movies/AnalyzeRatings.java
.
The Crunch job is made of a series of small functions that do a single task.
The map phase extracts two pieces of information from a Rating object:
GetMovieID
returns the ID that is used to group ratings together.GetRating
returns the rating value.
Next, the id and rating pairs are grouped by id. All of the ratings for a movie are processed as a single group in the reduce phase.
The reduce phase is made of three tasks:
AccumulateRatings
produces the rating-to-count histogram by counting the each rating for a movie.BuildRatingsHistogram
converts the movie id and accumulated ratings into an Avro object.IdentifyBimodalMovies
filters the histograms, selecting those with a peak, then a drop, then another peak.
The final output is stored in a Hive dataset called ratings_histograms
.
3. Look at the results
Find an interesting movie id in the ratings_histograms
table and find its title.
Hint: Impala doesn’t support nested types yet. To join the results using SQL, run your query in Hive.