Lab 5: Create a partitioned dataset
In this lab, you will create a partitioned dataset to store movie ratings. You will learn to:
- Build a partition strategy JSON configuration
- Create a partitioned dataset and import data
- Explore the dataset using Impala
If you haven’t already, make sure you’ve completed Lab 2: Create a movies dataset.
Data preparation
In this lab, you will use the MovieLens ratings data from lab 2.
The ratings in u.data need a slight transformation before they are immediately usable. Run the following commands to produce ratings.csv.
echo "timestamp,user_id,movie_id,rating" > ratings.csv
cat u.data | awk '{ print $4 "000," $1 "," $2 "," $3 }' | sort >> ratings.csv
This transform work could be done during import using MapReduce, but would require additional topics that are outside the scope of this series of labs.
Your ratings.csv file should look like this, including a header line and the timestamp value in milliseconds:
timestamp,user_id,movie_id,rating
874724710000,259,255,4
874724727000,259,286,4
Steps
1. Infer a schema from the ratings.csv file, named rating.avsc
The ratings will be partitioned by time, so the timestamp field must be required by the schema.
2. Build a partition strategy JSON file, named year-month.json
Use the kite-dataset command and the rating.avsc schema, build a partition strategy that will store data by year and month.
You might need to refer to the online reference, which has more detail than the built-in help.
3. Create a dataset called ratings
With the configuration you created, rating.avsc and year-month.json, create a Hive dataset called ratings.
Hive is the default storage for datasets; you can use either the full URI, dataset:hive:ratings, or simply ratings instead.
4. Import the ratings data
You might want to refer to the online reference, or the built-in help:
kite-dataset help csv-import
5. Run a SQL query in Impala
You can run queries in the Impala shell with impala-shell -q.
Before running queries over the ratings data, run invalidate metadata to force an immediate re-sync with Hive:
impala-shell -q 'invalidate metadata'
Next
- View the solution
- Move on to the next lab: Create a Flume pipeline