Lab 5: Create a partitioned dataset
In this lab, you will create a partitioned dataset to store movie ratings. You will learn to:
- Build a partition strategy JSON configuration
- Create a partitioned dataset and import data
- Explore the dataset using Impala
If you haven’t already, make sure you’ve completed Lab 2: Create a movies dataset.
Data preparation
In this lab, you will use the MovieLens ratings data from lab 2.
The ratings in u.data
need a slight transformation before they are immediately usable. Run the following commands to produce ratings.csv
.
echo "timestamp,user_id,movie_id,rating" > ratings.csv
cat u.data | awk '{ print $4 "000," $1 "," $2 "," $3 }' | sort >> ratings.csv
This transform work could be done during import using MapReduce, but would require additional topics that are outside the scope of this series of labs.
Your ratings.csv
file should look like this, including a header line and the timestamp value in milliseconds:
timestamp,user_id,movie_id,rating
874724710000,259,255,4
874724727000,259,286,4
Steps
1. Infer a schema from the ratings.csv
file, named rating.avsc
The ratings will be partitioned by time, so the timestamp
field must be required by the schema.
2. Build a partition strategy JSON file, named year-month.json
Use the kite-dataset
command and the rating.avsc
schema, build a partition strategy that will store data by year and month.
You might need to refer to the online reference, which has more detail than the built-in help.
3. Create a dataset called ratings
With the configuration you created, rating.avsc
and year-month.json
, create a Hive dataset called ratings
.
Hive is the default storage for datasets; you can use either the full URI, dataset:hive:ratings
, or simply ratings
instead.
4. Import the ratings data
You might want to refer to the online reference, or the built-in help:
kite-dataset help csv-import
5. Run a SQL query in Impala
You can run queries in the Impala shell with impala-shell -q
.
Before running queries over the ratings data, run invalidate metadata
to force an immediate re-sync with Hive:
impala-shell -q 'invalidate metadata'
Next
- View the solution
- Move on to the next lab: Create a Flume pipeline