Lab 5: Create a partitioned dataset

In this lab, you will create a partitioned dataset to store movie ratings. You will learn to:

Build a partition strategy JSON configuration
Create a partitioned dataset and import data
Explore the dataset using Impala

If you haven’t already, make sure you’ve completed Lab 2: Create a movies dataset.

Data preparation

In this lab, you will use the MovieLens ratings data from lab 2.

The ratings in u.data need a slight transformation before they are immediately usable. Run the following commands to produce ratings.csv.

echo "timestamp,user_id,movie_id,rating" > ratings.csv
cat u.data | awk '{ print $4 "000," $1 "," $2 "," $3 }' | sort >> ratings.csv

This transform work could be done during import using MapReduce, but would require additional topics that are outside the scope of this series of labs.

Your ratings.csv file should look like this, including a header line and the timestamp value in milliseconds:

timestamp,user_id,movie_id,rating
874724710000,259,255,4
874724727000,259,286,4

Steps

1. Infer a schema from the `ratings.csv` file, named `rating.avsc`

The ratings will be partitioned by time, so the timestamp field must be required by the schema.

2. Build a partition strategy JSON file, named `year-month.json`

Use the kite-dataset command and the rating.avsc schema, build a partition strategy that will store data by year and month.

You might need to refer to the online reference, which has more detail than the built-in help.

3. Create a dataset called `ratings`

With the configuration you created, rating.avsc and year-month.json, create a Hive dataset called ratings.

Hive is the default storage for datasets; you can use either the full URI, dataset:hive:ratings, or simply ratings instead.

4. Import the ratings data

You might want to refer to the online reference, or the built-in help:

kite-dataset help csv-import

5. Run a SQL query in Impala

You can run queries in the Impala shell with impala-shell -q.

Before running queries over the ratings data, run invalidate metadata to force an immediate re-sync with Hive:

impala-shell -q 'invalidate metadata'

View the solution
Move on to the next lab: Create a Flume pipeline

Data preparation

Steps

1. Infer a schema from the ratings.csv file, named rating.avsc

2. Build a partition strategy JSON file, named year-month.json

3. Create a dataset called ratings

4. Import the ratings data

5. Run a SQL query in Impala

Next

1. Infer a schema from the `ratings.csv` file, named `rating.avsc`

2. Build a partition strategy JSON file, named `year-month.json`

3. Create a dataset called `ratings`