Lab 5 Solution: Create a partitioned dataset

This is the solution page for Lab 5: Create a partitioned dataset.

Prepare the ratings data

echo "timestamp,user_id,movie_id,rating" > ratings.csv
cat u.data | awk '{ print $4 "000," $1 "," $2 "," $3 }' | sort >> ratings.csv

Your ratings.csv file should now look like this:

timestamp,user_id,movie_id,rating
874724710000,259,255,4
874724727000,259,286,4

1. Infer a schema for the ratings data

Use the --require option to list fields that cannot be null.

kite-dataset csv-schema ratings.csv --require timestamp --record-name Rating -o rating.avsc

Because the data has a header, there is no need to edit the schema’s field names. Your schema should look like this:

{
  "type" : "record",
  "name" : "Rating",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "timestamp",
    "type" : "long"
  }, {
    "name" : "user_id",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "movie_id",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "rating",
    "type" : [ "null", "long" ],
    "default" : null
  } ]
}

Notice that the type of timestamp is long, rather than a null or long.

2. Build a partition strategy JSON file

The partition-config command expects a series of colon-separated source and type pairs, where source is a field and type is the partition information to compute from the source field. For example, timestamp:year adds a partition for year values taken from the timestamp field.

Use the partition-config command to create the partition strategy for the ratings data:

kite-dataset partition-config timestamp:year timestamp:month --schema rating.avsc -o year-month.json

The schema is required to validate the partition strategy is compatible with the ratings data.

Your partition strategy, year-month.json, should look like this:

[ {
  "name" : "year",
  "source" : "timestamp",
  "type" : "year"
}, {
  "name" : "month",
  "source" : "timestamp",
  "type" : "month"
} ]

3. Create a dataset called `ratings`

Add --partition-by to pass the partition strategy file when creating the dataset.

kite-dataset create ratings --schema rating.avsc --partition-by year-month.json

The info command also shows a partition strategy when one is configured.

4. Import the ratings data

kite-dataset csv-import ratings.csv ratings

You can see the partition structure with this HDFS tree hack:

hadoop fs -ls -R /user/hive/warehouse/ratings | awk '{print $8}' | sed -e 's/[^-][^\/]*\//--/g' -e 's/^/ /' -e 's/-/|/'

 |-------year=1997
 |---------month=09
 |-----------ffc4a424-d22b-477f-b065-50616e821288.avro
 |---------month=10
 |-----------1c3b86d4-152e-4401-9b48-1161aab999ce.avro
 |---------month=11
 |-----------d2a0631f-e998-4545-995d-f10805f723d9.avro
 |---------month=12
 |-----------4d730d3d-d370-4043-95e7-95cf6ba0b65e.avro
 |-------year=1998
 |---------month=01
 |-----------0ee5f1ba-652d-49ab-bd6d-e7d8308712f9.avro
 |---------month=02
 |-----------f7e18f2b-ca4f-4bcc-9aea-db68bb3f05e2.avro
 |---------month=03
 |-----------46496658-ff2e-46ed-9dea-916621ae2a8e.avro
 |---------month=04
 |-----------b5407194-f3f0-4259-b4c0-57aec6012d85.avro

5. Run an Impala query

Impala syncs with the Hive metastore and can read any Hive tables created with Kite.

With the Impala shell, run invalidate metadata to force an immediate re-sync with Hive:

impala-shell -q 'invalidate metadata'

Now you can run a SQL queries over the ratings dataset, like this one to find the top 20 movies by average rating.

impala-shell -q 'select movie_id, avg(rating) from ratings group by movie_id order by avg(rating) desc limit 20'

Query: select movie_id, avg(rating) from ratings group by movie_id order by avg(rating) desc limit 20
+----------+-------------------+
| movie_id | avg(rating)       |
+----------+-------------------+
| 1293     | 5                 |
| 1500     | 5                 |
| 1201     | 5                 |
| 1536     | 5                 |
| 814      | 5                 |
| 1467     | 5                 |
| 1189     | 5                 |
| 1599     | 5                 |
| 1122     | 5                 |
| 1653     | 5                 |
| 1449     | 4.625             |
| 1594     | 4.5               |
| 1398     | 4.5               |
| 1642     | 4.5               |
| 119      | 4.5               |
| 408      | 4.491071428571429 |
| 318      | 4.466442953020135 |
| 169      | 4.466101694915254 |
| 483      | 4.45679012345679  |
| 114      | 4.447761194029851 |
+----------+-------------------+
Fetched 20 row(s) in 1.21s

Back to the lab
Move on to the next lab: Create a Flume pipeline