Lab 5 Solution: Create a partitioned dataset
This is the solution page for Lab 5: Create a partitioned dataset.
Prepare the ratings data
echo "timestamp,user_id,movie_id,rating" > ratings.csv
cat u.data | awk '{ print $4 "000," $1 "," $2 "," $3 }' | sort >> ratings.csv
Your ratings.csv
file should now look like this:
timestamp,user_id,movie_id,rating
874724710000,259,255,4
874724727000,259,286,4
1. Infer a schema for the ratings data
Use the --require
option to list fields that cannot be null.
kite-dataset csv-schema ratings.csv --require timestamp --record-name Rating -o rating.avsc
Because the data has a header, there is no need to edit the schema’s field names. Your schema should look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
{ "type" : "record", "name" : "Rating", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "timestamp", "type" : "long" }, { "name" : "user_id", "type" : [ "null", "long" ], "default" : null }, { "name" : "movie_id", "type" : [ "null", "long" ], "default" : null }, { "name" : "rating", "type" : [ "null", "long" ], "default" : null } ] } |
Notice that the type of timestamp
is long, rather than a null or long.
2. Build a partition strategy JSON file
The partition-config
command expects a series of colon-separated source and type pairs, where source is a field and type is the partition information to compute from the source field. For example, timestamp:year
adds a partition for year values taken from the timestamp
field.
Use the partition-config
command to create the partition strategy for the ratings data:
kite-dataset partition-config timestamp:year timestamp:month --schema rating.avsc -o year-month.json
The schema is required to validate the partition strategy is compatible with the ratings data.
Your partition strategy, year-month.json
, should look like this:
1 2 3 4 5 6 7 8 9 |
[ { "name" : "year", "source" : "timestamp", "type" : "year" }, { "name" : "month", "source" : "timestamp", "type" : "month" } ] |
3. Create a dataset called ratings
Add --partition-by
to pass the partition strategy file when creating the dataset.
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json
The info
command also shows a partition strategy when one is configured.
4. Import the ratings data
kite-dataset csv-import ratings.csv ratings
You can see the partition structure with this HDFS tree hack:
hadoop fs -ls -R /user/hive/warehouse/ratings | awk '{print $8}' | sed -e 's/[^-][^\/]*\//--/g' -e 's/^/ /' -e 's/-/|/'
|-------year=1997
|---------month=09
|-----------ffc4a424-d22b-477f-b065-50616e821288.avro
|---------month=10
|-----------1c3b86d4-152e-4401-9b48-1161aab999ce.avro
|---------month=11
|-----------d2a0631f-e998-4545-995d-f10805f723d9.avro
|---------month=12
|-----------4d730d3d-d370-4043-95e7-95cf6ba0b65e.avro
|-------year=1998
|---------month=01
|-----------0ee5f1ba-652d-49ab-bd6d-e7d8308712f9.avro
|---------month=02
|-----------f7e18f2b-ca4f-4bcc-9aea-db68bb3f05e2.avro
|---------month=03
|-----------46496658-ff2e-46ed-9dea-916621ae2a8e.avro
|---------month=04
|-----------b5407194-f3f0-4259-b4c0-57aec6012d85.avro
5. Run an Impala query
Impala syncs with the Hive metastore and can read any Hive tables created with Kite.
With the Impala shell, run invalidate metadata
to force an immediate re-sync with Hive:
impala-shell -q 'invalidate metadata'
Now you can run a SQL queries over the ratings
dataset, like this one to find the top 20 movies by average rating.
impala-shell -q 'select movie_id, avg(rating) from ratings group by movie_id order by avg(rating) desc limit 20'
Query: select movie_id, avg(rating) from ratings group by movie_id order by avg(rating) desc limit 20
+----------+-------------------+
| movie_id | avg(rating) |
+----------+-------------------+
| 1293 | 5 |
| 1500 | 5 |
| 1201 | 5 |
| 1536 | 5 |
| 814 | 5 |
| 1467 | 5 |
| 1189 | 5 |
| 1599 | 5 |
| 1122 | 5 |
| 1653 | 5 |
| 1449 | 4.625 |
| 1594 | 4.5 |
| 1398 | 4.5 |
| 1642 | 4.5 |
| 119 | 4.5 |
| 408 | 4.491071428571429 |
| 318 | 4.466442953020135 |
| 169 | 4.466101694915254 |
| 483 | 4.45679012345679 |
| 114 | 4.447761194029851 |
+----------+-------------------+
Fetched 20 row(s) in 1.21s
Next
- Back to the lab
- Move on to the next lab: Create a Flume pipeline