Lab 5 Solution: Create a partitioned dataset
This is the solution page for Lab 5: Create a partitioned dataset.
Prepare the ratings data
echo "timestamp,user_id,movie_id,rating" > ratings.csv
cat u.data | awk '{ print $4 "000," $1 "," $2 "," $3 }' | sort >> ratings.csv
Your ratings.csv file should now look like this:
timestamp,user_id,movie_id,rating
874724710000,259,255,4
874724727000,259,286,4
1. Infer a schema for the ratings data
Use the --require option to list fields that cannot be null.
kite-dataset csv-schema ratings.csv --require timestamp --record-name Rating -o rating.avsc
Because the data has a header, there is no need to edit the schema’s field names. Your schema should look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
{
"type" : "record",
"name" : "Rating",
"doc" : "Schema generated by Kite",
"fields" : [ {
"name" : "timestamp",
"type" : "long"
}, {
"name" : "user_id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "movie_id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "rating",
"type" : [ "null", "long" ],
"default" : null
} ]
}
|
Notice that the type of timestamp is long, rather than a null or long.
2. Build a partition strategy JSON file
The partition-config command expects a series of colon-separated source and type pairs, where source is a field and type is the partition information to compute from the source field. For example, timestamp:year adds a partition for year values taken from the timestamp field.
Use the partition-config command to create the partition strategy for the ratings data:
kite-dataset partition-config timestamp:year timestamp:month --schema rating.avsc -o year-month.json
The schema is required to validate the partition strategy is compatible with the ratings data.
Your partition strategy, year-month.json, should look like this:
1 2 3 4 5 6 7 8 9 |
[ {
"name" : "year",
"source" : "timestamp",
"type" : "year"
}, {
"name" : "month",
"source" : "timestamp",
"type" : "month"
} ]
|
3. Create a dataset called ratings
Add --partition-by to pass the partition strategy file when creating the dataset.
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json
The info command also shows a partition strategy when one is configured.
4. Import the ratings data
kite-dataset csv-import ratings.csv ratings
You can see the partition structure with this HDFS tree hack:
hadoop fs -ls -R /user/hive/warehouse/ratings | awk '{print $8}' | sed -e 's/[^-][^\/]*\//--/g' -e 's/^/ /' -e 's/-/|/'
|-------year=1997
|---------month=09
|-----------ffc4a424-d22b-477f-b065-50616e821288.avro
|---------month=10
|-----------1c3b86d4-152e-4401-9b48-1161aab999ce.avro
|---------month=11
|-----------d2a0631f-e998-4545-995d-f10805f723d9.avro
|---------month=12
|-----------4d730d3d-d370-4043-95e7-95cf6ba0b65e.avro
|-------year=1998
|---------month=01
|-----------0ee5f1ba-652d-49ab-bd6d-e7d8308712f9.avro
|---------month=02
|-----------f7e18f2b-ca4f-4bcc-9aea-db68bb3f05e2.avro
|---------month=03
|-----------46496658-ff2e-46ed-9dea-916621ae2a8e.avro
|---------month=04
|-----------b5407194-f3f0-4259-b4c0-57aec6012d85.avro
5. Run an Impala query
Impala syncs with the Hive metastore and can read any Hive tables created with Kite.
With the Impala shell, run invalidate metadata to force an immediate re-sync with Hive:
impala-shell -q 'invalidate metadata'
Now you can run a SQL queries over the ratings dataset, like this one to find the top 20 movies by average rating.
impala-shell -q 'select movie_id, avg(rating) from ratings group by movie_id order by avg(rating) desc limit 20'
Query: select movie_id, avg(rating) from ratings group by movie_id order by avg(rating) desc limit 20
+----------+-------------------+
| movie_id | avg(rating) |
+----------+-------------------+
| 1293 | 5 |
| 1500 | 5 |
| 1201 | 5 |
| 1536 | 5 |
| 814 | 5 |
| 1467 | 5 |
| 1189 | 5 |
| 1599 | 5 |
| 1122 | 5 |
| 1653 | 5 |
| 1449 | 4.625 |
| 1594 | 4.5 |
| 1398 | 4.5 |
| 1642 | 4.5 |
| 119 | 4.5 |
| 408 | 4.491071428571429 |
| 318 | 4.466442953020135 |
| 169 | 4.466101694915254 |
| 483 | 4.45679012345679 |
| 114 | 4.447761194029851 |
+----------+-------------------+
Fetched 20 row(s) in 1.21s
Next
- Back to the lab
- Move on to the next lab: Create a Flume pipeline