Lab 2 Solution: Create a movies dataset
This is the solution page for Lab 2: Create a movies dataset.
Download and unzip the source data
curl http://files.grouplens.org/datasets/movielens/ml-100k.zip -o ml-100k.zip
unzip ml-100k.zip
cd ml-100k
1. Infer a schema from the movies data file
The command to infer the file’s schema is:
kite-dataset csv-schema u.item --delimiter '|' --no-header --record-name Movie -o movie.avsc
If you add a header to the data file with just the columns you want, the csv-schema
command will use those field names. Otherwise, you need to edit the schema to correct the field names and remove unnecessary columns. You should end up with a schema like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
{ "type" : "record", "name" : "Movie", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "id", "type" : [ "null", "long" ], "default" : null }, { "name" : "title", "type" : [ "null", "string" ], "default" : null }, { "name" : "release_date", "type" : [ "null", "string" ], "default" : null }, { "name" : "video_release_date", "type" : [ "null", "string" ], "default" : null }, { "name" : "imdb_url", "type" : [ "null", "string" ], "default" : null } ] } |
2. Create the movies dataset
The create
command will create and configure an empty dataset:
kite-dataset create dataset:hdfs:/user/cloudera/example/movies --schema movie.avsc
You can verify the dataset’s configuration using the info
command:
kite-dataset info dataset:hdfs:/user/cloudera/example/movies
3. Import movies into the new dataset
Import the u.data file using the csv-import
command:
kite-dataset csv-import u.item --delimiter '|' --no-header dataset:hdfs:/user/cloudera/example/movies
You can see the contents of the dataset using the show
command:
kite-dataset show dataset:hdfs:/user/cloudera/example/movies
Show will output the first few records:
1 2 3 |
{"id": 1, "title": "Toy Story (1995)", "release_date": "01-Jan-1995", "video_release_date": "", "imdb_url": "..."} {"id": 2, "title": "GoldenEye (1995)", "release_date": "01-Jan-1995", "video_release_date": "", "imdb_url": "..."} {"id": 3, "title": "Four Rooms (1995)", "release_date": "01-Jan-1995", "video_release_date": "", "imdb_url": "..."} |
Next
- Back to the lab
- Move on to the next lab: Using
avro-tools