This is the solution page for Lab 2: Create a movies dataset.

Download and unzip the source data

curl http://files.grouplens.org/datasets/movielens/ml-100k.zip -o ml-100k.zip
unzip ml-100k.zip
cd ml-100k

1. Infer a schema from the movies data file

The command to infer the file’s schema is:

kite-dataset csv-schema u.item --delimiter '|' --no-header --record-name Movie -o movie.avsc

If you add a header to the data file with just the columns you want, the csv-schema command will use those field names. Otherwise, you need to edit the schema to correct the field names and remove unnecessary columns. You should end up with a schema like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  "type" : "record",
  "name" : "Movie",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "id",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "title",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "release_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "video_release_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "imdb_url",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}

2. Create the movies dataset

The create command will create and configure an empty dataset:

kite-dataset create dataset:hdfs:/user/cloudera/example/movies --schema movie.avsc

You can verify the dataset’s configuration using the info command:

kite-dataset info dataset:hdfs:/user/cloudera/example/movies

3. Import movies into the new dataset

Import the u.data file using the csv-import command:

kite-dataset csv-import u.item --delimiter '|' --no-header dataset:hdfs:/user/cloudera/example/movies

You can see the contents of the dataset using the show command:

kite-dataset show dataset:hdfs:/user/cloudera/example/movies

Show will output the first few records:

1
2
3
{"id": 1, "title": "Toy Story (1995)", "release_date": "01-Jan-1995", "video_release_date": "", "imdb_url": "..."}
{"id": 2, "title": "GoldenEye (1995)", "release_date": "01-Jan-1995", "video_release_date": "", "imdb_url": "..."}
{"id": 3, "title": "Four Rooms (1995)", "release_date": "01-Jan-1995", "video_release_date": "", "imdb_url": "..."}

Next