Lab 2: Create a movies dataset
In this lab, you will create a simple dataset of stored in HDFS. You will learn to:
- Infer a Schema from a CSV data sample
- Create a dataset using command-line tools
- Import data from a CSV file
If you haven’t already, make sure you’ve completed Lab 1: Setting up the Quickstart VM.
Movies data from GroupLens
This lab uses the MovieLens data, collected and made available by GroupLens. The MovieLens data is a large set of real ratings for a group of movies. For this series of labs, you need the small version with 100,000 ratings.
Start by downloading the data:
curl http://files.grouplens.org/datasets/movielens/ml-100k.zip -o ml-100k.zip
unzip ml-100k.zip
In the ml-100k
directory, the three files you need are README
, u.item
, and u.data
.
README
has the column names foru.item
is a headerless pipe-separated CSV file of moviesu.data
is a headerless tab-separated CSV file of movie ratings
The u.user
file has anonymous data about raters, but isn’t needed for this series of labs.
Steps
1. Use the kite-dataset
command to create a schema for the movies data, named movie.avsc
.
The columns in the data, from the README
, are: id, title, release_date, video_release_date, and imdb_url.
You might need to refer to the kite-dataset
online reference or the built-in help:
kite-dataset help csv-schema
You should edit the schema to replace the generic field names with the column names above.
Hint: you only need the first few data columns and can remove the genre columns from the schema (field_5 to the end). The next steps will ignore data columns that aren’t in the schema.
2. Create a dataset in HDFS
Create a dataset stored in HDFS, using the kite-dataset
command and the movie.avsc
schema.
Kite identifies a dataset by a URI, which should be dataset:hdfs:/user/cloudera/example/movies
to work with the other lab modules.
You might need to refer to the online reference or built-in help:
kite-dataset help create
3. Import the movies into the new dataset, dataset:hdfs:/user/cloudera/examples
You might need to refer to the online reference or built-in help:
kite-dataset help csv-import
Hints:
- You need to use
--no-header
to tell Kite there is no data header - Remember to set the delimiter character correctly
If your imported data has only null values, check that you’ve followed these hints!
Next
- View the solution
- Move on to the next lab: Using
avro-tools