In this lab, you will create a simple dataset of stored in HDFS. You will learn to:

  1. Infer a Schema from a CSV data sample
  2. Create a dataset using command-line tools
  3. Import data from a CSV file

If you haven’t already, make sure you’ve completed Lab 1: Setting up the Quickstart VM.

Movies data from GroupLens

This lab uses the MovieLens data, collected and made available by GroupLens. The MovieLens data is a large set of real ratings for a group of movies. For this series of labs, you need the small version with 100,000 ratings.

Start by downloading the data:

curl http://files.grouplens.org/datasets/movielens/ml-100k.zip -o ml-100k.zip
unzip ml-100k.zip

In the ml-100k directory, the three files you need are README, u.item, and u.data.

  • README has the column names for
  • u.item is a headerless pipe-separated CSV file of movies
  • u.data is a headerless tab-separated CSV file of movie ratings

The u.user file has anonymous data about raters, but isn’t needed for this series of labs.

Steps

1. Use the kite-dataset command to create a schema for the movies data, named movie.avsc.

The columns in the data, from the README, are: id, title, release_date, video_release_date, and imdb_url.

You might need to refer to the kite-dataset online reference or the built-in help:

kite-dataset help csv-schema

You should edit the schema to replace the generic field names with the column names above.

Hint: you only need the first few data columns and can remove the genre columns from the schema (field_5 to the end). The next steps will ignore data columns that aren’t in the schema.

2. Create a dataset in HDFS

Create a dataset stored in HDFS, using the kite-dataset command and the movie.avsc schema.

Kite identifies a dataset by a URI, which should be dataset:hdfs:/user/cloudera/example/movies to work with the other lab modules.

You might need to refer to the online reference or built-in help:

kite-dataset help create

3. Import the movies into the new dataset, dataset:hdfs:/user/cloudera/examples

You might need to refer to the online reference or built-in help:

kite-dataset help csv-import

Hints:

  • You need to use --no-header to tell Kite there is no data header
  • Remember to set the delimiter character correctly

If your imported data has only null values, check that you’ve followed these hints!

Next