Lab 4: Using parquet-tools

In this lab, you will use parquet-tools utility to inspect Parquet files. You will learn to:

If you haven’t already, make sure you’ve completed Lab 2: Create a movies dataset.

Create a Parquet dataset of movies

This lab requires a Parquet data file, you will create by copying an Avro dataset to a Parquet dataset.

First, create a Parquet dataset in Hive. Use --format to configure the dataset to store data as Parquet files.

kite-dataset create dataset:hive:movies --schema movie.avsc --format parquet

Next, import the movies data using csv-import:

kite-dataset csv-import u.item --no-header --delimiter '|' dataset:hive:movies

You can verify the data import using either the show command.

Start by listing the contents of the dataset you created above, which is in the Hive warehouse directory:

hadoop fs -ls hdfs:/user/hive/warehouse/movies

Found 1 items
-rw-r--r--   1 cloudera hive      77314 2015-02-03 17:26 /user/hive/warehouse/movies/72bc5d73-000f-4f16-ae03-fa14eeb74c38.parquet

Use the hadoop command to copy the .parquet file to the local file system.

hadoop fs -copyToLocal /user/hive/warehouse/movies/72bc5d73-000f-4f16-ae03-fa14eeb74c38.parquet movies.parquet

For the rest of this lab, use parquet-tools to inspect the movies.parquet file.

You might need to refer to the built-in help:

parquet-tools --help

Running a command with -h will print out help for using that command:

parquet-tools meta -h

Using parquet-tools: