This is the solution page for Lab 4: Using parquet-tools.

Copy the file from HDFS

List the files in your Hive dataset

hadoop fs -ls hdfs:/user/hive/warehouse/movies
Found 1 items
-rw-r--r--   1 cloudera hive      77314 2015-02-03 17:26 /user/hive/warehouse/movies/72bc5d73-000f-4f16-ae03-fa14eeb74c38.parquet

Use the hadoop command to copy the .parquet file to the local file system.

hadoop fs -copyToLocal /user/hive/warehouse/movies/72bc5d73-000f-4f16-ae03-fa14eeb74c38.parquet movies.parquet

1. Dump the file’s schema

parquet-tools schema movies.parquet

Produces:

message Movie {
  optional int64 id;
  optional binary title (UTF8);
  optional binary release_date (UTF8);
  optional binary video_release_date (UTF8);
  optional binary imdb_url (UTF8);
}

2. Find the file’s Avro schema

parquet-tools meta movies.parquet 

Produces:

creator:            parquet-mr (build 8e266e052e423af592871e2dfe09d54c03f6a0e8) 
extra:              avro.schema = {"type":"record","name":"Movie","doc":"Schema generated by Kite"," [more]...

file schema:        Movie 
--------------------------------------------------------------------------------------------------------------
id:                 OPTIONAL INT64 R:0 D:1
title:              OPTIONAL BINARY O:UTF8 R:0 D:1
release_date:       OPTIONAL BINARY O:UTF8 R:0 D:1
video_release_date: OPTIONAL BINARY O:UTF8 R:0 D:1
imdb_url:           OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1:        RC:1682 TS:172459 
--------------------------------------------------------------------------------------------------------------
id:                  INT64 SNAPPY DO:0 FPO:4 SZ:7546/13508/1.79 VC:1682 ENC:PLAIN,RLE,BIT_PACKED
title:               BINARY SNAPPY DO:0 FPO:7550 SZ:30166/46293/1.53 VC:1682 ENC:PLAIN,RLE,BIT_PACKED
release_date:        BINARY SNAPPY DO:0 FPO:37716 SZ:3007/5332/1.77 VC:1682 ENC:RLE,PLAIN_DICTIONARY [more]...
video_release_date:  BINARY SNAPPY DO:0 FPO:40723 SZ:57/53/0.93 VC:1682 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
imdb_url:            BINARY SNAPPY DO:0 FPO:40780 SZ:35344/107273/3.04 VC:1682 ENC:PLAIN,RLE,BIT_PACKED
parquet-tools cat movies.parquet

4. Find the column with the best compression ratio

parquet-tools meta movies.parquet

Or use dump and suppress the data:

parquet-tools dump -d movies.parquet
row group 0 
--------------------------------------------------------------------------------------------------------------
id:                  INT64 SNAPPY DO:0 FPO:4 SZ:7546/13508/1.79 VC:1682 ENC:BIT_PACKED,RLE,PLAIN
title:               BINARY SNAPPY DO:0 FPO:7550 SZ:30166/46293/1.53 VC:1682 ENC:BIT_PACKED,RLE,PLAIN
release_date:        BINARY SNAPPY DO:0 FPO:37716 SZ:3007/5332/1.77 VC:1682 ENC:BIT_PACKED,RLE,PLAIN [more]...
video_release_date:  BINARY SNAPPY DO:0 FPO:40723 SZ:57/53/0.93 VC:1682 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
imdb_url:            BINARY SNAPPY DO:0 FPO:40780 SZ:35344/107273/3.04 VC:1682 ENC:BIT_PACKED,RLE,PLAIN

    id TV=1682 RL=0 DL=1
    ----------------------------------------------------------------------------------------------------------
    page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:13463 VC:1682

    title TV=1682 RL=0 DL=1
    ----------------------------------------------------------------------------------------------------------
    page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:46218 VC:1682

    release_date TV=1682 RL=0 DL=1 DS:       241 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------------------------------------
    page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:1675 VC:1682

    video_release_date TV=1682 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------------------------------------
    page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:10 VC:1682

    imdb_url TV=1682 RL=0 DL=1
    ----------------------------------------------------------------------------------------------------------
    page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:107183 VC:1682

Next