Lab 4 Solution: Using parquet-tools
This is the solution page for Lab 4: Using parquet-tools.
Copy the file from HDFS
List the files in your Hive dataset
hadoop fs -ls hdfs:/user/hive/warehouse/movies
Found 1 items
-rw-r--r-- 1 cloudera hive 77314 2015-02-03 17:26 /user/hive/warehouse/movies/72bc5d73-000f-4f16-ae03-fa14eeb74c38.parquet
Use the hadoop
command to copy the .parquet
file to the local file system.
hadoop fs -copyToLocal /user/hive/warehouse/movies/72bc5d73-000f-4f16-ae03-fa14eeb74c38.parquet movies.parquet
1. Dump the file’s schema
parquet-tools schema movies.parquet
Produces:
message Movie {
optional int64 id;
optional binary title (UTF8);
optional binary release_date (UTF8);
optional binary video_release_date (UTF8);
optional binary imdb_url (UTF8);
}
2. Find the file’s Avro schema
parquet-tools meta movies.parquet
Produces:
creator: parquet-mr (build 8e266e052e423af592871e2dfe09d54c03f6a0e8)
extra: avro.schema = {"type":"record","name":"Movie","doc":"Schema generated by Kite"," [more]...
file schema: Movie
--------------------------------------------------------------------------------------------------------------
id: OPTIONAL INT64 R:0 D:1
title: OPTIONAL BINARY O:UTF8 R:0 D:1
release_date: OPTIONAL BINARY O:UTF8 R:0 D:1
video_release_date: OPTIONAL BINARY O:UTF8 R:0 D:1
imdb_url: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:1682 TS:172459
--------------------------------------------------------------------------------------------------------------
id: INT64 SNAPPY DO:0 FPO:4 SZ:7546/13508/1.79 VC:1682 ENC:PLAIN,RLE,BIT_PACKED
title: BINARY SNAPPY DO:0 FPO:7550 SZ:30166/46293/1.53 VC:1682 ENC:PLAIN,RLE,BIT_PACKED
release_date: BINARY SNAPPY DO:0 FPO:37716 SZ:3007/5332/1.77 VC:1682 ENC:RLE,PLAIN_DICTIONARY [more]...
video_release_date: BINARY SNAPPY DO:0 FPO:40723 SZ:57/53/0.93 VC:1682 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
imdb_url: BINARY SNAPPY DO:0 FPO:40780 SZ:35344/107273/3.04 VC:1682 ENC:PLAIN,RLE,BIT_PACKED
3. Print the content of the file
parquet-tools cat movies.parquet
4. Find the column with the best compression ratio
parquet-tools meta movies.parquet
Or use dump
and suppress the data:
parquet-tools dump -d movies.parquet
row group 0
--------------------------------------------------------------------------------------------------------------
id: INT64 SNAPPY DO:0 FPO:4 SZ:7546/13508/1.79 VC:1682 ENC:BIT_PACKED,RLE,PLAIN
title: BINARY SNAPPY DO:0 FPO:7550 SZ:30166/46293/1.53 VC:1682 ENC:BIT_PACKED,RLE,PLAIN
release_date: BINARY SNAPPY DO:0 FPO:37716 SZ:3007/5332/1.77 VC:1682 ENC:BIT_PACKED,RLE,PLAIN [more]...
video_release_date: BINARY SNAPPY DO:0 FPO:40723 SZ:57/53/0.93 VC:1682 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
imdb_url: BINARY SNAPPY DO:0 FPO:40780 SZ:35344/107273/3.04 VC:1682 ENC:BIT_PACKED,RLE,PLAIN
id TV=1682 RL=0 DL=1
----------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:13463 VC:1682
title TV=1682 RL=0 DL=1
----------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:46218 VC:1682
release_date TV=1682 RL=0 DL=1 DS: 241 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:1675 VC:1682
video_release_date TV=1682 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:10 VC:1682
imdb_url TV=1682 RL=0 DL=1
----------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:107183 VC:1682
Next
- Back to the lab
- Move on to the next lab: Create a partitioned dataset