Lab 3 Solution: Using avro-tools
This is the solution page for Lab 3: Using avro-tools.
Copy the file from HDFS
List the files in your dataset
hadoop fs -ls hdfs:/user/cloudera/example/movies
Found 2 items
drwxr-xr-x - cloudera cloudera 0 2015-02-03 14:19 hdfs:///user/cloudera/example/movies/.metadata
-rw-r--r-- 1 cloudera cloudera 73090 2015-02-03 14:20 hdfs:///user/cloudera/example/movies/a0f892e6-74e5-4098-bfe3-68e2b119046f.avro
Use the hadoop
command to copy the .avro
file to the local file system.
hadoop fs -copyToLocal /user/cloudera/example/movies/a0f892e6-74e5-4098-bfe3-68e2b119046f.avro movies.avro
1. Dump the file’s header key-value metadata
avro-tools getmeta movies.avro
Produces the schema and compression codec properties:
avro.codec snappy
avro.schema {"type":"record","name":"Movie","doc":"Schema generated by Kite","fields":[{"name":"id","type":["null","long"],"doc":"Type inferred from '1'","default":null},{"name":"title","type":["null","string"],"doc":"Type inferred from 'Toy Story (1995)'","default":null},{"name":"release_date","type":["null","string"],"doc":"Type inferred from '01-Jan-1995'","default":null},{"name":"video_release_date","type":["null","string"],"doc":"Type inferred from 'null'","default":null},{"name":"imdb_url","type":["null","string"],"doc":"Type inferred from 'http://us.imdb.com/M/title-exact?Toy%20Story%20(19'","default":null}]}
2. Dump the file’s schema
avro-tools getschema movies.avro
Produces a readable version of the schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
{ "type" : "record", "name" : "Movie", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "id", "type" : [ "null", "long" ], "default" : null }, { "name" : "title", "type" : [ "null", "string" ], "default" : null }, { "name" : "release_date", "type" : [ "null", "string" ], "default" : null }, { "name" : "video_release_date", "type" : [ "null", "string" ], "default" : null }, { "name" : "imdb_url", "type" : [ "null", "string" ], "default" : null } ] } |
3. Dump the content of an Avro data file as JSON
The cat
command sounds good, but dumps encoded avro data and the totext
method requires a special file schema. The best way to dump the content is using tojson
:
avro-tools tojson movies.avro | tail
The result is Avro-specific JSON with additional structure that preserves Avro type information. Each field value is printed as a JSON object with the Avro type and the data value.
1 2 3 |
{"id":{"long":1680},"title":{"string":"Sliding Doors (1998)"},"release_date":{"string":"01-Jan-1998"},"video_release_date":{"string":""},"imdb_url":{"string":"..."}} {"id":{"long":1681},"title":{"string":"You So Crazy (1994)"},"release_date":{"string":"01-Jan-1994"},"video_release_date":{"string":""},"imdb_url":{"string":"..."}} {"id":{"long":1682},"title":{"string":"Scream of Stone (Schrei aus Stein) (1991)"},"release_date":{"string":"08-Mar-1996"},"video_release_date":{"string":""},"imdb_url":{"string":"..."}} |
Next
- Back to the lab
- Move on to the next lab: Using
parquet-tools