Lab 3 Solution: Using avro-tools

This is the solution page for Lab 3: Using avro-tools.

Copy the file from HDFS

List the files in your dataset

hadoop fs -ls hdfs:/user/cloudera/example/movies

Found 2 items
drwxr-xr-x   - cloudera cloudera          0 2015-02-03 14:19 hdfs:///user/cloudera/example/movies/.metadata
-rw-r--r--   1 cloudera cloudera      73090 2015-02-03 14:20 hdfs:///user/cloudera/example/movies/a0f892e6-74e5-4098-bfe3-68e2b119046f.avro

Use the hadoop command to copy the .avro file to the local file system.

hadoop fs -copyToLocal /user/cloudera/example/movies/a0f892e6-74e5-4098-bfe3-68e2b119046f.avro movies.avro

1. Dump the file’s header key-value metadata

avro-tools getmeta movies.avro

Produces the schema and compression codec properties:

avro.codec  snappy
avro.schema {"type":"record","name":"Movie","doc":"Schema generated by Kite","fields":[{"name":"id","type":["null","long"],"doc":"Type inferred from '1'","default":null},{"name":"title","type":["null","string"],"doc":"Type inferred from 'Toy Story (1995)'","default":null},{"name":"release_date","type":["null","string"],"doc":"Type inferred from '01-Jan-1995'","default":null},{"name":"video_release_date","type":["null","string"],"doc":"Type inferred from 'null'","default":null},{"name":"imdb_url","type":["null","string"],"doc":"Type inferred from 'http://us.imdb.com/M/title-exact?Toy%20Story%20(19'","default":null}]}

2. Dump the file’s schema

avro-tools getschema movies.avro

Produces a readable version of the schema:

{
  "type" : "record",
  "name" : "Movie",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "id",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "title",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "release_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "video_release_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "imdb_url",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}

3. Dump the content of an Avro data file as JSON

The cat command sounds good, but dumps encoded avro data and the totext method requires a special file schema. The best way to dump the content is using tojson:

avro-tools tojson movies.avro | tail

The result is Avro-specific JSON with additional structure that preserves Avro type information. Each field value is printed as a JSON object with the Avro type and the data value.

1
2
3

{"id":{"long":1680},"title":{"string":"Sliding Doors (1998)"},"release_date":{"string":"01-Jan-1998"},"video_release_date":{"string":""},"imdb_url":{"string":"..."}}
{"id":{"long":1681},"title":{"string":"You So Crazy (1994)"},"release_date":{"string":"01-Jan-1994"},"video_release_date":{"string":""},"imdb_url":{"string":"..."}}
{"id":{"long":1682},"title":{"string":"Scream of Stone (Schrei aus Stein) (1991)"},"release_date":{"string":"08-Mar-1996"},"video_release_date":{"string":""},"imdb_url":{"string":"..."}}

Back to the lab
Move on to the next lab: Using parquet-tools

Copy the file from HDFS

1. Dump the file’s header key-value metadata

2. Dump the file’s schema

3. Dump the content of an Avro data file as JSON

Next