Datasets have a fairly predictable lifecycle: creation, population, validation, modification, and, ultimately, annihilation.

In truth, the lifecycle looks something like this.

For the sake of simplicity, let’s break it down into six general steps.

Preparation

If you have not done so already, install the Kite command-line interface.

Generate Schema Create Dataset Populate Dataset Validate Dataset Update Dataset Annihilate Dataset

Generate a Schema for the Dataset

You define a Kite dataset using an Avro schema. The schema defines the fields for each row in your dataset.

You can create the schema yourself as a plain text file. Avro schema syntax is designed to be concise, rather than easy to read. It can be tricky to create and troubleshoot a schema definition. See Defining a schema in the Apache Avro documentation for more information on writing your own Avro schema.

In most cases, it’s easier to generate a schema definition than to create one by hand. You can generate an Avro schema based on a Java object or a CSV data file.

Inferring a Schema from a Java Class

You can use the CLI Command object-schema to infer a dataset schema from the instance variable fields of a Java class. Classes are mapped to Avro records. Avro reflect only supports concrete classes with no-argument constructors. Avro reflect includes all inherited fields that are not static or transient. Fields cannot be null unless annotated by Nullable or a Union containing null.

For example, the following code sample excerpts the pertinent lines from a class that defines a Java object that describes a dataset about movies.

1
2
3
4
5
6
7
8
9
10
11
 package org.kitesdk.examples.data;
 /** Movie class */
 class Movie {
   private int id;
   private String title;
   private String releaseDate;
 . . . 
   public Movie() {
     // Empty constructor for serialization purposes
   }

Use the CLI command obj-schema to generate an Avro schema file based on the source Java class.

kite-dataset obj-schema org.kitesdk.cli.example.Movie -o movie.avsc

The CLI uses the names and data types of the instance variables in the Java object to construct an Avro schema definition. For the Movie class, it looks like this.

1
2
3
4
5
6
7
8
9
10
{
  "type":"record",
  "name":"Movie",
  "namespace":"org.kitesdk.examples.data",
  "fields":[
    {"name":"id","type":"int"},
    {"name":"title","type":"string"},
    {"name":"releaseDate","type":"string"},
  ]
}

For more insight into Avro reflection, see the Javadoc entry for Avro reflect.

Inferring a Schema from a CSV File

The Kite CLI can generate an Avro schema based on a CSV data file.

The CSV data file for the Movie dataset might start off like this.

id, title,releaseDate
1,Sam and the Big Dog,"August 14, 2014"
2,Crocophiles,"November 18, 1995"
. . .

Use the CLI command csv-schema to generate the Avro schema.

kite-dataset csv-schema movie.csv --class Movie -o movie.avsc

The Kite CLI infers field names from the values in the first row and data types from the values in the second row of the CSV file.

1
2
3
4
5
6
7
8
9
10
{
  "type":"record",
  "name":"Movie",
  "namespace":"org.kitesdk.examples.data",
  "fields":[
    {"name":"id","type":"int"},
    {"name":"title","type":"string"},
    {"name":"releaseDate","type":"string"},
  ]
}

Create Dataset

Once you have an Avro schema, you can create your dataset.

kite-dataset create movie --schema movie.avsc

Partition Strategy

In some cases, you can improve the performance of your dataset by creating logical partitions. For example, the Movie dataset could be partitioned by ID. Searches by ID would only search the containing folder, rather than the entire dataset. If you were searching for movie ID 3215, the search would be limited to the partition with records 3001-4000.

You define a partition strategy in JSON format. The following code sample defines the partition strategy movie.json for the Movie dataset.

1
2
3
4
5
[ {
  "source" : "id",
  "type" : "int",
  "name" : "id"
}]

Include the partition-by argument when you execute the create command.

kite-dataset create movie --schema movie.avsc partition-by movie.json

See Partitioned Datasets for more detail on partition strategies.

Column Mapping

Column mapping allows you to configure how your records should be stored in HBase for maximum performance and efficiency. You define the mapping based on the type of data you want to store, and Kite handles the infrastructure required to support your mapping strategy. See Column Mapping.

Parquet

If you typically work with a subset of the fields in your dataset rather than an entire row, you might want to create the dataset in Parquet format, rather than the default Avro format. See Parquet vs Avro Format.

kite-dataset create movie --schema movie.avsc -f parquet

Populate Dataset

Once you create the dataset, you can insert data in a number of ways.

Import CSV

You can use the CLI command csv-import to insert records from a CSV file to your dataset.

kite-dataset csv-import /kite/example/movie.csv movie

See csv-import for additional options.

Copy Dataset

Use the copy command to transfer the contents of one dataset into another.

kite-dataset copy movie_parquet movie

Validate Dataset

Select the first few records of your dataset to ensure that they loaded properly. Use the show command to view the first 10 records in your dataset.

kite-dataset show movie

10 records is the default. You can set the number of records you want returned when you execute the command. For example, this would return the first 50 records.

kite-dataset show movie -n 50

Update Dataset

Loading Data

Once you have created your Kite dataset, you can add records as you would with any CDH dataset. If you use csv-import to add more records, they are appended to the dataset.

Compacting the Dataset

One by-product of the copy command is that it compacts multiple files in a partition into one file. This can be particularly useful for datasets that use streaming input. You can periodically copy the active dataset to an archive, or copy the dataset, delete the current data, and copy the compacted data back to the active dataset.

Updating the Dataset Schema

Over time, your dataset requirements might change. You can add, remove, or change the datatype of columns in your dataset, provided you don’t attempt a change that would result in the loss or corruption of data. Kite follows the guidelines for Schema resolution in the Avro spec. See Schema Evolution for more detail and examples.

Annihilate Dataset

When you first create a dataset, you might want to tweak it before you go live. It can be much cleaner and easier to delete the nascent dataset and start over. It could also be the case that your dataset is no longer needed. Regardless of your motivation, you can permanently remove one or more datasets from CDH using the CLI command delete.

For example, to remove the movies dataset, you can use the following command.

$ kite-dataset delete movies