Kite Dataset API

Most of the time, you can create datasets and system prototypes using the Kite command line interface (CLI). When you want to perform these tasks using a Java program, you can use the Kite API. With the Kite API, you can perform tasks such as reading a dataset, defining and reading views of a dataset, and using MapReduce to process a dataset.

Dataset

A dataset is a collection of records, like a relational table. Records are similar to table rows, but the columns can contain strings, numbers, or nested data structures such as lists, maps, and other records.

The Dataset interface provides methods to work with the collection of records it represents. A Dataset is immutable.

Dataset URIs

Datasets are identified by URI. The dataset URI determines how Kite stores your dataset and its configuration metadata.

For example, if you want to create the products dataset in Hive, you can use this URI.

dataset:hive:products

Common dataset URI patterns are Hive, HDFS, Local FileSystem, and HBase. See Dataset and View URIs.

DatasetDescriptors

A DatasetDescriptor provides the structural definition of a dataset. It encapsulates all of the configuration needed to read and write data.

When you create a Dataset, you supply a DatasetDescriptor. That descriptor is saved and used by Kite when you interact with the dataset.

At a minimum, a DatasetDescriptor requires the record schema, which describes the records. You create a DatasetDescriptor object using the fluent DatasetDescriptor.Builder to set the schema and other configuration. See DatasetDescriptor Options for more configuration options.

Datasets

The Datasets class is the starting point when working with the Kite Data API. It provides operations around datasets, such as creating or deleting a dataset.

create

With a storage URI and a DatasetDescriptor, you can use Datasets.create to create a dataset instance. This example creates a dataset named products in the Hive metastore.

DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
  .schemaUri("resource:product.avsc")
  .build();

Dataset<Record> products = Datasets.create("dataset:hive:products", descriptor);

The create command creates an empty dataset. You can use a DatasetWriter to populate your dataset.

load

Load an existing dataset for processing using the load method. The load method verifies that the dataset exists, retrieves the dataset’s metadata, and verifies that you can communicate with its services.

1	Dataset<Record> products = Datasets.load("dataset:hive:products");

Once you load the dataset, you can retrieve and view the dataset records using DatasetReader.

The load method can also be used with a view URI.

update

Over time, your dataset requirements might change. You can use update to change a dataset’s configuration by replacing its DatasetDescriptor.

You can add, remove, or change the datatype of columns in your dataset, provided you don’t attempt a change that would corrupt the data. Kite follows the guidelines in the Avro schema. See Schema Evolution for more detail and examples.

This example updates the schema for the existing products dataset. First, it creates a new descriptor builder from the existing descriptor, to copy its settings, then adds products_v2.avsc as the schema and builds a new descriptor. Then it updates the dataset to use that new descriptor.

Dataset<Record> products = Datasets.load(
  "dataset:hive:products", Record.class);

DatasetDescriptor updatedDescriptor = new DatasetDescriptor.Builder(originalDescriptor)
  .schemaUri("resource:product_v2.avsc")
  .build(); 

Datasets.update("dataset:hive:products", updatedDescriptor);

delete

Delete a dataset, based on its URI, with the delete method. Kite takes care of any housekeeping, such as deleting both data and any metadata stored separately.

1	Datasets.delete("dataset:hive:products");

Working with Datasets

Avro Objects

Regardless of the underlying storage format, Kite uses Avro’s object models for its in-memory representations of data. This means you can write applications that use the same object classes and store the dataset in any of the available formats. To change the underlying storage format in your application, you only need to change its dataset URI.

In this introduction, Kite returns Avro’s generic data classes. Kite also supports Avro’s specific and reflect object models.

Avro Schema

A schema defines the field names and data types for records in a dataset. For example, this is the schema definition for the products dataset. It defines a name field as a string and an id field as an integer.

{
  "type": "record",
  "name": "Product",
  "namespace": "org.kitesdk.examples.data.generic",
  "doc": "A product record",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "id",
      "type": "long"
    }
  ]
}

Once you have defined a schema, you can use DatasetDescriptor.Builder to create a descriptor instance and then a dataset using that descriptor. Once a dataset is created, its schema is loaded automatically.

DatasetDescriptor Options

A DatasetDescriptor encapsulates the configuration needed to read and write a dataset. A DatasetDescriptor is immutable, and is created by DatasetDescriptor.Builder.

There are several options available when creating a descriptor:
* Set the record schema, required
* Add a partition strategy
* Choose the data format and compression
* Add custom key-value properties

Schema

A key element of the dataset descriptor is the record schema. There are a number of ways to find and set the schema using the DatasetDescriptor.Builder#schema methods. For example, you can read a schema definition file from HDFS or the classpath, you can use the schema from an existing data file, or you can use the builder to inspect any Java class and build a schema for it.

This example reads a schema definition file, product.avsc, from a project’s JAR that was built with the schema in src/main/resources/.

1
2
3

DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
    .schema("resource:product.avsc")
    .build();

Partition Strategy

Datasets commonly use a partition strategy to control the data layout for efficient storage and retrieval. You can pass both a PartitionStrategy object or a partition strategy JSON definition to DatasetDescriptor.Builder#partitionStrategy. See Partitioned Datasets for a conceptual introduction.

This example constructs a descriptor with its partition strategy created by [PartitionStrategy.Builder][javadoc-strategy-builder].

PartitionStrategy ymd = new PartitionStrategy.Builder()
    .year("timestamp")
    .month("timestamp")
    .day("timestamp")
    .build();

DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
    .schema("resource:event.avsc")
    .partitionStrategy(ymd)
    .build();

Note that a schema is always required to build a descriptor.

Storage Format

Storage format is set when you create a dataset and cannot be changed. See DatasetDescriptor.Builder#format.

The default storage format is Avro, which is the recommended format for most use cases.

You can alternatively use the Parquet format, which can result in smaller files and better read performance when using a subset of the dataset’s record fields (columns).

Both Avro and Parquet are efficient binary formats that are designed for Hadoop. You can use the Kite CLI or the Hue data browser to view the compressed binary data stored in these formats.

Compression Type

Kite uses Snappy compression by default. You also have the option of using Deflate, Bzip2, Lzo, or Uncompressed compression. See DatasetDescriptor.Builder#compressionType

DatasetWriter

The DatasetWriter class stores data in your dataset, using the layout and format you choose when creating the dataset.

This code snippet uses an Avro generic record builder to create a product from a list of names, assigns an ID number, and writes each record to the dataset.

DatasetWriter<Record> writer = null;

Dataset<Record> products = Datasets.load("dataset:hive:products", Record.class);

try {
  int i = 0;

  writer = products.newWriter();

  for (String item : items) {

    Record product = builder
      .set("name", item)
      .set("id", i)
      .build();

    writer.write(product);

    i += 1;
  }
} finally {
  if (writer != null) {
    writer.close();
  }
}

DatasetReader

DatasetReader retrieves records in a dataset for inspection and processing. It has supports iterating through the records as they are read.

This code snippet shows the code you use to load a dataset, then print each record to the console.

Dataset<Record> products = Datasets.load(
  "dataset:hive:products", Record.class);

DatasetReader<Record> reader = null;

try {
  reader = products.newReader();

  for (GenericRecord product : reader) {
    System.out.println(product);
  }
} finally {
  if (reader != null) {
    reader.close();
  }
}

If you don’t need the entire dataset, you can use the View API to select a subset of its records.

Kite Data Artifacts

You can use the Kite data API by adding dependencies for the artifacts described below.

kite-data-core has the Kite data API, including all of the Kite classes used in this introduction. It also includes the Dataset implementation for both HDFS and local file systems.
kite-data-hive is a Dataset implementation that creates Datasets as Hive tables and stores metadata in the Hive MetaStore. Add a dependency on kite-data-hive if you want to interact with your data through Hive or Impala
kite-data-hbase is an experimental Dataset implementation that creates datasets as HBase tables.
kite-data-crunch provides helpers to use a Kite dataset as a source or target in a Crunch pipeline.
kite-data-mapreduce provides MR input and output formats that read from or write to Kite datasets.

See the dependencies article for more information.

About Kite

Kite CLI

Kite API

Reference