The Kite Dataset command line interface (CLI) provides utility commands that let you quickly create a schema and dataset, import data from a CSV file, then view the results.

Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.

  • csv-schema (create a schema from a CSV data file)
  • obj-schema (create a schema from a Java object)
  • create (create a dataset based on an existing schema)
  • schema (view the schema for an existing dataset)
  • csv-import (import a CSV data file)
  • show (show the first n records of a dataset)
  • delete (delete a dataset)
  • partition-config (create a partition strategy for a schema)
  • help (get help for the dataset command in general or a specific command)


Use csv-schema to generate an Avro schema from a comma separated value (CSV) file.


dataset [general options] csv-schema <sample csv path> [command options]



The number of lines to skip before the start of the CSV data. Default is 0.


Quote character in the CSV data file. Default is the double-quote (").


Delimiter character in the CSV data file. Default is the comma (,).


Escape character in the CSV data file. Default is the backslash (\).

--class, --record-name

A class name or record name for the schema result. This value is required.

-o, --output

Save schema avsc to path.


Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1, and so on.


Minimize schema file size by eliminating white space.


Print the schema to standard out: dataset csv-schema sample.csv --class Sample

Write the schema to sample-schema.avsc: dataset csv-schema sample.csv -o sample-schema.avsc

Build a schema from a Java class.


dataset [general options] obj-schema <class name> [command options]


-o, --output

Save schema in Avro format to a given path.


Add a jar to the classpath used when loading the Java class.


Add a directory to the classpath used when loading the Java class.


Minimize schema file size by eliminating white space.


Create a schema for an example User class: dataset obj-schema org.kitesdk.cli.example.User

Create a schema for a class in a jar: dataset obj-schema com.example.MyRecord --jar my-application.jar

Save the schema for the example User class to user.avsc: dataset obj-schema org.kitesdk.cli.example.User -o user.avsc

After you have generated an Avro schema, you can use create to make an empty dataset.


dataset [general options] create <dataset name> [command options]


-d, --directory

The root directory of the dataset repository. Optional if using Hive for metadata storage.


Store data in HBase tables.


Store data in HDFS files.


Store data in Hive managed tables (default).


ZooKeeper host list as host or host:port.

-s, --schema

The file containing the Avro schema. This value is required.

-f, --format

By default, the dataset is created in Avro format. Use this switch to set the format to Parquet (-f parquet).

-p, --partition-by

The file containing a JSON-formatted partition strategy.


Create dataset "users" in Hive: dataset create users --schema user.avsc

Create dataset "users" using Parquet: dataset create users --schema user.avsc --format parquet

Create dataset "users" partitioned by JSON configuration: dataset create users --schema user.avsc --partition-by user_part.json

Show the schema for a dataset.


dataset [general options] schema <dataset name> [command options]


-d, --directory

The root directory of the dataset repository. Optional if you are using Hive for metadata storage.


Store data in HBase tables.


Store data in HDFS files.


Store data in Hive managed tables (default).


ZooKeeper host list as host or host:port.


Minimize schema file size by eliminating white space.

-o, --output

Save schema in Avro format to a given path.


Print the schema for dataset "users" to standard out: dataset schema users

Save the schema for dataset "users" to user.avsc: dataset schema users -o user.avsc

Copy CSV records into a dataset.


dataset [general options] csv-import <csv path> <dataset name> [command options]


-d, --directory

The root directory of the dataset repository. Optional if using Hive for metadata storage.


Store data in HBase tables.


Store data in HDFS files.


Store data in Hive managed tables (default).


ZooKeeper host list as host or host:port.


Escape character. Default is backslash (\).


Delimiter character. Default is comma (,).


Quote character. Default is double quote (").


Lines to skip before CSV start (default: 0)


Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1, and so on.


Copy the records from sample.csv to a dataset named "sample": dataset csv-import csv-import path/to/sample.csv sample

Print the first n records in a dataset.


dataset [general options] show <dataset name> [command options]


-d, --directory

The root directory of the dataset repository. Optional if using Hive for metadata storage.


Store data in HBase tables.


Store data in HDFS files.


Store data in Hive managed tables (default).


ZooKeeper host list as host or host:port.

-n, --num-records

The number of records to print. The default number is 10.


Show the first 10 records in dataset "users": dataset show users

Show the first 50 records in dataset "users": dataset show users -n 50

Delete one or more datasets and related metadata.


dataset [general options] delete <dataset names> [command options]


-d, --directory

The root directory of the dataset repository. Optional if using Hive for metadata storage.


Store data in HBase tables.


Store data in HDFS files.


Store data in Hive managed tables (default).


ZooKeeper host list as host or host:port.


Delete all data and metadata for the dataset "users": dataset delete users

Builds a partition strategy for a schema.


dataset [general options] partition-config <field:type pairs> [command options]


-s, --schema

The file containing the Avro schema. This value is required.

-o, --output

Save partition JSON file to path


Minimize output size by eliminating white space


Partition by email address, balanced across 16 hash partitions and save as a JSON file. dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json

Partition by created_at time's year, month, and day dataset partition-config created_at:year created_at:month created_at:day -s event.avsc

Retrieves details on the functions of one or more dataset commands.


dataset [general options] help <commands> [command options]


Retrieve details for the create, show, and delete commands. dataset help create show delete

