The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.

Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.


  • general options: options for all commands.
  • help: get help for the dataset command in general or a specific command.
  • create: create a dataset based on an existing schema.
  • copy: copy one dataset to another dataset.
  • transform: transform records from one dataset and store them in another dataset.
  • update: update the metadata descriptor for a dataset.
  • delete: delete a dataset.
  • schema : view the schema for an existing dataset.
  • info: show metadata for a dataset.
  • show: show the first n records of a dataset.
  • csv-schema: create a schema from a CSV data sample.
  • csv-import: import CSV data.
  • json-schema: create a schema from a JSON data sample.
  • json-import: import JSON data.
  • obj-schema: create a schema from a Java object.
  • partition-config: create a partition strategy for a schema.
  • mapping-config: create a mapping strategy for a schema.
  • log4j-config: Configure Log4j.
  • flume-config: Configure Flume.

General options

Every command begins with kite-dataset, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.

Turn on debug logging and show stack traces.

The Kite CLI supports the following environment variables.

HIVE_HOME Root directory of Hive instance
HIVE_CONF_DIR Configuration directory for Hive instance
HBASE_HOME Root directory of HBase instance
HADOOP_MAPRED_HOME Root directory for MapReduce
HADOOP_HOME Root directory for Hadoop instance

To show the values for these variables at runtime, set the debug= option to true. This can be helpful when troubleshooting issues where one or more of these resources is not found. For example:

debug=true kite-dataset info users

Use the flags= option to pass arguments to the internal Hadoop jar command. For example:

flags="-Xmx512m" kite-dataset info users`

Use csv-schema to generate an Avro schema from a comma separated value (CSV) file.

The schema produced by this command is a record based on the first few lines of the file. If the first line is a header, it is used to name the fields.

Field schemas are set by inspecting the first non-empty value in each field. Fields are nullable unless the field’s name is passed using --require. Nullable fields default to null.

The type is determined by the following rules:
* If the data is numeric and has a decimal point, the type is double
* If the data is numeric and has no decimal point, the type is long
* Otherwise, the type is string

See CSV format details.


kite-dataset [-v] csv-schema <sample csv path> [command options]


A class name or record name for the schema result. This value is required.
-o, --output Save schema avsc to path.
--require Mark a field required; the schema for this field will not allow null values.
Use more than once to require multiple fields.
--no-header Use this option when the CSV data file does not have header information in the first line.
Fields are given the default names field_0, field_1,…field_n.
--skip-lines The number of lines to skip before the start of the CSV data. Default is 0.
--delimiter Delimiter character in the CSV data file. Default is the comma (,).
--escape Escape character in the CSV data file. Default is the backslash (\).
--quote Quote character in the CSV data file. Default is the double-quote (“).
--minimize Minimize schema file size by eliminating white space.


Print the schema to standard out:

kite-dataset csv-schema sample.csv --class Sample

Write the schema to sample.avsc:

kite-dataset csv-schema sample.csv --class Sample -o sample.avsc

Build a schema from a JSON data sample.

This command produces a Schema by inspecting the first few JSON objects in the data sample. Each JSON object is converted to a Schema that describes it, and the final Schema is the result of merging each sample object’s Schema.

The following two-object data sample, for example

{ "id": 1, "color": "green", "shade": "dark" }
{ "id": 2, "color": "red" }

Produces the following merged Schema

  "type" : "record",
  "name" : "Sample",
  "fields" : [ {
    "name" : "id",
    "type" : "int"
  }, {
    "name" : "color",
    "type" : "string"
  }, {
    "name" : "shade",
    "type" : [ "null", "string" ],
    "default" : null
  } ]

See JSON format details.


kite-dataset [-v] json-schema <sample json path> [command options]


A class name or record name for the schema result. This value is required.
-o, --output Save schema avsc to path.
--minimize Minimize schema file size by eliminating white space.


Print an inferred schema for samples.json to standard out

kite-dataset json-schema samples.json --record-name Sample

Write an inferred schema to sample.avsc

kite-dataset json-schema samples.json --record-name Sample -o sample.avsc

Build a schema from a Java class.

Fields are assumed to be nullable if they are Objects, or required if they are primitives. You can edit the generated schema directly to remove the null option for specific fields.


kite-dataset [-v] obj-schema <class name> [command options]


-o, --output Save schema in Avro format to a given path.
--jar Add a jar to the classpath used when loading the Java class.
--lib-dir Add a directory to the classpath used when loading the Java class.
--minimize Minimize schema file size by eliminating white space.


Create a schema for an example User class:

kite-dataset obj-schema org.kitesdk.cli.example.User

Create a schema for a class in a jar:

kite-dataset obj-schema com.example.MyRecord --jar my-application.jar

Save the schema for the example User class to user.avsc:

kite-dataset obj-schema org.kitesdk.cli.example.User -o user.avsc

After you have generated an Avro schema, you can use create to make an empty dataset.


kite-dataset [-v] create <dataset> [command options]


-s, --schema A file containing the Avro schema. This value is required.
-f, --format By default, the dataset is created in Avro format.
Use this switch to set the format to Parquet -f parquet
-p, --partition-by A file containing a JSON-formatted partition strategy.
-m, --mapping A file containing a JSON-formatted column mapping.
--set, --property A property to set in the dataset’s descriptor:

Note: The dataset name must not contain a period (.).


Create dataset “users” in Hive:

kite-dataset create users --schema user.avsc

Create dataset “users” using Parquet:

kite-dataset create users --schema user.avsc --format parquet

Create dataset “users” partitioned by JSON configuration using a cache size of 20 (rather than the default cache size of 10):

kite-dataset create users --schema user.avsc --partition-by user_part.json --set kite.writer.cache-size=20

Create dataset “users” and set multiple properties:

kite-dataset create users --schema user.avsc --set kite.writer.cache-size=20 --set dfs.blocksize=256m

Update the metadata descriptor for a dataset.


kite-dataset [-v] update <dataset> [command options]


-s, --schema The file containing the Avro schema.
--set, --property Add a property pair:


Update schema for dataset “users” in Hive:

kite-dataset update users --schema user.avsc

Update HDFS dataset by URI, add property:

kite-dataset update dataset:hdfs:/user/me/datasets/users --set kite.write.cache-size=20

Show the schema for a dataset.


kite-dataset [-v] schema <dataset> [command options]


-o, --output Save schema in Avro format to a given path.
--minimize Minimize schema file size by eliminating white space.


Print the schema for dataset “users” to standard out:

kite-dataset schema users

Save the schema for dataset “users” to user.avsc:

dataset schema users -o user.avsc

Copy CSV records into a dataset.

Kite matches the CSV header to the target record schema’s fields by name. If a header is not present (that is, you use the --no-header option), then CSV columns are matched with the target fields based on their position.

As Kite constructs each record, it validates values using the target field’s schema. Invalid values (in numeric fields) and null values (in required fields) cause exceptions. Kite handles empty strings as null values for numeric fields.

See CSV format details.


kite-dataset [-v] csv-import <csv path> <dataset> [command options]


--no-header Use this option when the CSV data file does not have header information in the first line.
Fields are given the default names field_0, field_1,…field_n.
--skip-lines Lines to skip before CSV start (default: 0)
--delimiter Delimiter character. Default is comma (,).
--escape Escape character. Default is backslash (\).
--quote Quote character. Default is double quote (“).
--num-writers The number of writer processes to use
--no-compaction Copy to output directly, without compacting the data
--jar Add a jar to the runtime classpath
--transform A transform DoFn class name


Copy the records from sample.csv to a Hive dataset named “sample”:

kite-dataset csv-import path/to/sample.csv sample

Copy JSON objects into a dataset

Kite uses the target dataset’s Schema to validate and store the JSON objects.

  • All values must match the type specified in the target Schema
  • JSON objects will match both record and map Schemas
  • When converting a JSON object with a record Schema:
    • Only the record’s fields are used, other key-value pairs are ignored
    • All fields must be present or have a default value in the record Schema
  • When converting a JSON object with a map Schema, all key-value pairs are used

Invalid values, missing record fields, and other problems cause exceptions.

See JSON format details.


kite-dataset [-v] json-import <json path> <dataset name> [command options]


--num-writers The number of writer processes to use
--no-compaction Copy to output directly, without compacting the data
--jar Add a jar to the runtime classpath
--transform A transform DoFn class name


Copy the records from sample.json to dataset sample

kite-dataset json-import path/to/sample.json sample

Copy the records from sample.json to a dataset URI

kite-dataset json-import path/to/sample.json dataset:hdfs:/user/me/datasets/sample

Copy the records from an HDFS directory to sample

kite-dataset json-import hdfs:/data/path/samples/ sample

Print the first n records in a dataset.


kite-dataset [-v] show <dataset> [command options]


-n, --num-records The number of records to print. The default number is 10.


Show the first 10 records in dataset “users”:

kite-dataset show users

Show the first 50 records in dataset “users”:

kite-dataset show users -n 50

Copy records from one dataset to another.


kite-dataset [-v] copy <source dataset> <destination dataset> [command options]


--no-compaction Copy to output directly, without compacting the data.
--num-writers The number of writer processes to use.


Copy the contents of movies_avro to movies_parquet:

kite-dataset copy movies_avro movies_parquet

Copy the movies dataset into HBase in a map-only job:

kite-dataset copy movies dataset:hbase:zk-host/movies --no-compaction

Delete one or more datasets and related metadata.


kite-dataset [-v] delete <datasets> [command options]


Delete all data and metadata for the dataset “users”:

kite-dataset delete users

Builds a partition strategy for a schema.

The resulting partition strategy is a valid JSON partition strategy file.

Entries in the partition strategy are specified by field:type pairs, where field is the source field from the given schema and type can be:

year Extract the year from a timestamp
month Extract the month from a timestamp
day Extract the day from a timestamp
hour Extract the hour from a timestamp
minute Extract the minute from a timestamp
hash[N] Hash the source field, using N buckets
copy Copy the field without modification (identity)
provided Doesn’t use a source field, the field name is used to name the partition

Provided partitioners do not reference a source field and instead require that a value is provided when writing. Values can be provided by writing to views.


kite-dataset [-v] partition-config <field:type pairs> [command options]


-s, --schema The file containing the Avro schema. This value is required
-o, --output Save partition JSON file to path
--minimize Minimize output size by eliminating white space


Partition by email address, balanced across 16 hash partitions and save to a file.

kite-dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json

Partition by created_at time’s year, month, and day:

kite-dataset partition-config created_at:year created_at:month created_at:day -s event.avsc

Builds a column mapping for a schema, required for HBase. The resulting mapping definition is a valid JSON mapping file.

Mappings are specified by field:type pairs, where field is a source field from the given schema and type can be:

key Uses a key mapping
version Uses a version mapping (for optimistic concurrency)
any string The given string is used as the family in a column mapping

If the last option is used, the mapping type will determined by the source field type. Numbers will use counter, hash maps and records will use keyAsColumn, and all others will use column.


kite-dataset  [-v] create-column-mapping <field:type pairs> [command options]


-s, --schema The file containing the Avro schema.
-p, --partition-by The file containing the JSON partition strategy.
--minimize Minimize output size by eliminating white space.


Store email in the key, other fields in column family u:

kite-dataset  mapping-config email:key username:u id:u --schema user.avsc -o user-cols.json

Store preferences hash-map in column family prefs:

kite-dataset  mapping-config preferences:prefs --schema user.avsc

Use the version field as an OCC version:

kite-dataset  mapping-config version:version --schema user.avsc

Retrieves details on the functions of one or more dataset commands.


kite-dataset  [-v] help <commands> [command options]


Retrieve details for the create, show, and delete commands.

kite-dataset help create show delete

Transforms records from one dataset and stores them in another dataset.


kite-dataset transform <source dataset> <destination dataset> [command options]


--no-compaction Copy to output without compacting the data
--num-writers The number of writer processes to use
--transform A transform DoFn class name
--jar Add a jar to the runtime class path


Transform the contents of movies_src using com.example.TransformFn:

kite-dataset  transform movies_src movies --transform com.example.TransformFn --jar fns.jar

Print all metadata for a dataset.


kite-dataset info <dataset name>


Print all metadata for the “users” dataset:

kite-dataset info users


Builds a log4j configuration to log events to a dataset.


kite-dataset log4j-config <dataset name> --host <flume hostname> [command options]


--port Flume port
--class, --package Java class or package from which to log
--log-all Configure the root logger to send to Flume
-o, --output Save the log4j configuration to a file


Print log4j configuration to log to dataset “users”:

kite-dataset log4j-config --host --class org.kitesdk.examples.MyLoggingApp users

Save log4j configuration to the file

kite-dataset log4j-config --host --package org.kitesdk.examples -o users

Print log4j configuration to log from all classes:

kite-dataset log4j-config --host --log-all users

Builds a Flume configuration to log events to a dataset.


kite-dataset flume-config <dataset name or URI> [command options]


--use-dataset-uri Configure Flume with a dataset URI. Requires Flume 1.6 or later.
--agent Flume agent name
--source Flume source name
--bind Avro source bind address
--port Avro source port
--channel Flume channel name
--channel-type Flume channel type (memory or file)
--channel-capacity Flume channel capacity
--channel-transaction-capacity Flume channel transaction capacity
--checkpoint-dir File channel checkpoint directory (required when using --channel-type file)
--data-dir File channel data directory. Use the option multiple times for multiple data directories. (required when using --channel-type file)
--sink Flume sink name
--batch-size Records to write per batch
--roll-interval Time in seconds before starting the next file
--proxy-user User identity to use when writing to HDFS
-o, --output Save the Flume configuration to a file


Print Flume configuration to log to dataset “users”:

kite-dataset flume-config --checkpoint-dir /data/0/flume/checkpoint --data-dir /data/1/flume/data users

Print Flume configuration to log to dataset dataset:hdfs:/datasets/default/users:

kite-dataset flume-config --channel-type memory dataset:hdfs:/datasets/default/users

Save Flume configuration to the file

kite-dataset flume-config --channel-type memory -o users

