Kite CLI Reference

The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.

Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.

Commands

general options: options for all commands.
help: get help for the dataset command in general or a specific command.
create: create a dataset based on an existing schema.
update: update the metadata descriptor for a dataset.
compact: compact all or part of a dataset.
list: list datasets.
show: show the first n records of a dataset.
copy: copy one dataset to another dataset.
transform: transform records from one dataset and store them in another dataset.
delete: delete a dataset.
info: show metadata for a dataset.
schema : view the schema for an existing dataset.
csv-schema: create a schema from a CSV data sample.
json-schema: create a schema from a JSON data sample.
obj-schema: create a schema from a Java object.
csv-import: import CSV data.
json-import: import JSON data.
inputformat-import: import data using a custom InputFormat.
tar-import: import files from a tarball as a dataset.
partition-config: create a partition strategy for a schema.
mapping-config: create a mapping strategy for a schema.
log4j-config: Configure Log4j.
flume-config: Configure Flume.

General options

Every command begins with kite-dataset, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.

-v
--verbose
--debug Turn on debug logging and show stack traces.

The Kite CLI supports the following environment variables.

`HIVE_HOME`	Root directory of Hive instance
`HIVE_CONF_DIR`	Configuration directory for Hive instance
`HBASE_HOME`	Root directory of HBase instance
`HADOOP_MAPRED_HOME`	Root directory for MapReduce
`HADOOP_HOME`	Root directory for Hadoop instance

To show the values for these variables at runtime, set the debug= option to true. This can be helpful when troubleshooting issues where one or more of these resources is not found. For example:

debug=true kite-dataset info users

Use the flags= option to pass arguments to the internal Hadoop jar command. For example:

flags="-Xmx512m" kite-dataset info users`

`-s, --schema`	A file containing the Avro schema.
`-f, --format`	Set the dataset format, either `avro` or `parquet`. Defaults to `avro`
`-p, --partition-by`	A file containing a JSON-formatted partition strategy.
`-m, --mapping`	A file containing a JSON-formatted column mapping.
`--set, --property`	A property to set in the dataset’s descriptor: `prop.name=value`.
`--location`	The location where data is or should be stored.

`-s, --schema`	A file containing the updated Avro schema.
`-p, --partition-by`	A file containing an updated partition strategy.
`--set, --property`	Add a property pair: `prop.name=value`.

`--no-compaction`	Copy to output directly, without compacting the data.
`--num-writers`	The number of writer processes to use.

`--no-compaction`	Copy to output without compacting the data
`--num-writers`	The number of writer processes to use
`--transform`	A transform DoFn class name
`--jar`	Add a jar to the runtime class path

`-o, --output`	Save schema in Avro format to a given path.
`--minimize`	Minimize schema file size by eliminating white space.

`--class,` `--record-name`	A class name or record name for the schema result. This value is required.
`-o, --output`	Save schema avsc to path.
`--require`	Mark a field required; the schema for this field will not allow null values. Use more than once to require multiple fields.
`--no-header`	Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n.
`--skip-lines`	The number of lines to skip before the start of the CSV data. Default is 0.
`--delimiter`	Delimiter character in the CSV data file. Default is the comma (,).
`--escape`	Escape character in the CSV data file. Default is the backslash (\).
`--quote`	Quote character in the CSV data file. Default is the double-quote (“).
`--minimize`	Minimize schema file size by eliminating white space.

`--format`	The InputFormat class name. Must include the package.
`--jar`	Add a jar to the runtime classpath
`--record-type`	InputFormat argument to use as the record (`key` or `value`)
`--num-writers`	The number of writer processes to use
`--no-compaction`	Copy to output directly, without compacting the data
`--transform`	A transform DoFn class name
`--set, --property`	A property to set on the configuration: `prop.name=value`.

`year`	Extract the year from a timestamp
`month`	Extract the month from a timestamp
`day`	Extract the day from a timestamp
`hour`	Extract the hour from a timestamp
`minute`	Extract the minute from a timestamp
`hash[N]`	Hash the source field, using N buckets
`copy`	Copy the field without modification (identity)
`provided`	Doesn’t use a source field, the field name is used to name the partition

`-s, --schema`	The file containing the Avro schema. This value is required
`-o, --output`	Save partition JSON file to path
`--minimize`	Minimize output size by eliminating white space

`key`	Uses a key mapping
`version`	Uses a version mapping (for optimistic concurrency)
any string	The given string is used as the family in a column mapping

`--port`	Flume port
`--class`, `--package`	Java class or package from which to log
`--log-all`	Configure the root logger to send to Flume
`-o, --output`	Save the log4j configuration to a file

`--use-dataset-uri`	Configure Flume with a dataset URI. Requires Flume 1.6 or later.
`--agent`	Flume agent name
`--source`	Flume source name
`--bind`	Avro source bind address
`--port`	Avro source port
`--channel`	Flume channel name
`--channel-type`	Flume channel type (`memory` or `file`)
`--channel-capacity`	Flume channel capacity
`--channel-transaction-capacity`	Flume channel transaction capacity
`--checkpoint-dir`	File channel checkpoint directory (required when using `--channel-type file`)
`--data-dir`	File channel data directory. Use the option multiple times for multiple data directories. (required when using `--channel-type file`)
`--sink`	Flume sink name
`--batch-size`	Records to write per batch
`--roll-interval`	Time in seconds before starting the next file
`--proxy-user`	User identity to use when writing to HDFS
`-o, --output`	Save the Flume configuration to a file

`-s, --schema`	The file containing the Avro schema.
`-p, --partition-by`	The file containing the JSON partition strategy.
`--minimize`	Minimize output size by eliminating white space.

About Kite

Kite CLI

Kite API

Reference

Commands

General options

help

Syntax

Examples

create

Usage

Options

Examples:

update

Syntax

Options

Examples:

compact

Syntax

Options

Examples:

list

Syntax

Examples

show

Syntax

Options

Examples

copy

Syntax

Options

Examples

transform

Syntax

Options

Examples

delete

Syntax

Examples

info

Syntax

Example

schema

Syntax

Options

Examples:

csv-schema

Syntax

Options

Examples

json-schema

Syntax

Options

Examples

obj-schema

Syntax

Options

Examples

csv-import

Syntax

Options

Examples

json-import

Syntax

Options

Examples

inputformat-import

Syntax

Options

Examples

tar-import

Syntax

Options

Examples

partition-config

Syntax

Options:

Examples

mapping-config

Syntax