Kite CLI Reference
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.
Commands
- general options: options for all commands.
- help: get help for the dataset command in general or a specific command.
- create: create a dataset based on an existing schema.
- copy: copy one dataset to another dataset.
- transform: transform records from one dataset and store them in another dataset.
- update: update the metadata descriptor for a dataset.
- delete: delete a dataset.
- schema : view the schema for an existing dataset.
- info: show metadata for a dataset.
- show: show the first n records of a dataset.
- csv-schema: create a schema from a CSV data sample.
- csv-import: import CSV data.
- json-schema: create a schema from a JSON data sample.
- json-import: import JSON data.
- obj-schema: create a schema from a Java object.
- partition-config: create a partition strategy for a schema.
- mapping-config: create a mapping strategy for a schema.
- log4j-config: Configure Log4j.
- flume-config: Configure Flume.
General options
Every command begins with kite-dataset
, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.
-v --verbose --debug |
Turn on debug logging and show stack traces. |
The Kite CLI supports the following environment variables.
HIVE_HOME |
Root directory of Hive instance |
HIVE_CONF_DIR |
Configuration directory for Hive instance |
HBASE_HOME |
Root directory of HBase instance |
HADOOP_MAPRED_HOME |
Root directory for MapReduce |
HADOOP_HOME |
Root directory for Hadoop instance |
To show the values for these variables at runtime, set the debug=
option to true. This can be helpful when troubleshooting issues where one or more of these resources is not found. For example:
debug=true kite-dataset info users
Use the flags=
option to pass arguments to the internal Hadoop jar command. For example:
flags="-Xmx512m" kite-dataset info users`
csv-schema
Use csv-schema
to generate an Avro schema from a comma separated value (CSV) file.
The schema produced by this command is a record based on the first few lines of the file. If the first line is a header, it is used to name the fields.
Field schemas are set by inspecting the first non-empty value in each field. Fields are nullable unless the field’s name is passed using --require
. Nullable fields default to null
.
The type is determined by the following rules:
* If the data is numeric and has a decimal point, the type is double
* If the data is numeric and has no decimal point, the type is long
* Otherwise, the type is string
See CSV format details.
Syntax
kite-dataset [-v] csv-schema <sample csv path> [command options]
Options
--class, --record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--require |
Mark a field required; the schema for this field will not allow null values. Use more than once to require multiple fields. |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--skip-lines |
The number of lines to skip before the start of the CSV data. Default is 0. |
--delimiter |
Delimiter character in the CSV data file. Default is the comma (,). |
--escape |
Escape character in the CSV data file. Default is the backslash (\). |
--quote |
Quote character in the CSV data file. Default is the double-quote (“). |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print the schema to standard out:
kite-dataset csv-schema sample.csv --class Sample
Write the schema to sample.avsc:
kite-dataset csv-schema sample.csv --class Sample -o sample.avsc
json-schema
Build a schema from a JSON data sample.
This command produces a Schema by inspecting the first few JSON objects in the data sample. Each JSON object is converted to a Schema that describes it, and the final Schema is the result of merging each sample object’s Schema.
The following two-object data sample, for example
Produces the following merged Schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
{ "type" : "record", "name" : "Sample", "fields" : [ { "name" : "id", "type" : "int" }, { "name" : "color", "type" : "string" }, { "name" : "shade", "type" : [ "null", "string" ], "default" : null } ] } |
See JSON format details.
Syntax
kite-dataset [-v] json-schema <sample json path> [command options]
Options
--class, --record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print an inferred schema for samples.json
to standard out
kite-dataset json-schema samples.json --record-name Sample
Write an inferred schema to sample.avsc
kite-dataset json-schema samples.json --record-name Sample -o sample.avsc
obj-schema
Build a schema from a Java class.
Fields are assumed to be nullable if they are Objects, or required if they are primitives. You can edit the generated schema directly to remove the null
option for specific fields.
Syntax
kite-dataset [-v] obj-schema <class name> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--jar |
Add a jar to the classpath used when loading the Java class. |
--lib-dir |
Add a directory to the classpath used when loading the Java class. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Create a schema for an example User class:
kite-dataset obj-schema org.kitesdk.cli.example.User
Create a schema for a class in a jar:
kite-dataset obj-schema com.example.MyRecord --jar my-application.jar
Save the schema for the example User class to user.avsc:
kite-dataset obj-schema org.kitesdk.cli.example.User -o user.avsc
create
After you have generated an Avro schema, you can use create
to make an empty dataset.
Usage
kite-dataset [-v] create <dataset> [command options]
Options
-s, --schema |
A file containing the Avro schema. This value is required. |
-f, --format |
By default, the dataset is created in Avro format. Use this switch to set the format to Parquet -f parquet |
-p, --partition-by |
A file containing a JSON-formatted partition strategy. |
-m, --mapping |
A file containing a JSON-formatted column mapping. |
--set, --property |
A property to set in the dataset’s descriptor: prop.name=value . |
Note: The dataset name must not contain a period (.).
Examples:
Create dataset “users” in Hive:
kite-dataset create users --schema user.avsc
Create dataset “users” using Parquet:
kite-dataset create users --schema user.avsc --format parquet
Create dataset “users” partitioned by JSON configuration using a cache size of 20 (rather than the default cache size of 10):
kite-dataset create users --schema user.avsc --partition-by user_part.json --set kite.writer.cache-size=20
Create dataset “users” and set multiple properties:
kite-dataset create users --schema user.avsc --set kite.writer.cache-size=20 --set dfs.blocksize=256m
update
Update the metadata descriptor for a dataset.
Syntax
kite-dataset [-v] update <dataset> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
--set, --property |
Add a property pair: prop.name=value . |
Examples:
Update schema for dataset “users” in Hive:
kite-dataset update users --schema user.avsc
Update HDFS dataset by URI, add property:
kite-dataset update dataset:hdfs:/user/me/datasets/users --set kite.write.cache-size=20
schema
Show the schema for a dataset.
Syntax
kite-dataset [-v] schema <dataset> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples:
Print the schema for dataset “users” to standard out:
kite-dataset schema users
Save the schema for dataset “users” to user.avsc:
dataset schema users -o user.avsc
csv-import
Copy CSV records into a dataset.
Kite matches the CSV header to the target record schema’s fields by name. If a header is not present (that is, you use the --no-header
option), then CSV columns are matched with the target fields based on their position.
As Kite constructs each record, it validates values using the target field’s schema. Invalid values (in numeric fields) and null values (in required fields) cause exceptions. Kite handles empty strings as null values for numeric fields.
See CSV format details.
Syntax
kite-dataset [-v] csv-import <csv path> <dataset> [command options]
Options
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--skip-lines |
Lines to skip before CSV start (default: 0) |
--delimiter |
Delimiter character. Default is comma (,). |
--escape |
Escape character. Default is backslash (\). |
--quote |
Quote character. Default is double quote (“). |
--num-writers |
The number of writer processes to use |
--no-compaction |
Copy to output directly, without compacting the data |
--jar |
Add a jar to the runtime classpath |
--transform |
A transform DoFn class name |
Examples
Copy the records from sample.csv
to a Hive dataset named “sample”:
kite-dataset csv-import path/to/sample.csv sample
json-import
Copy JSON objects into a dataset
Kite uses the target dataset’s Schema to validate and store the JSON objects.
- All values must match the type specified in the target Schema
- JSON objects will match both record and map Schemas
- When converting a JSON object with a record Schema:
- Only the record’s fields are used, other key-value pairs are ignored
- All fields must be present or have a default value in the record Schema
- When converting a JSON object with a map Schema, all key-value pairs are used
Invalid values, missing record fields, and other problems cause exceptions.
See JSON format details.
Syntax
kite-dataset [-v] json-import <json path> <dataset name> [command options]
Options
--num-writers |
The number of writer processes to use |
--no-compaction |
Copy to output directly, without compacting the data |
--jar |
Add a jar to the runtime classpath |
--transform |
A transform DoFn class name |
Examples
Copy the records from sample.json
to dataset sample
kite-dataset json-import path/to/sample.json sample
Copy the records from sample.json
to a dataset URI
kite-dataset json-import path/to/sample.json dataset:hdfs:/user/me/datasets/sample
Copy the records from an HDFS directory to sample
kite-dataset json-import hdfs:/data/path/samples/ sample
show
Print the first n records in a dataset.
Syntax
kite-dataset [-v] show <dataset> [command options]
Options
-n, --num-records |
The number of records to print. The default number is 10. |
Examples
Show the first 10 records in dataset “users”:
kite-dataset show users
Show the first 50 records in dataset “users”:
kite-dataset show users -n 50
copy
Copy records from one dataset to another.
Syntax
kite-dataset [-v] copy <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output directly, without compacting the data. |
--num-writers |
The number of writer processes to use. |
Examples
Copy the contents of movies_avro
to movies_parquet
:
kite-dataset copy movies_avro movies_parquet
Copy the movies dataset into HBase in a map-only job:
kite-dataset copy movies dataset:hbase:zk-host/movies --no-compaction
delete
Delete one or more datasets and related metadata.
Syntax
kite-dataset [-v] delete <datasets> [command options]
Examples
Delete all data and metadata for the dataset “users”:
kite-dataset delete users
partition-config
Builds a partition strategy for a schema.
The resulting partition strategy is a valid JSON partition strategy file.
Entries in the partition strategy are specified by field:type
pairs, where field
is the source field from the given schema and type
can be:
year |
Extract the year from a timestamp |
month |
Extract the month from a timestamp |
day |
Extract the day from a timestamp |
hour |
Extract the hour from a timestamp |
minute |
Extract the minute from a timestamp |
hash[N] |
Hash the source field, using N buckets |
copy |
Copy the field without modification (identity) |
provided |
Doesn’t use a source field, the field name is used to name the partition |
Provided partitioners do not reference a source field and instead require that a value is provided when writing. Values can be provided by writing to views.
Syntax
kite-dataset [-v] partition-config <field:type pairs> [command options]
Options:
-s, --schema |
The file containing the Avro schema. This value is required |
-o, --output |
Save partition JSON file to path |
--minimize |
Minimize output size by eliminating white space |
Examples
Partition by email address, balanced across 16 hash partitions and save to a file.
kite-dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json
Partition by created_at
time’s year, month, and day:
kite-dataset partition-config created_at:year created_at:month created_at:day -s event.avsc
mapping-config
Builds a column mapping for a schema, required for HBase. The resulting mapping definition is a valid JSON mapping file.
Mappings are specified by field:type
pairs, where field
is a source field from the given schema and type
can be:
key |
Uses a key mapping |
version |
Uses a version mapping (for optimistic concurrency) |
any string | The given string is used as the family in a column mapping |
If the last option is used, the mapping type will determined by the source field type. Numbers will use counter
, hash maps and records will use keyAsColumn
, and all others will use column
.
Syntax
kite-dataset [-v] create-column-mapping <field:type pairs> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
-p, --partition-by |
The file containing the JSON partition strategy. |
--minimize |
Minimize output size by eliminating white space. |
Examples
Store email in the key, other fields in column family u
:
kite-dataset mapping-config email:key username:u id:u --schema user.avsc -o user-cols.json
Store preferences hash-map in column family prefs
:
kite-dataset mapping-config preferences:prefs --schema user.avsc
Use the version
field as an OCC version:
kite-dataset mapping-config version:version --schema user.avsc
help
Retrieves details on the functions of one or more dataset commands.
Syntax
kite-dataset [-v] help <commands> [command options]
Examples
Retrieve details for the create, show, and delete commands.
kite-dataset help create show delete
transform
Transforms records from one dataset and stores them in another dataset.
Syntax
kite-dataset transform <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output without compacting the data |
--num-writers |
The number of writer processes to use |
--transform |
A transform DoFn class name |
--jar |
Add a jar to the runtime class path |
Examples
Transform the contents of movies_src
using com.example.TransformFn
:
kite-dataset transform movies_src movies --transform com.example.TransformFn --jar fns.jar
info
Print all metadata for a dataset.
Syntax
kite-dataset info <dataset name>
Example
Print all metadata for the “users” dataset:
kite-dataset info users
log4j-config
Builds a log4j configuration to log events to a dataset.
Syntax
kite-dataset log4j-config <dataset name> --host <flume hostname> [command options]
Options
--port |
Flume port |
--class , --package |
Java class or package from which to log |
--log-all |
Configure the root logger to send to Flume |
-o, --output |
Save the log4j configuration to a file |
Examples
Print log4j configuration to log to dataset “users”:
kite-dataset log4j-config --host flume.cluster.com --class org.kitesdk.examples.MyLoggingApp users
Save log4j configuration to the file log4j.properties
:
kite-dataset log4j-config --host flume.cluster.com --package org.kitesdk.examples -o log4j.properties users
Print log4j configuration to log from all classes:
kite-dataset log4j-config --host flume.cluster.com --log-all users
flume-config
Builds a Flume configuration to log events to a dataset.
Syntax
kite-dataset flume-config <dataset name or URI> [command options]
Options
--use-dataset-uri |
Configure Flume with a dataset URI. Requires Flume 1.6 or later. |
--agent |
Flume agent name |
--source |
Flume source name |
--bind |
Avro source bind address |
--port |
Avro source port |
--channel |
Flume channel name |
--channel-type |
Flume channel type (memory or file ) |
--channel-capacity |
Flume channel capacity |
--channel-transaction-capacity |
Flume channel transaction capacity |
--checkpoint-dir |
File channel checkpoint directory (required when using --channel-type file ) |
--data-dir |
File channel data directory. Use the option multiple times for multiple data directories. (required when using --channel-type file ) |
--sink |
Flume sink name |
--batch-size |
Records to write per batch |
--roll-interval |
Time in seconds before starting the next file |
--proxy-user |
User identity to use when writing to HDFS |
-o, --output |
Save the Flume configuration to a file |
Examples
Print Flume configuration to log to dataset “users”:
kite-dataset flume-config --checkpoint-dir /data/0/flume/checkpoint --data-dir /data/1/flume/data users
Print Flume configuration to log to dataset dataset:hdfs:/datasets/default/users
:
kite-dataset flume-config --channel-type memory dataset:hdfs:/datasets/default/users
Save Flume configuration to the file flume.properties
:
kite-dataset flume-config --channel-type memory -o flume.properties users