Kite CLI Reference
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.
Commands
- general options: options for all commands.
- help: get help for the dataset command in general or a specific command.
- create: create a dataset based on an existing schema.
- update: update the metadata descriptor for a dataset.
- compact: compact all or part of a dataset.
- list: list datasets.
- show: show the first n records of a dataset.
- copy: copy one dataset to another dataset.
- transform: transform records from one dataset and store them in another dataset.
- delete: delete a dataset.
- info: show metadata for a dataset.
- schema : view the schema for an existing dataset.
- csv-schema: create a schema from a CSV data sample.
- json-schema: create a schema from a JSON data sample.
- obj-schema: create a schema from a Java object.
- csv-import: import CSV data.
- json-import: import JSON data.
- inputformat-import: import data using a custom InputFormat.
- tar-import: import files from a tarball as a dataset.
- partition-config: create a partition strategy for a schema.
- mapping-config: create a mapping strategy for a schema.
- log4j-config: Configure Log4j.
- flume-config: Configure Flume.
General options
Every command begins with kite-dataset
, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.
-v --verbose --debug |
Turn on debug logging and show stack traces. |
The Kite CLI supports the following environment variables.
HIVE_HOME |
Root directory of Hive instance |
HIVE_CONF_DIR |
Configuration directory for Hive instance |
HBASE_HOME |
Root directory of HBase instance |
HADOOP_MAPRED_HOME |
Root directory for MapReduce |
HADOOP_HOME |
Root directory for Hadoop instance |
To show the values for these variables at runtime, set the debug=
option to true. This can be helpful when troubleshooting issues where one or more of these resources is not found. For example:
debug=true kite-dataset info users
Use the flags=
option to pass arguments to the internal Hadoop jar command. For example:
flags="-Xmx512m" kite-dataset info users`
help
Retrieves details on the functions of one or more dataset commands.
Syntax
kite-dataset [-v] help <commands> [command options]
Examples
Retrieve details for the create command:
kite-dataset help create
create
Create a dataset in a new location or using existing data.
The dataset must be either a full dataset URI beginning with “dataset:” or a dataset name that will be created as a Hive table using the default “dataset:hive:
Any dataset configuration set in the command’s options will be validated against existing data.
If there is no existing data, a schema is required. If existing data is found, the inferred schema, partition strategy, and format will be used unless it is changed by command options.
Usage
kite-dataset [-v] create <dataset> [command options]
Options
-s, --schema |
A file containing the Avro schema. |
-f, --format |
Set the dataset format, either avro or parquet . Defaults to avro |
-p, --partition-by |
A file containing a JSON-formatted partition strategy. |
-m, --mapping |
A file containing a JSON-formatted column mapping. |
--set, --property |
A property to set in the dataset’s descriptor: prop.name=value . |
--location |
The location where data is or should be stored. |
Examples:
Create a new dataset in Hive called “users”:
kite-dataset create users --schema user.avsc
Create dataset “users” using Parquet format:
kite-dataset create users --schema user.avsc --format parquet
Create a Hive dataset for existing data in HDFS using the inferred schema and partition strategy:
kite-dataset create events --location /path/to/events
Create dataset “events” with the given partition strategy and set the writer cache size:
kite-dataset create events --partition-by config.json --set kite.writer.cache-size=20
Create dataset “users” and set multiple properties:
kite-dataset create users --schema user.avsc --set kite.writer.cache-size=20 --set dfs.blocksize=256m
update
Update the metadata for a dataset.
This command can update a dataset’s schema or partition strategy, and add or change dataset properties.
Schema updates are validated according to Avro’s Schema evolution rules to ensure that the updated schema can read data written with any previous version of the schema.
Partition strategy updates only allow replacing provided partitioners with another partitioner that is compatible with the existing partition data. For example, a provided partitioner called “year” with integer values can be replaced with a year partitioner called “year” that uses a valid timestamp field as its source.
Syntax
kite-dataset [-v] update <dataset> [command options]
Options
-s, --schema |
A file containing the updated Avro schema. |
-p, --partition-by |
A file containing an updated partition strategy. |
--set, --property |
Add a property pair: prop.name=value . |
Examples:
Update schema for dataset “users” in Hive:
kite-dataset update users --schema user.avsc
Update HDFS dataset by URI, add property:
kite-dataset update dataset:hdfs:/user/me/datasets/users --set kite.write.cache-size=20
compact
Compact all or part of a dataset.
Compaction will rewrite partitions in the dataset, combining all files in each partition into a single large file using the dataset’s current descriptor properties, like dfs.blocksize
or parquet.block.size
.
Partitions that have been rewritten will replace existing partitions by moving the rewritten content to a hidden location (a dot folder), deleting the existing partition, and renaming the hidden folder to replace it. This results in a small window of time when data for the partition is not visible.
This compaction does not coordinate with other readers or writers. No other processes should be reading from or writing to the dataset while this command is running.
If multiple directories make up a single logical partition, all of the directories will be replaced with a single rewritten directory with all of the data. This can happen when reading data with older naming schemes. For example, month=5
and month=05
are two directory names that would be considered the same logical partition.
Syntax
kite-dataset [-v] comapct <dataset or view> [command options]
Options
--num-writers |
The number of writer processes to use. |
Examples:
Compact all partitions of the events
dataset:
kite-dataset compact events
Compact all partitions under year=2015
in events
:
kite-dataset compact view:hive:events?year=2015
list
Lists available dataset URIs.
An optional repository URI can be given to list datasets in repositories other than Hive.
Repository URIs start with “repo:” and leave out table and namespace options that are in dataset or view URIs.
Syntax
kite-dataset [-v] list [repository] [command options]
Examples
Show all supported Hive datasets:
kite-dataset list
Show all datasets in HDFS under /data
:
kite-dataset list repo:hdfs:/data
show
Print the first n records in a dataset.
Syntax
kite-dataset [-v] show <dataset> [command options]
Options
-n, --num-records |
The number of records to print. The default number is 10. |
Examples
Show the first 10 records in dataset “users”:
kite-dataset show users
Show the first 50 records in dataset “users”:
kite-dataset show users -n 50
copy
Copy records from one dataset to another.
Syntax
kite-dataset [-v] copy <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output directly, without compacting the data. |
--num-writers |
The number of writer processes to use. |
Examples
Copy the contents of movies_avro
to movies_parquet
:
kite-dataset copy movies_avro movies_parquet
Copy the movies dataset into HBase in a map-only job:
kite-dataset copy movies dataset:hbase:zk-host/movies --no-compaction
transform
Transforms records from one dataset and stores them in another dataset.
Syntax
kite-dataset transform <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output without compacting the data |
--num-writers |
The number of writer processes to use |
--transform |
A transform DoFn class name |
--jar |
Add a jar to the runtime class path |
Examples
Transform the contents of movies_src
using com.example.TransformFn
:
kite-dataset transform movies_src movies --transform com.example.TransformFn --jar fns.jar
delete
Delete one or more datasets or views.
If deleting a dataset, all data and metadata is deleted. If deleting a view, only data is deleted.
Both datasets and views are identified by URI, but arguments that do not start with “dataset:” or “view:” are assumed to be a Hive table name.
Syntax
kite-dataset [-v] delete <datasets> [command options]
Examples
Delete all data and metadata for the dataset “users”:
kite-dataset delete users
Delete just data from the Hive dataset “users”:
kite-dataset delete view:hive:users
info
Print all metadata for a dataset.
Syntax
kite-dataset info <dataset name>
Example
Print all metadata for the “users” dataset:
kite-dataset info users
schema
Show the schema for a dataset.
Syntax
kite-dataset [-v] schema <dataset> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples:
Print the schema for dataset “users” to standard out:
kite-dataset schema users
Save the schema for dataset “users” to user.avsc:
dataset schema users -o user.avsc
csv-schema
Use csv-schema
to generate an Avro schema from a comma separated value (CSV) file.
The schema produced by this command is a record based on the first few lines of the file. If the first line is a header, it is used to name the fields.
Field schemas are set by inspecting the first non-empty value in each field. Fields are nullable unless the field’s name is passed using --require
. Nullable fields default to null
.
The type is determined by the following rules:
* If the data is numeric and has a decimal point, the type is double
* If the data is numeric and has no decimal point, the type is long
* Otherwise, the type is string
See CSV format details.
Syntax
kite-dataset [-v] csv-schema <sample csv path> [command options]
Options
--class, --record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--require |
Mark a field required; the schema for this field will not allow null values. Use more than once to require multiple fields. |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--skip-lines |
The number of lines to skip before the start of the CSV data. Default is 0. |
--delimiter |
Delimiter character in the CSV data file. Default is the comma (,). |
--escape |
Escape character in the CSV data file. Default is the backslash (\). |
--quote |
Quote character in the CSV data file. Default is the double-quote (“). |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print the schema to standard out:
kite-dataset csv-schema sample.csv --class Sample
Write the schema to sample.avsc:
kite-dataset csv-schema sample.csv --class Sample -o sample.avsc
json-schema
Build a schema from a JSON data sample.
This command produces a Schema by inspecting the first few JSON objects in the data sample. Each JSON object is converted to a Schema that describes it, and the final Schema is the result of merging each sample object’s Schema.
The following two-object data sample, for example
Produces the following merged Schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
{ "type" : "record", "name" : "Sample", "fields" : [ { "name" : "id", "type" : "int" }, { "name" : "color", "type" : "string" }, { "name" : "shade", "type" : [ "null", "string" ], "default" : null } ] } |
See JSON format details.
Syntax
kite-dataset [-v] json-schema <sample json path> [command options]
Options
--class, --record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print an inferred schema for samples.json
to standard out
kite-dataset json-schema samples.json --record-name Sample
Write an inferred schema to sample.avsc
kite-dataset json-schema samples.json --record-name Sample -o sample.avsc
obj-schema
Build a schema from a Java class.
Fields are assumed to be nullable if they are Objects, or required if they are primitives. You can edit the generated schema directly to remove the null
option for specific fields.
Syntax
kite-dataset [-v] obj-schema <class name> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--jar |
Add a jar to the classpath used when loading the Java class. |
--lib-dir |
Add a directory to the classpath used when loading the Java class. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Create a schema for an example User class:
kite-dataset obj-schema org.kitesdk.cli.example.User
Create a schema for a class in a jar:
kite-dataset obj-schema com.example.MyRecord --jar my-application.jar
Save the schema for the example User class to user.avsc:
kite-dataset obj-schema org.kitesdk.cli.example.User -o user.avsc
csv-import
Copy CSV records into a dataset.
Kite matches the CSV header to the target record schema’s fields by name. If a header is not present (that is, you use the --no-header
option), then CSV columns are matched with the target fields based on their position.
As Kite constructs each record, it validates values using the target field’s schema. Invalid values (in numeric fields) and null values (in required fields) cause exceptions. Kite handles empty strings as null values for numeric fields.
See CSV format details.
Syntax
kite-dataset [-v] csv-import <csv path> <dataset> [command options]
Options
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--skip-lines |
Lines to skip before CSV start (default: 0) |
--delimiter |
Delimiter character. Default is comma (,). |
--escape |
Escape character. Default is backslash (\). |
--quote |
Quote character. Default is double quote (“). |
--num-writers |
The number of writer processes to use |
--no-compaction |
Copy to output directly, without compacting the data |
--jar |
Add a jar to the runtime classpath |
--transform |
A transform DoFn class name |
Examples
Copy the records from sample.csv
to a Hive dataset named “sample”:
kite-dataset csv-import path/to/sample.csv sample
json-import
Copy JSON objects into a dataset
Kite uses the target dataset’s Schema to validate and store the JSON objects.
- All values must match the type specified in the target Schema
- JSON objects will match both record and map Schemas
- When converting a JSON object with a record Schema:
- Only the record’s fields are used, other key-value pairs are ignored
- All fields must be present or have a default value in the record Schema
- When converting a JSON object with a map Schema, all key-value pairs are used
Invalid values, missing record fields, and other problems cause exceptions.
See JSON format details.
Syntax
kite-dataset [-v] json-import <json path> <dataset name> [command options]
Options
--num-writers |
The number of writer processes to use |
--no-compaction |
Copy to output directly, without compacting the data |
--jar |
Add a jar to the runtime classpath |
--transform |
A transform DoFn class name |
Examples
Copy the records from sample.json
to dataset sample
kite-dataset json-import path/to/sample.json sample
Copy the records from sample.json
to a dataset URI
kite-dataset json-import path/to/sample.json dataset:hdfs:/user/me/datasets/sample
Copy the records from an HDFS directory to sample
kite-dataset json-import hdfs:/data/path/samples/ sample
inputformat-import
Copy records read by an InputFormat into a dataset.
This will use a custom input format by name and copy the keys or values (set by --record-type
) into a dataset.
Use the obj-schema
command to infer a schema for the key or value class used by the InputFormat.
Syntax
kite-dataset [-v] inputformat-import <data path> <dataset> [command options]
Options
--format |
The InputFormat class name. Must include the package. |
--jar |
Add a jar to the runtime classpath |
--record-type |
InputFormat argument to use as the record (key or value ) |
--num-writers |
The number of writer processes to use |
--no-compaction |
Copy to output directly, without compacting the data |
--transform |
A transform DoFn class name |
--set, --property |
A property to set on the configuration: prop.name=value . |
Examples
Import the keys from a sequence file of MyObject
defined in myobject.jar:
kite-dataset inputformat-import data.seq mytable --jar myobject.jar --record-type key \
--format org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat
tar-import
Create a dataset from a tarball or load a tarball into an existing dataset.
The datasets that this command will create or write to use a static schema, TarFileEntry, that has 2 fields: filename and filecontent.
Tarballs imported using this command are converted to Avro or Parquet files of TarFileEntries and are compressed using Snappy compression.
Syntax
kite-dataset [-v] tar-import <tarball> <dataset> [command options]
Options
--compression |
The compression algorithm used to compress the incoming tarball. |
Examples
Convert a tarball of files to an Avro dataset:
kite-dataset tar-import data.tar.gz dataset:hdfs:/user/me/tar_data
partition-config
Builds a partition strategy for a schema.
The resulting partition strategy is a valid JSON partition strategy file.
Entries in the partition strategy are specified by field:type
pairs, where field
is the source field from the given schema and type
can be:
year |
Extract the year from a timestamp |
month |
Extract the month from a timestamp |
day |
Extract the day from a timestamp |
hour |
Extract the hour from a timestamp |
minute |
Extract the minute from a timestamp |
hash[N] |
Hash the source field, using N buckets |
copy |
Copy the field without modification (identity) |
provided |
Doesn’t use a source field, the field name is used to name the partition |
Provided partitioners do not reference a source field and instead require that a value is provided when writing. Values can be provided by writing to views.
Syntax
kite-dataset [-v] partition-config <field:type pairs> [command options]
Options:
-s, --schema |
The file containing the Avro schema. This value is required |
-o, --output |
Save partition JSON file to path |
--minimize |
Minimize output size by eliminating white space |
Examples
Partition by email address, balanced across 16 hash partitions and save to a file.
kite-dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json
Partition by created_at
time’s year, month, and day:
kite-dataset partition-config created_at:year created_at:month created_at:day -s event.avsc
mapping-config
Builds a column mapping for a schema, required for HBase. The resulting mapping definition is a valid JSON mapping file.
Mappings are specified by field:type
pairs, where field
is a source field from the given schema and type
can be:
key |
Uses a key mapping |
version |
Uses a version mapping (for optimistic concurrency) |
any string | The given string is used as the family in a column mapping |
If the last option is used, the mapping type will determined by the source field type. Numbers will use counter
, hash maps and records will use keyAsColumn
, and all others will use column
.
Syntax
kite-dataset [-v] create-column-mapping <field:type pairs> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
-p, --partition-by |
The file containing the JSON partition strategy. |
--minimize |
Minimize output size by eliminating white space. |
Examples
Store email in the key, other fields in column family u
:
kite-dataset mapping-config email:key username:u id:u --schema user.avsc -o user-cols.json
Store preferences hash-map in column family prefs
:
kite-dataset mapping-config preferences:prefs --schema user.avsc
Use the version
field as an OCC version:
kite-dataset mapping-config version:version --schema user.avsc
log4j-config
Builds a log4j configuration to log events to a dataset.
Syntax
kite-dataset log4j-config <dataset name> --host <flume hostname> [command options]
Options
--port |
Flume port |
--class , --package |
Java class or package from which to log |
--log-all |
Configure the root logger to send to Flume |
-o, --output |
Save the log4j configuration to a file |
Examples
Print log4j configuration to log to dataset “users”:
kite-dataset log4j-config --host flume.cluster.com --class org.kitesdk.examples.MyLoggingApp users
Save log4j configuration to the file log4j.properties
:
kite-dataset log4j-config --host flume.cluster.com --package org.kitesdk.examples -o log4j.properties users
Print log4j configuration to log from all classes:
kite-dataset log4j-config --host flume.cluster.com --log-all users
flume-config
Builds a Flume configuration to log events to a dataset.
Syntax
kite-dataset flume-config <dataset name or URI> [command options]
Options
--use-dataset-uri |
Configure Flume with a dataset URI. Requires Flume 1.6 or later. |
--agent |
Flume agent name |
--source |
Flume source name |
--bind |
Avro source bind address |
--port |
Avro source port |
--channel |
Flume channel name |
--channel-type |
Flume channel type (memory or file ) |
--channel-capacity |
Flume channel capacity |
--channel-transaction-capacity |
Flume channel transaction capacity |
--checkpoint-dir |
File channel checkpoint directory (required when using --channel-type file ) |
--data-dir |
File channel data directory. Use the option multiple times for multiple data directories. (required when using --channel-type file ) |
--sink |
Flume sink name |
--batch-size |
Records to write per batch |
--roll-interval |
Time in seconds before starting the next file |
--proxy-user |
User identity to use when writing to HDFS |
-o, --output |
Save the Flume configuration to a file |
Examples
Print Flume configuration to log to dataset “users”:
kite-dataset flume-config --checkpoint-dir /data/0/flume/checkpoint --data-dir /data/1/flume/data users
Print Flume configuration to log to dataset dataset:hdfs:/datasets/default/users
:
kite-dataset flume-config --channel-type memory dataset:hdfs:/datasets/default/users
Save Flume configuration to the file flume.properties
:
kite-dataset flume-config --channel-type memory -o flume.properties users