Kite Dataset Command Line Interface
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.
Commands
- general options: options for all commands.
- help: get help for the dataset command in general or a specific command.
- create: create a dataset based on an existing schema.
- copy: copy one dataset to another dataset.
- transform: transform records from one dataset and store them in another dataset.
- update: update the metadata descriptor for a dataset.
- delete: delete a dataset.
- schema : view the schema for an existing dataset.
- info: show metadata for a dataset.
- show: show the first n records of a dataset.
- csv-schema: create a schema from a CSV data file.
- csv-import: import a CSV data file.
- obj-schema: create a schema from a Java object.
- partition-config: create a partition strategy for a schema.
- mapping-config: create a mapping strategy for a schema.
- log4j-config: Configure Log4j.
- flume-config: Configure Flume.
General options
Every command begins with kite-dataset
, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.
-v --verbose --debug |
Turn on debug logging and show stack traces. |
The Kite CLI supports the following environment variables.
HIVE_HOME |
Root directory of Hive instance |
HIVE_CONF_DIR |
Configuration directory for Hive instance |
HBASE_HOME |
Root directory of HBase instance |
HADOOP_MAPRED_HOME |
Root directory for MapReduce |
HADOOP_HOME |
Root directory for Hadoop instance |
To show the values for these variables at runtime, set the debug=
option to true. This can be helpful when troubleshooting issues where one or more of these resources is not found. For example:
debug=true kite-dataset info users
Use the flags=
option to pass arguments to the internal Hadoop jar command. For example:
flags="-Xmx512m" kite-dataset info users`
csv-schema
Use csv-schema
to generate an Avro schema from a comma separated value (CSV) file.
Syntax
kite-dataset [-v] csv-schema <sample csv path> [command options]
Options
--skip-lines |
The number of lines to skip before the start of the CSV data. Default is 0. |
--quote |
Quote character in the CSV data file. Default is the double-quote (“). |
--delimiter |
Delimiter character in the CSV data file. Default is the comma (,). |
--escape |
Escape character in the CSV data file. Default is the backslash (\). |
--class, --record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print the schema to standard out:
kite-dataset csv-schema sample.csv --class Sample
Write the schema to sample.avsc:
kite-dataset csv-schema sample.csv --class Sample -o sample.avsc
obj-schema
Build a schema from a Java class. Fields are assumed to be nullable by default. You can edit the generated schema directly to remove the “null” option for specific fields.
Syntax
kite-dataset [-v] obj-schema <class name> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--jar |
Add a jar to the classpath used when loading the Java class. |
--lib-dir |
Add a directory to the classpath used when loading the Java class. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Create a schema for an example User class:
kite-dataset obj-schema org.kitesdk.cli.example.User
Create a schema for a class in a jar:
kite-dataset obj-schema com.example.MyRecord --jar my-application.jar
Save the schema for the example User class to user.avsc:
kite-dataset obj-schema org.kitesdk.cli.example.User -o user.avsc
create
After you have generated an Avro schema, you can use create
to make an empty dataset.
Usage
kite-dataset [-v] create <dataset> [command options]
Options
-s, --schema |
A file containing the Avro schema. This value is required. |
-f, --format |
By default, the dataset is created in Avro format. Use this switch to set the format to Parquet -f parquet |
-p, --partition-by |
A file containing a JSON-formatted partition strategy. |
-m, --mapping |
A file containing a JSON-formatted column mapping. |
--set, --property |
A property to set in the dataset’s descriptor: prop.name=value . |
Note: The dataset name must not contain a period (.).
Examples:
Create dataset “users” in Hive:
kite-dataset create users --schema user.avsc
Create dataset “users” using Parquet:
kite-dataset create users --schema user.avsc --format parquet
Create dataset “users” partitioned by JSON configuration using a cache size of 20 (rather than the default cache size of 10):
kite-dataset create users --schema user.avsc --partition-by user_part.json --set kite.writer.cache-size=20
Create dataset “users” and set multiple properties:
kite-dataset create users --schema user.avsc --set kite.writer.cache-size=20 --set dfs.blocksize=256m
update
Update the metadata descriptor for a dataset.
Syntax
kite-dataset [-v] update-dataset <dataset> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
--set, --property |
Add a property pair: prop.name=value . |
Examples:
Update schema for dataset “users” in Hive:
kite-dataset update users --schema user.avsc
Update HDFS dataset by URI, add property:
kite-dataset update dataset:hdfs:/user/me/datasets/users --set kite.write.cache-size=20
schema
Show the schema for a dataset.
Syntax
kite-dataset [-v] schema <dataset> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples:
Print the schema for dataset “users” to standard out:
kite-dataset schema users
Save the schema for dataset “users” to user.avsc:
dataset schema users -o user.avsc
csv-import
Copy CSV records into a dataset.
Syntax
kite-dataset [-v] csv-import <csv path> <dataset> [command options]
Options
--escape |
Escape character. Default is backslash (\). |
--delimiter |
Delimiter character. Default is comma (,). |
--quote |
Quote character. Default is double quote (“). |
--skip-lines |
Lines to skip before CSV start (default: 0) |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
Examples
Copy the records from sample.csv
to a Hive dataset named “sample”:
kite-dataset csv-import path/to/sample.csv sample
show
Print the first n records in a dataset.
Syntax
kite-dataset [-v] show <dataset> [command options]
Options
-n, --num-records |
The number of records to print. The default number is 10. |
Examples
Show the first 10 records in dataset “users”:
kite-dataset show users
Show the first 50 records in dataset “users”:
kite-dataset show users -n 50
copy
Copy records from one dataset to another.
Syntax
kite-dataset [-v] copy <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output directly, without compacting the data. |
--num-writers |
The number of writer processes to use. |
Examples
Copy the contents of movies_avro
to movies_parquet
:
kite-dataset copy movies_avro movies_parquet
Copy the movies dataset into HBase in a map-only job:
kite-dataset copy movies dataset:hbase:zk-host/movies --no-compaction
delete
Delete one or more datasets and related metadata.
Syntax
kite-dataset [-v] delete <datasets> [command options]
Examples
Delete all data and metadata for the dataset “users”:
kite-dataset delete users
partition-config
Builds a partition strategy for a schema.
Syntax
kite-dataset [-v] partition-config <field:type pairs> [command options]
Options:
-s, --schema |
The file containing the Avro schema. This value is required |
-o, --output |
Save partition JSON file to path |
--minimize |
Minimize output size by eliminating white space |
Examples
Partition by email address, balanced across 16 hash partitions and save to a file.
kite-dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json
Partition by created_at
time’s year, month, and day:
kite-dataset partition-config created_at:year created_at:month created_at:day -s event.avsc
mapping-config
Builds a column mapping for a schema, required for HBase. The resulting mapping definition is a valid JSON mapping file.
Mappings are specified by field:type
pairs, where field
is a source field from the given schema and type
can be:
key |
Uses a key mapping |
version |
Uses a version mapping (for optimistic concurrency) |
any string | The given string is used as the family in a column mapping |
If the last option is used, the mapping type will determined by the source field type. Numbers will use counter
, hash maps and records will use keyAsColumn
, and all others will use column
.
Syntax
kite-dataset [-v] create-column-mapping <field:type pairs> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
-p, --partition-by |
The file containing the JSON partition strategy. |
--minimize |
Minimize output size by eliminating white space. |
Examples
Store email in the key, other fields in column family u
:
kite-dataset mapping-config email:key username:u id:u --schema user.avsc -o user-cols.json
Store preferences hash-map in column family prefs
:
kite-dataset mapping-config preferences:prefs --schema user.avsc
Use the version
field as an OCC version:
kite-dataset mapping-config version:version --schema user.avsc
help
Retrieves details on the functions of one or more dataset commands.
Syntax
kite-dataset [-v] help <commands> [command options]
Examples
Retrieve details for the create, show, and delete commands.
kite-dataset help create show delete
transform
Transforms records from one dataset and stores them in another dataset.
Syntax
kite-dataset transform <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output without compacting the data |
--num-writers |
The number of writer processes to use |
--transform |
A transform DoFn class name |
--jar |
Add a jar to the runtime class path |
Examples
Transform the contents of movies_src
using com.example.TransformFn
:
kite-dataset transform movies_src movies --transform com.example.TransformFn --jar fns.jar
info
Print all metadata for a dataset.
Syntax
kite-dataset info <dataset name>
Example
Print all metadata for the “users” dataset:
kite-dataset info users
log4j-config
Builds a log4j configuration to log events to a dataset.
Syntax
kite-dataset log4j-config <dataset name> --host <flume hostname> [command options]
Options
--port |
Flume port |
--class , --package |
Java class or package from which to log |
--log-all |
Configure the root logger to send to Flume |
-o, --output |
Save the log4j configuration to a file |
Examples
Print log4j configuration to log to dataset “users”:
kite-dataset log4j-config --host flume.cluster.com --class org.kitesdk.examples.MyLoggingApp users
Save log4j configuration to the file log4j.properties
:
kite-dataset log4j-config --host flume.cluster.com --package org.kitesdk.examples -o log4j.properties users
Print log4j configuration to log from all classes:
kite-dataset log4j-config --host flume.cluster.com --log-all users
flume-config
Builds a Flume configuration to log events to a dataset.
Syntax
kite-dataset flume-config <dataset name or URI> [command options]
Options
--use-dataset-uri |
Configure Flume with a dataset URI. Requires Flume 1.6 or later. |
--agent |
Flume agent name |
--source |
Flume source name |
--bind |
Avro source bind address |
--port |
Avro source port |
--channel |
Flume channel name |
--channel-type |
Flume channel type (memory or file ) |
--channel-capacity |
Flume channel capacity |
--channel-transaction-capacity |
Flume channel transaction capacity |
--checkpoint-dir |
File channel checkpoint directory (required when using --channel-type file ) |
--data-dir |
File channel data directory. Use the option multiple times for multiple data directories. (required when using --channel-type file ) |
--sink |
Flume sink name |
--batch-size |
Records to write per batch |
--roll-interval |
Time in seconds before starting the next file |
--proxy-user |
User identity to use when writing to HDFS |
-o, --output |
Save the Flume configuration to a file |
Examples
Print Flume configuration to log to dataset “users”:
kite-dataset flume-config --checkpoint-dir /data/0/flume/checkpoint --data-dir /data/1/flume/data users
Print Flume configuration to log to dataset dataset:hdfs:/datasets/default/users
:
kite-dataset flume-config --channel-type memory dataset:hdfs:/datasets/default/users
Save Flume configuration to the file flume.properties
:
kite-dataset flume-config --channel-type memory -o flume.properties users