Kite Dataset Command Line Interface
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.
Commands
- general options: options for all commands.
- csv-schema: create a schema from a CSV data file.
- obj-schema: create a schema from a Java object.
- create: create a dataset based on an existing schema.
- update: update the metadata descriptor for a dataset.
- schema : view the schema for an existing dataset.
- csv-import: import a CSV data file.
- show: show the first n records of a dataset.
- copy: copy one dataset to another dataset.
- delete: delete a dataset.
- partition-config: create a partition strategy for a schema.
- mapping-config: create a partition strategy for a schema.
- help: get help for the dataset command in general or a specific command.
General options
Every command begins with dataset
, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.
-v --verbose --debug |
Turn on debug logging and show stack traces. |
csv-schema
Use csv-schema
to generate an Avro schema from a comma separated value (CSV) file.
Syntax
dataset [-v] csv-schema <sample csv path> [command options]
Options
--skip-lines |
The number of lines to skip before the start of the CSV data. Default is 0. |
--quote |
Quote character in the CSV data file. Default is the double-quote (“). |
--delimiter |
Delimiter character in the CSV data file. Default is the comma (,). |
--escape |
Escape character in the CSV data file. Default is the backslash (\). |
--class, --record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print the schema to standard out:
dataset csv-schema sample.csv --class Sample
Write the schema to sample.avsc:
dataset csv-schema sample.csv --class Sample -o sample.avsc
obj-schema
Build a schema from a Java class.
Syntax
dataset [-v] obj-schema <class name> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--jar |
Add a jar to the classpath used when loading the Java class. |
--lib-dir |
Add a directory to the classpath used when loading the Java class. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Create a schema for an example User class:
dataset obj-schema org.kitesdk.cli.example.User
Create a schema for a class in a jar:
dataset obj-schema com.example.MyRecord --jar my-application.jar
Save the schema for the example User class to user.avsc:
dataset obj-schema org.kitesdk.cli.example.User -o user.avsc
create
After you have generated an Avro schema, you can use create
to make an empty dataset.
Usage
dataset [-v] create <dataset> [command options]
Options
-s, --schema |
A file containing the Avro schema. This value is required. |
-f, --format |
By default, the dataset is created in Avro format. Use this switch to set the format to Parquet -f parquet |
-p, --partition-by |
A file containing a JSON-formatted partition strategy. |
-m, --mapping |
A file containing a JSON-formatted column mapping. |
Examples:
Create dataset “users” in Hive:
dataset create users --schema user.avsc
Create dataset “users” using Parquet:
dataset create users --schema user.avsc --format parquet
Create dataset “users” partitioned by JSON configuration:
dataset create users --schema user.avsc --partition-by user_part.json
update
Update the metadata descriptor for a dataset.
Syntax
dataset [-v] update-dataset <dataset> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
--set, --property |
Add a property pair: prop.name=value . |
Examples:
Update schema for dataset “users” in Hive:
dataset update users --schema user.avsc
Update HDFS dataset by URI, add property:
dataset update dataset:hdfs:/user/me/datasets/users --set kite.write.cache-size=20
schema
Show the schema for a dataset.
Syntax
dataset [-v] schema <dataset> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples:
Print the schema for dataset “users” to standard out:
dataset schema users
Save the schema for dataset “users” to user.avsc:
dataset schema users -o user.avsc
csv-import
Copy CSV records into a dataset.
Syntax
dataset [-v] csv-import <csv path> <dataset> [command options]
Options
--escape |
Escape character. Default is backslash (\). |
--delimiter |
Delimiter character. Default is comma (,). |
--quote |
Quote character. Default is double quote (“). |
--skip-lines |
Lines to skip before CSV start (default: 0) |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
Examples
Copy the records from sample.csv
to a Hive dataset named “sample”:
dataset csv-import csv-import path/to/sample.csv sample
show
Print the first n records in a dataset.
Syntax
dataset [-v] show <dataset> [command options]
Options
-n, --num-records |
The number of records to print. The default number is 10. |
Examples
Show the first 10 records in dataset “users”:
dataset show users
Show the first 50 records in dataset “users”:
dataset show users -n 50
copy
Copy records from one dataset to another.
Syntax
dataset [-v] copy <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output directly, without compacting the data. |
--num-writers |
The number of writer processes to use. |
Examples
Copy the contents of movies_avro
to movies_parquet
:
dataset copy movies_avro movies_parquet
Copy the movies dataset into HBase in a map-only job:
dataset copy movies dataset:hbase:zk-host/movies --no-compaction
delete
Delete one or more datasets and related metadata.
Syntax
dataset [-v] delete <datasets> [command options]
Examples
Delete all data and metadata for the dataset “users”:
dataset delete users
partition-config
Builds a partition strategy for a schema.
Syntax
dataset [-v] partition-config <field:type pairs> [command options]
Options:
-s, --schema |
The file containing the Avro schema. This value is required |
-o, --output |
Save partition JSON file to path |
--minimize |
Minimize output size by eliminating white space |
Examples
Partition by email address, balanced across 16 hash partitions and save to a file.
dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json
Partition by created_at
time’s year, month, and day:
dataset partition-config created_at:year created_at:month created_at:day -s event.avsc
mapping-config
Builds a column mapping for a schema, required for HBase. The resulting mapping definition is a valid JSON mapping file.
Mappings are specified by field:type
pairs, where field
is a source field from the given schema and type
can be:
key |
Uses a key mapping |
version |
Uses a version mapping (for optimistic concurrency) |
any string | The given string is used as the family in a column mapping |
If the last option is used, the mapping type will determined by the source field type. Numbers will use counter
, hash maps and records will use keyAsColumn
, and all others will use column
.
Syntax
dataset [-v] create-column-mapping <field:type pairs> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
-p, --partition-by |
The file containing the JSON partition strategy. |
--minimize |
Minimize output size by eliminating white space. |
Examples
Store email in the key, other fields in column family u
:
dataset mapping-config email:key username:u id:u --schema user.avsc -o user-cols.json
Store preferences hash-map in column family prefs
:
dataset mapping-config preferences:prefs --schema user.avsc
Use the version
field as an OCC version:
dataset mapping-config version:version --schema user.avsc
Help
Retrieves details on the functions of one or more dataset commands.
Syntax
dataset [-v] help <commands> [command options]
Examples
Retrieve details for the create, show, and delete commands.
dataset help create show delete