Kite Dataset Command Line Interface
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
Each command is described below. See Using the Kite CLI to Create a Dataset for a practical example of the CLI in use.
Commands
- general options: options for all commands.
- csv-schema: create a schema from a CSV data file.
- obj-schema: create a schema from a Java object.
- create: create a dataset based on an existing schema.
- update: update the metadata descriptor for a dataset.
- schema : view the schema for an existing dataset.
- csv-import: import a CSV data file.
- show: show the first n records of a dataset.
- copy: copy one dataset to another dataset.
- delete: delete a dataset.
- partition-config: create a partition strategy for a schema.
- mapping-config: create a partition strategy for a schema.
- help: get help for the dataset command in general or a specific command.
General options
Every command begins with dataset, followed by general options. Currently, the only general option turns on debugging, which will show a stack trace if something goes awry during execution of the command. A concise set of additional options might be added as the product matures.
-v--verbose--debug |
Turn on debug logging and show stack traces. |
csv-schema
Use csv-schema to generate an Avro schema from a comma separated value (CSV) file.
Syntax
dataset [-v] csv-schema <sample csv path> [command options]
Options
--skip-lines |
The number of lines to skip before the start of the CSV data. Default is 0. |
--quote |
Quote character in the CSV data file. Default is the double-quote (“). |
--delimiter |
Delimiter character in the CSV data file. Default is the comma (,). |
--escape |
Escape character in the CSV data file. Default is the backslash (\). |
--class,--record-name |
A class name or record name for the schema result. This value is required. |
-o, --output |
Save schema avsc to path. |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Print the schema to standard out:
dataset csv-schema sample.csv --class Sample
Write the schema to sample.avsc:
dataset csv-schema sample.csv --class Sample -o sample.avsc
obj-schema
Build a schema from a Java class.
Syntax
dataset [-v] obj-schema <class name> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--jar |
Add a jar to the classpath used when loading the Java class. |
--lib-dir |
Add a directory to the classpath used when loading the Java class. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples
Create a schema for an example User class:
dataset obj-schema org.kitesdk.cli.example.User
Create a schema for a class in a jar:
dataset obj-schema com.example.MyRecord --jar my-application.jar
Save the schema for the example User class to user.avsc:
dataset obj-schema org.kitesdk.cli.example.User -o user.avsc
create
After you have generated an Avro schema, you can use create to make an empty dataset.
Usage
dataset [-v] create <dataset> [command options]
Options
-s, --schema |
A file containing the Avro schema. This value is required. |
-f, --format |
By default, the dataset is created in Avro format. Use this switch to set the format to Parquet -f parquet |
-p, --partition-by |
A file containing a JSON-formatted partition strategy. |
-m, --mapping |
A file containing a JSON-formatted column mapping. |
Examples:
Create dataset “users” in Hive:
dataset create users --schema user.avsc
Create dataset “users” using Parquet:
dataset create users --schema user.avsc --format parquet
Create dataset “users” partitioned by JSON configuration:
dataset create users --schema user.avsc --partition-by user_part.json
update
Update the metadata descriptor for a dataset.
Syntax
dataset [-v] update-dataset <dataset> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
--set, --property |
Add a property pair: prop.name=value. |
Examples:
Update schema for dataset “users” in Hive:
dataset update users --schema user.avsc
Update HDFS dataset by URI, add property:
dataset update dataset:hdfs:/user/me/datasets/users --set kite.write.cache-size=20
schema
Show the schema for a dataset.
Syntax
dataset [-v] schema <dataset> [command options]
Options
-o, --output |
Save schema in Avro format to a given path. |
--minimize |
Minimize schema file size by eliminating white space. |
Examples:
Print the schema for dataset “users” to standard out:
dataset schema users
Save the schema for dataset “users” to user.avsc:
dataset schema users -o user.avsc
csv-import
Copy CSV records into a dataset.
Syntax
dataset [-v] csv-import <csv path> <dataset> [command options]
Options
--escape |
Escape character. Default is backslash (\). |
--delimiter |
Delimiter character. Default is comma (,). |
--quote |
Quote character. Default is double quote (“). |
--skip-lines |
Lines to skip before CSV start (default: 0) |
--no-header |
Use this option when the CSV data file does not have header information in the first line. Fields are given the default names field_0, field_1,…field_n. |
Examples
Copy the records from sample.csv to a Hive dataset named “sample”:
dataset csv-import path/to/sample.csv sample
show
Print the first n records in a dataset.
Syntax
dataset [-v] show <dataset> [command options]
Options
-n, --num-records |
The number of records to print. The default number is 10. |
Examples
Show the first 10 records in dataset “users”:
dataset show users
Show the first 50 records in dataset “users”:
dataset show users -n 50
copy
Copy records from one dataset to another.
Syntax
dataset [-v] copy <source dataset> <destination dataset> [command options]
Options
--no-compaction |
Copy to output directly, without compacting the data. |
--num-writers |
The number of writer processes to use. |
Examples
Copy the contents of movies_avro to movies_parquet:
dataset copy movies_avro movies_parquet
Copy the movies dataset into HBase in a map-only job:
dataset copy movies dataset:hbase:zk-host/movies --no-compaction
delete
Delete one or more datasets and related metadata.
Syntax
dataset [-v] delete <datasets> [command options]
Examples
Delete all data and metadata for the dataset “users”:
dataset delete users
partition-config
Builds a partition strategy for a schema.
Syntax
dataset [-v] partition-config <field:type pairs> [command options]
Options:
-s, --schema |
The file containing the Avro schema. This value is required |
-o, --output |
Save partition JSON file to path |
--minimize |
Minimize output size by eliminating white space |
Examples
Partition by email address, balanced across 16 hash partitions and save to a file.
dataset partition-config email:hash[16] email:copy -s user.avsc -o part.json
Partition by created_at time’s year, month, and day:
dataset partition-config created_at:year created_at:month created_at:day -s event.avsc
mapping-config
Builds a column mapping for a schema, required for HBase. The resulting mapping definition is a valid JSON mapping file.
Mappings are specified by field:type pairs, where field is a source field from the given schema and type can be:
key |
Uses a key mapping |
version |
Uses a version mapping (for optimistic concurrency) |
| any string | The given string is used as the family in a column mapping |
If the last option is used, the mapping type will determined by the source field type. Numbers will use counter, hash maps and records will use keyAsColumn, and all others will use column.
Syntax
dataset [-v] create-column-mapping <field:type pairs> [command options]
Options
-s, --schema |
The file containing the Avro schema. |
-p, --partition-by |
The file containing the JSON partition strategy. |
--minimize |
Minimize output size by eliminating white space. |
Examples
Store email in the key, other fields in column family u:
dataset mapping-config email:key username:u id:u --schema user.avsc -o user-cols.json
Store preferences hash-map in column family prefs:
dataset mapping-config preferences:prefs --schema user.avsc
Use the version field as an OCC version:
dataset mapping-config version:version --schema user.avsc
Help
Retrieves details on the functions of one or more dataset commands.
Syntax
dataset [-v] help <commands> [command options]
Examples
Retrieve details for the create, show, and delete commands.
dataset help create show delete