Using the Kite Command Line Interface to Create a Dataset
Kite provides a set of tools that handle the basic legwork for creating a dataset, allowing you to focus on the specifics of the business problem you want to solve. This short tutorial walks you through the process of creating a dataset and viewing the results using the command line interface (CLI).
Preparation
If you have not done so already, download the Kite command-line interface jar. This jar is the executable that runs the command-line interface, so save it as dataset
. To download with curl, run:
curl http://central.maven.org/maven2/org/kitesdk/kite-tools/0.15.0/kite-tools-0.15.0-binary.jar -o dataset
chmod +x dataset
Create a CSV Data File
If you have a CSV file sitting around waiting to be used, you can substitute your file for the one that follows. The truth is, it doesn’t matter if you have 100 columns or 2 columns, the process is the same. Larger datasets are only larger, not more complex.
If you don’t have a CSV file handy, you can copy the next code snippet and save it as a plain text file named sandwiches.csv.
name, description
Reuben, Pastrami and sauerkraut on toasted rye with Russian dressing.
PBJ, Peanut butter and grape jelly on white bread.
Infer the Schema
All right. Now we get to use the CLI. Start by inferring an Avro schema file from the sandwiches.csv file you just created. Enter the following command to create an Avro schema file named sandwich.avsc with the class name Sandwich. The schema details are based on the headings and data in sandwiches.csv.
dataset csv-schema sandwiches.csv --class Sandwich -o sandwich.avsc
If you open sandwich.avsc in a text editor, it looks something like the code block below. csv-schema
infers the names of the columns from the header, and the data type for each column from the first row of values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{ "type" : "record", "name" : "Sandwich", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "name", "type" : [ "null", "string" ], "doc" : "Type inferred from \"Reuben\"" }, { "name" : "description", "type" : [ "null", "string" ], "doc" : "Type inferred from \" Pastrami and sauerkraut on toasted rye with Russian dressing.\"" } ] } |
Create the Dataset
With a schema, you can create a new dataset. Enter the following command.
dataset create sandwiches -s sandwich.avsc
While it does not create actual sandwiches, it does create an empty dataset in which you can store sandwich descriptions, which is the next best thing. Probably.
Just for giggles, you can reverse the process you just completed and look at the underlying schema of your dataset using the following command.
dataset schema sandwiches
You’ll get the same schema back, but this time, trust me, it’s coming from the Hive repository. Honest.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{ "type" : "record", "name" : "Sandwich", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "name", "type" : [ "null", "string" ], "doc" : "Type inferred from \"Reuben\"" }, { "name" : "description", "type" : [ "null", "string" ], "doc" : "Type inferred from \" Pastrami and sauerkraut on toasted rye with Russian dressing.\"" } ] } |
Import the CSV Data
You’ve created a dataset in the Hive repository, which is the container, but not the information itself. Next, you might want to add some data so that you can run some queries. Use the following command to import the sandwiches in your CSV file.
dataset csv-import sandwiches.csv sandwiches
The method returns a record count.
Added 2 records to dataset "sandwiches"
But can you believe that? Inquiring minds want to verify that the information is actually in the dataset.
Show the Results
You can list records from your newly created dataset using the show
command.
dataset show sandwiches
By default, CLI retrieves up to the first 10 records from your dataset.
{"name": "Reuben", "description": " Pastrami and sauerkraut on toasted rye with Russian dressing."}
{"name": "PBJ", "description": " Peanut butter and grape jelly on white bread."}
If you find that number of sandwiches overwhelming, you can change the number of records the query returns.
dataset show sandwiches -n 1
This time only the first record prints to screen.
{"name": "Reuben", "description": " Pastrami and sauerkraut on toasted rye with Russian dressing."}
You can import additional records to the database and use Hive or Impala to query the results.
Delete the Dataset
Given the ease with which you just created the sandwiches dataset, it seems a shame to destroy it out of hand. Keep in mind that this was only an example, and not something you were meant to treasure. I suppose you don’t have to delete it, you might want to keep it around as a souvenir, like the first dollar earned or something like that. If so, create a dataset you hate, and prepare to annihilate it using this unassuming command.
dataset delete sandwiches
There they go. Reuben and PBJ are gone. But you can create them again, or any other dataset you choose, using the CLI.