Dataset, View, and Repository URIs
Datasets, views, and repositories are identified by URI. As a general rule, you can use a URI any time you would specify a dataset or view. If you attempt to perform an action on a view that is not allowed, the action fails.
Dataset URIs
You construct a dataset URI using one of the following patterns, depending on your chosen dataset scheme.
Scheme | Pattern |
---|---|
Hive | dataset:hive:<namespace>/<dataset> |
HDFS | dataset:hdfs:/<path>/<namespace>/<dataset-name> |
S3 | dataset:s3a://<bucket>/<namespace>/<dataset-name> dataset:s3n://<bucket>/<path>/<namespace>/<dataset-name> |
Local FS | dataset:file:/<path>/<namespace>/<dataset-name> |
HBase | dataset:hbase:<zookeeper>/<dataset-name> |
Dataset patterns always begin with the dataset:
prefix. Any of these patterns can be modified to create a View URI.
Hive
Hive manages your datatables for you. You only have to provide the dataset name. You also have the option of providing a namespace.
dataset:hive:<namespace>/<dataset>
If you want to use external Hive datatables, you must also provide a path to the dataset. If you don’t explicitly set a namespace, Kite uses the default namespace. For a Hive dataset, a Kite namespace maps one-to-one to a Hive database.
dataset:hive:/<path>/<namespace>/<dataset-name>
In earlier versions of Kite, dataset:hive:a/b
meant directory ./a/b Now, it has changed to namespace=a dataset=b.
To create an external table, add location=/path/to/data/dir
to the dataset URI.
dataset:hive:namespace/dataset?location=/path/to/data/dir
HDFS
The URI for a dataset in HDFS uses the following pattern. You provide a path to the dataset.
dataset:hdfs:/<path>/<namespace>/<dataset-name>
While it is not a typical use case, there might be times where it is useful to specify the HDFS host and port. You can insert the host and port before the path in the URI. You can use these URIs to select between HDFS instances.
The host and port are required if your Hadoop configuration files aren’t on your classpath. This can happen if your application is running outside the cluster and you haven’t taken the step of deploying client configuration files to the server running the application. The host is the hostname for the namenode; the port is the port for the namenode (typically 8020). If you have an HA configuration, you always need client configuration files, and the host should be the NameService ID.
dataset:hdfs://<host>[:port]/<path>/<namespace>/<dataset-name>
S3
Kite supports datasets stored in S3 using both s3a
and s3n
file system schemes. The URI host is used to pass a S3 bucket name.
S3 credentials should be set in the environment configuration using the right property for the FS scheme:
s3a
: usefs.s3a.access.key
for id andfs.s3a.secret.key
for keys3n
: usefs.s3n.awsAccessKeyId
for id andfs.s3n.awsSecretAccessKey
for key
Local File System
The local file system dataset URI follows a pattern similar to the HDFS URI, with the file:
scheme.
dataset:file:/<path>/<namespace>/<dataset-name>
HBase
hbase:<zookeeper>/<dataset-name>
The zookeeper argument is a comma separated list of hosts. For example
hbase:host1,host2:9999,host3/myDataset
View URIs
A view URI is constructed by changing the prefix of a dataset URI from dataset:
to view:
. You then add query arguments as name/value pairs, similar to query arguments in an HTTP URL. Query arguments place constraints on the information returned in the view.
view:<scheme-specific-URI>?<field>=<constraint>
For example, you can restrict values returned from a table of users to users whose favorite color is pink.
view:hdfs:/default/cloudera/users?favoriteColor=pink
You can insert records in a view, and the changes are reflected in the source dataset.
You can also set constraints based on dataset partitions. For example, if a dataset of movie ratings were partitioned by date, you might create the view URI this way, constraining the ratings returned to March, 2014.
view:hdfs:/default/cloudera/ratings?year=2014&month=3
If the URI begins with dataset:
, any constraints are ignored.
There are three formats used to set constraint values. The values can be numbers or strings, but the values you specify must match the schema definition for the field.
Format | Constraint Type | Example | Meaning |
---|---|---|---|
empty | Exists (value is not null) | favoriteColor= |
Field favoriteColor is populated. |
comma-separated list | In (any of the specified values) | genre=comedy,animation |
Field genre is comedy or animation. |
interval | Range of values | month=[1,4] |
Date is from January 1 through April 30. |
See Interval Notation for more examples of defining ranges of values.
Repository URIs
Repository URI patterns always begin with the repo:
prefix and leave out table and namespace options that are in dataset or view URIs.
Scheme | Pattern |
---|---|
Hive | repo:hive |
HDFS | repo:hdfs:/<path> |
Local FS | repo:file:/<path> |
HBase | repo:hbase:<zookeeper> |
In the Kite Dataset API, you use a repository URI with the Datasets.list
method to retrieve a list of valid datasets. You can also pass a repository URI to the CLI list command.
For example, to list the dataset URIs for the Hive repository, use Datasets.list("repo:hive");
.