org.kitesdk.data
Class DatasetRepositories

java.lang.Object
  extended by org.kitesdk.data.DatasetRepositories

public class DatasetRepositories
extends Object

Convenience methods for working with DatasetRepository instances.

Since:
0.8.0

Constructor Summary
DatasetRepositories()
           
 
Method Summary
static DatasetRepository open(String uri)
          Synonym for open(java.net.URI) for String URIs.
static DatasetRepository open(URI repositoryUri)
           Open a DatasetRepository for the given URI.
static RandomAccessDatasetRepository openRandomAccess(String uri)
          Synonym for openRandomAccess(java.net.URI) for String URIs.
static RandomAccessDatasetRepository openRandomAccess(URI repositoryUri)
           Synonym for open(java.net.URI) for RandomAccessDatasetRepositorys
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DatasetRepositories

public DatasetRepositories()
Method Detail

open

public static DatasetRepository open(String uri)
Synonym for open(java.net.URI) for String URIs.

Parameters:
uri - a String URI
Returns:
a DatasetRepository for the given URI.
Throws:
IllegalArgumentException - If the String cannot be parsed into a valid URI.

open

public static DatasetRepository open(URI repositoryUri)

Open a DatasetRepository for the given URI.

This method provides a simpler way to connect to a DatasetRepository while providing information about the appropriate MetadataProvider and other options to use. For almost all cases, this is the preferred method of retrieving an instance of a DatasetRepository.

The format of a repository URI is as follows.

repo:[storage component]

The [storage component] indicates the underlying metadata and, in some cases, physical storage of the data, along with any options. The supported storage backends are:

Local FileSystem URIs

file:[path] where [path] is a relative or absolute filesystem path to be used as the dataset repository root directory in which to store dataset data. When specifying an absolute path, the null authority (i.e. file:///my/path) form may be used. Alternatively, the authority section may be omitted entirely (e.g. file:/my/path). Either way, it is illegal to provide an authority (i.e. file://this-part-is-illegal/my/path). This storage backend will produce a DatasetRepository that stores both data and metadata on the local operating system filesystem. See FileSystemDatasetRepository for more information.

HDFS FileSystem URIs

hdfs://[host]:[port]/[path] where [host] and [port] indicate the location of the Hadoop NameNode, and [path] is the dataset repository root directory in which to store dataset data. This form will load the Hadoop configuration information per the usual methods (i.e. searching the process's classpath for the various configuration files). This storage backend will produce a DatasetRepository that stores both data and metadata in HDFS. See FileSystemDatasetRepository for more information.

Hive/HCatalog URIs

hive and hive://[metastore-host]:[metastore-port]/ will connect to the Hive MetaStore. Dataset locations will be determined by Hive as managed tables.

hive:/[path] and hive://[metastore-host]:[metastore-port]/[path] will also connect to the Hive MetaStore, but tables will be external and stored under [path]. The repository storage layout will be the same as hdfs and file repositories. HDFS connection options can be supplied by adding hdfs-host and hdfs-port query options to the URI (see examples).

HBase URIs

repo:hbase:[zookeeper-host1]:[zk-port],[zookeeper-host2],... will open a HBase-backed DatasetRepository. This URI may also be instantiated with openRandomAccess(URI) to instantiate a RandomAccessDatasetRepository

Examples

. Any non-root path will match the external Hive URIs.
repo:file:foo/bar Store data+metadata on the local filesystem in the directory ./foo/bar.
repo:file:///data Store data+metadata on the local filesystem in the directory /data
repo:hdfs://localhost:8020/data Same as above, but stores data+metadata on HDFS.
repo:hive Connects to the Hive MetaStore and creates managed tables.
repo:hive://meta-host:9083/ Connects to the Hive MetaStore at thrift://meta-host:9083, and creates managed tables. This only matches when the path is /
repo:hive:/path?hdfs-host=localhost&hdfs-port=8020 Connects to the default Hive MetaStore and creates external tables stored in hdfs://localhost:8020/ at path. hdfs-host and hdfs-port are optional.
repo:hive://meta-host:9083/path?hdfs-host=localhost&hdfs-port=8020 Connects to the Hive MetaStore at thrift://meta-host:9083/ and creates external tables stored in hdfs://localhost:8020/ at path. hdfs-host and hdfs-port are optional.
repo:hbase:zk1,zk2,zk3 Connects to HBase via the given zookeeper quorum nodes.

Parameters:
repositoryUri - The repository URI
Returns:
An appropriate implementation of DatasetRepository
Since:
0.8.0

openRandomAccess

public static RandomAccessDatasetRepository openRandomAccess(String uri)
Synonym for openRandomAccess(java.net.URI) for String URIs.

Parameters:
uri - a String URI
Returns:
An appropriate implementation of RandomAccessDatasetRepository
Throws:
IllegalArgumentException - If the String cannot be parsed into a valid URI.
Since:
0.9.0

openRandomAccess

public static RandomAccessDatasetRepository openRandomAccess(URI repositoryUri)

Synonym for open(java.net.URI) for RandomAccessDatasetRepositorys

This method provides a simpler way to connect to a DatasetRepository the same way open(java.net.URI) does, but instead returns an implementation of type RandomAccessDatasetRepository. This method should be used when one needs to access RandomAccessDatasets to take advantage of the random access methods.

The format of a repository URI is as follows.

repo:[storage component]

The [storage component] indicates the underlying metadata and, in some cases, physical storage of the data, along with any options. The supported storage backends are:

HBase URIs

repo:hbase:[zookeeper-host1]:[zk-port],[zookeeper-host2],... will open a HBase-backed DatasetRepository. This URI may also be instantiated with openRandomAccess(URI) to instantiate a RandomAccessDatasetRepository

Examples

repo:hbase:zk1,zk2,zk3 Connects to HBase via the given zookeeper quorum nodes.

Parameters:
repositoryUri - The repository URI
Returns:
An appropriate implementation of RandomAccessDatasetRepository
Since:
0.9.0


Copyright © 2013–2014. All rights reserved.