org.kitesdk.data.filesystem
Class FileSystemDatasetRepository

java.lang.Object
  extended by org.kitesdk.data.spi.AbstractDatasetRepository
      extended by org.kitesdk.data.filesystem.FileSystemDatasetRepository
All Implemented Interfaces:
DatasetRepository

public class FileSystemDatasetRepository
extends org.kitesdk.data.spi.AbstractDatasetRepository

A DatasetRepository that stores data in a Hadoop FileSystem.

Given a FileSystem, a root directory, and a MetadataProvider, this DatasetRepository implementation can load and store Datasets on both local filesystems as well as the Hadoop Distributed FileSystem (HDFS). Users may directly instantiate this class with the three dependencies above and then perform dataset-related operations using any of the provided methods. The primary methods of interest will be create(String, org.kitesdk.data.DatasetDescriptor), load(String), and delete(String) which create a new dataset, load an existing dataset, or delete an existing dataset, respectively. Once a dataset has been created or loaded, users can invoke the appropriate Dataset methods to get a reader or writer as needed.

DatasetWriter instances returned from this implementation have the following flush() method semantics. For Avro files, flush() will invoke HDFS hflush, which guarantees that client buffers are flushed, so new readers will see all entries written up to that point. For Parquet files, flush() has no effect.

See Also:
DatasetRepository, Dataset, DatasetDescriptor, PartitionStrategy, MetadataProvider

Nested Class Summary
static class FileSystemDatasetRepository.Builder
          A fluent builder to aid in the construction of FileSystemDatasetRepository instances.
 
Constructor Summary
FileSystemDatasetRepository(Configuration conf, MetadataProvider metadataProvider)
          Construct a FileSystemDatasetRepository for the given MetadataProvider for metadata storage.
 
Method Summary
<E> Dataset<E>
create(String name, DatasetDescriptor descriptor)
          Create a Dataset with the supplied descriptor.
 boolean delete(String name)
          Delete the named Dataset.
 boolean exists(String name)
          Checks if there is a Dataset in this repository named name.
 MetadataProvider getMetadataProvider()
           
 Collection<String> list()
          List the names of the Datasets in this DatasetRepository.
<E> Dataset<E>
load(String name)
          Get the latest version of a named Dataset.
static PartitionKey partitionKeyForPath(Dataset dataset, URI partitionPath)
          Get a PartitionKey corresponding to a partition's filesystem path represented as a URI.
 String toString()
           
<E> Dataset<E>
update(String name, DatasetDescriptor descriptor)
          Update an existing Dataset to reflect the supplied descriptor.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

FileSystemDatasetRepository

public FileSystemDatasetRepository(Configuration conf,
                                   MetadataProvider metadataProvider)
Construct a FileSystemDatasetRepository for the given MetadataProvider for metadata storage.

Parameters:
conf - a Configuration for FileSystem access
metadataProvider - the provider for metadata storage
Since:
0.8.0
Method Detail

create

public <E> Dataset<E> create(String name,
                             DatasetDescriptor descriptor)
Description copied from interface: DatasetRepository
Create a Dataset with the supplied descriptor. Depending on the underlying dataset storage, some schemas types or configurations may not be supported. If an illegal schema is supplied, an exception will be thrown by the implementing class. It is illegal to create a more than one dataset with a given name. If a duplicate name is provided, an exception is thrown.

Parameters:
name - The fully qualified dataset name
descriptor - A descriptor that describes the schema and other properties of the dataset
Returns:
The newly created dataset

update

public <E> Dataset<E> update(String name,
                             DatasetDescriptor descriptor)
Description copied from interface: DatasetRepository
Update an existing Dataset to reflect the supplied descriptor. The common case is updating a dataset schema. Depending on the underlying dataset storage, some updates may not be supported, such as a change in format or partition strategy. Any attempt to make an unsupported or incompatible update will result in an exception being thrown and no change being made to the dataset.

Parameters:
name - The fully qualified dataset name
descriptor - A descriptor that describes the schema and other properties of the dataset
Returns:
The updated dataset

load

public <E> Dataset<E> load(String name)
Description copied from interface: DatasetRepository
Get the latest version of a named Dataset. If no dataset with the provided name exists, a DatasetNotFoundException is thrown.

Parameters:
name - The name of the dataset.

delete

public boolean delete(String name)
Description copied from interface: DatasetRepository
Delete the named Dataset. If no dataset with the provided name exists, a DatasetNotFoundException is thrown.

Parameters:
name - The name of the dataset.
Returns:
true if the dataset was successfully deleted, false if the dataset does not exist.

exists

public boolean exists(String name)
Description copied from interface: DatasetRepository
Checks if there is a Dataset in this repository named name.

Parameters:
name - a Dataset name to check the existence of
Returns:
true if a Dataset named name exists, false otherwise

list

public Collection<String> list()
Description copied from interface: DatasetRepository
List the names of the Datasets in this DatasetRepository. If there is not at least one Dataset in this repository, an empty list will be returned.

Returns:
a Collection of Dataset names (Strings)

partitionKeyForPath

public static PartitionKey partitionKeyForPath(Dataset dataset,
                                               URI partitionPath)
Get a PartitionKey corresponding to a partition's filesystem path represented as a URI. If the path is not a valid partition, then IllegalArgumentException is thrown. Note that the partition does not have to exist.

Parameters:
dataset - the filesystem dataset
partitionPath - a directory path where the partition data is stored
Returns:
a partition key representing the partition at the given path
Since:
0.4.0

toString

public String toString()
Overrides:
toString in class Object

getMetadataProvider

public MetadataProvider getMetadataProvider()
Returns:
the MetadataProvider being used by this repository.
Since:
0.2.0


Copyright © 2013–2014. All rights reserved.