|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.kitesdk.data.spi.AbstractDatasetRepository org.kitesdk.data.filesystem.FileSystemDatasetRepository
public class FileSystemDatasetRepository
A DatasetRepository
that stores data in a Hadoop FileSystem
.
Given a FileSystem
, a root directory, and a MetadataProvider
,
this DatasetRepository
implementation can load and store
Dataset
s on both local filesystems as well as the Hadoop Distributed
FileSystem (HDFS). Users may directly instantiate this class with the three
dependencies above and then perform dataset-related operations using any of
the provided methods. The primary methods of interest will be
create(String, org.kitesdk.data.DatasetDescriptor)
,
load(String)
, and
delete(String)
which create a new dataset, load an existing
dataset, or delete an existing dataset, respectively. Once a dataset has been created
or loaded, users can invoke the appropriate Dataset
methods to get a reader
or writer as needed.
DatasetWriter
instances returned from this
implementation have the following flush()
method semantics. For Avro
files, flush()
will invoke HDFS hflush
,
which guarantees that client buffers are flushed, so new readers will see all
entries written up to that point. For Parquet files, flush()
has no
effect.
DatasetRepository
,
Dataset
,
DatasetDescriptor
,
PartitionStrategy
,
MetadataProvider
Nested Class Summary | |
---|---|
static class |
FileSystemDatasetRepository.Builder
A fluent builder to aid in the construction of FileSystemDatasetRepository
instances. |
Constructor Summary | |
---|---|
FileSystemDatasetRepository(Configuration conf,
MetadataProvider metadataProvider)
Construct a FileSystemDatasetRepository for the given
MetadataProvider for metadata storage. |
Method Summary | ||
---|---|---|
|
create(String name,
DatasetDescriptor descriptor)
Create a Dataset with the supplied descriptor . |
|
boolean |
delete(String name)
Delete the named Dataset . |
|
boolean |
exists(String name)
Checks if there is a Dataset in this repository named name . |
|
MetadataProvider |
getMetadataProvider()
|
|
Collection<String> |
list()
List the names of the Dataset s in this DatasetRepository . |
|
|
load(String name)
Get the latest version of a named Dataset . |
|
static PartitionKey |
partitionKeyForPath(Dataset dataset,
URI partitionPath)
Get a PartitionKey corresponding to a partition's filesystem path
represented as a URI . |
|
String |
toString()
|
|
|
update(String name,
DatasetDescriptor descriptor)
Update an existing Dataset to reflect the supplied descriptor . |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public FileSystemDatasetRepository(Configuration conf, MetadataProvider metadataProvider)
FileSystemDatasetRepository
for the given
MetadataProvider
for metadata storage.
conf
- a Configuration
for FileSystem
accessmetadataProvider
- the provider for metadata storageMethod Detail |
---|
public <E> Dataset<E> create(String name, DatasetDescriptor descriptor)
DatasetRepository
Dataset
with the supplied descriptor
. Depending on
the underlying dataset storage, some schemas types or configurations may
not be supported. If an illegal schema is supplied, an exception will be
thrown by the implementing class. It is illegal to create a more than one
dataset with a given name. If a duplicate name is provided, an exception is
thrown.
name
- The fully qualified dataset namedescriptor
- A descriptor that describes the schema and other
properties of the dataset
public <E> Dataset<E> update(String name, DatasetDescriptor descriptor)
DatasetRepository
Dataset
to reflect the supplied descriptor
. The
common case is updating a dataset schema. Depending on
the underlying dataset storage, some updates may not be supported,
such as a change in format or partition strategy.
Any attempt to make an unsupported or incompatible update will result in an
exception being thrown and no change being made to the dataset.
name
- The fully qualified dataset namedescriptor
- A descriptor that describes the schema and other properties of the
dataset
public <E> Dataset<E> load(String name)
DatasetRepository
Dataset
. If no dataset with the
provided name
exists, a DatasetNotFoundException
is thrown.
name
- The name of the dataset.public boolean delete(String name)
DatasetRepository
Dataset
. If no dataset with the
provided name
exists, a DatasetNotFoundException
is thrown.
name
- The name of the dataset.
true
if the dataset was successfully deleted, false if the
dataset does not exist.public boolean exists(String name)
DatasetRepository
Dataset
in this repository named name
.
name
- a Dataset
name to check the existence of
name
exists, false otherwisepublic Collection<String> list()
DatasetRepository
Dataset
s in this DatasetRepository
.
If there is not at least one Dataset
in this repository, an empty
list will be returned.
Collection
of Dataset names (String
s)public static PartitionKey partitionKeyForPath(Dataset dataset, URI partitionPath)
PartitionKey
corresponding to a partition's filesystem path
represented as a URI
. If the path is not a valid partition,
then IllegalArgumentException
is thrown. Note that the partition does not
have to exist.
dataset
- the filesystem datasetpartitionPath
- a directory path where the partition data is stored
public String toString()
toString
in class Object
public MetadataProvider getMetadataProvider()
MetadataProvider
being used by this repository.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |