public class CrunchDatasets extends Object
A helper class for exposing Dataset
s and View
s as Crunch
ReadableSource
s or Target
s.
Constructor and Description |
---|
CrunchDatasets() |
Modifier and Type | Method and Description |
---|---|
static <E> ReadableSource<E> |
asSource(String uri,
Class<E> type)
|
static <E> ReadableSource<E> |
asSource(URI uri,
Class<E> type)
|
static <E> ReadableSource<E> |
asSource(View<E> view)
Expose the given
View as a Crunch ReadableSource . |
static Target |
asTarget(String uri)
|
static Target |
asTarget(URI uri)
|
static <E> Target |
asTarget(View<E> view)
|
static <E> PCollection<E> |
partition(PCollection<E> collection,
Dataset<E> dataset)
Partitions
collection to be stored efficiently in dataset . |
static <E> PCollection<E> |
partition(PCollection<E> collection,
View<E> view)
Partitions
collection to be stored efficiently in View . |
static <E> PCollection<E> |
partition(PCollection<E> collection,
View<E> view,
int numWriters)
Partitions
collection to be stored efficiently in View . |
static <E> PCollection<E> |
partition(PCollection<E> collection,
View<E> view,
int numWriters,
int numPartitionWriters)
Partitions
collection to be stored efficiently in View . |
public static <E> ReadableSource<E> asSource(View<E> view)
View
as a Crunch ReadableSource
.E
- the type of entity produced by the sourceview
- the view to read fromReadableSource
for the viewpublic static <E> ReadableSource<E> asSource(URI uri, Class<E> type)
E
- the type of entity produced by the sourceuri
- the URI of the view or dataset to read fromtype
- the Java type of the entities in the datasetReadableSource
for the viewpublic static <E> ReadableSource<E> asSource(String uri, Class<E> type)
E
- the type of entity produced by the sourceuri
- the URI of the view or dataset to read fromtype
- the Java type of the entities in the datasetReadableSource
for the viewpublic static <E> Target asTarget(View<E> view)
E
- the type of entity stored in the viewview
- the view to write toTarget
for the viewpublic static Target asTarget(String uri)
uri
- the dataset or view URITarget
for the dataset or viewpublic static Target asTarget(URI uri)
uri
- the dataset or view URITarget
for the dataset or viewpublic static <E> PCollection<E> partition(PCollection<E> collection, View<E> view)
collection
to be stored efficiently in View
.
This restructures the parallel collection so that all of the entities that will be stored in a given partition will be processed by the same writer.
E
- the type of entities in the collection and underlying datasetcollection
- a collection of entitiesview
- a View
of a dataset to partition the collection forpublic static <E> PCollection<E> partition(PCollection<E> collection, Dataset<E> dataset)
collection
to be stored efficiently in dataset
.
This restructures the parallel collection so that all of the entities that will be stored in a given partition will be processed by the same writer.
E
- the type of entities in the collection and underlying datasetcollection
- a collection of entitiesdataset
- a dataset to partition the collection forpublic static <E> PCollection<E> partition(PCollection<E> collection, View<E> view, int numWriters)
collection
to be stored efficiently in View
.
This restructures the parallel collection so that all of the entities that will be stored in a given partition will be processed by the same writer.
If the dataset is not partitioned, then this will structure all of the
entities to produce a number of files equal to numWriters
.
E
- the type of entities in the collection and underlying datasetcollection
- a collection of entitiesview
- a View
of a dataset to partition the collection fornumWriters
- the number of writers that should be usedpartition(PCollection, View)
public static <E> PCollection<E> partition(PCollection<E> collection, View<E> view, int numWriters, int numPartitionWriters)
collection
to be stored efficiently in View
.
This restructures the parallel collection so that all of the entities that
will be stored in a given partition will be evenly distributed across a specified
numPartitionWriters
.
If the dataset is not partitioned, then this will structure all of the
entities to produce a number of files equal to numWriters
.
E
- the type of entities in the collection and underlying datasetcollection
- a collection of entitiesview
- a View
of a dataset to partition the collection fornumWriters
- the number of writers that should be usednumPartitionWriters
- the number of writers data for a single partition will be distributed acrosspartition(PCollection, View)
Copyright © 2013–2015. All rights reserved.