Most of the time, you don’t need to work with all of the records stored in a dataset. It is common to work with subsets, like events last month rather than all events. The Views API is a way to express constraints for the records that Kite loads.
If your dataset is partitioned, Kite intelligently determines which partitions to draw from based on a view’s constraints. You don’t have to specify the partitions yourself because Kite will filter out partitions that cannot contain matching records, automatically.
Kite filters records so that you can express requirements for your data and have Kite enforce them. For example,
events.with("type") is a view of
events where each record loaded by the view will have a non-null value for the type field.
View interface represents a logical collection of records in a dataset. It might seem as though a view is a subset of a dataset, but it is more accurate to think of a dataset as a view with no constraints applied.
You can use a view as the input for a MapReduce job or read its content directly by using
View#newReader to get a
DatasetReader that returns only records in the view.
View instances are immutable. You can pass the view to other operations, safe in the knowledge that it won’t be changed at all.
Refining a View: Selecting Records
You create a view by adding a constraint to an existing view or dataset using one of the following methods.
||Add a non-null constraint for a field||
||Add an equality constraint for a field||
||Add a set-inclusion constraint for a field||
||Add a >= constraint for a field||
||Add a > constraint for a field||
||Add a <= constraint for a field||
||Add a < constraint for a field||
Each method returns a new
View with the additional constraint added to the parent view1.
For example, If you want to work with the
ratings dataset and with numeric rating of 5, you would use the
Kite inspects each record and applies this constraint before passing records to your application. Only ratings with the value 5 are returned.
The object you pass as a constraint must match the data type. For example, if the rating field is a String data type, sending the value 5 as a constraint will throw an exception.
You can chain refinement method calls to create a more complicated view all at once. For example, if you’ve dfined start and end variables, you can select a range of times during which ratings are submitted by chaining
to for the same record field.
ratings.from("time", start).to("time", end).with("rating", 5);
If the ratings dataset is partitioned by
time, then the view will automatically take advantage of dataset partitioning. Kite intelligently determines which partitions to draw from in response to this filter value. See Partitioned Datasets.
Using Different Classes and Schemas: Selecting Columns
By default, views will use Avro’s
GenericRecord type when returning records. You can set the type that will be constructed by calling
View#asType(Class) and passing a class that is compatible with the dataset’s schema. Kite will automatically set the read schema based on the type you pass.
You can also set the read schema and still use generic records by calling
View#asSchema(Schema) with your read schema.
asSchema will load only the requested data fields from the dataset. Selecting fields avoids spending extra time deserializing some fields in Avro and enables Parquet to skip large portions of the underlying data. This can be used to drastically improve read speeds.
Working with Views
Loading a View
In addition to creating a view with the API, you can load a view from a view URI. A view URIs is analogous to a dataset URI, where the scheme is
view: instead of
dataset: and constraints are added as query arguments.
The following code snippet creates a view for a dataset of movie ratings submitted by the critic with
View<Record> ratings = Datasets.load("view:hive:ratings?user_id=125");
Inspecting a View
In some use cases, it might not be necessary to return a set of values, but only verify that values do or do not exist. For example, you might want to only submit a MapReduce job if there are values that would be processed. These methods allow you to inspect a view at runtime.
isEmpty method returns whether your
View contains any records at all.
getUri method returns a URI for a
View that can be passed to
includes method returns whether an entity matches a view’s constraints. That is, whether the record would be included in this
View if it were present in the
Working with Records in a View
You can interact with records in a view the way you would work with records in a full dataset. You can use
newWriter to get the same reader or writer objects, but they are restricted to operations on the view.
newReader method creates an appropriate
DatasetReader that returns only records that match the view’s constraints.
See [Restricted Views][restricted-views].
deleteAll method deletes all entities in the dataset that match the view’s constraints.
If the delete cannot be completed cleanly, then the method throws an
UnsupportedOperationException. In the FileSystem implementation, for example, individual records cannot be deleted, only entire files. That means that Kite only allows you to delete an entire partition directory.
This method will delete records in a dataset, and will not delete the dataset itself. When called on a dataset, all records in the dataset will be removed. To delete a dataset in addition to the data stored in that dataset, use