Kite Views API
Most of the time, you don’t need to work with all of the records stored in a dataset. It is common to work with subsets, like events last month rather than all events. The Views API is a way to express constraints for the records that Kite loads.
If your dataset is partitioned, Kite intelligently determines which partitions to draw from based on a view’s constraints. You don’t have to specify the partitions yourself because Kite will filter out partitions that cannot contain matching records, automatically.
Kite filters records so that you can express requirements for your data and have Kite enforce them. For example, events.with("type")
is a view of events
where each record loaded by the view will have a non-null value for the type field.
Views
Kite’s View
interface represents a logical collection of records in a dataset. It might seem as though a view is a subset of a dataset, but it is more accurate to think of a dataset as a view with no constraints applied.
You can use a view as the input for a MapReduce job or read its content directly by using View#newReader
to get a DatasetReader
that returns only records in the view.
View
instances are immutable. You can pass the view to other operations, safe in the knowledge that it won’t be changed at all.
Refining a View: Selecting Records
You create a view by adding a constraint to an existing view or dataset using one of the following methods.
Method | Definition | Example |
with |
Add a non-null constraint for a field | events.with("level") |
with |
Add an equality constraint for a field | events.with("level", "FATAL") |
with |
Add a set-inclusion constraint for a field | events.with("level", "WARNING", "ERROR") |
from |
Add a >= constraint for a field | events.from("day", 1) |
fromAfter |
Add a > constraint for a field | events.fromAfter("day", 4) |
to |
Add a <= constraint for a field | events.to("year", 2014) |
toBefore |
Add a < constraint for a field | events.toBefore("year", 2015) |
Each method returns a new View
with the additional constraint added to the parent view1.
For example, If you want to work with the ratings
dataset and with numeric rating of 5, you would use the with
method.
1 |
ratings.with("rating", 5); |
Kite inspects each record and applies this constraint before passing records to your application. Only ratings with the value 5 are returned.
The object you pass as a constraint must match the data type. For example, if the rating field is a String data type, sending the value 5 as a constraint will throw an exception.
You can chain refinement method calls to create a more complicated view all at once. For example, if you’ve dfined start and end variables, you can select a range of times during which ratings are submitted by chaining from
and to
for the same record field.
1 |
ratings.from("time", start).to("time", end).with("rating", 5); |
If the ratings dataset is partitioned by time
, then the view will automatically take advantage of dataset partitioning. Kite intelligently determines which partitions to draw from in response to this filter value. See Partitioned Datasets.
Using Different Classes and Schemas: Selecting Columns
By default, views will use Avro’s GenericRecord
type when returning records. You can set the type that will be constructed by calling View#asType(Class)
and passing a class that is compatible with the dataset’s schema. Kite will automatically set the read schema based on the type you pass.
You can also set the read schema and still use generic records by calling View#asSchema(Schema)
with your read schema.
Both asType
and asSchema
will load only the requested data fields from the dataset. Selecting fields avoids spending extra time deserializing some fields in Avro and enables Parquet to skip large portions of the underlying data. This can be used to drastically improve read speeds.
Working with Views
Loading a View
In addition to creating a view with the API, you can load a view from a view URI. A view URIs is analogous to a dataset URI, where the scheme is view:
instead of dataset:
and constraints are added as query arguments.
The following code snippet creates a view for a dataset of movie ratings submitted by the critic with user_id
125.
1 |
View<Record> ratings = Datasets.load("view:hive:ratings?user_id=125"); |
Inspecting a View
In some use cases, it might not be necessary to return a set of values, but only verify that values do or do not exist. For example, you might want to only submit a MapReduce job if there are values that would be processed. These methods allow you to inspect a view at runtime.
isEmpty
The isEmpty
method returns whether your View
contains any records at all.
getUri
The getUri
method returns a URI for a View
that can be passed to Datasets.load
.
includes
The includes
method returns whether an entity matches a view’s constraints. That is, whether the record would be included in this View
if it were present in the Dataset
.
Working with Records in a View
You can interact with records in a view the way you would work with records in a full dataset. You can use newReader
and newWriter
to get the same reader or writer objects, but they are restricted to operations on the view.
newReader
The newReader
method creates an appropriate DatasetReader
that returns only records that match the view’s constraints.
newWriter
The newWriter
method creates an appropriate DatasetWriter
that will write only records that match the view’s constraints. For more information, see Writing to Views.
See [Restricted Views][restricted-views].
deleteAll
The deleteAll
method deletes all entities in the dataset that match the view’s constraints.
If the delete cannot be completed cleanly, then the method throws an UnsupportedOperationException
. In the FileSystem implementation, for example, individual records cannot be deleted, only entire files. That means that Kite only allows you to delete an entire partition directory.
This method will delete records in a dataset, and will not delete the dataset itself. When called on a dataset, all records in the dataset will be removed. To delete a dataset in addition to the data stored in that dataset, use Datasets.delete
.
Notes:
- Views are created by refining other views because they are immutable and cannot be changed. This works like a Java’s String methods that always return new strings.