Partitioned Datasets

Partitioning Video

You can partition your dataset on one or more attributes of an entity. Proper partitioning helps Hadoop store information for improved performance. You can partition your records using hash, identity, or date (year, month, day, hour) strategies.

Imagine that you are working the registration desk at a conference. There are 100 attendees. You might decide you can hand out the name badges more efficiently if you have two lines that distribute the badges alphabetically, from A-M and N-Z. The attendee knows to go to one line or the other. The conference worker at the N-Z station doesn't bother looking through the badges A-M, because she knows that the attendee is going to be in her set of badges. It's more efficient, because she only has to search half of the items to find the right one.

However, that assumes that there is an equal distribution of names across all letters of the alphabet. It could be that there just happen to be an unusually high number of attendees whose names start with "B," resulting in long waits in the first line and little activity in the second line. Breaking out into three tables, A-I, J-S, T-Z might provide greater efficiency, sharing the workload and reducing the search time at any one table.

Fastest of all might be to have 100 stations where each individual could pick up her badge, but that would be a waste of time and effort for most of the staff. There is a "sweet spot" that maximizes efficiency with minimal resources.

When you store records in HDFS, the data is stored in "partitions" that provide coarse-grained organization. You can improve performance by providing hints to the system using a partitioning strategy. For example, if you most often retrieve your data using time-based queries, you can define the partitioning strategy by year, month, day, and hour, depending on the frequency with which you're capturing data. Your queries will run more quickly if the buckets used to organize your data correspond with the types of queries you make most often.

Partition Strategies

Partition strategies configure how kite will logically divide your dataset. Good partitioning enhances performance by making it easier to look in the correct places (and more importantly, to avoid looking in places where the data is not likely to be found). For example, if you partition a database by Year and Month, then send a query to locate data from February 2012, the system knows to only search for records in the partition for February in the year 2012: it doesn’t have to search any other year, or any other month in the same year. This can lead to great improvements in performance.

A partition strategy is defined using JSON notation.

Partition strategies have the following traits:
* Each must be defined in a valid JSON file.
* Each is created from a list of partition definitions.
* Each partition must define a source and a type, but the name is optional (if you do not define a name explicitly, it is given a generic name at configuration time).
* Valid types are
* year
* month
* day
* hour
* minute
* identity
* hash

*Hash also requires an integer, representing the number of partitions into which the field entries should be distributed.

You can create a partition strategy by hand, but it’s usually easier to use the CLI to create the definition in valid JSON format and check it against your schema, then edit it if needed.

If you write your own field definition with source and type, the system fills in the name automatically. If you change the default name, the system internally still uses the original name of the field when the partition strategy was created to maintain consistency.

Defining a Partitioning Strategy

Partitioning strategies are described in JSON format. You have the option of providing a partitioning strategy when you first create a dataset (you cannot apply a partitioning strategy to an existing dataset after you create it).

For example, here is a schema for a dataset that keeps track of visitors to a casino who belong to a loyalty club called the "High Rollers." The dataset stores information about their activity them UserID, recording when they enter the casino and how long they stay each time they visit.

HighRollers.avsc

{
  "type" : "record",
  "name" : "HighRollersClub",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "UserID",
    "type" : [ "null", "long" ],
    "doc" : "Unique customer identifier."
  }, {
    "name" : "EntryTime",
    "type" : "long",
    "doc" : "Timestamp at entry in milliseconds since Unix epoch"
  }, {
    "name" : "Duration",
    "type" : [ "null", "long" ],
    "doc" : "Time interval in milliseconds"
  } ]
}

Since the most common query against this data is time-based, you can define a partitioning strategy that gives Hadoop a hint that it should partition the information by year, month, and day. The partitioning strategy looks like this.

HighRollers.json

[ {
  "source" : "EntryTime",
  "type" : "year",
  "name" : "year"
}, {
  "source" : "EntryTime",
  "type" : "month",
  "name" : "month"
}, {
  "source" : "EntryTime",
  "type" : "day",
  "name" : "day"
} ]

You can also use the command line interface command partition-config to generate the JSON file. See partition-config.

Creating a Dataset That Uses a Partition Strategy

When you create a new dataset, you can specify the partition strategy along with the schema in the command line arguments. You can apply a partition strategy only when creating the dataset. You cannot apply a partition strategy to an existing dataset.

For example, you can create a dataset for our HighRollers club using this command.

dataset create HighRollersClub -s HighRollers.avsc -p HighRollers.json 

See create for more options when creating a dataset.
You can also use Kite to manage datasets in HBase, using the same tools and APIs. HBase datasets work differently than datasets backed by files in HDFS in two ways. First, dataset partitioning is handled by HBase and configuring it is a little different. Second, HBase stores data as a group of values, or cells, so you will need to configure how Kite divides your records into separate cells.