All past Kite releases are documented on this page. Upcoming release dates can be found in JIRA.

Version 0.17.0

Release date: 9 October 2014

Version 0.17.0 contains the following notable changes:

  • The Kite examples now require the Cloudera Quickstart VM version 5.1 or later.
  • Kite 0.15.0 and 0.16.0 default to an appender which writes to both Avro and Parquet files, thus incurring 2x the I/O resources, when writing to a Parquet dataset. That default has switched back to the behavior from 0.14.0 and before, which is to write just to Parquet. When using a Parquet dataset, the DatasetWriter#flush() and DatasetWriter#sync() methods have no effect. That means data written to a Parquet dataset is not durable until after a successful call to DatasetWriter#close(). Users that want the behavior found in 0.15.0 and 0.16.0 can set the property kite.parquet.non-durable-writes to false using the API or the update command in the CLI. After setting the property, the DatasetWriter#flush() and DatasetWriter#sync() methods will flush and sync the Avro version of the data respectively. If there is a failure before the writer is closed, the data can be recovered by reading the Avro version of the file and writing the records to Parquet. This recovery is a manual process.
  • Kite now supports namespaces for datasets. For Hive datasets, the Kite namespace maps to the Hive database where the table will be stored. Namespaces also changed the file system repository layout for local file and HDFS datasets. Dataset URIs used with previous releases will work unmodified. New datasets created using the DatasetRepository API (which moved to the SPI in 0.16.0) will not end up in the same location as in previous releases. The work-around is to use dataset URIs with the Datasets API. See the docs on dataset URIs for more details.
  • Users can now select the compression codec for Avro and Parquet datasets. See CDK-299 for more details.
  • The kite-data-hcatalog module has been renamed to kite-data-hive. A Maven relocation was put in place to prevent projects from breaking. However, we strongly encourage you to update your dependency to kite-data-hive in your projects. See CDK-452 for details.
  • Hive external table URIs no long support relative locations. A URI with the pattern dataset:hive:examples/ratings now means to use a namespace of examples and a dataset named ratings. You can create external URIs using the location query parameter. For example: dataset:hive:examples/ratings?location=/tmp/data/examples/ratings.
  • The Kite CLI tool has been renamed from dataset to kite-dataset. See CDK-670 for more information.
  • Kite will no longer use an embedded Hive MetaStore if it is not configured to connect to a remote MetaStore. Instead, Kite will throw an exception to avoid confusing behavior. See CDK-651 for more information.
  • You can now partition datasets by sub-fields. See CDK-435 for details.
  • File-based dataset names that are not alphanumeric (plus underscore) now issue a deprecation warning. Non-conforming names will be made illegal in a future release. See CDK-673.
  • There is a new experimental module, kite-minicluster, for running Hadoop services for testing and development purposes. The minicluster currently supports HDFS, Hive, HBase, and Flume services, and can be run directly from a Java program, or using the CLI. See CDK-679. The minicluster is experimental since its API and CLI are still subject to incompatible changes.
  • The Oozie portion of the demo example was removed. See CDK-605 for details.
  • Morphlines Library
    • Added morphline command that removes all record field values for which the field name and value matches a blacklist but not a whitelist: removeValues

The full change log is available from JIRA.

Version 0.16.0

Release date: 21 August 2014

Version 0.16.0 contains the following notable changes:

  • Kite datasets can be read from and written to by Apache Spark jobs. See the new Spark example for details on usage.
  • Added a CLI transform task for transforming entities read from a source dataset before storing then into a target dataset.
  • Added a CDH5 application parent POM that makes it easy to build Kite applications on CDH5 using Maven. The Spark example uses this parent POM.
  • The DatasetRepository and DatasetRepositories APIs have been moved to the SPI and deprecated from the public API. Users should move to the new Datasets API before the next release.
  • Kite will now properly generate Parquet Hive tables on Hive 0.13 and later.
  • Writing to a non-empty dataset or view from MapReduce or Crunch will now fail unless the write mode is explicitly set to append or overwrite. This is a change from the previous behavior which was to append. See CDK-572 and CDK-347 for details.

The full change log is available from JIRA.

Version 0.15.0

Release date: 15 July 2014

Version 0.15.0 contains the following notable changes:

  • Kite artifacts are built against Apache Hadoop 2 and related projects, and are now available in Maven Central.
  • Added new introduction and concepts documentation.
  • Added a new Datasets convenience class for opening and working with Datasets, superseding DatasetRepositories.
  • Deprecated partition related methods in Dataset in favor of the views API.
  • Added a CLI copy task for copying datasets and also for dataset format conversion and data compaction.
  • Added an application parent POM that makes it easy to use Kite in a Maven project. The examples now use this parent POM.
  • Updated to Crunch 0.10.0
  • Morphlines Library
    • Added morphline command that parses an InputStream that contains protobuf data: readProtobuf (Rober Fiser via whoschek)
    • Added morphline command that extracts specific values from a protobuf object, akin to a simple form of XPath: extractProtobufPaths (Rober Fiser via whoschek)
    • Added morphline command that removes all record fields for which the field name matches a blacklist but not a whitelist: removeFields
    • Added optional parameters maxCharactersPerRecord and onMaxCharactersPerRecord to morphline command readCSV
    • Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.1 to bug fix release maxmind-db-0.3.3
    • Upgraded kite-morphlines-core module from metrics-0.3.1 to bug fix release metrics-0.3.2

The full change log is available from JIRA.

Version 0.14.1

Release date: 23 May 2014

Version 0.14.1 is a bug-fix release with the following notable changes:

The full change log is available from JIRA.

Version 0.14.0

Release date: 13 May 2014

Version 0.14.0 has the following notable additions:

And the following bug fixes:

  • Updated CLI environment setup for CDH5.0 QuickStart VM
  • Fixed compatibility with CDH5 Hive, CDK-416
  • Fixed schema update validation bug, CDK-410
  • Added reconnect support when Hive connections drop, CDK-415

The full change log is available from JIRA.

Version 0.13.0

Release date: April 23, 2014

Version 0.13.0 has the following notable changes:

  • Added datasets command-line interface
    • Build avro schemas from CSV data samples and java classes
    • Create, view, and delete Kite datasets
    • Import CSV data into a dataset
  • Morphlines Library
    • Added morphline command that opens an HDFS file for read and returns a corresponding Java InputStream: openHdfsFile
    • Added morphline command that converts an InputStream to a byte array in main memory: readBlob
    • Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-4 to Saxon-HE-9.5.1-5

The full change log is available from JIRA.

Version 0.12.1

Release date: March 18, 2014

Version 0.12.1 is a bug-fix release with the following notable changes:

  • Fixed slow job setup for crunch when using large Datasets (thanks Gabriel Reid!)
  • Fixed CDK-328, Hive metastore concurrent access bug (thanks Karel Vervaeke!)
  • Clarified documentation for deleting datasets
  • Added more better checking to catch errors earlier
    • Catch partition strategies that rely on missing data fields
    • Catch Hive-incompatible table, column, and partition names
  • Added warnings when creating FS or HBase datasets that are incompatible with Hive

The full change log is available from JIRA.

Version 0.12.0

Release date: March 10, 2014

Version 0.12.0 has the following notable changes:

  • MapReduce support for Datasets. New input and output formats (DatasetKeyInputFormat and DatasetKeyOutputFormat) make it possible to use Datasets with MapReduce.
  • Views API. There is an incompatible change in this release: RefineableView in the org.kitesdk.data package has been renamed to RefinableView (no ‘e’). Clients should update and recompile.
  • Morphlines Library
    • Added a sampling command that forwards each input record with a given probability to its child command: sample
    • Added a command that ignores all input records beyond the N-th record, akin to the Unix head command: head
    • Improved morphline import performance if all commands are specified via fully qualified class names.
    • Added several performance enhancements.
    • Added an example module that describes how to unit test Morphline config files and custom Morphline commands.
    • Improved documentation.

The full change log is available from JIRA.

Version 0.11.0

Release date: February 6, 2014

Version 0.11.0 has the following notable changes:

  • Views API. A new API for expressing a subset of a dataset using logical constraints such as field matching or ranges. See the documentation for RefineableView for details. The HBase example has been extended to use a view for doing a partial scan of the table.
  • Dataset API. Removed APIs that were deprecated in 0.9.0. See the API Diffs for all the changes.
  • Upgrade to Crunch 0.9.0.
  • Morphlines Library
    • Added morphline command to read from Hadoop Avro Parquet Files: readAvroParquetFile
    • Added support for multi-character separators as well as a regex separators to splitKeyValue command.
    • Added addEmptyStrings parameter to readCSV command to indicate whether or not to add zero length strings to the output field.
    • Upgraded kite-morphlines-solr-* modules from solr-4.6.0 to solr-4.6.1.
    • Upgraded kite-morphlines-json module from jackson-databind-2.2.1 to jackson-databind-2.3.1.
    • Upgraded kite-morphlines-metrics-servlets module from jetty-8.1.13.v20130916 to jetty-8.1.14.v20131031.
    • Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-3 to Saxon-HE-9.5.1-4.
    • Fixed CDK-282 readRCFile command is broken (Prasanna Rajaperumal via whoschek).

The full change log is available from JIRA.

Version 0.10.1

Release date: January 13, 2014

Version 0.10.1 includes the following bug fixes:

  • CDK-249: Correctly add new partitions to the Hive MetaStore
  • CDK-260: Fixed the date-format partition function in expressions
  • CDK-266: Fixed file name uniqueness
  • CDK-273: Fixed spurious batch size warning in log4j integration
  • Fixed NoClassDefFoundError for crunch in kite-tools module
  • Added more debug logging to Morphlines
  • Solr should fail fast if ZK has no solr configuration

This patch release is fully-compatible with 0.9.1, which uses the deprecated CDK packages.

The full change log is available from JIRA.

Version 0.10.0

Release date: December 9, 2013

Version 0.10.0 has the following notable changes:

Version 0.9.0

Release date: December 5, 2013

Version 0.9.0 has the following notable changes:

  • HBase support. There is a new experimental API working with random access datasets stored in HBase. The API exposes get/put operations, but there is no support for scans from an arbitrary row in this release. (The latter will be added in 0.11.0 as a part of the forthcoming views API.) For usage information, consult the new HBase example.
  • Parquet. Datasets in Parquet format can now be written (and read) using Crunch.
  • CSV. Datasets in CSV format can now be read using the dataset APIs. See the compatibility example.
  • Dataset API. Removed APIs that were deprecated in 0.8.0. See the API Diffs for all the changes.
  • Morphlines Library
    • Added morphline command to read from RCFile: readRCFile (Prasanna Rajaperumal via whoschek)
    • Added morphline command to convert a morphline record to an Avro record: toAvro
    • Added morphline command that serializes Avro records into a byte array: writeAvroToByteArray
    • Added morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup: geoIP
    • Added morphline command that parses a user agent string and returns structured higher level data like user agent family, operating system, version, and device type: userAgent
    • Added option to fail the following commands if an URI is syntactically invalid: extractURIComponents, extractURIComponent, extractURIQueryParameters
    • Upgraded cdk-morphlines-solr-core module from solr-4.4 to solr-4.6.
    • Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-2 to saxon-HE-9.5.1-3.
    • Fixed race condition on parallel initialization of multiple Solr morphlines within the same JVM.
    • For enhanced safety readSequenceFile command nomore reuses the identity of Hadoop Writeable objects.

The full change log is available from JIRA.

Version 0.8.1

Release date: October 23, 2013

Version 0.8.1 has the following notable changes:

  • Morphlines Library
    • Made xquery and xslt commands also compatible with woodstox-3.2.7 (not just woodstox-4.x).

Version 0.8.0

Release date: October 7, 2013

Version 0.8.0 has the following notable changes:

  • Dataset Repository URIs. Repositories can be referred to (and opened) by a URI. For example, repo:hdfs://namenode:8020/data specifies a Dataset Repository stored in HDFS. Dataset descriptors carry the repository URI.
  • Dataset API. Removed APIs that were deprecated in 0.7.0. Deprecated some constructors in favor of builders. See API Diffs for all the changes.
  • Upgrade to Parquet 1.2.0.
  • Morphlines Library
    • Added option for commands to register health checks (not just metrics) with the MorphlineContext.
    • Added registerJVMMetrics command that registers metrics that are related to the Java Virtual Machine with the MorphlineContext. For example, this includes metrics for garbage collection events, buffer pools, threads and thread deadlocks.
    • Added morphline commands to publish the metrics of all morphline commands to JMX, SLF4J and CSV files. The new commands are: startReportingMetricsToJMX, startReportingMetricsToSLF4 and startReportingMetricsToCSV.
    • Added EXPERIMENTAL cdk-morphlines-metrics-servlets maven module with new startReportingMetricsToHTTP command that exposes liveness status, health check status, metrics state and thread dumps via a set of HTTP URIs served by Jetty, using the AdminServlet.
    • Added cdk-morphlines-hadoop-core maven module with new downloadHdfsFile command for transferring HDFS files, e.g. to help with centralized configuration file management.
    • Added option to specify boost values to loadSolr command.
    • Added several performance enhancements.
    • Upgraded cdk-morphlines-solr-cell maven module from tika-1.3 to tika-1.4 to pick up some bug fixes.
    • Upgraded cdk-morphlines-core maven module from com.google.code.regexp-0.1.9 to 0.2.3 to pick up some bug fixes (Internally shaded version).
    • The constructor of AbstractCommand now has an additional parameter that refers to the CommandBuilder. The old constructor has been deprecated and will be removed in the next release.
    • The ISO8601_TIMEZONE grok pattern now allows the omission of minutes in a timezone offset.
    • Ensured morphline commands can refer to record field names containing arbitrary characters. Previously some commands could not refer to record field names containing the ‘.’ dot character. This limitation has been removed.

The full change log is available from JIRA.

Version 0.7.0

Release date: September 5, 2013

Version 0.7.0 has the following notable changes:

  • Dataset API. Changes to make the API more consistent and better integrated with standard Java APIs like Iterator, Iterable, Flushable, and Closeable.
  • Java 7. CDK now also works with Java 7.
  • Upgrade to Avro 1.7.5.
  • Morphlines Library
    • Added commands splitKeyValue, extractURIComponent and toByteArray
    • Added outputFields parameter to the split command to support a list of column names similar to the readCSV command
    • Added tika-xmp maven module as a dependency to cdk-morphline-solr-cell module
    • Added several performance enhancements
    • Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-1 to saxon-HE-9.5.1-2

The full change log is available from JIRA.

Version 0.6.0

Release date: August 16, 2013

Version 0.6.0 has the following notable changes:

  • Dependency management. Solr and Lucene dependencies have been upgrade to 4.4.
  • Build system. The version of the Maven Javadoc plugin has been upgraded.

Version 0.5.0

Release date: August 1, 2013

Version 0.5.0 has the following notable changes:

  • Examples. All examples can be run from the user’s host machine, as an alternative to running from within the QuickStart VM guest.
  • CDK Maven Plugin. A new plugin with goals for manipulating datasets, and packaging, deploying, and running distributed applications.
  • Dependency management. Hadoop components are now marked as provided to give users more control. See the dependencies page.
  • Upgrade to Parquet 1.0.0 and Crunch 0.7.0.
  • Morphlines Library
    • Added commands xquery, xslt, convertHTML for reading, extracting and transforming XML and HTML with XPath, XQuery and XSLT
    • Added tokenizeText command that uses the embedded Solr/Lucene Analyzer library to generate tokens from a text string, without sending data to a Solr server
    • Added translate command that examines each value in a given field and replaces it with the replacement value defined in a given dictionary aka lookup hash table
    • By default, disable quoting and multi-line fields feature and comment line feature for the readCSV morphline command.
    • Added several performance enhancements

The full change log is available from JIRA.

Version 0.4.1

Release date: July 11, 2013

Version 0.4.1 has the following notable changes:

  • Morphlines Library
    • Expanded documentation and examples
    • Made SolrLocator and ZooKeeperDownloader collection alias aware
    • Added commands readJson and extractJsonPaths for reading, extracting, and transforming JSON files and JSON objects, in the same style as Avro support
    • Added commands split, findReplace, extractURIComponents, extractURIQueryParameters, decodeBase64
    • Fixed extractAvroPaths exception with flatten = true if path represents a non-leaf node of type Record
    • Added several performance enhancements

Version 0.4.0

Release date: June 22, 2013

Version 0.4.0 has the following notable changes:

  • Morphlines Library. A morphline is a rich configuration file that makes it easy to define an ETL transformation chain embedded in Hadoop components such as Search, Flume, MapReduce, Pig, Hive, Sqoop.
  • An Oozie example. A new example of using Oozie to run a transformation job periodically.
  • QuickStart VM update. The examples now use version 4.3.0 of the Cloudera QuickStart VM.
  • Java package changes. The com.cloudera.data package and subpackages have been renamed to com.cloudera.cdk.data, and com.cloudera.cdk.flume has become com.cloudera.cdk.data.flume.
  • Finer-grained Maven modules. The module organization and naming has changed, including making all group IDs com.cloudera.cdk. Please see the new dependencies page for details.

The full change log is available from JIRA.

Version 0.3.0

Release date: June 6, 2013

Version 0.3.0 has the following notable changes:

  • Logging to a dataset. Using log4j as the logging API and Flume as the log transport, it is now possible to log application events to datasets.
  • Crunch support. Datasets can be exposed as Crunch sources and targets.
  • Date partitioning. New partitioning functions for partitioning datasets by year/month/day/hour/minute.
  • New examples. The new examples repository has examples for all these new features. The examples use the Cloudera QuickStart VM, version 4.2.0, to make running the examples as simple as possible.

The full change log is available from JIRA.

Version 0.2.0

Release date: May 2, 2013

Version 0.2.0 has two major additions:

  • Experimental support for reading and writing datasets in Parquet format.
  • Support for storing dataset metadata in a Hive/HCatalog metastore.

The examples module has example code for both of these usages.

The full change log is available from JIRA.

Version 0.1.0

Release date: April 5, 2013

Version 0.1.0 is the first release of the CDK Data module. This is considered a beta release. As a sub-1.0.0 release, this version is not subject to the normal API compatibility guarantees. See the Compatibility Statement for information about API compatibility guarantees.

Back to top

Version: 0.17.0. Last Published: 2014-10-09.

Reflow Maven skin by Andrius Velykis.