All past Kite releases are documented on this page. Upcoming release dates can be found in JIRA.

Version 1.0.0

Release date: 23 February 2015

Version 1.0.0 contains the following notable changes:

  • All deprecated classes and methods have been removed from the data modules.
  • DatasetWriter no longer has a flush() or a sync() method (see CDK-892). Some (but not all) implementations of DatasetWriter implement the org.kitesdk.data.Flushable or org.kitesdk.data.Syncable interfaces, so you need to use the following idiom to flush the stream. (Calling sync() is similar.) DatasetWriter<Record> writer = ... if (writer instanceof Flushable) { ((Flushable) writer).flush(); }
  • Avro schemas are now stored in HDFS for Hive datasets. This overcomes the 4K limit on schema size, as well as providing better schema evolution checking since all versions of the schema are stored. See CDK-969
  • Removing a partition from a dataset now removes the partition from the Hive metastore (see CDK-924).
  • Morphlines Library
    • Added support for nested documents aka child documents to loadSolr morphline command.

The full change log
is available from JIRA.

Version 0.18.0

Release date: 11 February 2015

Version 0.18.0 contains the following notable changes:

  • There is a new kite-dataset command, tar-import, for importing the contents of a tarfile into a dataset.
  • The delete command can now delete the data contained in a view.
  • The csv-schema and csv-import commands now take a --header argument for specifying the CSV header.
  • The restriction on filesystem dataset names is now enforced: attempting to create a dataset with a non-alphanumeric
    name (underscores are valid too) results in an error.
  • Morphlines Library
    • Upgraded kite-morphlines-solr-* to solr-4.10.3
    • Upgraded kite-morphlines-tika-* to tika-1.5 (in sync with solr-4.10.3)
    • Avoid NPE in geoIP morphline command if IP is not found (Santiago Mola via whoschek)

The full change log
is available from JIRA.

Version 0.17.1

Release date: 9 December 2014

Version 0.17.1 is a bug-fix release with the following notable changes:

  • Kite data
    • CSV imports will now use the dataset schema to read CSV records rather than inferring a schema from the data (see CDK-800).
    • CSV floats or doubles read with an integer or long type will result in NumberFormatException during import. Previously, this was caught by checking the inferred schema with the dataset schema, but this method was unreliable. See CDK-801 for more information.
  • Morphlines Library
    • Added support for deleting documents stored in Solr by unique id as well as by query
    • Added documentation on how to update a subset of fields of an existing document stored in Solr: partial document updates
    • Added ability to register custom Java extension functions with xquery and xslt morphline commands: xquery morphline command.
    • Enhanced documentation for xquery morphline command.
    • Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.3 to bug fix release maxmind-db-1.0.0

The full change log
is available from JIRA.

Version 0.17.0

Release date: 9 October 2014

Version 0.17.0 contains the following notable changes:

  • The Kite examples now require the Cloudera Quickstart VM
    version 5.1 or later.
  • Kite 0.15.0 and 0.16.0 default to an appender which writes to both Avro and Parquet files, thus incurring
    2x the I/O resources, when writing to a Parquet dataset. That default has switched back to the behavior
    from 0.14.0 and before, which is to write just to Parquet. When using a Parquet dataset, the DatasetWriter#flush() and DatasetWriter#sync() methods have no effect. That means data written to
    a Parquet dataset is not durable until after a successful call to DatasetWriter#close(). Users that
    want the behavior found in 0.15.0 and 0.16.0 can set the property kite.parquet.non-durable-writes to false using the API or the update command
    in the CLI. After setting the property, the DatasetWriter#flush() and DatasetWriter#sync() methods
    will flush and sync the Avro version of the data respectively. If there is a failure before the writer
    is closed, the data can be recovered by reading the Avro version of the file and writing the records
    to Parquet. This recovery is a manual process.
  • Kite now supports namespaces for datasets. For Hive datasets, the Kite namespace maps to the
    Hive database where the table will be stored. Namespaces also changed the file system repository
    layout for local file and HDFS datasets. Dataset URIs used with previous releases will work
    unmodified. New datasets created using the DatasetRepository API (which moved to the SPI in 0.16.0)
    will not end up in the same location as in previous releases. The work-around is to use dataset URIs
    with the Datasets API. See the docs on dataset URIs
    for more details.
  • Users can now select the compression codec for Avro and Parquet datasets. See CDK-299 for more details.
  • The kite-data-hcatalog module has been renamed to kite-data-hive. A Maven relocation
    was put in place to prevent projects from breaking. However, we strongly encourage you to update
    your dependency to kite-data-hive in your projects. See CDK-452
    for details.
  • Hive external table URIs no long support relative locations. A URI with the pattern dataset:hive:examples/ratings now means to use a namespace of examples and a
    dataset named ratings. You can create external URIs using the location query parameter.
    For example: dataset:hive:examples/ratings?location=/tmp/data/examples/ratings.
  • The Kite CLI tool has been renamed from dataset to kite-dataset. See CDK-670 for more information.
  • Kite will no longer use an embedded Hive MetaStore if it is not configured to
    connect to a remote MetaStore. Instead, Kite will throw an exception to avoid
    confusing behavior. See CDK-651
    for more information.
  • You can now partition datasets by sub-fields. See CDK-435
    for details.
  • File-based dataset names that are not alphanumeric (plus underscore) now issue a
    deprecation warning. Non-conforming names will be made illegal in a future release.
    See CDK-673.
  • There is a new experimental module, kite-minicluster, for running Hadoop services
    for testing and development purposes. The minicluster currently supports HDFS, Hive,
    HBase, and Flume services, and can be run directly from a Java program,
    or using the CLI. See CDK-679. The
    minicluster is experimental since its API and CLI are still subject to
    incompatible changes.
  • The Oozie portion of the demo example
    was removed. See CDK-605 for details.
  • Morphlines Library
    • Added morphline command that removes all record field values for which the field name and value matches a blacklist but not a whitelist: removeValues

The full change log
is available from JIRA.

Version 0.16.0

Release date: 21 August 2014

Version 0.16.0 contains the following notable changes:

  • Kite datasets can be read from and written to by Apache Spark jobs. See the new Spark example for details on usage.
  • Added a CLI transform task for transforming
    entities read from a source dataset before storing then into a target dataset.
  • Added a CDH5 application parent POM that makes it easy to build Kite applications on CDH5 using Maven.
    The Spark example uses this parent POM.
  • The DatasetRepository and DatasetRepositories APIs have been moved to the SPI and
    deprecated from the public API. Users should move to the new Datasets API before the
    next release.
  • Kite will now properly generate Parquet Hive tables on Hive 0.13 and later.
  • Writing to a non-empty dataset or view from MapReduce or Crunch will now fail unless
    the write mode is explicitly set to append or overwrite. This is a change from
    the previous behavior which was to append. See CDK-572 and CDK-347 for details.

The full change log
is available from JIRA.

Version 0.15.0

Release date: 15 July 2014

Version 0.15.0 contains the following notable changes:

  • Kite artifacts are built against Apache Hadoop 2 and related projects,
    and are now available in Maven Central.
  • Added new introduction and concepts documentation.
  • Added a new Datasets
    convenience class for opening and working with Datasets, superseding DatasetRepositories.
  • Deprecated partition related methods in Dataset in favor of the views API.
  • Added a CLI copy task for copying
    datasets and also for dataset format conversion and data compaction.
  • Added an application parent POM that makes it easy to use Kite in a Maven project.
    The examples now use this parent POM.
  • Updated to Crunch 0.10.0
  • Morphlines Library
    • Added morphline command that parses an InputStream that contains protobuf data: readProtobuf (Rober Fiser via whoschek)
    • Added morphline command that extracts specific values from a protobuf object, akin to a simple form of XPath: extractProtobufPaths (Rober Fiser via whoschek)
    • Added morphline command that removes all record fields for which the field name matches a blacklist but not a whitelist: removeFields
    • Added optional parameters maxCharactersPerRecord and onMaxCharactersPerRecord to morphline command readCSV
    • Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.1 to bug fix release maxmind-db-0.3.3
    • Upgraded kite-morphlines-core module from metrics-0.3.1 to bug fix release metrics-0.3.2

The full change log
is available from JIRA.

Version 0.14.1

Release date: 23 May 2014

Version 0.14.1 is a bug-fix release with the following notable changes:

The full change log
is available from JIRA.

Version 0.14.0

Release date: 13 May 2014

Version 0.14.0 has the following notable additions:

And the following bug fixes:

  • Updated CLI environment setup for CDH5.0 QuickStart VM
  • Fixed compatibility with CDH5 Hive, CDK-416
  • Fixed schema update validation bug, CDK-410
  • Added reconnect support when Hive connections drop, CDK-415

The full change log
is available from JIRA.

Version 0.13.0

Release date: April 23, 2014

Version 0.13.0 has the following notable changes:

  • Added datasets command-line interface
    • Build avro schemas from CSV data samples and java classes
    • Create, view, and delete Kite datasets
    • Import CSV data into a dataset
  • Morphlines Library
    • Added morphline command that opens an HDFS file for read and returns a corresponding Java InputStream: openHdfsFile
    • Added morphline command that converts an InputStream to a byte array in main memory: readBlob
    • Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-4 to Saxon-HE-9.5.1-5

The full change log
is available from JIRA.

Version 0.12.1

Release date: March 18, 2014

Version 0.12.1 is a bug-fix release with the following notable changes:

  • Fixed slow job setup for crunch when using large Datasets (thanks Gabriel Reid!)
  • Fixed CDK-328, Hive metastore concurrent access bug (thanks Karel Vervaeke!)
  • Clarified documentation for deleting datasets
  • Added more better checking to catch errors earlier
    • Catch partition strategies that rely on missing data fields
    • Catch Hive-incompatible table, column, and partition names
  • Added warnings when creating FS or HBase datasets that are incompatible with Hive

The full change log
is available from JIRA.

Version 0.12.0

Release date: March 10, 2014

Version 0.12.0 has the following notable changes:

  • MapReduce support for Datasets. New input and output formats (DatasetKeyInputFormat
    and DatasetKeyOutputFormat) make it possible to use Datasets with MapReduce.
  • Views API. There is an incompatible change in this release: RefineableView in the
    org.kitesdk.data package has been renamed to RefinableView (no ‘e’). Clients should
    update and recompile.
  • Morphlines Library
    • Added a sampling command that forwards each input record with a given probability to its child command: sample
    • Added a command that ignores all input records beyond the N-th record, akin to the Unix head command: head
    • Improved morphline import performance if all commands are specified via fully qualified class names.
    • Added several performance enhancements.
    • Added an example module that describes how to unit test Morphline config files and custom Morphline commands.
    • Improved documentation.

The full change log
is available from JIRA.

Version 0.11.0

Release date: February 6, 2014

Version 0.11.0 has the following notable changes:

  • Views API. A new API for expressing a subset of a dataset using logical constraints
    such as field matching or ranges. See the documentation for RefineableView
    for details. The HBase example
    has been extended to use a view for doing a partial scan of the table.
  • Dataset API. Removed APIs that were deprecated in 0.9.0. See the API Diffs for all the changes.
  • Upgrade to Crunch 0.9.0.
  • Morphlines Library
    • Added morphline command to read from Hadoop Avro Parquet Files: readAvroParquetFile
    • Added support for multi-character separators as well as a regex separators to splitKeyValue command.
    • Added addEmptyStrings parameter to readCSV command to indicate whether or not to add zero length strings to the output field.
    • Upgraded kite-morphlines-solr-* modules from solr-4.6.0 to solr-4.6.1.
    • Upgraded kite-morphlines-json module from jackson-databind-2.2.1 to jackson-databind-2.3.1.
    • Upgraded kite-morphlines-metrics-servlets module from jetty-8.1.13.v20130916 to jetty-8.1.14.v20131031.
    • Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-3 to Saxon-HE-9.5.1-4.
    • Fixed CDK-282 readRCFile command is broken (Prasanna Rajaperumal via whoschek).

The full change log
is available from JIRA.

Version 0.10.1

Release date: January 13, 2014

Version 0.10.1 includes the following bug fixes:

  • CDK-249: Correctly add new partitions to the Hive MetaStore
  • CDK-260: Fixed the date-format partition function in expressions
  • CDK-266: Fixed file name uniqueness
  • CDK-273: Fixed spurious batch size warning in log4j integration
  • Fixed NoClassDefFoundError for crunch in kite-tools module
  • Added more debug logging to Morphlines
  • Solr should fail fast if ZK has no solr configuration

This patch release is fully-compatible with 0.9.1, which uses the deprecated CDK packages.

The full change log
is available from JIRA.

Version 0.10.0

Release date: December 9, 2013

Version 0.10.0 has the following notable changes:

  • Renamed the project from CDK to Kite.
    The main goal of Kite is to increase the accessibility of Apache Hadoop as a platform.
    This isn’t specific to Cloudera, so we updated the name to correctly represent the project as an open, community-driven set of tools.
    To make migration easier, there are no feature changes and migration instructions have been added for existing projects.
    • Renamed java packages com.cloudera.cdk.* to org.kitesdk.*. This change is trivial and mechanical but it does break backwards compatibility. This is a one-time event - going forward no such backwards incompatible renames are planned. This mass rename is the only change going from the cdk-0.9.0 release to the kite-0.10.0 release.
    • Renamed maven module names and jar files from cdk-* to kite-*.
    • Moved github repo from http://github.com/cloudera/cdk to http://github.com/kite-sdk/kite.
    • Moved documentation from http://cloudera.github.io/cdk/docs/current to http://kitesdk.org/docs/current.
    • Moved morphline reference guide from http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html to http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html.

Version 0.9.0

Release date: December 5, 2013

Version 0.9.0 has the following notable changes:

  • HBase support. There is a new experimental API working with random access datasets
    stored in HBase. The API exposes get/put operations, but there is no support for
    scans from an arbitrary row in this release. (The latter will be added in 0.11.0 as a
    part of the forthcoming views API.) For
    usage information, consult the new HBase example.
  • Parquet. Datasets in Parquet format can now be written (and read) using Crunch.
  • CSV. Datasets in CSV format can now be read using the dataset APIs. See the compatibility example.
  • Dataset API. Removed APIs that were deprecated in 0.8.0. See the API Diffs for all the
    changes.
  • Morphlines Library
    • Added morphline command to read from RCFile: readRCFile (Prasanna Rajaperumal via whoschek)
    • Added morphline command to convert a morphline record to an Avro record: toAvro
    • Added morphline command that serializes Avro records into a byte array: writeAvroToByteArray
    • Added morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup: geoIP
    • Added morphline command that parses a user agent string and returns structured higher level data like user agent family, operating system, version, and device type: userAgent
    • Added option to fail the following commands if an URI is syntactically invalid: extractURIComponents, extractURIComponent, extractURIQueryParameters
    • Upgraded cdk-morphlines-solr-core module from solr-4.4 to solr-4.6.
    • Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-2 to saxon-HE-9.5.1-3.
    • Fixed race condition on parallel initialization of multiple Solr morphlines within the same JVM.
    • For enhanced safety readSequenceFile command nomore reuses the identity of Hadoop Writeable objects.

The full change log
is available from JIRA.

Version 0.8.1

Release date: October 23, 2013

Version 0.8.1 has the following notable changes:

  • Morphlines Library
    • Made xquery and xslt commands also compatible with woodstox-3.2.7 (not just woodstox-4.x).

Version 0.8.0

Release date: October 7, 2013

Version 0.8.0 has the following notable changes:

  • Dataset Repository URIs. Repositories can be referred to (and opened) by a URI. For
    example, repo:hdfs://namenode:8020/data specifies a Dataset Repository stored in HDFS.
    Dataset descriptors carry the repository URI.
  • Dataset API. Removed APIs that were deprecated in 0.7.0. Deprecated some constructors
    in favor of builders. See API Diffs
    for all the changes.
  • Upgrade to Parquet 1.2.0.
  • Morphlines Library
    • Added option for commands to register health checks (not just metrics) with the MorphlineContext.
    • Added registerJVMMetrics command that registers metrics that are related to the Java Virtual Machine
      with the MorphlineContext. For example, this includes metrics for garbage collection events,
      buffer pools, threads and thread deadlocks.
    • Added morphline commands to publish the metrics of all morphline commands to JMX, SLF4J and CSV files.
      The new commands are: startReportingMetricsToJMX, startReportingMetricsToSLF4 and startReportingMetricsToCSV.
    • Added EXPERIMENTAL cdk-morphlines-metrics-servlets maven module with new startReportingMetricsToHTTP command that
      exposes liveness status, health check status, metrics state and thread dumps via a set of HTTP URIs served by Jetty,
      using the AdminServlet.
    • Added cdk-morphlines-hadoop-core maven module with new downloadHdfsFile
      command for transferring HDFS files, e.g. to help with centralized configuration file management.
    • Added option to specify boost values to loadSolr command.
    • Added several performance enhancements.
    • Upgraded cdk-morphlines-solr-cell maven module from tika-1.3 to tika-1.4 to pick up some bug fixes.
    • Upgraded cdk-morphlines-core maven module from com.google.code.regexp-0.1.9 to 0.2.3 to pick up some bug fixes (Internally shaded version).
    • The constructor of AbstractCommand now has an additional parameter that refers to the CommandBuilder.
      The old constructor has been deprecated and will be removed in the next release.
    • The ISO8601_TIMEZONE grok pattern now allows the omission of minutes in a timezone offset.
    • Ensured morphline commands can refer to record field names containing arbitrary characters.
      Previously some commands could not refer to record field names containing the ‘.’ dot character.
      This limitation has been removed.

The full change log
is available from JIRA.

Version 0.7.0

Release date: September 5, 2013

Version 0.7.0 has the following notable changes:

  • Dataset API. Changes to make the API more consistent and better integrated with
    standard Java APIs like Iterator, Iterable, Flushable, and Closeable.
  • Java 7. CDK now also works with Java 7.
  • Upgrade to Avro 1.7.5.
  • Morphlines Library
    • Added commands splitKeyValue, extractURIComponent and toByteArray
    • Added outputFields parameter to the split command to support a list of column names similar to the readCSV command
    • Added tika-xmp maven module as a dependency to cdk-morphline-solr-cell module
    • Added several performance enhancements
    • Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-1 to saxon-HE-9.5.1-2

The full change log
is available from JIRA.

Version 0.6.0

Release date: August 16, 2013

Version 0.6.0 has the following notable changes:

  • Dependency management. Solr and Lucene dependencies have been upgrade to 4.4.
  • Build system. The version of the Maven Javadoc plugin has been upgraded.

Version 0.5.0

Release date: August 1, 2013

Version 0.5.0 has the following notable changes:

  • Examples. All examples can be run from the user’s host machine,
    as an alternative to running from within the QuickStart VM guest.
  • CDK Maven Plugin. A new plugin with
    goals for manipulating datasets, and packaging, deploying,
    and running distributed applications.
  • Dependency management. Hadoop components are now marked as provided to give users
    more control. See the dependencies page.
  • Upgrade to Parquet 1.0.0 and Crunch 0.7.0.
  • Morphlines Library
    • Added commands xquery, xslt, convertHTML for reading, extracting and transforming XML and HTML with XPath, XQuery and XSLT
    • Added tokenizeText command that uses the embedded Solr/Lucene Analyzer library to generate tokens from a text string, without sending data to a Solr server
    • Added translate command that examines each value in a given field and replaces it with the replacement value defined in a given dictionary aka lookup hash table
    • By default, disable quoting and multi-line fields feature and comment line feature for the readCSV morphline command.
    • Added several performance enhancements

The full change log
is available from JIRA.

Version 0.4.1

Release date: July 11, 2013

Version 0.4.1 has the following notable changes:

  • Morphlines Library
    • Expanded documentation and examples
    • Made SolrLocator and ZooKeeperDownloader collection alias aware
    • Added commands readJson and extractJsonPaths for reading, extracting, and transforming JSON files and JSON objects, in the same style as Avro support
    • Added commands split, findReplace, extractURIComponents, extractURIQueryParameters, decodeBase64
    • Fixed extractAvroPaths exception with flatten = true if path represents a non-leaf node of type Record
    • Added several performance enhancements

Version 0.4.0

Release date: June 22, 2013

Version 0.4.0 has the following notable changes:

  • Morphlines Library. A morphline is a rich configuration file that makes it easy to
    define an ETL transformation chain embedded in Hadoop components such as Search, Flume,
    MapReduce, Pig, Hive, Sqoop.
  • An Oozie example. A new example of using Oozie to run a transformation job
    periodically.
  • QuickStart VM update. The examples now use version 4.3.0 of the Cloudera QuickStart VM.
  • Java package changes. The com.cloudera.data package and subpackages
    have been renamed to com.cloudera.cdk.data, and com.cloudera.cdk.flume has become com.cloudera.cdk.data.flume.
  • Finer-grained Maven modules. The module organization and naming has changed,
    including making all group IDs com.cloudera.cdk. Please see the new dependencies
    page
    for details.

The full change log
is available from JIRA.

Version 0.3.0

Release date: June 6, 2013

Version 0.3.0 has the following notable changes:

  • Logging to a dataset. Using log4j as the logging API and Flume as the log transport,
    it is now possible to log application events to datasets.
  • Crunch support. Datasets can be exposed as Crunch sources and targets.
  • Date partitioning. New partitioning functions for partitioning datasets by
    year/month/day/hour/minute.
  • New examples. The new examples repository
    has examples for all these new features. The examples use the Cloudera QuickStart VM,
    version 4.2.0, to make running the examples as simple as possible.

The full change log
is available from JIRA.

Version 0.2.0

Release date: May 2, 2013

Version 0.2.0 has two major additions:

  • Experimental support for reading and writing datasets in Parquet format.
  • Support for storing dataset metadata in a Hive/HCatalog metastore.

The examples module has example code for both of these usages.

The full change log
is available from JIRA.

Version 0.1.0

Release date: April 5, 2013

Version 0.1.0 is the first release of the CDK Data module. This is considered a beta release. As a sub-1.0.0 release, this version is not subject to the
normal API compatibility guarantees. See the Compatibility Statement for
information about API compatibility guarantees.