Kite SDK Release Notes

All past Kite releases are documented on this page. Upcoming release dates can be found in JIRA.

Version 1.1.0

Release date: 16 June 2015

Version 1.1.0 contains the following notable changes:

Extended the “create” command to automatically detect schema and partitioning of existing data
Added a “compact” command to the CLI that will merge small files in place via a MapReduce job
Updated copy, transform, and compact to create a configurable number of files in each partition
Fixed out-of-memory errors when writing large files with the CLI
Added Kite URI handler for Oozie
Added support for datasets stored in S3
Added asSchema and asType methods to the API to allow setting the reader schema and type
Added isReady and signalReady to views for signaling when a view is ready for processing
Added a “list” CLI command
Added partition view to represent a view of a dataset partition
Added a new partitioner for fixed-size ranges
Removed the hadoop-1 test profile
Fixed copy to or from a Hive table on Kerberos-enabled clusters

Version 1.0.0

Release date: 23 February 2015

Version 1.0.0 contains the following notable changes:

All deprecated classes and methods have been removed from the data modules.
DatasetWriter no longer has a flush() or a sync() method (see CDK-892). Some (but not all) implementations of DatasetWriter implement the org.kitesdk.data.Flushable or org.kitesdk.data.Syncable interfaces, so you need to use the following idiom to flush the stream. (Calling sync() is similar.) DatasetWriter<Record> writer = ... if (writer instanceof Flushable) { ((Flushable) writer).flush(); }
Avro schemas are now stored in HDFS for Hive datasets. This overcomes the 4K limit on schema size, as well as providing better schema evolution checking since all versions of the schema are stored. See CDK-969
Removing a partition from a dataset now removes the partition from the Hive metastore (see CDK-924).
Morphlines Library
- Added support for nested documents aka child documents to loadSolr morphline command.

The full change log
is available from JIRA.

Version 0.18.0

Release date: 11 February 2015

Version 0.18.0 contains the following notable changes:

There is a new kite-dataset command, tar-import, for importing the contents of a tarfile into a dataset.
The delete command can now delete the data contained in a view.
The csv-schema and csv-import commands now take a --header argument for specifying the CSV header.
The restriction on filesystem dataset names is now enforced: attempting to create a dataset with a non-alphanumeric
name (underscores are valid too) results in an error.
Morphlines Library
- Upgraded kite-morphlines-solr-* to solr-4.10.3
- Upgraded kite-morphlines-tika-* to tika-1.5 (in sync with solr-4.10.3)
- Avoid NPE in geoIP morphline command if IP is not found (Santiago Mola via whoschek)

The full change log
is available from JIRA.

Version 0.17.1

Release date: 9 December 2014

Version 0.17.1 is a bug-fix release with the following notable changes:

Kite data
- CSV imports will now use the dataset schema to read CSV records rather than inferring a schema from the data (see CDK-800).
- CSV floats or doubles read with an integer or long type will result in NumberFormatException during import. Previously, this was caught by checking the inferred schema with the dataset schema, but this method was unreliable. See CDK-801 for more information.
Morphlines Library
- Added support for deleting documents stored in Solr by unique id as well as by query
- Added documentation on how to update a subset of fields of an existing document stored in Solr: partial document updates
- Added ability to register custom Java extension functions with xquery and xslt morphline commands: xquery morphline command.
- Enhanced documentation for xquery morphline command.
- Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.3 to bug fix release maxmind-db-1.0.0

The full change log
is available from JIRA.

Version 0.17.0

Release date: 9 October 2014

Version 0.17.0 contains the following notable changes:

The Kite examples now require the Cloudera Quickstart VM
version 5.1 or later.
Kite 0.15.0 and 0.16.0 default to an appender which writes to both Avro and Parquet files, thus incurring
2x the I/O resources, when writing to a Parquet dataset. That default has switched back to the behavior
from 0.14.0 and before, which is to write just to Parquet. When using a Parquet dataset, the DatasetWriter#flush() and DatasetWriter#sync() methods have no effect. That means data written to
a Parquet dataset is not durable until after a successful call to DatasetWriter#close(). Users that
want the behavior found in 0.15.0 and 0.16.0 can set the property kite.parquet.non-durable-writes to false using the API or the update command
in the CLI. After setting the property, the DatasetWriter#flush() and DatasetWriter#sync() methods
will flush and sync the Avro version of the data respectively. If there is a failure before the writer
is closed, the data can be recovered by reading the Avro version of the file and writing the records
to Parquet. This recovery is a manual process.
Kite now supports namespaces for datasets. For Hive datasets, the Kite namespace maps to the
Hive database where the table will be stored. Namespaces also changed the file system repository
layout for local file and HDFS datasets. Dataset URIs used with previous releases will work
unmodified. New datasets created using the DatasetRepository API (which moved to the SPI in 0.16.0)
will not end up in the same location as in previous releases. The work-around is to use dataset URIs
with the Datasets API. See the docs on dataset URIs
for more details.
Users can now select the compression codec for Avro and Parquet datasets. See CDK-299 for more details.
The kite-data-hcatalog module has been renamed to kite-data-hive. A Maven relocation
was put in place to prevent projects from breaking. However, we strongly encourage you to update
your dependency to kite-data-hive in your projects. See CDK-452
for details.
Hive external table URIs no long support relative locations. A URI with the pattern dataset:hive:examples/ratings now means to use a namespace of examples and a
dataset named ratings. You can create external URIs using the location query parameter.
For example: dataset:hive:examples/ratings?location=/tmp/data/examples/ratings.
The Kite CLI tool has been renamed from dataset to kite-dataset. See CDK-670 for more information.
Kite will no longer use an embedded Hive MetaStore if it is not configured to
connect to a remote MetaStore. Instead, Kite will throw an exception to avoid
confusing behavior. See CDK-651
for more information.
You can now partition datasets by sub-fields. See CDK-435
for details.
File-based dataset names that are not alphanumeric (plus underscore) now issue a
deprecation warning. Non-conforming names will be made illegal in a future release.
See CDK-673.
There is a new experimental module, kite-minicluster, for running Hadoop services
for testing and development purposes. The minicluster currently supports HDFS, Hive,
HBase, and Flume services, and can be run directly from a Java program,
or using the CLI. See CDK-679. The
minicluster is experimental since its API and CLI are still subject to
incompatible changes.
The Oozie portion of the demo example
was removed. See CDK-605 for details.
Morphlines Library
- Added morphline command that removes all record field values for which the field name and value matches a blacklist but not a whitelist: removeValues

The full change log
is available from JIRA.

Version 0.16.0

Release date: 21 August 2014

Version 0.16.0 contains the following notable changes:

Kite datasets can be read from and written to by Apache Spark jobs. See the new Spark example for details on usage.
Added a CLI transform task for transforming
entities read from a source dataset before storing then into a target dataset.
Added a CDH5 application parent POM that makes it easy to build Kite applications on CDH5 using Maven.
The Spark example uses this parent POM.
The DatasetRepository and DatasetRepositories APIs have been moved to the SPI and
deprecated from the public API. Users should move to the new Datasets API before the
next release.
Kite will now properly generate Parquet Hive tables on Hive 0.13 and later.
Writing to a non-empty dataset or view from MapReduce or Crunch will now fail unless
the write mode is explicitly set to append or overwrite. This is a change from
the previous behavior which was to append. See CDK-572 and CDK-347 for details.

The full change log
is available from JIRA.

Version 0.15.0

Release date: 15 July 2014

Version 0.15.0 contains the following notable changes:

Kite artifacts are built against Apache Hadoop 2 and related projects,
and are now available in Maven Central.
Added new introduction and concepts documentation.
Added a new Datasets
convenience class for opening and working with Datasets, superseding DatasetRepositories.
Deprecated partition related methods in Dataset in favor of the views API.
Added a CLI copy task for copying
datasets and also for dataset format conversion and data compaction.
Added an application parent POM that makes it easy to use Kite in a Maven project.
The examples now use this parent POM.
Updated to Crunch 0.10.0
Morphlines Library
- Added morphline command that parses an InputStream that contains protobuf data: readProtobuf (Rober Fiser via whoschek)
- Added morphline command that extracts specific values from a protobuf object, akin to a simple form of XPath: extractProtobufPaths (Rober Fiser via whoschek)
- Added morphline command that removes all record fields for which the field name matches a blacklist but not a whitelist: removeFields
- Added optional parameters maxCharactersPerRecord and onMaxCharactersPerRecord to morphline command readCSV
- Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.1 to bug fix release maxmind-db-0.3.3
- Upgraded kite-morphlines-core module from metrics-0.3.1 to bug fix release metrics-0.3.2

The full change log
is available from JIRA.

Version 0.14.1

Release date: 23 May 2014

Version 0.14.1 is a bug-fix release with the following notable changes:

Usability improvements for the CLI, CDK-424
Fixed Dataset examples: dataset-hbase, dataset-staging, dataset-compatibility
Fixed a bug in the Kite Maven Plugin, CDK-406

The full change log
is available from JIRA.

Version 0.14.0

Release date: 13 May 2014

Version 0.14.0 has the following notable additions:

Added View support to Kite MapReduce and Crunch
Added more documentation on kitesdk.org
- Kite command-line interface tutorial and reference
Added HBase storage option to the CLI, --use-hbase (experimental)
Added a new JSON configuration format for partition strategies
- Supports hash, identity, year, month, day, hour, minute parititoners
- Partition strategy documentation
Added partition strategy support to the CLI
- Create and validate partition strategies using partition-config
- Create partitioned datasets with create
Added a builder and JSON configuration format for HBase column mappings
- Supports column, counter, keyAsColumn, key, and version mappings
Updated to parquet 1.4.1

And the following bug fixes:

Updated CLI environment setup for CDH5.0 QuickStart VM
Fixed compatibility with CDH5 Hive, CDK-416
Fixed schema update validation bug, CDK-410
Added reconnect support when Hive connections drop, CDK-415

The full change log
is available from JIRA.

Version 0.13.0

Release date: April 23, 2014

Version 0.13.0 has the following notable changes:

Added datasets command-line interface
- Build avro schemas from CSV data samples and java classes
- Create, view, and delete Kite datasets
- Import CSV data into a dataset
Morphlines Library
- Added morphline command that opens an HDFS file for read and returns a corresponding Java InputStream: openHdfsFile
- Added morphline command that converts an InputStream to a byte array in main memory: readBlob
- Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-4 to Saxon-HE-9.5.1-5

The full change log
is available from JIRA.

Version 0.12.1

Release date: March 18, 2014

Version 0.12.1 is a bug-fix release with the following notable changes:

Fixed slow job setup for crunch when using large Datasets (thanks Gabriel Reid!)
Fixed CDK-328, Hive metastore concurrent access bug (thanks Karel Vervaeke!)
Clarified documentation for deleting datasets
Added more better checking to catch errors earlier
- Catch partition strategies that rely on missing data fields
- Catch Hive-incompatible table, column, and partition names
Added warnings when creating FS or HBase datasets that are incompatible with Hive

The full change log
is available from JIRA.

Version 0.12.0

Release date: March 10, 2014

Version 0.12.0 has the following notable changes:

MapReduce support for Datasets. New input and output formats (DatasetKeyInputFormat
and DatasetKeyOutputFormat) make it possible to use Datasets with MapReduce.
Views API. There is an incompatible change in this release: RefineableView in the
org.kitesdk.data package has been renamed to RefinableView (no ‘e’). Clients should
update and recompile.
Morphlines Library
- Added a sampling command that forwards each input record with a given probability to its child command: sample
- Added a command that ignores all input records beyond the N-th record, akin to the Unix head command: head
- Improved morphline import performance if all commands are specified via fully qualified class names.
- Added several performance enhancements.
- Added an example module that describes how to unit test Morphline config files and custom Morphline commands.
- Improved documentation.

The full change log
is available from JIRA.

Version 0.11.0

Release date: February 6, 2014

Version 0.11.0 has the following notable changes:

Views API. A new API for expressing a subset of a dataset using logical constraints
such as field matching or ranges. See the documentation for RefineableView
for details. The HBase example
has been extended to use a view for doing a partial scan of the table.
Dataset API. Removed APIs that were deprecated in 0.9.0. See the API Diffs for all the changes.
Upgrade to Crunch 0.9.0.
Morphlines Library
- Added morphline command to read from Hadoop Avro Parquet Files: readAvroParquetFile
- Added support for multi-character separators as well as a regex separators to splitKeyValue command.
- Added addEmptyStrings parameter to readCSV command to indicate whether or not to add zero length strings to the output field.
- Upgraded kite-morphlines-solr-* modules from solr-4.6.0 to solr-4.6.1.
- Upgraded kite-morphlines-json module from jackson-databind-2.2.1 to jackson-databind-2.3.1.
- Upgraded kite-morphlines-metrics-servlets module from jetty-8.1.13.v20130916 to jetty-8.1.14.v20131031.
- Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-3 to Saxon-HE-9.5.1-4.
- Fixed CDK-282 readRCFile command is broken (Prasanna Rajaperumal via whoschek).

The full change log
is available from JIRA.

Version 0.10.1

Release date: January 13, 2014

Version 0.10.1 includes the following bug fixes:

CDK-249: Correctly add new partitions to the Hive MetaStore
CDK-260: Fixed the date-format partition function in expressions
CDK-266: Fixed file name uniqueness
CDK-273: Fixed spurious batch size warning in log4j integration
Fixed NoClassDefFoundError for crunch in kite-tools module
Added more debug logging to Morphlines
Solr should fail fast if ZK has no solr configuration

This patch release is fully-compatible with 0.9.1, which uses the deprecated CDK packages.

The full change log
is available from JIRA.

Version 0.10.0

Release date: December 9, 2013

Version 0.10.0 has the following notable changes:

Renamed the project from CDK to Kite.
The main goal of Kite is to increase the accessibility of Apache Hadoop as a platform.
This isn’t specific to Cloudera, so we updated the name to correctly represent the project as an open, community-driven set of tools.
To make migration easier, there are no feature changes and migration instructions have been added for existing projects.
- Renamed java packages com.cloudera.cdk.* to org.kitesdk.*. This change is trivial and mechanical but it does break backwards compatibility. This is a one-time event - going forward no such backwards incompatible renames are planned. This mass rename is the only change going from the cdk-0.9.0 release to the kite-0.10.0 release.
- Renamed maven module names and jar files from cdk-* to kite-*.
- Moved github repo from http://github.com/cloudera/cdk to http://github.com/kite-sdk/kite.
- Moved documentation from http://cloudera.github.io/cdk/docs/current to http://kitesdk.org/docs/current.
- Moved morphline reference guide from http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html to http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html.

Version 0.9.0

Release date: December 5, 2013

Version 0.9.0 has the following notable changes:

HBase support. There is a new experimental API working with random access datasets
stored in HBase. The API exposes get/put operations, but there is no support for
scans from an arbitrary row in this release. (The latter will be added in 0.11.0 as a
part of the forthcoming views API.) For
usage information, consult the new HBase example.
Parquet. Datasets in Parquet format can now be written (and read) using Crunch.
CSV. Datasets in CSV format can now be read using the dataset APIs. See the compatibility example.
Dataset API. Removed APIs that were deprecated in 0.8.0. See the API Diffs for all the
changes.
Morphlines Library
- Added morphline command to read from RCFile: readRCFile (Prasanna Rajaperumal via whoschek)
- Added morphline command to convert a morphline record to an Avro record: toAvro
- Added morphline command that serializes Avro records into a byte array: writeAvroToByteArray
- Added morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup: geoIP
- Added morphline command that parses a user agent string and returns structured higher level data like user agent family, operating system, version, and device type: userAgent
- Added option to fail the following commands if an URI is syntactically invalid: extractURIComponents, extractURIComponent, extractURIQueryParameters
- Upgraded cdk-morphlines-solr-core module from solr-4.4 to solr-4.6.
- Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-2 to saxon-HE-9.5.1-3.
- Fixed race condition on parallel initialization of multiple Solr morphlines within the same JVM.
- For enhanced safety readSequenceFile command nomore reuses the identity of Hadoop Writeable objects.

The full change log
is available from JIRA.

Version 0.8.1

Release date: October 23, 2013

Version 0.8.1 has the following notable changes:

Morphlines Library
- Made xquery and xslt commands also compatible with woodstox-3.2.7 (not just woodstox-4.x).

Version 0.8.0

Release date: October 7, 2013

Version 0.8.0 has the following notable changes:

Dataset Repository URIs. Repositories can be referred to (and opened) by a URI. For
example, repo:hdfs://namenode:8020/data specifies a Dataset Repository stored in HDFS.
Dataset descriptors carry the repository URI.
Dataset API. Removed APIs that were deprecated in 0.7.0. Deprecated some constructors
in favor of builders. See API Diffs
for all the changes.
Upgrade to Parquet 1.2.0.
Morphlines Library
- Added option for commands to register health checks (not just metrics) with the MorphlineContext.
- Added registerJVMMetrics command that registers metrics that are related to the Java Virtual Machine
  with the MorphlineContext. For example, this includes metrics for garbage collection events,
  buffer pools, threads and thread deadlocks.
- Added morphline commands to publish the metrics of all morphline commands to JMX, SLF4J and CSV files.
  The new commands are: startReportingMetricsToJMX, startReportingMetricsToSLF4 and startReportingMetricsToCSV.
- Added EXPERIMENTAL cdk-morphlines-metrics-servlets maven module with new startReportingMetricsToHTTP command that
  exposes liveness status, health check status, metrics state and thread dumps via a set of HTTP URIs served by Jetty,
  using the AdminServlet.
- Added cdk-morphlines-hadoop-core maven module with new downloadHdfsFile
  command for transferring HDFS files, e.g. to help with centralized configuration file management.
- Added option to specify boost values to loadSolr command.
- Added several performance enhancements.
- Upgraded cdk-morphlines-solr-cell maven module from tika-1.3 to tika-1.4 to pick up some bug fixes.
- Upgraded cdk-morphlines-core maven module from com.google.code.regexp-0.1.9 to 0.2.3 to pick up some bug fixes (Internally shaded version).
- The constructor of AbstractCommand now has an additional parameter that refers to the CommandBuilder.
  The old constructor has been deprecated and will be removed in the next release.
- The ISO8601_TIMEZONE grok pattern now allows the omission of minutes in a timezone offset.
- Ensured morphline commands can refer to record field names containing arbitrary characters.
  Previously some commands could not refer to record field names containing the ‘.’ dot character.
  This limitation has been removed.

The full change log
is available from JIRA.

Version 0.7.0

Release date: September 5, 2013

Version 0.7.0 has the following notable changes:

Dataset API. Changes to make the API more consistent and better integrated with
standard Java APIs like Iterator, Iterable, Flushable, and Closeable.
Java 7. CDK now also works with Java 7.
Upgrade to Avro 1.7.5.
Morphlines Library
- Added commands splitKeyValue, extractURIComponent and toByteArray
- Added outputFields parameter to the split command to support a list of column names similar to the readCSV command
- Added tika-xmp maven module as a dependency to cdk-morphline-solr-cell module
- Added several performance enhancements
- Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-1 to saxon-HE-9.5.1-2

The full change log
is available from JIRA.

Version 0.6.0

Release date: August 16, 2013

Version 0.6.0 has the following notable changes:

Dependency management. Solr and Lucene dependencies have been upgrade to 4.4.
Build system. The version of the Maven Javadoc plugin has been upgraded.

Version 0.5.0

Release date: August 1, 2013

Version 0.5.0 has the following notable changes:

Examples. All examples can be run from the user’s host machine,
as an alternative to running from within the QuickStart VM guest.
CDK Maven Plugin. A new plugin with
goals for manipulating datasets, and packaging, deploying,
and running distributed applications.
Dependency management. Hadoop components are now marked as provided to give users
more control. See the dependencies page.
Upgrade to Parquet 1.0.0 and Crunch 0.7.0.
Morphlines Library
- Added commands xquery, xslt, convertHTML for reading, extracting and transforming XML and HTML with XPath, XQuery and XSLT
- Added tokenizeText command that uses the embedded Solr/Lucene Analyzer library to generate tokens from a text string, without sending data to a Solr server
- Added translate command that examines each value in a given field and replaces it with the replacement value defined in a given dictionary aka lookup hash table
- By default, disable quoting and multi-line fields feature and comment line feature for the readCSV morphline command.
- Added several performance enhancements

The full change log
is available from JIRA.

Version 0.4.1

Release date: July 11, 2013

Version 0.4.1 has the following notable changes:

Morphlines Library
- Expanded documentation and examples
- Made SolrLocator and ZooKeeperDownloader collection alias aware
- Added commands readJson and extractJsonPaths for reading, extracting, and transforming JSON files and JSON objects, in the same style as Avro support
- Added commands split, findReplace, extractURIComponents, extractURIQueryParameters, decodeBase64
- Fixed extractAvroPaths exception with flatten = true if path represents a non-leaf node of type Record
- Added several performance enhancements

Version 0.4.0

Release date: June 22, 2013

Version 0.4.0 has the following notable changes:

Morphlines Library. A morphline is a rich configuration file that makes it easy to
define an ETL transformation chain embedded in Hadoop components such as Search, Flume,
MapReduce, Pig, Hive, Sqoop.
An Oozie example. A new example of using Oozie to run a transformation job
periodically.
QuickStart VM update. The examples now use version 4.3.0 of the Cloudera QuickStart VM.
Java package changes. The com.cloudera.data package and subpackages
have been renamed to com.cloudera.cdk.data, and com.cloudera.cdk.flume has become com.cloudera.cdk.data.flume.
Finer-grained Maven modules. The module organization and naming has changed,
including making all group IDs com.cloudera.cdk. Please see the new dependencies
page for details.

The full change log
is available from JIRA.

Version 0.3.0

Release date: June 6, 2013

Version 0.3.0 has the following notable changes:

Logging to a dataset. Using log4j as the logging API and Flume as the log transport,
it is now possible to log application events to datasets.
Crunch support. Datasets can be exposed as Crunch sources and targets.
Date partitioning. New partitioning functions for partitioning datasets by
year/month/day/hour/minute.
New examples. The new examples repository
has examples for all these new features. The examples use the Cloudera QuickStart VM,
version 4.2.0, to make running the examples as simple as possible.

The full change log
is available from JIRA.

Version 0.2.0

Release date: May 2, 2013

Version 0.2.0 has two major additions:

Experimental support for reading and writing datasets in Parquet format.
Support for storing dataset metadata in a Hive/HCatalog metastore.

The examples module has example code for both of these usages.

The full change log
is available from JIRA.

Version 0.1.0

Release date: April 5, 2013

Version 0.1.0 is the first release of the CDK Data module. This is considered a beta release. As a sub-1.0.0 release, this version is not subject to the
normal API compatibility guarantees. See the Compatibility Statement for
information about API compatibility guarantees.

About Kite

Kite CLI

Kite API

Reference

Kite SDK Release Notes

Version 1.1.0

Version 1.0.0

Version 0.18.0

Version 0.17.1

Version 0.17.0

Version 0.16.0

Version 0.15.0

Version 0.14.1

Version 0.14.0

Version 0.13.0

Version 0.12.1

Version 0.12.0

Version 0.11.0

Version 0.10.1

Version 0.10.0

Version 0.9.0

Version 0.8.1

Version 0.8.0

Version 0.7.0

Version 0.6.0

Version 0.5.0

Version 0.4.1

Version 0.4.0

Version 0.3.0

Version 0.2.0

Version 0.1.0