Kite SDK Release Notes
All past Kite releases are documented on this page. Upcoming release dates can be found in JIRA.
Version 0.17.1
Release date: 9 December 2014
Version 0.17.1 is a bug-fix release with the following notable changes:
- Kite data
- CSV imports will now use the dataset schema to read CSV records rather than inferring a schema from the data (see CDK-800).
- CSV floats or doubles read with an integer or long type will result in NumberFormatException during import. Previously, this was caught by checking the inferred schema with the dataset schema, but this method was unreliable. See CDK-801 for more information.
- Morphlines Library
- Added support for deleting documents stored in Solr by unique id as well as by query
- Added documentation on how to update a subset of fields of an existing document stored in Solr: partial document updates
- Added ability to register custom Java extension functions with xquery and xslt morphline commands: xquery morphline command.
- Enhanced documentation for xquery morphline command.
- Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.3 to bug fix release maxmind-db-1.0.0
The full change log
is available from JIRA.
Version 0.17.0
Release date: 9 October 2014
Version 0.17.0 contains the following notable changes:
- The Kite examples now require the
Cloudera Quickstart VM
version 5.1 or later. - Kite 0.15.0 and 0.16.0 default to an appender which writes to both Avro and Parquet files, thus incurring
2x the I/O resources, when writing to a Parquet dataset. That default has switched back to the behavior
from 0.14.0 and before, which is to write just to Parquet. When using a Parquet dataset, theDatasetWriter#flush()
andDatasetWriter#sync()
methods have no effect. That means data written to
a Parquet dataset is not durable until after a successful call toDatasetWriter#close()
. Users that
want the behavior found in 0.15.0 and 0.16.0 can set the propertykite.parquet.non-durable-writes
tofalse
using the API or theupdate
command
in the CLI. After setting the property, theDatasetWriter#flush()
andDatasetWriter#sync()
methods
will flush and sync the Avro version of the data respectively. If there is a failure before the writer
is closed, the data can be recovered by reading the Avro version of the file and writing the records
to Parquet. This recovery is a manual process. - Kite now supports namespaces for datasets. For Hive datasets, the Kite namespace maps to the
Hive database where the table will be stored. Namespaces also changed the file system repository
layout for local file and HDFS datasets. Dataset URIs used with previous releases will work
unmodified. New datasets created using theDatasetRepository
API (which moved to the SPI in 0.16.0)
will not end up in the same location as in previous releases. The work-around is to use dataset URIs
with theDatasets
API. See the docs on dataset URIs
for more details. - Users can now select the compression codec for Avro and Parquet datasets. See CDK-299 for more details.
- The
kite-data-hcatalog
module has been renamed tokite-data-hive
. A Maven relocation
was put in place to prevent projects from breaking. However, we strongly encourage you to update
your dependency tokite-data-hive
in your projects. See CDK-452
for details. - Hive external table URIs no long support relative locations. A URI with the pattern
dataset:hive:examples/ratings
now means to use a namespace ofexamples
and a
dataset namedratings
. You can create external URIs using thelocation
query parameter.
For example:dataset:hive:examples/ratings?location=/tmp/data/examples/ratings
. - The Kite CLI tool has been renamed from
dataset
tokite-dataset
. See CDK-670 for more information. - Kite will no longer use an embedded Hive MetaStore if it is not configured to
connect to a remote MetaStore. Instead, Kite will throw an exception to avoid
confusing behavior. See CDK-651
for more information. - You can now partition datasets by sub-fields. See CDK-435
for details. - File-based dataset names that are not alphanumeric (plus underscore) now issue a
deprecation warning. Non-conforming names will be made illegal in a future release.
See CDK-673. - There is a new experimental module,
kite-minicluster
, for running Hadoop services
for testing and development purposes. The minicluster currently supports HDFS, Hive,
HBase, and Flume services, and can be run directly from a Java program,
or using the CLI. See CDK-679. The
minicluster is experimental since its API and CLI are still subject to
incompatible changes. - The Oozie portion of the demo example
was removed. See CDK-605 for details. - Morphlines Library
- Added morphline command that removes all record field values for which the field name and value matches a blacklist but not a whitelist: removeValues
The full change log
is available from JIRA.
Version 0.16.0
Release date: 21 August 2014
Version 0.16.0 contains the following notable changes:
- Kite datasets can be read from and written to by Apache Spark jobs. See the new Spark example for details on usage.
- Added a CLI transform task for transforming
entities read from a source dataset before storing then into a target dataset. - Added a CDH5 application parent POM that makes it easy to build Kite applications on CDH5 using Maven.
The Spark example uses this parent POM. - The
DatasetRepository
andDatasetRepositories
APIs have been moved to the SPI and
deprecated from the public API. Users should move to the newDatasets
API before the
next release. - Kite will now properly generate Parquet Hive tables on Hive 0.13 and later.
- Writing to a non-empty dataset or view from MapReduce or Crunch will now fail unless
the write mode is explicitly set to append or overwrite. This is a change from
the previous behavior which was to append. See CDK-572 and CDK-347 for details.
The full change log
is available from JIRA.
Version 0.15.0
Release date: 15 July 2014
Version 0.15.0 contains the following notable changes:
- Kite artifacts are built against Apache Hadoop 2 and related projects,
and are now available in Maven Central. - Added new introduction and concepts documentation.
- Added a new Datasets
convenience class for opening and working with Datasets, superseding DatasetRepositories. - Deprecated partition related methods in Dataset in favor of the views API.
- Added a CLI copy task for copying
datasets and also for dataset format conversion and data compaction. - Added an application parent POM that makes it easy to use Kite in a Maven project.
The examples now use this parent POM. - Updated to Crunch 0.10.0
- Morphlines Library
- Added morphline command that parses an InputStream that contains protobuf data: readProtobuf (Rober Fiser via whoschek)
- Added morphline command that extracts specific values from a protobuf object, akin to a simple form of XPath: extractProtobufPaths (Rober Fiser via whoschek)
- Added morphline command that removes all record fields for which the field name matches a blacklist but not a whitelist: removeFields
- Added optional parameters
maxCharactersPerRecord
andonMaxCharactersPerRecord
to morphline command readCSV - Upgraded kite-morphlines-maxmind module from maxmind-db-0.3.1 to bug fix release maxmind-db-0.3.3
- Upgraded kite-morphlines-core module from metrics-0.3.1 to bug fix release metrics-0.3.2
The full change log
is available from JIRA.
Version 0.14.1
Release date: 23 May 2014
Version 0.14.1 is a bug-fix release with the following notable changes:
- Usability improvements for the CLI, CDK-424
- Fixed Dataset examples: dataset-hbase, dataset-staging, dataset-compatibility
- Fixed a bug in the Kite Maven Plugin, CDK-406
The full change log
is available from JIRA.
Version 0.14.0
Release date: 13 May 2014
Version 0.14.0 has the following notable additions:
- Added View support to Kite MapReduce and Crunch
- Added more documentation on kitesdk.org
- Added HBase storage option to the CLI,
--use-hbase
(experimental) - Added a new JSON configuration format for partition strategies
- Supports hash, identity, year, month, day, hour, minute parititoners
- Partition strategy documentation
- Added partition strategy support to the CLI
- Create and validate partition strategies using partition-config
- Create partitioned datasets with create
- Added a builder and JSON configuration format for HBase column mappings
- Supports column, counter, keyAsColumn, key, and version mappings
- Updated to parquet 1.4.1
And the following bug fixes:
- Updated CLI environment setup for CDH5.0 QuickStart VM
- Fixed compatibility with CDH5 Hive, CDK-416
- Fixed schema update validation bug, CDK-410
- Added reconnect support when Hive connections drop, CDK-415
The full change log
is available from JIRA.
Version 0.13.0
Release date: April 23, 2014
Version 0.13.0 has the following notable changes:
- Added datasets command-line interface
- Build avro schemas from CSV data samples and java classes
- Create, view, and delete Kite datasets
- Import CSV data into a dataset
- Morphlines Library
- Added morphline command that opens an HDFS file for read and returns a corresponding Java InputStream: openHdfsFile
- Added morphline command that converts an InputStream to a byte array in main memory: readBlob
- Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-4 to Saxon-HE-9.5.1-5
The full change log
is available from JIRA.
Version 0.12.1
Release date: March 18, 2014
Version 0.12.1 is a bug-fix release with the following notable changes:
- Fixed slow job setup for crunch when using large Datasets (thanks Gabriel Reid!)
- Fixed CDK-328, Hive metastore concurrent access bug (thanks Karel Vervaeke!)
- Clarified documentation for deleting datasets
- Added more better checking to catch errors earlier
- Catch partition strategies that rely on missing data fields
- Catch Hive-incompatible table, column, and partition names
- Added warnings when creating FS or HBase datasets that are incompatible with Hive
The full change log
is available from JIRA.
Version 0.12.0
Release date: March 10, 2014
Version 0.12.0 has the following notable changes:
- MapReduce support for Datasets. New input and output formats (DatasetKeyInputFormat
and DatasetKeyOutputFormat) make it possible to use Datasets with MapReduce. - Views API. There is an incompatible change in this release: RefineableView in the
org.kitesdk.data package has been renamed to RefinableView (no ‘e’). Clients should
update and recompile. - Morphlines Library
- Added a sampling command that forwards each input record with a given probability to its child command: sample
- Added a command that ignores all input records beyond the N-th record, akin to the Unix head command: head
- Improved morphline import performance if all commands are specified via fully qualified class names.
- Added several performance enhancements.
- Added an example module that describes how to unit test Morphline config files and custom Morphline commands.
- Improved documentation.
The full change log
is available from JIRA.
Version 0.11.0
Release date: February 6, 2014
Version 0.11.0 has the following notable changes:
- Views API. A new API for expressing a subset of a dataset using logical constraints
such as field matching or ranges. See the documentation for RefineableView
for details. The HBase example
has been extended to use a view for doing a partial scan of the table. - Dataset API. Removed APIs that were deprecated in 0.9.0. See the API Diffs for all the changes.
- Upgrade to Crunch 0.9.0.
- Morphlines Library
- Added morphline command to read from Hadoop Avro Parquet Files: readAvroParquetFile
- Added support for multi-character separators as well as a regex separators to splitKeyValue command.
- Added
addEmptyStrings
parameter to readCSV command to indicate whether or not to add zero length strings to the output field. - Upgraded kite-morphlines-solr-* modules from solr-4.6.0 to solr-4.6.1.
- Upgraded kite-morphlines-json module from jackson-databind-2.2.1 to jackson-databind-2.3.1.
- Upgraded kite-morphlines-metrics-servlets module from jetty-8.1.13.v20130916 to jetty-8.1.14.v20131031.
- Upgraded kite-morphlines-saxon module from Saxon-HE-9.5.1-3 to Saxon-HE-9.5.1-4.
- Fixed CDK-282
readRCFile
command is broken (Prasanna Rajaperumal via whoschek).
The full change log
is available from JIRA.
Version 0.10.1
Release date: January 13, 2014
Version 0.10.1 includes the following bug fixes:
- CDK-249: Correctly add new partitions to the Hive MetaStore
- CDK-260: Fixed the date-format partition function in expressions
- CDK-266: Fixed file name uniqueness
- CDK-273: Fixed spurious batch size warning in log4j integration
- Fixed NoClassDefFoundError for crunch in kite-tools module
- Added more debug logging to Morphlines
- Solr should fail fast if ZK has no solr configuration
This patch release is fully-compatible with 0.9.1, which uses the deprecated CDK packages.
The full change log
is available from JIRA.
Version 0.10.0
Release date: December 9, 2013
Version 0.10.0 has the following notable changes:
- Renamed the project from CDK to Kite.
The main goal of Kite is to increase the accessibility of Apache Hadoop as a platform.
This isn’t specific to Cloudera, so we updated the name to correctly represent the project as an open, community-driven set of tools.
To make migration easier, there are no feature changes and migration instructions have been added for existing projects.- Renamed java packages
com.cloudera.cdk.*
toorg.kitesdk.*
. This change is trivial and mechanical but it does break backwards compatibility. This is a one-time event - going forward no such backwards incompatible renames are planned. This mass rename is the only change going from thecdk-0.9.0
release to thekite-0.10.0
release. - Renamed maven module names and jar files from
cdk-*
tokite-*
. - Moved github repo from http://github.com/cloudera/cdk to http://github.com/kite-sdk/kite.
- Moved documentation from http://cloudera.github.io/cdk/docs/current to http://kitesdk.org/docs/current.
- Moved morphline reference guide from http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html to http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html.
- Renamed java packages
Version 0.9.0
Release date: December 5, 2013
Version 0.9.0 has the following notable changes:
- HBase support. There is a new experimental API working with random access datasets
stored in HBase. The API exposes get/put operations, but there is no support for
scans from an arbitrary row in this release. (The latter will be added in 0.11.0 as a
part of the forthcoming views API.) For
usage information, consult the new HBase example. - Parquet. Datasets in Parquet format can now be written (and read) using Crunch.
- CSV. Datasets in CSV format can now be read using the dataset APIs. See the compatibility example.
- Dataset API. Removed APIs that were deprecated in 0.8.0. See the
API Diffs for all the
changes. - Morphlines Library
- Added morphline command to read from RCFile: readRCFile (Prasanna Rajaperumal via whoschek)
- Added morphline command to convert a morphline record to an Avro record: toAvro
- Added morphline command that serializes Avro records into a byte array: writeAvroToByteArray
- Added morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup: geoIP
- Added morphline command that parses a user agent string and returns structured higher level data like user agent family, operating system, version, and device type: userAgent
- Added option to fail the following commands if an URI is syntactically invalid: extractURIComponents, extractURIComponent, extractURIQueryParameters
- Upgraded cdk-morphlines-solr-core module from solr-4.4 to solr-4.6.
- Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-2 to saxon-HE-9.5.1-3.
- Fixed race condition on parallel initialization of multiple Solr morphlines within the same JVM.
- For enhanced safety readSequenceFile command nomore reuses the identity of Hadoop Writeable objects.
The full change log
is available from JIRA.
Version 0.8.1
Release date: October 23, 2013
Version 0.8.1 has the following notable changes:
- Morphlines Library
- Made xquery and xslt commands also compatible with woodstox-3.2.7 (not just woodstox-4.x).
Version 0.8.0
Release date: October 7, 2013
Version 0.8.0 has the following notable changes:
- Dataset Repository URIs. Repositories can be referred to (and opened) by a URI. For
example, repo:hdfs://namenode:8020/data specifies a Dataset Repository stored in HDFS.
Dataset descriptors carry the repository URI. - Dataset API. Removed APIs that were deprecated in 0.7.0. Deprecated some constructors
in favor of builders. See API Diffs
for all the changes. - Upgrade to Parquet 1.2.0.
- Morphlines Library
- Added option for commands to register health checks (not just metrics) with the MorphlineContext.
- Added registerJVMMetrics command that registers metrics that are related to the Java Virtual Machine
with the MorphlineContext. For example, this includes metrics for garbage collection events,
buffer pools, threads and thread deadlocks. - Added morphline commands to publish the metrics of all morphline commands to JMX, SLF4J and CSV files.
The new commands are: startReportingMetricsToJMX, startReportingMetricsToSLF4 and startReportingMetricsToCSV. - Added EXPERIMENTAL
cdk-morphlines-metrics-servlets
maven module with new startReportingMetricsToHTTP command that
exposes liveness status, health check status, metrics state and thread dumps via a set of HTTP URIs served by Jetty,
using the AdminServlet. - Added
cdk-morphlines-hadoop-core
maven module with new downloadHdfsFile
command for transferring HDFS files, e.g. to help with centralized configuration file management. - Added option to specify boost values to loadSolr command.
- Added several performance enhancements.
- Upgraded
cdk-morphlines-solr-cell
maven module from tika-1.3 to tika-1.4 to pick up some bug fixes. - Upgraded
cdk-morphlines-core
maven module from com.google.code.regexp-0.1.9 to 0.2.3 to pick up some bug fixes (Internally shaded version). - The constructor of AbstractCommand now has an additional parameter that refers to the CommandBuilder.
The old constructor has been deprecated and will be removed in the next release. - The ISO8601_TIMEZONE grok pattern now allows the omission of minutes in a timezone offset.
- Ensured morphline commands can refer to record field names containing arbitrary characters.
Previously some commands could not refer to record field names containing the ‘.’ dot character.
This limitation has been removed.
The full change log
is available from JIRA.
Version 0.7.0
Release date: September 5, 2013
Version 0.7.0 has the following notable changes:
- Dataset API. Changes to make the API more consistent and better integrated with
standard Java APIs like Iterator, Iterable, Flushable, and Closeable. - Java 7. CDK now also works with Java 7.
- Upgrade to Avro 1.7.5.
- Morphlines Library
- Added commands
splitKeyValue
,extractURIComponent
andtoByteArray
- Added
outputFields
parameter to thesplit
command to support a list of column names similar to thereadCSV
command - Added tika-xmp maven module as a dependency to cdk-morphline-solr-cell module
- Added several performance enhancements
- Upgraded cdk-morphlines-saxon module from saxon-HE-9.5.1-1 to saxon-HE-9.5.1-2
- Added commands
The full change log
is available from JIRA.
Version 0.6.0
Release date: August 16, 2013
Version 0.6.0 has the following notable changes:
- Dependency management. Solr and Lucene dependencies have been upgrade to 4.4.
- Build system. The version of the Maven Javadoc plugin has been upgraded.
Version 0.5.0
Release date: August 1, 2013
Version 0.5.0 has the following notable changes:
- Examples. All examples can be run from the user’s host machine,
as an alternative to running from within the QuickStart VM guest. - CDK Maven Plugin. A new
plugin with
goals for manipulating datasets, and packaging, deploying,
and running distributed applications. - Dependency management. Hadoop components are now marked as provided to give users
more control. See the dependencies page. - Upgrade to Parquet 1.0.0 and Crunch 0.7.0.
- Morphlines Library
- Added commands
xquery
,xslt
,convertHTML
for reading, extracting and transforming XML and HTML with XPath, XQuery and XSLT - Added
tokenizeText
command that uses the embedded Solr/Lucene Analyzer library to generate tokens from a text string, without sending data to a Solr server - Added
translate
command that examines each value in a given field and replaces it with the replacement value defined in a given dictionary aka lookup hash table - By default, disable quoting and multi-line fields feature and comment line feature for the
readCSV
morphline command. - Added several performance enhancements
- Added commands
The full change log
is available from JIRA.
Version 0.4.1
Release date: July 11, 2013
Version 0.4.1 has the following notable changes:
- Morphlines Library
- Expanded documentation and examples
- Made
SolrLocator
andZooKeeperDownloader
collection alias aware - Added commands
readJson
andextractJsonPaths
for reading, extracting, and transforming JSON files and JSON objects, in the same style as Avro support - Added commands
split
,findReplace
,extractURIComponents
,extractURIQueryParameters
,decodeBase64
- Fixed
extractAvroPaths
exception with flatten = true if path represents a non-leaf node of type Record - Added several performance enhancements
Version 0.4.0
Release date: June 22, 2013
Version 0.4.0 has the following notable changes:
- Morphlines Library. A morphline is a rich configuration file that makes it easy to
define an ETL transformation chain embedded in Hadoop components such as Search, Flume,
MapReduce, Pig, Hive, Sqoop. - An Oozie example. A new example of using Oozie to run a transformation job
periodically. - QuickStart VM update. The examples now use version 4.3.0 of the Cloudera QuickStart VM.
- Java package changes. The
com.cloudera.data
package and subpackages
have been renamed tocom.cloudera.cdk.data
, andcom.cloudera.cdk.flume
has becomecom.cloudera.cdk.data.flume
. - Finer-grained Maven modules. The module organization and naming has changed,
including making all group IDscom.cloudera.cdk
. Please see the new dependencies
page for details.
The full change log
is available from JIRA.
Version 0.3.0
Release date: June 6, 2013
Version 0.3.0 has the following notable changes:
- Logging to a dataset. Using log4j as the logging API and Flume as the log transport,
it is now possible to log application events to datasets. - Crunch support. Datasets can be exposed as Crunch sources and targets.
- Date partitioning. New partitioning functions for partitioning datasets by
year/month/day/hour/minute. - New examples. The new examples repository
has examples for all these new features. The examples use the Cloudera QuickStart VM,
version 4.2.0, to make running the examples as simple as possible.
The full change log
is available from JIRA.
Version 0.2.0
Release date: May 2, 2013
Version 0.2.0 has two major additions:
- Experimental support for reading and writing datasets in Parquet format.
- Support for storing dataset metadata in a Hive/HCatalog metastore.
The examples module has example code for both of these usages.
The full change log
is available from JIRA.
Version 0.1.0
Release date: April 5, 2013
Version 0.1.0 is the first release of the CDK Data module. This is considered a
beta release. As a sub-1.0.0 release, this version is not subject to the
normal API compatibility guarantees. See the Compatibility Statement for
information about API compatibility guarantees.