Lab 6: Create a Flume pipeline

In this lab, you will create a Flume pipeline to send movie ratings from a web application into HDFS. You will learn to:

Configure Flume to write records to a dataset
Build and run a movie rating web application

If you haven’t already, make sure you’ve completed Lab 2: Create a movies dataset and Lab 5: Create a partitioned dataset.

Background

In this lab, you will build a Flume pipeline that adds movie ratings to the ratings dataset you created in lab 5.

Flume collects individual ratings sent via RPC and constantly appends them to a dataset. Ratings are produced by a simple web application, located in /home/cloudera/ratings-app.

This architecture mimics a real-world deployment:

The web application is decoupled from Hadoop. It does a minimal amount of work to send records, then moves on with its purpose: user interaction.
Flume accepts records from the application as quickly as possible to avoid interrupting the user experience.
Flume handles durability and eventual storage in HDFS.
The Hadoop cluster doesn’t need to be exposed to the web application tier, unless the application tier chooses to access Hadoop data directly. (This application caches the movies dataset.)

Steps

1. Create a Flume configuration, named `flume.conf`

You might want to refer to the online reference, or the built-in help:

kite-dataset help flume-config

Hints:

You should set the agent name to agent when generating your Flume configuration
The web application is configured to use the default RPC port, 41415
For this example, it is easier to use a memory channel

2. Restart Flume using your `flume.conf`

Replace the /etc/flume-ng/conf/flume.conf file, where Flume will look for its configuration.

sudo cp flume.conf /etc/flume-ng/conf/flume.conf
sudo /etc/init.d/flume-ng-agent restart

You can verify that there were no start-up errors by looking at the Flume log, /var/log/flume-ng/flume.log.

3. Build and run the ratings application

Build the ratings web application with Maven.

If you are using the lab VM, this repository is already downloaded for you; skip the clone step.

git clone https://github.com/rdblue/ratings-app.git ~/ratings-app
cd ~/ratings-app
mvn package

Then run the runtime jar with the hadoop command:

hadoop jar target/ratings-app-0.17.1-runtime.jar

4. Rate some movies

Once the application has started, rate a few movies with the app.

In the terminal window where the application is running, you should see ratings shown as they are submitted:

>> Listening on 0.0.0.0:4567
...
Sending rating: {"timestamp": 1423073425248, "user_id": 34, "movie_id": 176, "rating": 5}
Sending rating: {"timestamp": 1423073571712, "user_id": 34, "movie_id": 367, "rating": 4}

5. Verify the ratings are written correctly

The Flume configuration will release data files every 30 seconds. Use Impala or the Kite show command to see the new records that have been added by Flume.

Remember, you can run individual Impala queries using -q and you will need to re-sync Impala’s metadata to pick up new partitions:

impala-shell -q 'invalidate metadata'

View the solution
Move on to the next lab: Analyze ratings with Crunch