Lab 6 Solution: Create a Flume pipeline
This is the solution page for Lab 6: Create a Flume pipeline.
Steps
1. Create a Flume configuration
The flume-config
command will produce a working configuration with the following options:
ratings
– the target dataset--agent agent
– configure the agent named agent--channel-type memory
– hold ratings in memory while waiting for them to be written
kite-dataset flume-config ratings --agent agent --channel-type memory --output flume.conf
2. Update Flume’s configuration and restart
Replace the /etc/flume-ng/conf/flume.conf
file, where Flume will look for its configuration.
sudo cp flume.conf /etc/flume-ng/conf/flume.conf
sudo /etc/init.d/flume-ng-agent restart
Flume’s log (/var/log/flume-ng/flume.log
) should show successful start-up messages for the source and sink:
...
INFO [thread-1] (o.a.f.s.k.DatasetSink.start:191) - Started DatasetSink dataset:hive://127.0.0.1:9083/default/default/ratings
...
INFO [thread-3] (o.a.f.s.AvroSource.start:253) - Avro source avro-event-source started.
...
3. Build and run the ratings application
cd /home/cloudera/ratings-app
mvn package
hadoop jar target/ratings-app-0.17.1-runtime.jar
4. Rate some movies
Use the running app to rate a few movies.
You will see the ratings show in the terminal window as they are submitted:
>> Listening on 0.0.0.0:4567
...
Sending rating: {"timestamp": 1423073425248, "user_id": 34, "movie_id": 176, "rating": 5}
Sending rating: {"timestamp": 1423073571712, "user_id": 34, "movie_id": 367, "rating": 4}
5. Verify the ratings are written correctly
To use Impala, re-sync the metadata and select ratings from this year:
impala-shell -q 'invalidate metadata'
impala-shell -q 'select * from ratings where year=2015'
Query: select * from ratings where year=2015
+---------------+---------+----------+--------+------+-------+
| timestamp | user_id | movie_id | rating | year | month |
+---------------+---------+----------+--------+------+-------+
| 1423073571712 | 34 | 367 | 4 | 2015 | 2 |
| 1423073425248 | 34 | 176 | 5 | 2015 | 2 |
+---------------+---------+----------+--------+------+-------+
Fetched 2 row(s) in 0.61s
To use Kite, you can pass a view URI to the show command:
kite-dataset show view:hive:ratings?year=2015
{"timestamp": 1423073571712, "user_id": 34, "movie_id": 367, "rating": 4}
{"timestamp": 1423073425248, "user_id": 34, "movie_id": 176, "rating": 5}
Next
- Back to the lab
- Move on to the next lab: Analyze ratings with Crunch