This repository contains sample code to create a FB app that streams data to Hadoop. It includes components such as:
-
Sample PHP code for the Facebook HTTP gets and posts
-
Flume configuration for a Facebook HTTP Source
-
The flume agent
-
A INI file for the Facebook PHP code
-
DDL for a Facebook hive table
Assumes you have working knowledge of these technical components.
-
Install apache
a) yum install httpd
b) vi /etc/httpd/conf/httpd.conf
c) Uncomment NameVirtualHost :
DocumentRoot /var/www/html ServerName www.example.com
d) chkconfig httpd on
e) /etc/init.d/httpd restart
-
Install PHP
a) http:www.howtoforge.com/installing-apache2-with-php5-and-mysql-support-on-centos-5.3
b) yum install php
-
Install FB PHP
a) Mkdir /home/cloudera/facebook/php-sdk/facebook-phpsdk-master
b) Download https:github.com/facebook/facebook-php-sdk
i) wget https:github.com/facebook/facebook-php-sdk/archive/master.zip
- See flume.conf for proper values
1a) Make sure you have flume core and flume ng sdk 1.3.0 or greater
-
Also need to download and add the folloiwng missing jars
cp flume-ng-core-1.3.0.jar /usr/lib/flume-ng/lib/ cp /home/cloudera/facebook/apache-flume-1.3.0-bin/lib/flume-ng-sdk-1.3.0.jar /usr/lib/flume-ng/lib
-
NOTE - the YYYYMMDDHH isn�t working, disabled for now, need to try later
a. Reason (excption=t java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Flume wasn't able to parse timestamp header in the event to resolve time based bucketing. Please check that you're correctly populating timestamp header (for example using TimestampInterceptor source interceptor).) b. need to specify Timestamp in Header as documented here http:flume.apache.org/FlumeUserGuide.html (See JSONHandler) c. Modified flume.conf to make use of inteceptors as documented in http:flume.apache.org/FlumeUserGuide.html i. Here are the new properties in flume.conf <pre> SocialAgent.sources.FacebookHttpSource.interceptors = Ts SocialAgent.sources.FacebookHttpSource.interceptors.Ts.type = org.apache.flume.interceptor.TimestampInterceptor$Builder </pre>
-
Need to modify /etc/default/flume_ng_agent for a single Agent called SocialAgent, then /etc/flume-ng-agent/flume.conf to have Twitter and Facebook as a single aganet with difewrent channels and sinks
a. Make sure two channe;s specified otherwise only the last channel specified will run b. e.g. <pre> SocialAgent.sources = FacebookHttpSource Twitter SocialAgent.channels = FBmemoryChannel MemChannel SocialAgent.sinks = fbHDFS HDFS </pre>
-
create a Facebook App https://developers.facebook.com/docs/guides/canvas/
-
create a real-time subscription for the FB app
-
Add the app to the page you care about
-
Setup a callback URL
-
Change the parameters in Facebook in facebook.ini according to your app and callback URL
a. To add an App as a tab to a Page:
b. www.facebook.com/add.php?api_key=YOUR_APP_ID&pages=1&page=YOUR_PAGE_ID
-
/etc/init.d/flume-ng-agent stop
-
/etc/init.d/flume-ng-agent start
-
make sure everything looks good in /var/log/flume-ng/flume.log
- Create custom Java Package to Ignore Flume Temp Files You’ll want to create a new Java package using the following steps. There is no Java programming knowledge required, simply follow these instructions. It is necessary to create this Java class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to HDFS
mkdir com mkdir com/twitter mkdir com/twitter/util export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.0.0-cdh4.1.3.jar:hadoop-common.jar Be sure to reference the cdh4.X.X you are working with vi com/twitter/util/FileFilterExcludeTmpFiles.java Copy the Java source code at the end of the posting and save it. javac com/twitter/util/FileFilterExcludeTmpFiles.java jar cf TwitterUtil.jar com cp TwitterUtil.jar /usr/lib/hadoop
-
- Remove Wait condition from Oozie job configuration
Open coord-app.xml (in the location you placed the oozie-workflows folder)
Remove the following tags. This is extremely important in making the tutorial as real-time as possible. The default Oozie workflow has defined a readyIndicator which acts as a wait event. It instructs the workflow to create a new partition after an hour completes. Thus if you leave this configuration as-is, there will be a lag as great as one-hour between tweets and when the tweets can be queried. The reason for this default configuration is that the tutorial did not define the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary Flume files. Because we have deployed this custom package in step 1, we do not have to force a full hour to complete before querying tweets.
${coord:current(1 + (coord:tzOffset() / 60))}Retart Oozie workflow
sudo -u hdfs oozie job -oozie http://localhost:11000/oozie -config /home/oozie_lib/oozie-workflows/job.properties -run sudo -u hdfs oozie job -oozie http://localhost:11000/oozie -kill
- Modify the Hive Configuration File to use Packge in Step 1 to Ignore Flume Temp files
Edit the file /etc/hive/conf/hive-site.xml, and add the following tags. The first property ensures that you won’t have to add the JSON SerDe package and the new customer package that excludes Flume temporary files for each Hive session. This will become part of the overall Hive configurations that is available to each Hive session. The second tags instruct MapReduce of the class name and location of the new Java class that we created and compiled above.
hive.aux.jars.path file:///usr/lib/hadoop/hive-serdes-1.0-SNAPSHOT.jar,file:///usr/lib/hadoop/TwitterUtil.jar com.twitter.util.FileFilterExcludeTmpFilesSample Code
Use the following 12-lines of Java code (many thanks to the contributors to the CDH Google group for the working example)
package com.twitter.util; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.PathFilter; public class FileFilterExcludeTmpFiles implements PathFilter { public boolean accept(Path p) { String name = p.getName(); return !name.startsWith("_") && !name.startsWith(".") && !name.endsWith(". tmp"); } }