How To setup a Facebook app that streams to Hadoop & Make Twitter stream real-time

This repository contains sample code to create a FB app that streams data to Hadoop. It includes components such as:

Sample PHP code for the Facebook HTTP gets and posts
Flume configuration for a Facebook HTTP Source
The flume agent
A INI file for the Facebook PHP code
DDL for a Facebook hive table

Assumes you have working knowledge of these technical components.

Step 1: Install & Configure Apache

Install apache

a) yum install httpd

b) vi /etc/httpd/conf/httpd.conf

c) Uncomment NameVirtualHost :
```
   
      DocumentRoot /var/www/html
      ServerName www.example.com
   
   
```
d) chkconfig httpd on

e) /etc/init.d/httpd restart
Install PHP

a) http:www.howtoforge.com/installing-apache2-with-php5-and-mysql-support-on-centos-5.3

b) yum install php
Install FB PHP

a) Mkdir /home/cloudera/facebook/php-sdk/facebook-phpsdk-master

b) Download https:github.com/facebook/facebook-php-sdk

i) wget https:github.com/facebook/facebook-php-sdk/archive/master.zip

Step 2: Configure flume

See flume.conf for proper values

1a) Make sure you have flume core and flume ng sdk 1.3.0 or greater

Also need to download and add the folloiwng missing jars

 cp flume-ng-core-1.3.0.jar /usr/lib/flume-ng/lib/
 
 cp /home/cloudera/facebook/apache-flume-1.3.0-bin/lib/flume-ng-sdk-1.3.0.jar /usr/lib/flume-ng/lib

NOTE - the YYYYMMDDHH isn�t working, disabled for now, need to try later

a. Reason (excption=t java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Flume wasn't able to parse timestamp header in the event to resolve time based bucketing. Please check that you're correctly populating timestamp header (for example using TimestampInterceptor source interceptor).)

b. need to specify Timestamp in Header as documented here http:flume.apache.org/FlumeUserGuide.html (See JSONHandler)

c. Modified flume.conf to make use of inteceptors as documented in http:flume.apache.org/FlumeUserGuide.html
   
   i. Here are the new properties in flume.conf
<pre>
SocialAgent.sources.FacebookHttpSource.interceptors = Ts
SocialAgent.sources.FacebookHttpSource.interceptors.Ts.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
</pre>

Need to modify /etc/default/flume_ng_agent for a single Agent called SocialAgent, then /etc/flume-ng-agent/flume.conf to have Twitter and Facebook as a single aganet with difewrent channels and sinks

a. Make sure two channe;s specified otherwise only the last channel specified will run

        b. e.g.
           <pre>
           SocialAgent.sources = FacebookHttpSource Twitter
   SocialAgent.channels = FBmemoryChannel MemChannel
       SocialAgent.sinks = fbHDFS HDFS
   </pre>

Step 3: Create and Configure FB App

create a Facebook App https://developers.facebook.com/docs/guides/canvas/
create a real-time subscription for the FB app
Add the app to the page you care about
Setup a callback URL
Change the parameters in Facebook in facebook.ini according to your app and callback URL

a. To add an App as a tab to a Page:

b. www.facebook.com/add.php?api_key=YOUR_APP_ID&pages=1&page=YOUR_PAGE_ID

Step 4: Stop/Start flume agent

/etc/init.d/flume-ng-agent stop
/etc/init.d/flume-ng-agent start
make sure everything looks good in /var/log/flume-ng/flume.log

Step 5: Make Twitter stream real-time

Create custom Java Package to Ignore Flume Temp Files You’ll want to create a new Java package using the following steps. There is no Java programming knowledge required, simply follow these instructions. It is necessary to create this Java class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to HDFS

mkdir com
mkdir com/twitter
mkdir com/twitter/util
export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.0.0-cdh4.1.3.jar:hadoop-common.jar
Be sure to reference the cdh4.X.X you are working with
vi com/twitter/util/FileFilterExcludeTmpFiles.java
Copy the Java source code at the end of the posting and save it.
javac com/twitter/util/FileFilterExcludeTmpFiles.java
jar cf TwitterUtil.jar com
cp TwitterUtil.jar /usr/lib/hadoop

- Remove Wait condition from Oozie job configuration

Open coord-app.xml (in the location you placed the oozie-workflows folder)

Remove the following tags. This is extremely important in making the tutorial as real-time as possible. The default Oozie workflow has defined a readyIndicator which acts as a wait event. It instructs the workflow to create a new partition after an hour completes. Thus if you leave this configuration as-is, there will be a lag as great as one-hour between tweets and when the tweets can be queried. The reason for this default configuration is that the tutorial did not define the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary Flume files. Because we have deployed this custom package in step 1, we do not have to force a full hour to complete before querying tweets.

  
  ${coord:current(1 + (coord:tzOffset() / 60))}

Retart Oozie workflow
sudo -u hdfs oozie job -oozie http://localhost:11000/oozie -config /home/oozie_lib/oozie-workflows/job.properties -run
sudo -u hdfs oozie job -oozie http://localhost:11000/oozie -kill  


Modify the Hive Configuration File to use Packge in Step 1 to Ignore Flume Temp files

Edit the file /etc/hive/conf/hive-site.xml, and add the following tags.  The first property ensures that you won’t have to add the JSON SerDe package and the new customer package that excludes Flume temporary files for each Hive session. This will become part of the overall Hive configurations that is available to each Hive session. The second tags instruct MapReduce of the class name and location of the new Java class that we created and compiled above.
  hive.aux.jars.path
  file:///usr/lib/hadoop/hive-serdes-1.0-SNAPSHOT.jar,file:///usr/lib/hadoop/TwitterUtil.jar
  
 
   com.twitter.util.FileFilterExcludeTmpFiles
 
 
Sample Code
Use the following 12-lines of Java code (many thanks to the contributors to the CDH Google group for the working example)
package com.twitter.util;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
public class FileFilterExcludeTmpFiles implements PathFilter {
public boolean accept(Path p) {
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".") && !name.endsWith(". tmp");
}
}


See more at: http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html#sthash.wBNmhsQU.dpuf

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How To setup a Facebook app that streams to Hadoop & Make Twitter stream real-time

Step 1: Install & Configure Apache

Step 2: Configure flume

Step 3: Create and Configure FB App

Step 4: Stop/Start flume agent

Step 5: Make Twitter stream real-time

About

Releases

Packages

vishalsah/sentiment_analysis

Folders and files

Latest commit

History

Repository files navigation

How To setup a Facebook app that streams to Hadoop & Make Twitter stream real-time

Step 1: Install & Configure Apache

Step 2: Configure flume

Step 3: Create and Configure FB App

Step 4: Stop/Start flume agent

Step 5: Make Twitter stream real-time

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages