Skip to content
FCCNhviana edited this page Dec 28, 2015 · 23 revisions

This code was developed and tested in the Linux environment (Red Hat Enterprise Linux 5).

Requirements

Maven 2.x:

Step-by-step

Checkout Hadoop (branch-0.14):

  • git clone -b branch-0.14 https://github.com/arquivo/hadoop-common

Install Hadoop:

  • cd hadoop
  • create pom.xml
 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
   <modelVersion>4.0.0</modelVersion>
   <name>hadoop</name>
   <url>http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.14</url>   
   <groupId>org.apache</groupId>
   <artifactId>hadoop</artifactId>
   <version>0.14.5-dev-core</version>
   <packaging>jar</packaging>
   <build>
   <directory>build/</directory>
   <plugins>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-antrun-plugin</artifactId>
         <version>1.8</version>
         <executions>
             <execution>
                 <phase>compile</phase>
                 <configuration>
                         <target>
                             <ant target="jar" inheritRefs="true">
                             <property name="build.compiler" value="extJavac"/>
                             <property name="build.sysclasspath" value="last"/>
                             </ant>
                         </target>
                     </configuration>
                     <goals>
                         <goal>run</goal>
                     </goals>
                 </execution>
             </executions>
         </plugin>
     </plugins>
   </build>
 </project>
  • mvn install
This version of Hadoop (http://hadoop.apache.org/) must be used for all mapreduce processing.
Checkout PwaLucene + PwaArchiveAccess:
  • git clone https://github.com/arquivo/pwa-technologies.git

    Install PwaLucene:

  • cd pwa-technologies/PwaLucene
  • mvn install Install PwaArchiveAccess:
  • cd pwa-technologies/PwaArchive-access
  • mvn install
  • configure (only if you need to change the default configuration)
  • mvn install

The JAR and WAR files are available in:

  • pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar
  • pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-webapp/target/nutchwax-webapp-0.11.0-SNAPSHOT.war
  • pwa-technologies/PwaArchive-access/projects/wayback/wayback-webapp/target/wayback-1.2.1.war
  • pwa-technologies/PwaLucene/target/pwalucene-1.0.0-SNAPSHOT.jar

Others

Symbolic link of nutch for nutch-trec:

  • cd pwa-technologies/PwaArchive-access
  • ln -s ../../projects/nutchwax/nutchwax-thirdparty/nutch/ projects/nutch-trec/
This is only necessary if you will use the TREC datasets for tests.