This project demonstrates how to set up a Hadoop HDFS cluster for distributed file storage and integrate it with Amazon S3 to leverage scalable cloud storage. The project includes steps for setting up Hadoop, configuring it to work with S3, and performing file operations.
- Set up a Hadoop HDFS cluster for distributed file storage.
- Integrate HDFS with Amazon S3 for scalable cloud storage.
- Perform file operations in both HDFS and S3.
- Understand the architecture and components of HDFS and S3.
- Java (JDK 1.8 or later)
- Hadoop (version 3.4.0)
- AWS CLI
- AWS S3 bucket
- Internet connection
- Hadoop HDFS
- Amazon S3
- AWS CLI (Command Line Interface)
- Hadoop Ecosystem Tools (HDFS Shell Commands)
-
Install Hadoop: Download Hadoop from the official Apache Hadoop website and follow the installation guide for your operating system.
-
Configure Hadoop: Edit the Hadoop configuration files (
core-site.xml
,hdfs-site.xml
) to set up the HDFS cluster.-
core-site.xml
:<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>fs.s3a.access.key</name> <value>YourAccessKey</value> </property> <property> <name>fs.s3a.secret.key</name> <value>YourSecretKey</value> </property> <property> <name>fs.s3a.endpoint</name> <value>s3.amazonaws.com</value> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> </configuration>
-
hdfs-site.xml
:<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///usr/local/hadoop/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///usr/local/hadoop/hdfs/datanode</value> </property> </configuration>
-
-
Start HDFS: Format the namenode and start HDFS services:
hdfs namenode -format start-dfs.sh
-
Install AWS CLI: Follow the installation guide for the AWS CLI.
-
Configure AWS CLI: Configure AWS CLI with your AWS credentials:
aws configure
-
Set Up S3 Bucket: Create an S3 bucket using the AWS Management Console or CLI:
aws s3 mb s3://your-bucket-name
-
Transfer Data Between HDFS and S3:
-
Copy a file from HDFS to S3:
hdfs dfs -cp /user/sowrabh/hadoopfiles/input.txt s3a://distributed-data-storage/Input-files/
-
Copy a file from S3 to HDFS:
hdfs dfs -cp s3a://distributed-data-storage/Input-files/input.txt /user/sowrabh/hadoopfiles/
-
-
HDFS File Operations:
- List files in HDFS:
hdfs dfs -ls /
- Create a directory in HDFS:
hdfs dfs -mkdir /user/yourname
- Upload a file to HDFS:
hdfs dfs -put localfile.txt /user/yourname/
- Download a file from HDFS:
hdfs dfs -get /user/yourname/localfile.txt localfile.txt
- List files in HDFS:
-
S3 File Operations:
- List files in S3:
aws s3 ls s3://your-bucket-name/
- Upload a file to S3:
aws s3 cp localfile.txt s3://your-bucket-name/localfile.txt
- Download a file from S3:
aws s3 cp s3://your-bucket-name/localfile.txt localfile.txt
- List files in S3:
- ClassNotFoundException:
Ensure that
hadoop-aws-3.4.0.jar
andaws-java-sdk-bundle-1.11.1026.jar
are placed in the$HADOOP_HOME/share/hadoop/tools/lib/
directory. Add the following tohadoop-env.sh
:export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
This project demonstrates how to set up a distributed data storage system using Hadoop HDFS and integrate it with Amazon S3 for scalable cloud storage. By following these steps, you can efficiently manage and transfer large volumes of data between on-premise and cloud storage solutions.
This project is licensed under the MIT License - see the LICENSE file for details.