Distributed Data Storage with Hadoop HDFS and Amazon S3

Introduction

This project demonstrates how to set up a Hadoop HDFS cluster for distributed file storage and integrate it with Amazon S3 to leverage scalable cloud storage. The project includes steps for setting up Hadoop, configuring it to work with S3, and performing file operations.

Objectives

Set up a Hadoop HDFS cluster for distributed file storage.
Integrate HDFS with Amazon S3 for scalable cloud storage.
Perform file operations in both HDFS and S3.
Understand the architecture and components of HDFS and S3.

Prerequisites

Java (JDK 1.8 or later)
Hadoop (version 3.4.0)
AWS CLI
AWS S3 bucket
Internet connection

Tools and Technologies

Hadoop HDFS
Amazon S3
AWS CLI (Command Line Interface)
Hadoop Ecosystem Tools (HDFS Shell Commands)

Step-by-Step Implementation

Step 1: Set Up Hadoop HDFS Cluster

Install Hadoop: Download Hadoop from the official Apache Hadoop website and follow the installation guide for your operating system.

Configure Hadoop: Edit the Hadoop configuration files (core-site.xml, hdfs-site.xml) to set up the HDFS cluster.

core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <value>YourAccessKey</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>YourSecretKey</value>
    </property>
    <property>
        <name>fs.s3a.endpoint</name>
        <value>s3.amazonaws.com</value>
    </property>
    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
</configuration>

hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///usr/local/hadoop/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///usr/local/hadoop/hdfs/datanode</value>
    </property>
</configuration>

Start HDFS: Format the namenode and start HDFS services:
```
hdfs namenode -format
start-dfs.sh
```

Step 2: Integrate HDFS with Amazon S3

Install AWS CLI: Follow the installation guide for the AWS CLI.
Configure AWS CLI: Configure AWS CLI with your AWS credentials:
```
aws configure
```
Set Up S3 Bucket: Create an S3 bucket using the AWS Management Console or CLI:
```
aws s3 mb s3://your-bucket-name
```

Transfer Data Between HDFS and S3:

Copy a file from HDFS to S3:

hdfs dfs -cp /user/sowrabh/hadoopfiles/input.txt s3a://distributed-data-storage/Input-files/

Copy a file from S3 to HDFS:

hdfs dfs -cp s3a://distributed-data-storage/Input-files/input.txt /user/sowrabh/hadoopfiles/

Step 3: Perform File Operations

HDFS File Operations:

List files in HDFS:
```
hdfs dfs -ls /
```
Create a directory in HDFS:
```
hdfs dfs -mkdir /user/yourname
```

Upload a file to HDFS:

hdfs dfs -put localfile.txt /user/yourname/

Download a file from HDFS:

hdfs dfs -get /user/yourname/localfile.txt localfile.txt

S3 File Operations:

List files in S3:
```
aws s3 ls s3://your-bucket-name/
```

Upload a file to S3:

aws s3 cp localfile.txt s3://your-bucket-name/localfile.txt

Download a file from S3:

aws s3 cp s3://your-bucket-name/localfile.txt localfile.txt

Troubleshooting

ClassNotFoundException: Ensure that hadoop-aws-3.4.0.jar and aws-java-sdk-bundle-1.11.1026.jar are placed in the $HADOOP_HOME/share/hadoop/tools/lib/ directory. Add the following to hadoop-env.sh:
```
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
```

Conclusion

This project demonstrates how to set up a distributed data storage system using Hadoop HDFS and integrate it with Amazon S3 for scalable cloud storage. By following these steps, you can efficiently manage and transfer large volumes of data between on-premise and cloud storage solutions.

References

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config_files		config_files
scripts		scripts
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Data Storage with Hadoop HDFS and Amazon S3

Introduction

Objectives

Prerequisites

Tools and Technologies

Step-by-Step Implementation

Step 1: Set Up Hadoop HDFS Cluster

Step 2: Integrate HDFS with Amazon S3

Step 3: Perform File Operations

Troubleshooting

Conclusion

References

License

About

Releases

Packages

Languages

License

sowrabh-m/Distributed_Data_Storage

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Storage with Hadoop HDFS and Amazon S3

Introduction

Objectives

Prerequisites

Tools and Technologies

Step-by-Step Implementation

Step 1: Set Up Hadoop HDFS Cluster

Step 2: Integrate HDFS with Amazon S3

Step 3: Perform File Operations

Troubleshooting

Conclusion

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages