Welcome to the Spark Analysis repository, focusing on Walmart's stock data from 2012 to 2017. This project is designed to provide a practical, hands-on experience with Apache Spark DataFrames, exploring various arithmetic and logical operations through a series of questions and exercises.
In this repository, we dive into the world of big data analysis using Apache Spark, a leading platform for large-scale SQL, batch processing, stream processing, and machine learning. Using Walmart stock data spanning five years, we'll explore fundamental DataFrame operations, data manipulation techniques, and basic analytics.
- Data Exploration: Understand the structure and characteristics of the dataset.
- Arithmetic Operations: Perform calculations and aggregations to derive insights.
- Logical Operations: Apply logical operations to filter and refine the data analysis.
- Question-Based Learning: Each exercise is framed as a question to guide your analysis.
- Dataset
- The dataset consists of Walmart's stock prices from 2012 to 2017. It includes columns like Date, Open, High, Low, Close, Volume, and Adjusted Close.
- Prerequisites
- Apache Spark (preferably the latest version)
- Basic knowledge of Python and SQL
- Installation and Setup
- Clone the Repository
git clone https://github.com/uannabi/PySparkExercise.git
cd PySparkExercise
The exercises are designed as Jupyter notebooks that you can run in your Spark environment. Ensure you have Jupyter installed and configured for use with Spark.
This repository aims to provide a foundational understanding of Spark DataFrames through practical exercises. Contributions, suggestions, and improvements are warmly welcomed. Feel free to fork the repository and submit your pull requests.