With these scripts you can scrap reviews of any category you choose from Daraz.pk which is Pakistan's largest ecommerce platform. You can use them for NLP, research or just for fun.
The scripts need the following to be installed in your system:
This is a two step process. And there are two scripts involved. 1_extract_products.py
and 2_extract_reviews.py
. Follow the instructions (Run in commandline):
- Clone the repository in your system.
git clone https://github.com/sfsultan/daraz_review_data_scraping.git
- Change the working directory :
cd daraz_review_data_scraping
- Create a virtual enviornemnt:
virutalenv env
- If you have multiple python installations, provide the suitable one as the argument.
- Activate the virutalenv:
- Linux :
source env/bin/activate
- Windows :
env\Scripts\activate
- Linux :
- Install all the dependencies:
pip install -r requirements.txt
- Run
1_extract_products.py
to extract the product urls for a particular category. Script excepts two arguments.category name
: The name of the category.total pages
: Total number of pages that category has.- example :
python 1_extract_products.py "smartphones" 56
The above command will result in acsv
file containing all the urls for the products of the provided category.
- Run
2_extract_reviews.py
to extract all the reviews from the urls present in thecsv
file generated in the first step. This script only has one argument:category name
:- example :
python 2_extract_reviews.py "smartphones"
- example :
The scripts leverage two python libraries selenium
and beautifulsoup
for the data extraction.
@misc{daraz-review-data-scraping-2021,
name = {Fahd Sultan},
author = {sfsultan},
title = {Extract reviews from Daraz.pk},
version = {1},
date = {2021-05-08},
type = {electronic resource}
}