- Scrapes basic metadata with ratings and reviews
- Scrape all or specific brands
- Scrape unlocked, locked, or both cell phones
- Use multiple Puppeteer pages as workers
Read more on personalizing setting at the configuration section.
You can download pre-scraped datasets at Kaggle.
puppeteer
for browser-based scrapingprettier
for formatting source codests-node
for running TypeScript scripts
- Make sure the dependencies are downloaded by running
npm install
oryarn
. - (Optional) Copy
config.default.ts
(this file is ignored with git) toconfig.ts
and customize config variables onconfig.ts
.
- Open the project directory in Visual Studio Code.
- Select and execute Scrape Search Results in the launch options on the Debug tab (exported to
./data/yyyymmdd-results.csv
). - Then select and execute Scrape Item Reviews (exported to
./data/yyyymmdd-reviews.csv
).
- Run
npm run scrape:items
oryarn scrape:items
first to scrape initial item results (exported to./data/yyyymmdd-results.csv
). - Then run
npm run scrape:reviews
oryarn scrape:reviews
to scrape item reviews (exported to./data/yyyymmdd-reviews.csv
).
-
scrape:items
Scrapes and saves entry results for review scraping.
-
scrape:reviews
Scrapes and saves entry reviews based on
scrape:items
data. -
format
Format all
.ts
files. -
format:data
Format
.json
files in/data
.
-
brands
-string[]
Self explanatory.
Defaults to ten major phone manufacturers, set to
[]
(empty array) to disable brand filtering and select all available brands.Note that by selecting all brands will not assign what brand it is, probably will implement this in future versions.
-
brandKeywords
-{brand: string, keywords: string[]}
Brand alternative names or keywords for brand assignment.
Since the search page does not explicitly tell what brand it is, after scraping the results it determines from the items' URL and title by comparing
brands
andbrandKeywords
values. -
categories
-'unlocked' | 'locked' | 'both'
Also self explanatory.
Whether scrape unlocked, locked, or both categories. If both, workers will scrape unlocked results first then locked results.
-
numberOfWorkers
-number
Number of active 'workers' or pages to use for scraping.
Note that Amazon's server will assume too many requests or workers as an unusual traffic and will return a captcha page instead of the intended result page
{
brands: [
'ASUS',
'Apple',
'Google',
'HUAWEI',
'Motorola',
'Nokia',
'OnePlus',
'Samsung',
'Sony',
'Xiaomi',
],
brandKeywords: [
{ brand: 'Apple', keywords: ['iPhone'] },
{ brand: 'Google', keywords: ['Pixel'] },
{ brand: 'HUAWEI', keywords: ['Honor'] },
{ brand: 'Motorola', keywords: ['Moto'] },
{ brand: 'Samsung', keywords: ['Haven'] },
{ brand: 'Sony', keywords: ['Xperia'] },
],
categories: 'both',
numberOfWorkers: 8,
}