crawl-screenshot-diff

Crawls a website, make screenshots and compare those screenshots visually

It uses angularjs 1.3 for the web frontend and iojs and a mysql database on the server side.

About this project

This was a quick hack for a big website relaunch project to track the progress of the development and population with content of a new website (using Typo3 as CMS). That's why the front-end is not fully developed in a sense, that there are always feedbacks when a button is clicked or similar events. It's more of an "expert-system" - use the browser console and a terminal to get a grip of what's going on and if a click on a button just activated the browser window or started a new job.

How to install

required software

iojs
ImageMagick
curl
phantomjs
mysql

clone the repository

git clone https://github.com/bumzack/crawl-screenshot-diff.git

use npm to install the required packages

npm install

create an SSL certificate

Got to the server directory and create a SSL certificate. If you want use an existing certificate, change the path and filenames in server.js accordingly.

If you want to create your own certificate, create one using the following command:

openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -keyout crawlscreen.key -out crawlscreen.crt

Import the database dump

Import the database dump into a mysql database.

e.g.

mysql -u USER -p DATABASENAME < database/crawlscreendiff.sql

replace USER and DATABASENAME with appropiate values for your setup.

Configuration

In the server directory you find a file called config.js. Adjust the constants to your settings. The most important settings are:

listenPort: which port the server listens to. Default: port 3000.
database: username, password and databasename for the mysql database. Default values: crawlscreendiff
appSecret: secret for the JWT authentication. Defaultvalue: crawlScreenshotApp
phantomjs: path to the executeable binary phantomjs (including the name of the binary).

First run

Change to the server directory and start the server with node server.js. Open your browser and enter https://localhost:3000 which should present you with a login form. The database dump contains 2 users, which allows you to login:

one admin user: username: admin, password: admin
one readonly user: username: readonly, password: readonly

If you want to add more users or change the default passwords, you can use the following mysql statements:

insert an admin user

INSERT INTO crawl_user (username,  password, admin)  VALUES ('admin', SHA2('admin', 512), true)

insert an unprivileged user

INSERT INTO crawl_user (username,  password, admin, accessrights)  VALUES ('readonlyuser',  SHA2('readonlyuser', 512), false, "PAGE_ANALYTICS,PAGE_SEO")

what it does

In a first step this tool uses curl to retrieve the HTML code of the domain, uses jsdom and jquery to analyze the HTML code and finds all internal links (and external links and images and forms and ...). Then it crawls through all internal links and saves the HTML code of allpages in the mysql database. No images, no javascript, no CSS files are downloaded. When there is no unvisited internal link left, the crawl job is finished.
In a second step screenshots can be made for all of these pages.
If the first 2 steps are repeated for the same domain, then the differences on these pages can be visualized in an diff-image resp. there is also an "all-in-one" image.

3 different jobs / cronjobs

So there are 3 different jobs, which can be manually started via the web front-end or can also be started via cron-jobs. See the directory server/cronjobs for examples how this can be achieved. You'll find scripts there which start a crawl-job and a screenshot-job and a different script which starts a screenshot-comparison job for the last 2 crawl jobs of the domain. This makes it possible to automatically create diff-images on a daily/weekly basis for a domain.

logging

In the server/log directory several log files are created and can grow quite fast in size.

troubleshooting

mysql connection fails: set the correct path to the mysql.lock file in the config.js file. (it worked on Mac OSX without the path, on CentOS 7.1 the path was necessary to make it work).
phantomjs on Mac OSX: the current version is 2.0.0. if there are problems, try downgrading to 1.9.7.
if you get an error similar to yieldable wrapper dbQuery threadId 2770 error-msg: Error: write EPIPE then you can try a either
- decrease the value of curljobMaxarraylength or
- if you have root access on your server, then increase the values of the following 2 constants in the [mysqld] section in the my.cnf file (don't know if 512M is the right or a good value - it seems enough tough :-) )
  - max_allowed_packet=512M
  - net_buffer_length=512M

The error indicates, that the SQL queries are too big. For technical details see this stackoverflow question (the nodejs error doesn't reveal the true nature of the error).

the SEO part

is quite experimental.

When a crawl-job is finished, all the content is split it into single words and inserted into a table in the mysql database. This makes ist possible to count the occurencies of each word on a single page - and also take into account in which kind of HTML element the word is placed. So it is possible to weigh the use of a word whether it is only used once in a <p> element or if it is used in the <title> tag, an <h1> and a <p> tag. To my limited knowledge of how google decides what's an important word an a page, this is sorta one of the criteria.

a few words regarding security

not all mysql statements/values are escaped
both the curl and phantomjs command use parameters which allow the access to untrusted SSL certificates. If you don't want that, then remove the -k parameter for the curl call and the --ignore-ssl-errors=true parameter for the phantomjs call.

unused features

There is some code, which is currently not used or activated but it should be possible to make these features work without going through too much pain.

text external links: after parsing the html code with jsdom, different content types are stored in the database. Currently only internal links are used. There are also external links and these external links can be tested: which HTTP code is returned when these links are curled. This allows for a simple check whether or not external links on the page are working and which are not.
user admin: add/remove users via the web frontend. Change passwords in the web frontend.
blacklist of words which should be ignored for the SEO part.

data privacy

The software logs the IP adresses of all requests in the server/log/access.log. If this does not comply with your privacy rules then remove the IP in the server.js file - search for this.ip as a starting point.

License

This software is licensed under the The MIT License (MIT) - see The MIT License for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
client		client
database		database
server		server
.gitignore		.gitignore
README.md		README.md
license.txt		license.txt
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawl-screenshot-diff

About this project

How to install

required software

clone the repository

use npm to install the required packages

create an SSL certificate

Import the database dump

Configuration

First run

insert an admin user

insert an unprivileged user

what it does

3 different jobs / cronjobs

logging

troubleshooting

the SEO part

a few words regarding security

unused features

data privacy

License

About

Releases 1

Packages

Languages

License

bumzack/crawl-screenshot-diff

Folders and files

Latest commit

History

Repository files navigation

crawl-screenshot-diff

About this project

How to install

required software

clone the repository

use npm to install the required packages

create an SSL certificate

Import the database dump

Configuration

First run

insert an admin user

insert an unprivileged user

what it does

3 different jobs / cronjobs

logging

troubleshooting

the SEO part

a few words regarding security

unused features

data privacy

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages