Spider

💫 Spider is a PHP command line tool that allows you to crawl a website for informations scraping.

Spider is a crawler of website modulable write in PHP. The tool allows you to retrieve information and execute code on website pages. It can be useful for SEO or security audit purposes. Users have the possibility to use the modules created by the community or to create their own modules (written in PHP via a web interface).

What is a Crawler?

A crawler is an indexing robot, it automatically explores the pages of a website. Using a crawler can have several interests:

  • Information search & retrieval
  • Validation of the SEO of your website
  • Integration test
  • Execution of PHP code on several pages in an automated way

Features

  • Get all links from website
  • Check HTTP response
  • Create your own Modules (Crawl & execute your PHP code).
  • No database, Pure PHP & Symfony
  • Output json file

Libraries

I would be happy to receive your ideas and contributions to the project 😃

Getting started

Installation

Composer Usage

Use Spider library in your project & create your own modules.

composer require mediashare/spider
1

Usage

Create index.php file and init the config.

<?php
// ./index.php
require 'vendor/autoload.php';
use Mediashare\Spider\Entity\Config;
use Mediashare\Spider\Entity\Url;
use Mediashare\Spider\Spider;

// Website Config
$config = new Config();
$config->setWebspider(true); // Crawl all website
$config->setReportsDir(__DIR__.'/reports/'); // Default reports path
$config->setModulesDir(__DIR__.'/modules/'); // Default modules path
// Prompt Console / Dump
$config->setVerbose(true); // Prompt verbose output
$config->setJson(false); // Prompt json output
// Modules Activation
$config->enableDefaultModule(true); // Enable default SEO kernel modules
$config->removeModule('FileDownload'); // Disable Module

// Url
$url = new Url('https://mediashare.fr');

// Run Spider
$spider = new Spider($url, $config);
$result = $spider->run();
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Modules

Requierements

  • The name of your class needs to be the same as the name of the .php file.
  • The entry point for executing modules is the run() function, so it is mandatory to have a run() function in your module. Spider executes the run() function public when the webpage has just been crawled. So you can use the DomCrawler.

Documentation

DomCrawler is symfony component for DOM navigation for HTML and XML documents. You can retrieve Documentation Here.

Create own module to execute actions when the crawler scraps a webpage.

<?php
// ./modules/LinksTest.php
namespace Mediashare\Modules;

class LinksTest {
    public $dom;
    public function run() { 
        $links = [];
        foreach($this->dom->filter('a') as $link) {
            if (!empty($link)) {
                $href = rtrim(ltrim($link->getAttribute('href')));
                if ($href) {
                    if (isset($links[$href])) {
                        $links[$href]['counter']++;
                    } else {
                        $links[$href]['counter'] = 1;
                    }
                }
            }
        }
        return $links;
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Execute the code from the console.

php index.php
1

Output

-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*
* Output file result: /home/slote/Bureau/Spider/var/reports/marquand.pro/5dfaf1c0147c6.json
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*
1
2
3

Modules

Modules are tools created by the community to add features when crawling a website. Adding a module to a crawler allows the automation of code execution on one or more pages of a website. Modules are executed when crawling a page. More information...

Last Updated: 1/21/2020, 1:45:13 PM