BH SEO: Scraping with HTMLSQL

déc

BH SEO: Scraping with htmlSQL

z. SEO 2010-12-07

First of all, you can get it on the official website (no more updated) or through the download link at the end of this post.

Why use htmlSQL?

Sure, you’ll ask me: « Why use a PHP library when scripts like XrumerSEO or AutoSplog exist? ». The answer is: it can be set in crontab and totally be autonomous.

Before a concrete example, here is a simple request you may have using the lib:

You can see the syntax is quite identic to standard SQL and it allows you to parse the whole html easily!

It means you can:

extract all the images of a given page:
```
select src from img
```

extract all descriptions of products:

select text from div where $class == 'descr'

extract every anchor from a page:
```
select text from a
```
etc

You can imagine how powerfull the script could be. You can even navigate through pages using different instance of the object (check fr y scraping example beside):

Using:

 select href from a where class='download_url

And then, foreach of those URLs :

select href from a where $text == 'download files'

I hope you’re still following because the main part is here: you can now get all informations you need for your website ;)

Example

ini_set ('max_execution_time', 0); // No execution limite

$categorie_toadd = 7;

$mon_site = "http://www.site.tld/category/nice/?page=2";

// Include both snoopy and htmlsql classes + database connection class //
require("snoopy.class.php");
require("htmlsql.class.php");
require("Db.class.php");

$DB = new DB();

$tab_img = $tab_url2 = $tab_url1 = $tab_descr = $tab_type = $tab_poid = $tab_langue =  $tab_system = $tab_nom = $tab_version = $tab_url = array();

$wsql = new htmlsql();

if (!$wsql->connect('url', $mon_site))
{
    print "Error while connecting: " . $wsql->error;
    exit;
}

// request: we are looking for links which have the see_results class
// which is the class of websites URLs within the results
$wsql->query('SELECT * FROM a WHERE $class =="see_results" ');
foreach($wsql->fetch_array() as $link){
//    echo $link['href'] . "
";
    $tab_url[] = $link['href'];
}

$wsql->query('SELECT * FROM div WHERE $class =="system" ');
foreach($wsql->fetch_array() as $link){
//    echo $link['text'] . "
";
    $ret = 6; // default

    if (eregi("98/Me/2000/XP",$link['text'])) {
            // affect id freeware
            $ret = 5 ;
    }
    if (eregi("2000/XP",$link['text'])) {
            // affect id freeware
            $ret = 6 ;
    }
    if (eregi("MacOSX",$link['text'])) {
            // affect id freeware
            $ret = 1 ;
    }
    if (eregi("Linux/MacOSX/2000/XP",$link['text'])) {
            // affect id freeware
            $ret = 12 ;
    }

    $tab_system[] = $ret;
}
$wsql->query('SELECT * FROM div WHERE $class =="language" ');
foreach($wsql->fetch_array() as $link){
//    echo $link['text'] . "
";

    if (eregi("Français",$link['text'])) {
            // affect id freeware
            $ret = 1 ;
    }
    if (eregi("Anglais",$link['text'])) {
            // affect id freeware
            $ret = 2 ;
    }
    if (eregi("Français",$link['text'])) {
            // affect id freeware
            $ret = 1 ;
    }    

    $tab_langue[] = $ret;
}
$wsql->query('SELECT * FROM td WHERE $class =="size" ');
foreach($wsql->fetch_array() as $link){
//    echo $link['text'] . "
";

    $toaprse = explode("
", $link['text'] );
    $toaprse[0] = strip_tags($toaprse[0]);

    $last_op = explode(" ", $toaprse[0] );
    $taille = (double)$last_op[0];

//    print_r($last_op);

    if (eregi("m",$last_op[1])) $taille *=1000;

    $toaprse[1] = strip_tags($toaprse[1]);
    $toaprse[1] = str_replace("Télécharger", "", $toaprse[1]);

    if (eregi("freeware",$toaprse[1])) {
        // affect id freeware
        $ret = 1 ;
    }
    if (eregi("Source ",$toaprse[1])) {
        // affect open source
        $ret = 4 ;
    }
    if (eregi("Demo ",$toaprse[1])) {
        // affect Shareware
        $ret = 2 ;
    }
    $tab_poid[] = $taille;
    $tab_type[] = $ret;
}

Here is part of a functionnal script of mine.

Content spinning & Translation

Now, it’s time for content spinning. You’ll be able to find many scripts for that on BH websites but here is a link to where you can get one:

http://www.deliciouscadaver.com/script-php-de-content-spinning-multiniveaux-recursif.html

I’m pasting here the script from 512banque, standard one but usefull for beginning:

function spinnage($text){

        if(!preg_match(‘/{/si’, $text)) {

        return $text;

         }
                else {
        preg_match_all(‘/\{([^{}]*)\}/si’, $text, $matches);
        $occur = count($matches[1]);
        for ($i=0; $i<$occur; $i++)
        {
                $word_spinning = explode("|",$matches[1][$i]);
                shuffle($word_spinning);
                $text = str_replace($matches[0][$i], $word_spinning[0], $text);
        }
return  spinnage($text);
        }

}

echo spinnage($text);

Another tips I can give you is to use the Google translate API if you want to translate some part of your datas.

Now it’s time to add datas to your database, I’ll let you choose your DB class. If needed, ask me through comment, but you should probably have yours. Author of htmlSQL is also providing a class for that.

We are here close from the end. We’re now able to parse any web page and parse it saving datas to our database.

Automatisation

Let’s now have a little « crontab -e » ;)

I mean it’s time for us to automatize the scraping process. Let’s save our PHP files and the requirement (api, connectors, required class) and add it to the crontab. Something like that works fine but be carefull to set include in full path using this syntax:

10 10 * * * php -f /home/www/scrappyboy/scrappyboy-script.php

Note: the script will be executed every day at 10h10 and add fresh new datas on your website.

We’re done now. You can combine this script with SimplePie, good script if you want to parse RSS feeds.

Search

Another tips I can give you is to use this lib on google.com for parsing SERP and write your own ranks.fr script :)

select href from a where $class == 'l'

This requests works fine and will allow you to get SERP for a given keyword. Usefull for creating reports.

Have fun & scrap & sorry for that poor english ;)

Ressources

Download htmlsql v0.5 (203)
Sources: -none atm- If you want any sources, just ask me in comments

black hat seo, scrapping, seo, SERP

Address: http://www.devquotes.com/2010/12/07/seo-scraping-with-websql/

« An efficient network throttling algorithm

Merry Christmas from afnic »

Trackback

3 comments untill now

iop @ 2010-12-07 17:22

Nice, what have you got in your db.php plz ?
karim @ 2010-12-07 18:26

class Db is an interface to my database.

$Db->Query($query);
$Db->Getnumrow($query);
$Db->fetch_array($query);

etc…

if needed i can provide mine.
La veille du week-end (neuvième) | LoïcG @ 2010-12-10 08:06

[...] SEO : scraping avec HTMLSQL (c’est “un peu” Black-Hat … use with caution !) : via @512banque [...]

devquotes,

devs are (s)talking.