crawler

ssimon171078 Well-Known Member

Messages:: 277

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 103

#1

i wrote php crawler script ,how i can know all links from web page what i need to parse?
my code is :

<?php
//parser of website ebay domain names
$website="www.example.com";

$filename="services4.txt";
$fd=fopen($filename,"a+");

$content=file_get_contents($website);
$dom=new DOMDocument;
$dom->loadhtml($content);
$links=$dom->getElementsByTagName("a");
foreach ($links as $link)
{
    $link_nza=$link->getAttribute("href");
     if (strpos($link_nanza,"listings")){
      rtrim(link_nanza);
   fwrite($fd,$link_web.$link_nanza);
     fwrite($fd,"\n");}
    }


fclose($fd);
?>

PHP:

ssimon171078, May 24, 2015 IP

davidokedion Greenhorn

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 21

#2

Use regex to grab the links.

davidokedion, Jul 29, 2015 IP

EricBruggema Well-Known Member

Messages:: 1,740

Likes Received:: 28

Best Answers:: 13

Trophy Points:: 175

#3

Don't use regex, as regex is very complex and slow. Parsing the HTML is way more secure and faster. But to fetch all links you should consider href="", src="" tags

EricBruggema, Aug 7, 2015 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#4

You already seem to be pulling all the links, so what are you even asking?!?

Or are you wanting to filter out just the links that point to the same domain? If so, use parse_url:
http://php.net/manual/en/function.parse-url.php

If you filter by PHP_URL_HOST and the result is either empty or matches the domain you are parsing, it's likely a document on the same site. You may also want to check if there's a <base> tag present and use that's value when PHP_URL_HOST is missing.

I would also stick to just parsing href on anchors since things like SRC attributes on LINK or IMG tags should NOT contain content.

Of course, if the site being parsed relies on scripttardery you're pretty well buggered on trying to deal with that.

deathshadow, Aug 7, 2015 IP

Log in or Sign up

crawler

ssimon171078 Well-Known Member

davidokedion Greenhorn

EricBruggema Well-Known Member

deathshadow Acclaimed Member

Useful Searches