Parser

ssimon171078 Well-Known Member

Messages:: 277

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 103

#1

i made parser of ebay but i have problem: each line of text that i parsed i receive twice
my code:
<?php
//parser of website ebay domain names
ini_set('memory_limit','1024M');
ini_set('max_execution_time',0);
$website="http://www.ebay.com/sch/Domain-Names-/3767/i.html";
$filename="ebay_domain_names3.txt";
$fd=fopen($filename,"a+");
function parse($Page){
global $website;
global $fd;
if ($Page!=0)
{$content=file_get_contents($website."?_pgn=".$Page."&_skc=200&rt=nc");echo ($website."?_pgn=".$Page."&_skc=200&rt=nc");}
else
{$content=file_get_contents($website);}
$dom=new DOMDocument();
$dom->loadhtml($content);
$links=$dom->getElementsByTagName("a");
foreach ($links as $link)
{
    $links_ebay=$link->getAttribute("href");
    if (strpos($links_ebay,"itm")){
    fwrite($fd,$links_ebay);
    fwrite($fd,"\n");}
    }

 
}
for ($Page=0;$Page<22000;$Page++){
parse($Page);
sleep(10);
}


fclose($fd);
?>
PHP:
my text file:
http://www.ebay.com/itm/OSRON-COM-For-Sale-PREMIUM-DOMAIN-NAME-Aged-BRANDABLE-3-4-5-Letter-/271784332998?pt=LH_DefaultDomain_0&hash=item3f479bcec6
http://www.ebay.com/itm/OSRON-COM-For-Sale-PREMIUM-DOMAIN-NAME-Aged-BRANDABLE-3-4-5-Letter-/271784332998?pt=LH_DefaultDomain_0&hash=item3f479bcec6
http://www.ebay.com/itm/InsulinInhaled-com-FDA-Approved-Breakthrough-Diabetes-No-Injection-Treatment-/181672911427?pt=LH_DefaultDomain_0&hash=item2a4c8ca243
http://www.ebay.com/itm/InsulinInhaled-com-FDA-Approved-Breakthrough-Diabetes-No-Injection-Treatment-/181672911427?pt=LH_DefaultDomain_0&hash=item2a4c8ca243

ssimon171078, Feb 27, 2015 IP

PDD Greenhorn

Messages:: 67

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 23

#2

It's because there's two a tags associated with each domain listing. One for the link and one for the image.

PDD, Feb 27, 2015 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#3

Just change the script to do a comparison between the current link and the former - if they're the same, don't print it.
Something like this:
foreach ($links as $link)
{
    $prev_link = '';
    $links_ebay=$link->getAttribute("href");
    if (strpos($links_ebay,"itm") && $links_ebay != $prev_link){
    fwrite($fd,$links_ebay);
    fwrite($fd,"\n");
   $prev_link = $links_ebay;
   }
}
PHP:

PoPSiCLe, Feb 27, 2015 IP

PDD Greenhorn

Messages:: 67

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 23

#4

that code would reset $prev_link every iteration . i think a cleaner solution is to just check for the img class attribute for the <a> tag using his DOM parser.

PDD, Feb 27, 2015 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#5

Ops - the $prev_link = ''; should of course be OUTSIDE the foreach-function. Shit happens when you type on a mobile keyboard

PoPSiCLe, Feb 27, 2015 IP

ssimon171078 Well-Known Member

Messages:: 277

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 103

#6

i have small question i have html code:
<li class="lvprice prc">
            <span  class="bold">
                    <b>ILS</b> 597.46</span>
                </li>
HTML:
how can i receive 597.46 when i want to use PHP i think to use DOMXPath ?

ssimon171078, Feb 28, 2015 IP

PDD Greenhorn

Messages:: 67

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 23

#7

in pseudocode

doc('span[class=bold]')->innerText

Code (php):

or

$span = doc('span[class=bold]');
$span->children('b')->remove();
$price = $span->innerText;

Code (php):

PDD, Feb 28, 2015 IP

EricBruggema Well-Known Member

Messages:: 1,740

Likes Received:: 28

Best Answers:: 13

Trophy Points:: 175

#8

Why not use a temporary array to store the found URL, before storing to the file check if it exists in the array if not, add and write, if it does, ignore and go to the next... it ain't that hard...

EricBruggema, Mar 3, 2015 IP

Log in or Sign up

Parser

ssimon171078 Well-Known Member

PDD Greenhorn

PoPSiCLe Illustrious Member

PDD Greenhorn

PoPSiCLe Illustrious Member

ssimon171078 Well-Known Member

PDD Greenhorn

EricBruggema Well-Known Member

Useful Searches