i made parser of ebay but i have problem: each line of text that i parsed i receive twice my code: <?php //parser of website ebay domain names ini_set('memory_limit','1024M'); ini_set('max_execution_time',0); $website="http://www.ebay.com/sch/Domain-Names-/3767/i.html"; $filename="ebay_domain_names3.txt"; $fd=fopen($filename,"a+"); function parse($Page){ global $website; global $fd; if ($Page!=0) {$content=file_get_contents($website."?_pgn=".$Page."&_skc=200&rt=nc");echo ($website."?_pgn=".$Page."&_skc=200&rt=nc");} else {$content=file_get_contents($website);} $dom=new DOMDocument(); $dom->loadhtml($content); $links=$dom->getElementsByTagName("a"); foreach ($links as $link) { $links_ebay=$link->getAttribute("href"); if (strpos($links_ebay,"itm")){ fwrite($fd,$links_ebay); fwrite($fd,"\n");} } } for ($Page=0;$Page<22000;$Page++){ parse($Page); sleep(10); } fclose($fd); ?> PHP: my text file: http://www.ebay.com/itm/OSRON-COM-For-Sale-PREMIUM-DOMAIN-NAME-Aged-BRANDABLE-3-4-5-Letter-/271784332998?pt=LH_DefaultDomain_0&hash=item3f479bcec6 http://www.ebay.com/itm/OSRON-COM-For-Sale-PREMIUM-DOMAIN-NAME-Aged-BRANDABLE-3-4-5-Letter-/271784332998?pt=LH_DefaultDomain_0&hash=item3f479bcec6 http://www.ebay.com/itm/InsulinInhaled-com-FDA-Approved-Breakthrough-Diabetes-No-Injection-Treatment-/181672911427?pt=LH_DefaultDomain_0&hash=item2a4c8ca243 http://www.ebay.com/itm/InsulinInhaled-com-FDA-Approved-Breakthrough-Diabetes-No-Injection-Treatment-/181672911427?pt=LH_DefaultDomain_0&hash=item2a4c8ca243
It's because there's two a tags associated with each domain listing. One for the link and one for the image.
Just change the script to do a comparison between the current link and the former - if they're the same, don't print it. Something like this: foreach ($links as $link) { $prev_link = ''; $links_ebay=$link->getAttribute("href"); if (strpos($links_ebay,"itm") && $links_ebay != $prev_link){ fwrite($fd,$links_ebay); fwrite($fd,"\n"); $prev_link = $links_ebay; } } PHP:
that code would reset $prev_link every iteration . i think a cleaner solution is to just check for the img class attribute for the <a> tag using his DOM parser.
Ops - the $prev_link = ''; should of course be OUTSIDE the foreach-function. Shit happens when you type on a mobile keyboard
i have small question i have html code: <li class="lvprice prc"> <span class="bold"> <b>ILS</b> 597.46</span> </li> HTML: how can i receive 597.46 when i want to use PHP i think to use DOMXPath ?
in pseudocode doc('span[class=bold]')->innerText Code (php): or $span = doc('span[class=bold]'); $span->children('b')->remove(); $price = $span->innerText; Code (php):
Why not use a temporary array to store the found URL, before storing to the file check if it exists in the array if not, add and write, if it does, ignore and go to the next... it ain't that hard...