i wrote php crawler script ,how i can know all links from web page what i need to parse? my code is : <?php //parser of website ebay domain names $website="www.example.com"; $filename="services4.txt"; $fd=fopen($filename,"a+"); $content=file_get_contents($website); $dom=new DOMDocument; $dom->loadhtml($content); $links=$dom->getElementsByTagName("a"); foreach ($links as $link) { $link_nza=$link->getAttribute("href"); if (strpos($link_nanza,"listings")){ rtrim(link_nanza); fwrite($fd,$link_web.$link_nanza); fwrite($fd,"\n");} } fclose($fd); ?> PHP:
Don't use regex, as regex is very complex and slow. Parsing the HTML is way more secure and faster. But to fetch all links you should consider href="", src="" tags
You already seem to be pulling all the links, so what are you even asking?!? Or are you wanting to filter out just the links that point to the same domain? If so, use parse_url: http://php.net/manual/en/function.parse-url.php If you filter by PHP_URL_HOST and the result is either empty or matches the domain you are parsing, it's likely a document on the same site. You may also want to check if there's a <base> tag present and use that's value when PHP_URL_HOST is missing. I would also stick to just parsing href on anchors since things like SRC attributes on LINK or IMG tags should NOT contain content. Of course, if the site being parsed relies on scripttardery you're pretty well buggered on trying to deal with that.