i started to write Php crawler i need to filter some links how can i do it ?preg_match_all ? for example i need to block https://www.website.com/user.. my code : <?php $page = file_get_contents('https://www.website.com/'); $newstr=preg_split("/<a href=/",$page); //print_r($newstr); $fh=fopen("file111.txt","a"); foreach ($newstr as $links) { $links=strip_tags($links); print ("$links <br/>"); trim ($links); fwrite($fh,$links); } fclose($fh); ?> PHP:
So disregard NetStar And perhaps do this <?php $page = file_get_contents('https://www.website.com/'); $newstr=preg_split("/<a href=/",$page); //print_r($newstr); $fh=fopen("file111.txt","a"); foreach ($newstr as $links) { //If $links does not match the string below we echo and write to file if (strpos($links,'https://www.website.com/user..') === false) { $links=strip_tags($links); print ("$links <br/>"); trim ($links); fwrite($fh,$links); } } fclose($fh); ?> PHP:
You can disagree with me. However, you can't disregard the use of a library for parsing HTML. Using a regular expression to extract links is not the right tool for the job. First, we are assuming that ALL links begin with "a href", include no additional spaces and are on a single line which is never always true. Sometimes TITLE, NAME, ETC also follows "A". Second, the posted code is NOT parsing all links. So again... Look into PHP Simple HTML DOM Parser OR PHPQuery.
He didn't ask if he was parsing it correctly, he asked what the next step was which I have answered. If he wants to do a better job of parsing the page you could just recommend that he user PHP's DOMXPath class which works very well, no need to mess with any of the solutions you listed.
Disregard both of them . Use Querypath , which is a wrapper around PHP's DOMdocument and is superior in every way to simple html dom / regex . simple html dom uses regex only, so its not really better than regex, plus it has nasty memory leak issues (sort of)
So you took the time to provide an answer knowing that ultimately it wouldn't have served him any justice? Welp...that's terrible advice because these posts are archived for others to see.
I personally prefer regex for simple HTML search/replace. You can use this regex: <a[^>]*>[^\r\n]+</a> and use an "ungreedy" modifier (lookup modifiers). Any HTML parser which is not based on regex will fail to parse an improper HTML document due to HTML errors (forgotten closing tags, typos, closing tags in the wrong order, etc). That's why regex is better for a simple search/replace. If you are looking for anything beyond simple search/replace you should consider a better HTML parser. Good luck!
depends on what you're crawling for and what site you're crawling, but preg_match_all() (see php.net searchbox) is an excellent way of getting the info you need at minimal CPU cost.