Php crawler

ssimon171078 Well-Known Member

Messages:: 277

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 103

#1

i started to write Php crawler i need to filter some links how can i do it ?preg_match_all ?
for example i need to block https://www.website.com/user..

my code :
<?php
$page = file_get_contents('https://www.website.com/');

$newstr=preg_split("/<a href=/",$page);

//print_r($newstr);
$fh=fopen("file111.txt","a");
foreach ($newstr as $links)
{
$links=strip_tags($links);
print ("$links <br/>");
trim  ($links);
fwrite($fh,$links);


}
fclose($fh);
?>
PHP:

Last edited by a moderator: Oct 6, 2014

ssimon171078, Sep 27, 2014 IP

NetStar Notable Member

Messages:: 2,471

Likes Received:: 541

Best Answers:: 21

Trophy Points:: 245

#2

No. That is not how you do it. Google Simple HTML DOM Parser

NetStar, Sep 27, 2014 IP

Anveto Well-Known Member

Messages:: 697

Likes Received:: 40

Best Answers:: 19

Trophy Points:: 195

#3

So disregard NetStar

And perhaps do this

<?php
$page = file_get_contents('https://www.website.com/');

$newstr=preg_split("/<a href=/",$page);

//print_r($newstr);
$fh=fopen("file111.txt","a");
foreach ($newstr as $links)
{
//If $links does not match the string below we echo and write to file
  if (strpos($links,'https://www.website.com/user..') === false) {
$links=strip_tags($links);
print ("$links <br/>");
trim ($links);
fwrite($fh,$links);
}


}
fclose($fh);
?>

PHP:

Last edited by a moderator: Oct 6, 2014

Anveto, Sep 28, 2014 IP

NetStar Notable Member

Messages:: 2,471

Likes Received:: 541

Best Answers:: 21

Trophy Points:: 245

#4

MarkusTenghamn said: ↑
So disregard NetStar

And perhaps do this
<?php
$page = file_get_contents('https://www.website.com/');

$newstr=preg_split("/<a href=/",$page);

//print_r($newstr);
$fh=fopen("file111.txt","a");
foreach ($newstr as $links)
{
//If $links does not match the string below we echo and write to file
  if (strpos($links,'https://www.website.com/user..') === false) {
$links=strip_tags($links);
print ("$links <br/>");
trim ($links);
fwrite($fh,$links);
}


}
fclose($fh);
?>
PHP:
Click to expand...
You can disagree with me. However, you can't disregard the use of a library for parsing HTML. Using a regular expression to extract links is not the right tool for the job. First, we are assuming that ALL links begin with "a href", include no additional spaces and are on a single line which is never always true. Sometimes TITLE, NAME, ETC also follows "A". Second, the posted code is NOT parsing all links.

So again... Look into PHP Simple HTML DOM Parser OR PHPQuery.

Last edited by a moderator: Oct 6, 2014

NetStar, Sep 28, 2014 IP

Anveto Well-Known Member

Messages:: 697

Likes Received:: 40

Best Answers:: 19

Trophy Points:: 195

#5

He didn't ask if he was parsing it correctly, he asked what the next step was which I have answered.

If he wants to do a better job of parsing the page you could just recommend that he user PHP's DOMXPath class which works very well, no need to mess with any of the solutions you listed.

Anveto, Sep 28, 2014 IP

kutchbhi Active Member

Messages:: 130

Likes Received:: 4

Best Answers:: 2

Trophy Points:: 70

#6

Disregard both of them .
Use Querypath , which is a wrapper around PHP's DOMdocument and is superior in every way to simple html dom / regex .
simple html dom uses regex only, so its not really better than regex, plus it has nasty memory leak issues (sort of)

kutchbhi, Sep 29, 2014 IP

NetStar Notable Member

Messages:: 2,471

Likes Received:: 541

Best Answers:: 21

Trophy Points:: 245

#7

MarkusTenghamn said: ↑

He didn't ask if he was parsing it correctly, he asked what the next step was which I have answered.
Click to expand...

So you took the time to provide an answer knowing that ultimately it wouldn't have served him any justice? Welp...that's terrible advice because these posts are archived for others to see.

NetStar, Sep 29, 2014 IP

nitsanbn Active Member

Messages:: 382

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 58

#8

I personally prefer regex for simple HTML search/replace.
You can use this regex: <a[^>]*>[^\r\n]+</a> and use an "ungreedy" modifier (lookup modifiers).

Any HTML parser which is not based on regex will fail to parse an improper HTML document due to HTML errors (forgotten closing tags, typos, closing tags in the wrong order, etc). That's why regex is better for a simple search/replace.
If you are looking for anything beyond simple search/replace you should consider a better HTML parser.

Good luck!

nitsanbn, Oct 6, 2014 IP

seductiveapps.com Active Member

Messages:: 200

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 60

#9

depends on what you're crawling for and what site you're crawling, but preg_match_all() (see php.net searchbox) is an excellent way of getting the info you need at minimal CPU cost.

seductiveapps.com, Nov 16, 2014 IP

Log in or Sign up

Php crawler

ssimon171078 Well-Known Member

NetStar Notable Member

Anveto Well-Known Member

NetStar Notable Member

Anveto Well-Known Member

kutchbhi Active Member

NetStar Notable Member

nitsanbn Active Member

seductiveapps.com Active Member

Useful Searches