PHP scan file, display matches?

Discussion in 'PHP' started by JosS, Jun 3, 2007.

  1. #1
    I don't know if this is possible and what sort of things I need to look into, and was hoping one of you could help me out.

    Basically I have a URL, for instance http://www.mysite.com/yeah.php

    and on this page, is a list of URLS/page text. I need to setup something so it will go through the source of the page and grab all the URL's containing a term such as 'pageno='

    Then grab all these URL's and then spit them out in a list?

    What sort of things would I need to look at on php.net to accomplish this?

    Cheers in advance.
     
    JosS, Jun 3, 2007 IP
  2. krt

    krt Well-Known Member

    Messages:
    829
    Likes Received:
    38
    Best Answers:
    0
    Trophy Points:
    120
    #2
    File handling functions or cURL. This usually does the job:
    $html = file_get_contents('http://site.com/page');
    PHP:
    As for getting the URLs from the HTML that the above function returns, look up regex pattern matching.
    Something like:
    preg_match_all("~http://site.com/\?pageno=\d+~", $html, $m);
    PHP:
     
    krt, Jun 3, 2007 IP
  3. JosS

    JosS Guest

    Messages:
    369
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #3
    If I do that, then say "echo $m;" the file just says "Array"

    Should It just display every result it finds?
     
    JosS, Jun 3, 2007 IP
  4. JosS

    JosS Guest

    Messages:
    369
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Ok I worked that part out,

    print_r($m);

    I'll see how i go from here. Thanks a tonne mate! I will post when I have my project completed :)
     
    JosS, Jun 3, 2007 IP
  5. JosS

    JosS Guest

    Messages:
    369
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Gah,

    How do I get rid of all the Array ( [0] => Array ( [0] => stuff, and just echo each results on its own with a <br />

    after it?
     
    JosS, Jun 3, 2007 IP
  6. raredev

    raredev Peon

    Messages:
    49
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    how about this one
    
    $html = file_get_contents('http://site.com/page');
    $lines = explode("\n", $html);
    foreach ($lines as $line)
        if (strpos($line, 'pageno=') !== FALSE)
            echo $line.'<br/>';
    
    PHP:
     
    raredev, Jun 3, 2007 IP
  7. JosS

    JosS Guest

    Messages:
    369
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Thanks! That's like perfect, but I see it echos all the other HTML on the URL line of the page.

    Is there anyway to strip the URL's only, and display them out of the $lines
     
    JosS, Jun 3, 2007 IP
  8. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #8
    
    <?
    function extract_urls_from_html( $location, $inurl )
    {
    	$returns = array( );
    	
    	if( ( $handle = fopen( $location, 'r' ) ) )
    	{
    		while( !feof( $handle ) )
    		{
    			if( preg_match( $inurl, fgets( $handle, 4096 ), $matches ) )
    			{
    				$returns[ ]	= $matches[1];
    			}
    			
    		}	
    		fclose( $handle );
    	}
    	return $returns[1] ? $returns : false ;
    }
    
    $urls = extract_urls_from_html( "http://yoursite.com/yeah.php", '~href="(.*?pageno=.*?)"~' );
    
    if( is_array( $urls ) )
    {
    	print( implode("<br />\n", $urls ) );
    }
    ?>
    
    PHP:
    Something like that...... gimme a link to the page if the regex doesn't work, regex is a pretty specific thing to use, it's not normally good enough just to say a link like "pageno=" you would have to give the exact structure of the link should you not want the function to return anything else at all.
     
    krakjoe, Jun 3, 2007 IP
    JosS likes this.