How do I extract all e-mail addresses from an HTML file?

Discussion in 'PHP' started by kkibak, Dec 11, 2006.

  1. #1
    I'm trying to write a little script that will run through an HTML page and identify every e-mail address on the page (even if it is in a mailto: ).. I want to store these all in an array so that I can echo them back one after another.

    Here's what I've tried so far--I'm not sure if I'm approaching this correctly and I can't seem to get it to work (I'm also totally new to regex so I could have probs there):

    
    	$url = "http://www.URL.com";
    
    	
    	if(!($contents = file_get_contents($url)))
    	{
    		echo 'could not open url';
    		exit;
    	}
    	
    	$contents = htmlentities($contents);
    	$p = "/[\._a-zA-Z0-9-]+@[\._a-zA-Z0-9-]+/i";
    	preg_match_all($p, $contents, $out);
    	  	   
    	echo "<h1>Email: ".$out[1]."</h1>";
    	echo "<hr>$contents";
    
    
    Code (markup):
    If the HTML code of www.URL.com had something like:

    
    blah blaeh email@email.com <table><tr><td>bleh <a href="mailto:email2@email.com">me</a></td></tr></table>
    
    and this email too: email3@email.com.
    
    Code (markup):
    I would want my script to get all 3 emails and echo them back to me.

    Any help is much appreciated! :) To preempt the suspicions, no I'm not using this for anything spammy, it's to identify e-mails on a bunch of old html pages before we had a db set up correctly.
     
    kkibak, Dec 11, 2006 IP
  2. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #2
    This was meant to be an edit but for some reason it made a new post.. sorry about that

    I got it working and just wanted to share the code:

    
    	$url = "http://www.url.com";
    	
    	if(!($contents = file_get_contents($url)))
    	{
    		echo 'could not open url';
    		exit;
    	}
    	
    	$contents = htmlentities($contents);
    	$p = "/[\._a-zA-Z0-9-]+@[\._a-zA-Z0-9-]+/i";
    	
    	#$p = '/breeder/';
    	preg_match_all($p, $contents, $out);
    	  	   
    	$n = "0"
    	
    	$nout = $out[0];
    	foreach ($nout as $t) {
    		echo $nout[0];
    		$n++; 
    	}
    
    Code (markup):
     
    kkibak, Dec 11, 2006 IP
  3. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #3
    or

    
    <? 
    	$url = "http://pozter.info";
    
    	$contents = @file_get_contents($url);
    		 
    	@preg_match_all("/mailto:(.*)\"/", $contents, $out);
    	   	   
    	echo "<pre>"; print_r($out[1]);
    ?>
    
    PHP:
    PS, the foreach at the bottom of your code isn't doing anything
     
    krakjoe, Dec 11, 2006 IP
  4. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #4
    thats the part that echos email addresses

    also, that code would only match emails that had mailto code surround them--i want all addresses, even those that are not clickable.
     
    kkibak, Dec 11, 2006 IP
  5. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #5
    $nout = $out[0];
    foreach ($nout as $t) {
    echo $nout[0];
    $n++;
    }

    your echoing the same value everytime you loop, the $n var isn't used, and neither is $t, so the foreach loop isn't doing anything.

    every e-mail address on the page (even if it is in a mailto: ).

    didn't see the word even, sorry....
     
    krakjoe, Dec 11, 2006 IP
  6. kkibak

    kkibak Peon

    Messages:
    1,083
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #6
    oops there's a typo in that code, it should be echo $nout[$n];
     
    kkibak, Dec 11, 2006 IP