1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP-DOM Scraping HTML content

Discussion in 'PHP' started by Camay123, Oct 12, 2012.

  1. #1
    OK,

    So I want to be able to enter a url in some form and treat it with php/dom.

    I want to extract certain informations from it based on html tag.

    For example, I want to enter a url and extract the text content of the H1 tag.

    Im running in 2 difficulties:

    1) I extract the whole h1 tag but just want the innerHTML.

    2) If my H1 is linked inside, I extract the link <a href...> tag also.

    Im using this parser code: http://simplehtmldom.sourceforge.net/manual.htm

    Here is my code so far:

    <?php
    
    include("simple_html_dom.php");
    
    $html = file_get_html("http://".$_POST["url"]);
    
    foreach($html->find('H1') as $element); 
    echo $element;
    
    
    
    ?>
    Code (markup):
    Any help is appreciated as im trying to learn all this as an experiment project.
     
    Camay123, Oct 12, 2012 IP
  2. pixelator

    pixelator Peon

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Hi,

    I'm not sure If this might help. Why use jQuery to access and manipulate the client side scripts.


     
    pixelator, Oct 12, 2012 IP
  3. koconder

    koconder Well-Known Member

    Messages:
    122
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    115
    #3
    jQuey is a good way to handle this... also you can use XPATH which will treat the DOM as XML and elements.
     
    koconder, Oct 13, 2012 IP
  4. Red Swordfish Media

    Red Swordfish Media Peon

    Messages:
    18
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    You might also check into using PHP file_get_contents. You can use strpos to pass in the start and stop information.
     
    Red Swordfish Media, Oct 13, 2012 IP
  5. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #5
    Seems like an interesting object:

    To get the InnerHTML:

    
    echo $element->innertext;
    
    PHP:
    To get the outer Href, I would search the parent src attr:

    
    echo $element->parent()->href;
    
    PHP:
     
    ThePHPMaster, Oct 15, 2012 IP
  6. KsNitro

    KsNitro Greenhorn

    Messages:
    60
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    20
    #6
    You may want to have a look at zend_dom as part of the Zend php framework.

    Here is a Link.
     
    KsNitro, Oct 17, 2012 IP
  7. SHRT.IN

    SHRT.IN Peon

    Messages:
    25
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    1
    #7
    Alright I just tested this and it works. Enjoy:
    <?php$content = file_get_contents("http://" . $_POST['url']);
    $start_limiter = '<h1>';$end_limiter = '</h1>';
    $start_pos = strpos($content,$start_limiter);if ($start_pos === FALSE){die("Starting limiter ".$start_limiter." not found ");}
    $end_pos = strpos($content,$end_limiter,$start_pos);
    if ($end_pos === FALSE){die("Ending limiter ".$end_limiter." not found ");}
    
    $h1tag= substr($content, $start_pos, ($end_pos)-$start_pos);echo $h1tag;?>
    PHP:
     
    SHRT.IN, Oct 17, 2012 IP