1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Is it possible to screen scrape / grab a div with an ID from an external site?

Discussion in 'PHP' started by Deliwasista, Feb 23, 2012.

  1. #1
    I have used php and curl to screenscrape a web page into another web page.. but the page called was a very basic .txt file

    What i need to do now is call a Div from an external web page. The div has an ID of "mobile"

    this div contains all of the page copy. No navigation, footer, header etc.

    I own both sites, the reason for doing this is the copy on the screengrabbed page will be updated frequently. And as this page copy is duplicated on the other site this will mean not having to do twice the update.

    I have spent 6 hours wandering the web trying to work out how its done. And i see other people asking the same question and the answers agree it is possible but no one gives a clear answer as to how.

    I am not a php programmer - so part of instructions wont be enough sadly. I have found a popular answer "Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element." - but ive tried this and its too advanced for me to work out. Ive asked my service provider and they have pointed me towards grabber v.01 - which ive downloaded but it isnt well commented enough for me to adjust.

    1) can this be done by inserting something inbetween divs on my page the is calling the screengrab.

    2) or does this need to be a program that runs and then delivers it to my page calling the screen grab

    3) is this something called in the header of my page calling the grab...

    im lost.

    I have tried:
    <div id="page">

    <?
    $html = file_get_contents('http://www.mysite.com',0);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $dom_element = $dom->getElementById('mobile');
    $inner_html = $dom_element->textContent;
    ?>

    </div>

    ive also tried:
    <div id="page">

    $html = file_get_html('http://www.mysite.com');
    $ret = $html->find('div[id=mobile]');
    </div>

    and that didnt work either..

    any pointers much appreciated!
     
    Deliwasista, Feb 23, 2012 IP
  2. Deliwasista

    Deliwasista Member

    Messages:
    35
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #2
    ps ive tried a whole heap of other things too but too many to list! these were the last two..
     
    Deliwasista, Feb 23, 2012 IP
  3. Lee Stevens

    Lee Stevens Active Member

    Messages:
    148
    Likes Received:
    3
    Best Answers:
    2
    Trophy Points:
    68
    #3
    Try this:
    
    <?php
    $url = "http://www.mysite.com";
    
    $d = new DOMDocument();
    $d->loadHTMLFile($url);
    
    $xpath = new DOMXPath($d);
    $myMobile = $xpath->query('//@id="mobile"')->item(0);
    ?>
    
    PHP:
     
    Lee Stevens, Feb 24, 2012 IP
  4. Deliwasista

    Deliwasista Member

    Messages:
    35
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #4
    Hi Lee,

    thanks for the suggestion - but this one returns a stream of page errors..
    eg - Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: ID download already defined in http://mysite.com, line: 139 in /home/mysite/www/m/test.php on line 20


    I did get my hopes up with this one:

    <?
    $html = file_get_contents('http://www.trailrun.co.nz/aucklandseries/hunua.php');

    $dom = new DOMDocument('1.0', 'iso-8859-1');

    //Suppress any warnings from invalid html markup
    @$dom->loadHTML( $html );
    $xpath = new DOMXPath( $dom );
    $query = '//div[@id="mobile"]';

    $nodes = $xpath->query( $query );
    foreach( $nodes as $node ){
    echo $node->nodeValue;
    }

    ?>

    This one above does actually pull the information from the div into my page!... but sadly it strips all of the layout inside that div :( and just displays it as a massive paragraph.. so it must be stripping titles, images and classes within the div...
     
    Last edited: Feb 24, 2012
    Deliwasista, Feb 24, 2012 IP
  5. Lee Stevens

    Lee Stevens Active Member

    Messages:
    148
    Likes Received:
    3
    Best Answers:
    2
    Trophy Points:
    68
    #5
    When you say "strips all of the layout inside that div" do you mean inner HTML syntax? Or do you mean CSS styling?
     
    Lee Stevens, Feb 25, 2012 IP
  6. Rukbat

    Rukbat Well-Known Member

    Messages:
    2,908
    Likes Received:
    37
    Best Answers:
    51
    Trophy Points:
    125
    #6
    Doing what you want is advanced, so you'll either have to advance your skills, find someone who wants to play and has the skills or pay someone to do it. It's really as simple as using cURL or opening the foreign page as a file, but either one is "advanced".
     
    Rukbat, Feb 25, 2012 IP
  7. Deliwasista

    Deliwasista Member

    Messages:
    35
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #7
    Thanks Rukbat - I came to that conclusion last night :) ive worked around my lack of knowledge by removing the div from the page im trying to screenscrape - placing it in its own file. Using php to include it back into its old file, and curl to call it into my other file :)

    so I have acheived the end result of only having to update one file when the copy has to update, so all good.. if not a little round about lol. Mobile app here we come ;)
     
    Deliwasista, Feb 25, 2012 IP
  8. mcflause

    mcflause Peon

    Messages:
    3
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks for opening this thread Deliwasista! and thanks for answering it Lee Stevens!

    I needed to scrape the ATP Rank (Tennis ranking) from the ATP site for a players site I was doing. This works perfect!

    Thanks, and good luck with your project!
     
    mcflause, Mar 24, 2012 IP
  9. Deliwasista

    Deliwasista Member

    Messages:
    35
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #9
    it stripped out all html syntax and styling and displayed all of the copy as a run togther massive paragraph :)
     
    Deliwasista, Mar 27, 2012 IP
  10. Deliwasista

    Deliwasista Member

    Messages:
    35
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #10
    Excellent! im glad you found an answer too

    FYI Im using curl to pull the external file holding the copy into the frame of my page.. and looking for other options as this seems to slow my page download time down considerably.

    <?php
    $data = file_get_contents("http://www.mysite.co.nz/training/training.php",0);
    echo $data;
    ?>
    <?php
    $url = "http://www.mysite.co.nz/training/training.php";
    $ch = curl_init();
    $timeout = 5; // set to zero for no timeout
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $file_contents = curl_exec($ch);
    curl_close($ch);
    ?>



    ps in my travels i came across this answer from someone else on the subject of php includes. Im only calling a simple text file so this has not effected me - but im adding it to my thread just in case it helps anyone else trying to include a more complicated page.
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------

    "Something not previously stated here - but found elsewhere - is that if a file is included using a URL and it has a '.php' extension - the file is parsed by php - not just included as it would be if it were linked to locally.

    This means the functions and (more importantly) classes included will NOT work.

    for example:

    <?php
    include "http://example.com/MyInclude.php";
    ?>

    would not give you access to any classes or functions within the MyInclude.php file.

    to get access to the functions or classes you need to include the file with a different extension - such as '.inc' This way the php interpreter will not 'get in the way' and the text will be included normally. "
     
    Deliwasista, Mar 27, 2012 IP