1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP and scrapping help

Discussion in 'PHP' started by chosenlight, Oct 26, 2013.

  1. #1
    Hi everyone,
    SEMrush
    I'm working on a project where i need to extract content between 2 html tags using php. I tried a bunch of things but its not working for me. If anyone can provide code or point me to a script would be wonderful.

    So for example: website target has the following

    bunch of stuff I don't need
    <font color="#0083c6">Data I need</font>
    bunch of stuff I don't need
    <font color="#0083c6">Data I need</font>
    bunch of stuff I don't need
    <font color="#0083c6">Data I need</font>
    bunch of stuff I don't need

    printing on the screen using maybe comas between each set would be perfect.

    Thanks everyone
     
    chosenlight, Oct 26, 2013 IP
    SEMrush
  2. jscg

    jscg Well-Known Member

    Messages:
    161
    Likes Received:
    5
    Best Answers:
    3
    Trophy Points:
    108
    Digital Goods:
    2
    #2
    You should use PHP Simple HTML DOM Parser: http://simplehtmldom.sourceforge.net/

    Manual: http://simplehtmldom.sourceforge.net/manual.htm

    This is just example how it can be used:
    
    include('simple_html_dom.php');
    
    // Create DOM from URL
    $html = file_get_html('http://example.com/');
    
    foreach($html->find(('font[color=#0083c6]') as $element)
          echo $element->plaintext . '<br>';
    
    PHP:
     
    Last edited: Oct 27, 2013
    jscg, Oct 27, 2013 IP
  3. ryan_uk

    ryan_uk Illustrious Member

    Messages:
    3,983
    Likes Received:
    1,022
    Best Answers:
    33
    Trophy Points:
    465
    #3
    That's not needed as preg_match (there is a preg_match_all, if you need it for more than one occurence, too) or domDocument can be used to do such a task (then just write your own custom, re-usable function). I hope the OP is using it for ethical purposes or has permission to retrieve content from a site in this way.
     
    ryan_uk, Oct 27, 2013 IP
  4. Darkforest

    Darkforest Active Member

    Messages:
    88
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    85
    Digital Goods:
    1
    #4
    I'd say go the jscg way if you are working on bigger websites. The Simple HTML DOM library comes in really handy when you have to re-use the code a lot or the websites have complex structures.
    However, if it's a small site, then preg_match should be more than enough :)

    Also, a further note. PHP will not be able to parse ANY JavaScript that you give it. Say, if the site is partially generated using JavaScript, PHP has no way of seeing the part of the website that JavaScript has changed or generated.

    In order to make this work, you'll need far more complicated things. You'll probably need a whole server and a browser to render the page with some additional libraries to get the elements that you want ;)
     
    Darkforest, Oct 27, 2013 IP
  5. ian11

    ian11 Greenhorn

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    11
    #5
    I recommend using simple built in PHP functions for this like so:

    $url = "http://jafty.com/";
    $HTML = file_get_contents($url);
    $begin_tag = "<title>";
    $end_tag = "</title>";
    $array1 = explode($begin_tag, $HTML);//text you want will be at key 1
    $content1 = $array1[1];
    $array2 = explode($end_tag, $content1);//text you want will be at key 0
    $content2 = $array2[0];
    echo $content2;
    PHP:
    The above code will print the contents of the title tag of my website, Jafty.com, to the browser window. Simply change $url, $begin_tag and $end_tag variable values to suite your needs and run the script and it will work. I did test it by the way.
     
    ian11, Oct 28, 2013 IP
  6. kutchbhi

    kutchbhi Active Member

    Messages:
    130
    Likes Received:
    4
    Best Answers:
    2
    Trophy Points:
    70
    #6
    Say No to simple html dom (memory leak issues)
    Say No regex (ugly)
    Say Yes to querypath!!

    $qp = $htmlqp('file/string') ;
    $fontTags = $qp->find('font');

    foreach(fontTags as $tag ){
    echo $tag->text() ;
    }

    http://www.ibm.com/developerworks/web/library/os-php-querypath/index.html
     
    kutchbhi, Oct 28, 2013 IP
  7. PinoyEngine™

    PinoyEngine™ Well-Known Member

    Messages:
    298
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    145
    #7
    If you're scraping an online website, cURL could help you.
     
    PinoyEngine™, Oct 29, 2013 IP
  8. hav0c

    hav0c Notable Member

    Messages:
    1,391
    Likes Received:
    39
    Best Answers:
    0
    Trophy Points:
    235
    Digital Goods:
    1
    #8
    Maybe you could try something similar to this:

    Use "print_r($data) to print the array and of course "print_r($data[1])" or whichever specific piece(s) you need to make visible.
     
    Last edited: Oct 29, 2013
    hav0c, Oct 29, 2013 IP
  9. chosenlight

    chosenlight Active Member

    Messages:
    363
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    58
    #9
    wow - thank you guys for all your help! I have a few options here and ill try them out and let you know which worked best for me. Thanks again.
    Sorry for disappearing there, was traveling w/o inet access!

    @ryan_uk ; what I'm retrieving is public domain.
     
    chosenlight, Oct 29, 2013 IP
  10. chosenlight

    chosenlight Active Member

    Messages:
    363
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    58
    #10
    Hi all - just a little bit more help

    @ian11 I tried your method, it works to retrieve only the first record. I tried to print_r($content2) - it also only showed only the first result

    @jscg your method returns an unexpected T_AS error. I see that your using a foreach not for loop so that can't be the problem. I also changed color=#0083c6 to color=\"#0083c6\" but it didn't accomplish much.
     
    chosenlight, Oct 29, 2013 IP
  11. jscg

    jscg Well-Known Member

    Messages:
    161
    Likes Received:
    5
    Best Answers:
    3
    Trophy Points:
    108
    Digital Goods:
    2
    #11
    Paste full error line.
     
    jscg, Oct 30, 2013 IP
  12. ezprint2008

    ezprint2008 Well-Known Member

    Messages:
    611
    Likes Received:
    15
    Best Answers:
    2
    Trophy Points:
    140
    Digital Goods:
    4
    #12
    The problem with doing this is that you don't know what will be on a page /code source (IF you're going to use the script to target multiple sites)
    I've tried this with dictionary.com to grab definitions of words and whether they are noun,verb, etc
    The code was rough off-road crapology to make it work but since I was only using Dictionary.com I was able to see how they formatted their responses
    It still sometimes comes with extra characters. But that's a good question = how do you grab information (when you dont know what it will be) and not include all the junk for the source code. strip it with HTML chars etc. sanitize scripts etc. I've never tried CURL so maybe Pinoy has something?
    If you're only going to grab from one site all the time I have a rough code that can do the trick as long as their formatting stays the same. I mark the area and line of source code
    and then remove the string match to clear the junk and it leaves only the part ya want between those tag formatted areas.
     
    ezprint2008, Oct 30, 2013 IP
  13. chosenlight

    chosenlight Active Member

    Messages:
    363
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    58
    #13
    @jscg -

    Thanks for your help, here is the error message, I just changed the file path an d my username. Its hosted on netfirms beleive php 5.2

    Parse error: syntax error, unexpected T_AS in /hermes/bosoraweb128/b1783/nf.[myusername]/public_html/[myfile.php] on line 5
    
    <?php
    include('simple_html_dom.php');
    // Create DOM from URL
    $html = file_get_html('http://[webpath]');
    foreach($html->find(('font[color=\"#0083c6\"]') as $element)
      echo $element->plaintext . '<br>';
    ?>
    
    Code (markup):
    Now forgive me, I'm not fimiliar with how DOM works, but is simple_html_dom.php a file that has to be uploaded to the server?

    @ezprint2008 you know, I'm not even trying to do that. I am trying to grab info from one page and one page only. All the data is between the colored font tag and there is no special charcs. It's basically a whole bunch of names with no hyphens, apostrophes, or anything like that.
    Thanks
     
    chosenlight, Oct 30, 2013 IP
  14. hav0c

    hav0c Notable Member

    Messages:
    1,391
    Likes Received:
    39
    Best Answers:
    0
    Trophy Points:
    235
    Digital Goods:
    1
    #14
    Have you had a chance to try the few lines of code I've suggested? Might be exactly what you need in the simplest of ways.
     
    hav0c, Oct 30, 2013 IP
  15. chosenlight

    chosenlight Active Member

    Messages:
    363
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    58
    #15
    @hav0c . Thanks your reply. Your code prints "Array ( )", so its not finding anything. I am sure font tags are accurate, I cut and paste them.
    Does it make a difference that the target page is an ASP page . (ie: ur ends in "browse.asp?c=a") - don't think it should, but thought I should mention just in case.

    my code is :
    
    <?php
    $url = file_get_contents("http://www.[domainname].com/browse.asp?c=a");
    preg_match('/<font color="#0083c6">(.*?)<\/font>/isu', $url, $data);
    print_r($data);
    ?>
    
    Code (markup):
    Thanks so much for your help.
     
    chosenlight, Oct 30, 2013 IP
  16. ROOFIS

    ROOFIS Well-Known Member

    Messages:
    1,234
    Likes Received:
    30
    Best Answers:
    5
    Trophy Points:
    120
    #16
    DOM is definitely the way to go but you can do so without a major class to call on.

    I'm a bit rusty on Document Object Model orientation but I've coded below an example that should do what your asking for, but do yourself a favour and look on the php reference pages listed in the example script regarding char encoding ect.

    You can easily modify to functionize by giving it a function name and making a return call instead of echo.

    <pre>
    <?php
    #see:http://php.net/manual/en/domdocument.loadhtmlfile.php
    #see:http://php.net/manual/en/domdocument.getelementsbytagname.php
    
    #@ to supress errors for invalid html pages!
    $doc = new DOMDocument();
    @$doc->loadHTMLFile("http://www.[webpage you seek]/");
    $elements = $doc->getElementsByTagName('font');
    foreach ($elements as $font) {
    echo $font->nodeValue, PHP_EOL; #or " ", for side by side rendering!
    }
    die();
    ?>
    </pre>
    PHP:

    .
     
    Last edited: Oct 31, 2013
    ROOFIS, Oct 31, 2013 IP
  17. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,277
    Likes Received:
    33
    Best Answers:
    7
    Trophy Points:
    150
    #17
    I would strongly suggest not using the DOM. DOM will only work correctly if the html is structured correctly. I find that most of the time it will fail. use preg_match_all and it will work if valid html or not.
     
    stephan2307, Nov 7, 2013 IP
    ryan_uk likes this.