Need a little text scraper script

Discussion in 'PHP' started by Mr.Bill, Oct 19, 2008.

  1. #1
    I have this text on this page of many sites so I would like to be able see them all at once by having a script that will scrape the content off each one.

    To see what needs to be scraped can be seen here sneaking.org/online.php

    I want it to be able to multiple site
    example

    and then display it like this

    Hope this makes since and get some nice person to help out.
     
    Mr.Bill, Oct 19, 2008 IP
  2. riamathews

    riamathews Peon

    Messages:
    306
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Code is done.
     
    riamathews, Oct 19, 2008 IP
  3. Mr.Bill

    Mr.Bill Well-Known Member

    Messages:
    2,818
    Likes Received:
    134
    Best Answers:
    0
    Trophy Points:
    160
    #3
    Looking for someone to offer it for free then others can use it on there sites. Not in the market to purchase one.
     
    Mr.Bill, Oct 19, 2008 IP
  4. Kyosys

    Kyosys Peon

    Messages:
    226
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #4
    just google, scripts like these are fairly easy to create and I'm sure you'll find what you need
     
    Kyosys, Oct 20, 2008 IP
  5. blackthought286

    blackthought286 Well-Known Member

    Messages:
    334
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    103
    #5
    a little regex and maybe curl is all you should need.
     
    blackthought286, Oct 20, 2008 IP
  6. riamathews

    riamathews Peon

    Messages:
    306
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Hi,

    Here is the code..
    
     <?php
    $target_url = 'http://sneaking.org/online.php';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    if (!$html) {
    	echo "
    cURL error number:" .curl_errno($ch);
    	echo "
    cURL error:" . curl_error($ch);
    	exit;
    }
    echo $html;
    ?>
    
    Code (markup):
    The variable $html contains the scrapped content

    Ria
     
    riamathews, Oct 20, 2008 IP
  7. Mr.Bill

    Mr.Bill Well-Known Member

    Messages:
    2,818
    Likes Received:
    134
    Best Answers:
    0
    Trophy Points:
    160
    #7
    riamathews thank you. How would I be able to add more domains to this?
     
    Mr.Bill, Oct 20, 2008 IP
  8. Bind

    Bind Peon

    Messages:
    70
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #8
    its untested but should work.

    
    <?php
    $target_url = Array(    "http://site1.org/online.php",
                            "http://site2.org/online.php",
                            "http://site3.org/online.php",
                            "http://site4.org/online.php"
                        );
    $useragent = $_SERVER['USER_AGENT'];
    foreach ($target_url as $this_url)
        {
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
            curl_setopt($ch, CURLOPT_URL,$this_url);
            curl_setopt($ch, CURLOPT_FAILONERROR, true);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_AUTOREFERER, true);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            $html = curl_exec($ch);
            if (!$html)
                {
                    echo "<p>cURL error number:" .curl_errno($ch)."<BR />";
            	    echo "cURL error:" . curl_error($ch)."</p>";
                }
            else
                {
                    echo "<p>$this_url<br />$html</p>";
                }
        }
    ?>
    PHP:
     
    Bind, Oct 20, 2008 IP
  9. Mr.Bill

    Mr.Bill Well-Known Member

    Messages:
    2,818
    Likes Received:
    134
    Best Answers:
    0
    Trophy Points:
    160
    #9
    perfect thank you both of you works great.
     
    Mr.Bill, Oct 20, 2008 IP
  10. Shadowplay

    Shadowplay Peon

    Messages:
    394
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #10
    How do you use code like this? What's the procedure?
     
    Shadowplay, Oct 25, 2008 IP
  11. Bind

    Bind Peon

    Messages:
    70
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #11
    1. configure the array with your urls (or replace the array with data import from a database).

    2. save it with a .php file extension.

    3. upload it to your webserver.

    4. access it with a web browser
     
    Bind, Oct 25, 2008 IP
  12. Mr.Bill

    Mr.Bill Well-Known Member

    Messages:
    2,818
    Likes Received:
    134
    Best Answers:
    0
    Trophy Points:
    160
    #12
    Would it be possible to say scrap a <table class="something" so it pulled only info from that one table with an array for more then one url?
     
    Mr.Bill, Oct 25, 2008 IP
  13. Mr.Bill

    Mr.Bill Well-Known Member

    Messages:
    2,818
    Likes Received:
    134
    Best Answers:
    0
    Trophy Points:
    160
    #13
    Anyone know how this could be accomplished?
     
    Mr.Bill, Oct 28, 2008 IP
  14. Mr.Bill

    Mr.Bill Well-Known Member

    Messages:
    2,818
    Likes Received:
    134
    Best Answers:
    0
    Trophy Points:
    160
    #14
    Is it possible to scrap a certain tables and not just the whole? No one has answered so I am not sure if I should give up on this idea.
     
    Mr.Bill, Nov 12, 2008 IP
  15. Bind

    Bind Peon

    Messages:
    70
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #15
    you would probably need to create a preg_match(/^regex/); to pull it out.

    research regular expressions and pulling data between html tags.

    it's doable.
     
    Bind, Nov 13, 2008 IP