[Need Help] PHP Scraping

Discussion in 'PHP' started by LeetPCUser, Jul 10, 2009.

  1. #1
    LeetPCUser, Jul 10, 2009 IP
  2. wd_2k6

    wd_2k6 Peon

    Messages:
    1,740
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #2
    wd_2k6, Jul 10, 2009 IP
  3. anthonywebs

    anthonywebs Banned

    Messages:
    657
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #3
    
    $file="http://web.minorleaguebaseball.com/milb/stats/stats.jsp?t=l_trn&lid=117&sid=l117";
    $contents=file_get_contents($file);
    echo $contents;
    
    PHP:
    this should work
     
    anthonywebs, Jul 10, 2009 IP
  4. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I want to be able to manipulate the data.

    http://breakpointdesigns.com/test2.php

    The information is being generated by JavaScript and this does not work. Any more suggestions. I seriously am stumped. Someone once had suggested JSON, but I am not familiar with that.
     
    LeetPCUser, Jul 10, 2009 IP
  5. anthonywebs

    anthonywebs Banned

    Messages:
    657
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    0
    #5
    sorry but thats because the links to the javascript and stylesheets are relative, you would have to do something like

    $content=str_ireplace("/javascript.js","http://example.com/javascript.js",$content);
    PHP:
    right b4 you echo the text for all of them

    I will PM you the full working thing with all of the replacements and you can test it
     
    anthonywebs, Jul 10, 2009 IP
  6. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #6
    LeetPCUser, Jul 10, 2009 IP
  7. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #7
    You will need to replace all occurrences of all links: css, js, etc..
     
    ThePHPMaster, Jul 10, 2009 IP
  8. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #8
    It is calling writeData();. How can I generate the contents of that? Any examples would be great. I am not a whiz at PHP.

    I do not want to copy the entire page, I just want to grab the data and print it on a white page.

    Please help.
     
    LeetPCUser, Jul 10, 2009 IP
  9. jazzcho

    jazzcho Peon

    Messages:
    326
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #9
    It can be done with selenium BUT you need your own server.
     
    jazzcho, Jul 11, 2009 IP
  10. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I do not have my own server. I am trying to figure out how to get the data on this shared server and be able to CRON it later. I am sure somebody has to know how to do this. It is really frustrating.
     
    LeetPCUser, Jul 11, 2009 IP
  11. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Can anyone figure this out? I have realized it is using JS, but I can't determine where those files are.
     
    LeetPCUser, Jul 12, 2009 IP
  12. wd_2k6

    wd_2k6 Peon

    Messages:
    1,740
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Hmm i've had a look but I can't see where this function is coming from. This is how far I got:

    
    <?
    //Location of doc
    $file="http://web.minorleaguebaseball.com/milb/stats/stats.jsp?t=l_trn&lid=117&sid=l117";
    
    //Get all doc contents
    $contents=file_get_contents($file);
    
    //change stylesheet loc
    $contents = str_ireplace("<link href=\"","<link href=\"http://web.minorleaguebaseball.com",$contents);
    
    //all src
    $contents = str_ireplace("src=\"","src=\"http://web.minorleaguebaseball.com",$contents);
    
    //change inline css loc
    $contents = str_ireplace("style=\"background: url(","style=\"background: url(http://web.minorleaguebaseball.com",$contents);
    																			 
    $contents = str_ireplace("style=\"background-image: url(","style=\"background-image: url(http://web.minorleaguebaseball.com",$contents);
    																						 
    //display page
    echo $contents;
    ?>
    
    PHP:
    Basically everything is displaying apart from the data. There must be a link i've missed somewhere which is being defined in another way to those i've already replaced.
    The problem is the data doesn't seem viewable in the source code, well it does sometimes and doesn't sometimes (very odd), if I hover over the info and press view selection source in Firebug it seems to display it, but if i just click view source it doesn't show :s.

    You need to be able to grab JS data somehow with PHP I guess and find the script where this writeData() function is!!
     
    wd_2k6, Jul 12, 2009 IP
  13. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #13
    I know the JS exists, I am just not sure where. Can anyone help locate the file?
     
    LeetPCUser, Jul 13, 2009 IP
  14. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #14
    LeetPCUser, Jul 13, 2009 IP
  15. wd_2k6

    wd_2k6 Peon

    Messages:
    1,740
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #15
    What exactly do you mean by decrypt? You could chop the data up, for example:
    
    <?php
    $file = 'http://web.minorleaguebaseball.com/lookup/json/named.transaction_all.bam?league_id=112&start_date=20090712';
    $contents = file_get_contents($file);
    $contents = str_ireplace("\"player\":","<b>\"PLAYER\":</b>",$contents);
    $contents = str_ireplace("[","[<br /><br /><B>STARTLIST</B><BR />",$contents);
    $contents = str_ireplace("}","}<br /><br /><B>NEWPLAYER</B><BR />",$contents);
    $contents = str_ireplace(",",",<br />",$contents);
    
    echo $contents;
    
    ?>
    PHP:
     
    wd_2k6, Jul 13, 2009 IP
  16. gregor171

    gregor171 Peon

    Messages:
    15
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #16
    you can also do it with CURL and read the data with regex (you'd need two regular expressions: first to separate data table and then to read rows) ;-)
     
    gregor171, Jul 13, 2009 IP
  17. wd_2k6

    wd_2k6 Peon

    Messages:
    1,740
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #17
    I'm not familiar with cURL or regex but you could also split the data up with some string functions, for example:

    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    
    <body>
    <?php
    
    function get_string_between($string, $start, $end){
            $string = " ".$string;
            $ini = strpos($string,$start);
            if ($ini == 0) return "";
            $ini += strlen($start);   
            $len = strpos($string,$end,$ini) - $ini;
            return substr($string,$ini,$len);
    }
    
    $file = 'http://web.minorleaguebaseball.com/lookup/json/named.transaction_all.bam?league_id=112&start_date=20090712';
    $contents = file_get_contents($file);
    //Remove start of file
    $pos = strpos($contents, "[");
    $contents = substr_replace($contents, "",0, $pos);
    $pos = strpos($contents, "{");
    $contents = substr_replace($contents, "",0, $pos);
    
    //Remove end of file
    $pos = strpos($contents, "]");
    $contents = substr_replace($contents, "", $pos);
    
    //Add Line-Breaks
    $contents = str_ireplace(",",",<br />",$contents);
    
    //Count amount of players for use in loop
    $players = substr_count($contents,"}");
    
    //Create array of players
    while($x < $players){
    $player[] = get_string_between($contents, "{", "}");
    $pos = strpos($contents, "},");
    $contents = substr_replace($contents, "",0, $pos);
    $pos2 = strpos($contents, "{");
    $contents = substr_replace($contents, "",0, $pos2);
    $x++;
    }
    
    
    foreach ($player as $p){
    	echo "<h1>Player</h1>";
    	echo $p;
    }
    ?>
    </body>
    </html>
    
    PHP:
    obviously the other methods suggested are probably better but i have no experience in them :)
     
    wd_2k6, Jul 13, 2009 IP
  18. LeetPCUser

    LeetPCUser Peon

    Messages:
    711
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #18
    I have gotten to the point where I need a regular expression.

    I need to do this:

    blah blah blah !#(@%O#&%)@*(# [

    Remove everything, all characters and letters, before the first [. What is a regex function I can develop for that.
     
    LeetPCUser, Jul 14, 2009 IP
  19. wd_2k6

    wd_2k6 Peon

    Messages:
    1,740
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #19
    I don't know about regex sorry, but I did this in my file with the following:
    (assuming the whole file is kept it a variable called $contents)
    
    //Remove start of file
    
    //Find position of first [ character
    $pos = strpos($contents, "[");
    //Add 1 to this position to actually include the [ character
    $pos += 1;
    //Now remove everyhing from start position (0) to this first [ character whose position has been defined with $pos
    $contents = substr_replace($contents, "",0, $pos);
    
    PHP:
     
    wd_2k6, Jul 14, 2009 IP
  20. wd_2k6

    wd_2k6 Peon

    Messages:
    1,740
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #20
    Here save this as a new PHP file and check it out in your browser:

    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    <style type="text/css" media="screen">
    table { margin: 0 auto; border: 3px solid #000; }
    table th { border: 2px solid #000; padding: 10px; font-size: 20px; }
    table td { border: 2px solid #000; padding: 10px; }
    </style>
    </head>
    
    <body>
    <?php
    //function to get string between to places
    function get_string_between($string, $start, $end){
            $string = " ".$string;
            $ini = strpos($string,$start);
            if ($ini == 0) return "";
            $ini += strlen($start);   
            $len = strpos($string,$end,$ini) - $ini;
            return substr($string,$ini,$len);
    }
    
    //asign our file
    $file = 'http://web.minorleaguebaseball.com/lookup/json/named.transaction_all.bam?league_id=112&start_date=20090712';
    $contents = file_get_contents($file);
    
    //Remove start of file
    $pos = strpos($contents, "[");
    $pos += 1;
    $contents = substr_replace($contents, "",0, $pos);
    
    //Remove end of file
    $pos = strpos($contents, "]");
    $contents = substr_replace($contents, "", $pos);
    
    //Add Line-Breaks
    $contents = str_ireplace(",",",<br />",$contents);
    
    //Count amount of players for use in loop
    $players = substr_count($contents,"}");
    
    //Create array of players
    while($x < $players){
    $player[] = get_string_between($contents, "{", "}");
    $pos = strpos($contents, "},");
    $contents = substr_replace($contents, "",0, $pos);
    $pos2 = strpos($contents, "{");
    $contents = substr_replace($contents, "",0, $pos2);
    $x++;
    }
    
    
    //create sub array of data
    foreach ($player as $p){
    	$name[] = get_string_between($p, "\"player\": \"", "\",");
    	$team[] = get_string_between($p, "\"team\": \"", "\",");
    	$notes[] = get_string_between($p, "\"note\": \"", "\"");
    }
    
    //show table of data contained in sub arrays
    echo "<table><tr><th>Name</th><th>Team</th><th>Note</th></tr>";
    for ($x=0; $x < sizeof($player); $x++){
    	echo "<tr><td>". $name[$x]." </td>";
    	echo "<td>". $team[$x]."</td>"; 
    	echo "<td>". $notes[$x]."</td></tr>";
    }
    echo "</table>";
    
    // optional loop to show players array
    /*foreach ($player as $p){
    	echo "<h1>Player</h1>";
    	echo $p;
    }*/
    ?>
    </body>
    </html>
    
    PHP:
    I didn't transfer all of the columns (just did name, team and note) but you can see the jist of things here, and it stores the data tidly in arrays, so it could easily be inserted into a database :)
     
    wd_2k6, Jul 14, 2009 IP