help for web scraping

Discussion in 'PHP' started by playwright, Jun 2, 2010.

  1. #1
    Hello..i'm new to php so i need some real help in here...
    I trying to create a web scraper that grabs a forum's content and shows only the posts. . The source code is here:

    <html>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <?php
    $html = file_get_contents ('http://www.......');
    $dom = new DomDocument();
    @$dom->loadHTML ($html);
    $xpath = new DOMXPath ($dom);
    $key = $xpath->query ('//*[@class="postTextContainer"]');
    foreach($key as $keys){
    echo $keys->nodeValue ,"<br/> \n";
    }
    ?>
    </html>

    can anyone tell me how i could grab all the posts that are in the same thread??now i can only grab the posts that are in the above url..i think it's called multiple page parsing?? I also want to ask how i can delete the content that exists between two tags and exists in the content that i have grabbed with the above code?? more specific the tag is <div class="........">bla bla</div>
     
    playwright, Jun 2, 2010 IP
  2. kidatum

    kidatum Peon

    Messages:
    61
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You're obviously new to php because that code makes no sense at all at least to me. You're asking how and I will tell you I'm not going to write the code for you.

    1. Fetch the page with posts.
    2. Use preg_match_all() function + regex to find the posts
    3. Do w/e it is you want to do with them.

    If you want to delete html tags, there is a function in php called strip_tags().
     
    kidatum, Jun 2, 2010 IP
  3. playwright

    playwright Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks for your answer. Actually i tried this way, but i couldn't find a way to grab the content between <!-- message --> and <!-- / message -->. I couldn't find the right regex pattern. Do you know which regex may fit??
    With strip_tags() you delete only html tags or html tags and the text between them??
     
    playwright, Jun 3, 2010 IP
  4. gapz101

    gapz101 Well-Known Member

    Messages:
    524
    Likes Received:
    8
    Best Answers:
    2
    Trophy Points:
    150
    #4
    gapz101, Jun 3, 2010 IP
  5. saviola

    saviola Peon

    Messages:
    17
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #5
    "string strip_tags ( string $str [, string $allowable_tags ] )"
    This function tries to return a string with all NUL bytes, HTML and PHP tags stripped from a given str.

    The above example will output:

     
    saviola, Jun 3, 2010 IP
  6. flexdex

    flexdex Peon

    Messages:
    104
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #6
    
    $subject =<<<AAA
    notmatching
    <!-- message --> and 
    this
    is 
    matching<!-- / message -->
    notmatching
    AAA;
    
    if (preg_match('%<!-- message -->(.+)<!-- / message -->%si', $subject, $regs)) {
    	$result = "This is your captured text: {$regs[1]}";
    } else {
    	$result = "does not match";
    }
    
    echo "$result\n";
    
    PHP:
    Regards, flexdex
     
    flexdex, Jun 3, 2010 IP
  7. playwright

    playwright Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Thanks flexdex your answer is totally right!! any idea about how to parse all pages of the thread???
     
    playwright, Jun 3, 2010 IP
  8. Gray Fox

    Gray Fox Well-Known Member

    Messages:
    196
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    130
    #8
    Talk about hypocrisy.
     
    Gray Fox, Jun 3, 2010 IP