grab text from website

Discussion in 'PHP' started by Shimurai, Aug 3, 2012.

  1. #1
    hey everyone,

    I'm trying to grab a number from a website, from here http://xtremetop100.com/mu-online

    It's a topsite and I want to get my server IN votes using php so I can make a script that will predict how many votes I will have at the end of the month.

    my site is in the first place, it's the "ZHYPERMU SEASON 6 PRO SERVICE", and I want to get the IN and OUT votes using php. I tried using file_get_contents but I'm a little stuck on how to grab that exact number.

    any help will be appreciated!
     
    Shimurai, Aug 3, 2012 IP
  2. writingwhiz

    writingwhiz Peon

    Messages:
    12
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    1
    #2
    file_get_contents is a good start to pick up the HTML from that page.

    It looks like there a table outputting all of the IN and OUT details, so you have a variety of options, ranging from searching the string for specific markers and then catching the IN/OUT variables, or using something like DOM to parse the entire table. Quite a few ways to go about it.

    I'd be able to code this for you very quickly (would be done by tonight) if you'd like (PM me if interested).
     
    writingwhiz, Aug 4, 2012 IP
  3. furqanartists

    furqanartists Well-Known Member

    Messages:
    35
    Likes Received:
    0
    Best Answers:
    1
    Trophy Points:
    101
    #3
    You can use PHP CURL and file_get_content function you grab the data from other site!
     
    furqanartists, Aug 4, 2012 IP
  4. setyp

    setyp Member

    Messages:
    25
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    38
    #4
    for download content use file_get_contents or lib CURL
    for parsing use regular exp or lib DOMDocument
    good luck ;)
     
    setyp, Aug 6, 2012 IP
  5. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #5
    I would NOT use file_get_contents, since you're basically looking to go through the DOM. setyp has it right, use DOMDocument to load it with the loadhtmlfile method.

    http://www.php.net/manual/en/domdocument.loadhtmlfile.php

    Though the idiotic outdated 1990's style code of that site is REALLY going to be hard to deal with -- what with all the span for nothing, inlined style, bold tags for no reason, lack of using a table properly, multiple tables doing one table's job, the same ID being used more than once, lack of unique id's to target each 'row' (that's assigned as a table, that's just ****tarded), etc, etc.

    Probably outside your control, but that steaming pile should be dragged kicking and screaming into THIS century before trying to data scrape it. Otherwise you're gonna drive yourself nuts trying to navigate the DOM. Right now your best bet is to get all the span, check for the class you want, then try to go up to the parent node to make sure it contains the anchor you are after...
     
    deathshadow, Aug 6, 2012 IP
  6. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,999
    Best Answers:
    253
    Trophy Points:
    515
    #6
    Ok, took a moment to try it out...

    
    <?php
    
    function getElementsByClassName(DOMDocument $domNode, $className,$tagType='*') {
    	$matches=array();
    	$elementList=$domNode->getElementsByTagName($tagType);
    	foreach($elementList as $element) {
    		if ($element->hasAttribute('class')) {
    			$classes=explode(' ',$element->getAttribute('class'));
    			if (in_array($className, $classes)) $matches[]=$element;
    		}
    	}
    	return $matches;
    }
    
    function getPreviousSiblingTag($element,$tagName) {
    	$result=$element->previousSibling;
    	while (
    		isset($result) && (
    			(get_class($result)!='DOMElement') ||
    			($result->tagName!=$tagName)
    		)
    	) {
    		$result=$result->previousSibling;
    	}
    	return $result;
    }
    
    function getStats($url,$siteId) {
    	$document=new DOMDocument();
    	/*
    		we have to discard errors since most sites are so piss poor coded
    		it's a miracle they even render.
    	*/
    	@$document->loadHTMLFile($url);
    	$statsList=getElementsByClassName($document,'stats1','td');
    	foreach ($statsList as $statsTD) {
    		$checkTD=getPreviousSiblingTag($statsTD,'td');
    		if (isset($checkTD)) {
    			$anchorList=$checkTD->getElementsByTagName('a');
    			foreach ($anchorList as $anchor) {
    				$testHREF=$anchor->getAttribute('href');
    				if (strpos($testHREF,(string) $siteId)>0) {
    					$spanList=$statsTD->getElementsByTagName('span');
    					foreach ($spanList as $span) {
    						if (get_class($span->firstChild)=='DOMText') {
    							if (is_numeric($span->textContent)) return $span->textContent;
    						}
    					}
    				}
    			}
    		}
    	}
    	return false;
    }
    
    echo 'Value: ',getStats('http://xtremetop100.com/mu-online',1132207972);
    
    ?>
    Code (markup):
    That should do it. In theory you could go simpler by loading it as text, then looking for that number, then taking the second span immediately following, but that would be really fragile. The above parses the page properly using the DOM to find the elements, so it's a bit more tolerant of changes to the content -- though it does take it quite a while to run.

    You could probably do better using PHP's "xpath" routines, but I've not yet grasped quite how those work.
     
    deathshadow, Aug 6, 2012 IP
  7. theolympian

    theolympian Peon

    Messages:
    3
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I use to do view page source to get those texts, and it works.
     
    theolympian, Aug 6, 2012 IP