grab text from website

Shimurai Well-Known Member

Messages:: 186

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 110

#1

hey everyone,

I'm trying to grab a number from a website, from here http://xtremetop100.com/mu-online

It's a topsite and I want to get my server IN votes using php so I can make a script that will predict how many votes I will have at the end of the month.

my site is in the first place, it's the "ZHYPERMU SEASON 6 PRO SERVICE", and I want to get the IN and OUT votes using php. I tried using file_get_contents but I'm a little stuck on how to grab that exact number.

any help will be appreciated!

Shimurai, Aug 3, 2012 IP

writingwhiz Peon

Messages:: 12

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 1

#2

file_get_contents is a good start to pick up the HTML from that page.

It looks like there a table outputting all of the IN and OUT details, so you have a variety of options, ranging from searching the string for specific markers and then catching the IN/OUT variables, or using something like DOM to parse the entire table. Quite a few ways to go about it.

I'd be able to code this for you very quickly (would be done by tonight) if you'd like (PM me if interested).

writingwhiz, Aug 4, 2012 IP

furqanartists Well-Known Member

Messages:: 35

Likes Received:: 0

Best Answers:: 1

Trophy Points:: 101

#3

You can use PHP CURL and file_get_content function you grab the data from other site!

furqanartists, Aug 4, 2012 IP

setyp Member

Messages:: 25

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 38

#4

for download content use file_get_contents or lib CURL
for parsing use regular exp or lib DOMDocument
good luck

setyp, Aug 6, 2012 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#5

I would NOT use file_get_contents, since you're basically looking to go through the DOM. setyp has it right, use DOMDocument to load it with the loadhtmlfile method.

http://www.php.net/manual/en/domdocument.loadhtmlfile.php

Though the idiotic outdated 1990's style code of that site is REALLY going to be hard to deal with -- what with all the span for nothing, inlined style, bold tags for no reason, lack of using a table properly, multiple tables doing one table's job, the same ID being used more than once, lack of unique id's to target each 'row' (that's assigned as a table, that's just ****tarded), etc, etc.

Probably outside your control, but that steaming pile should be dragged kicking and screaming into THIS century before trying to data scrape it. Otherwise you're gonna drive yourself nuts trying to navigate the DOM. Right now your best bet is to get all the span, check for the class you want, then try to go up to the parent node to make sure it contains the anchor you are after...

deathshadow, Aug 6, 2012 IP

deathshadow Acclaimed Member

Messages:: 9,732

Likes Received:: 1,999

Best Answers:: 253

Trophy Points:: 515

#6

Ok, took a moment to try it out...


<?php

function getElementsByClassName(DOMDocument $domNode, $className,$tagType='*') {
	$matches=array();
	$elementList=$domNode->getElementsByTagName($tagType);
	foreach($elementList as $element) {
		if ($element->hasAttribute('class')) {
			$classes=explode(' ',$element->getAttribute('class'));
			if (in_array($className, $classes)) $matches[]=$element;
		}
	}
	return $matches;
}

function getPreviousSiblingTag($element,$tagName) {
	$result=$element->previousSibling;
	while (
		isset($result) && (
			(get_class($result)!='DOMElement') ||
			($result->tagName!=$tagName)
		)
	) {
		$result=$result->previousSibling;
	}
	return $result;
}

function getStats($url,$siteId) {
	$document=new DOMDocument();
	/*
		we have to discard errors since most sites are so piss poor coded
		it's a miracle they even render.
	*/
	@$document->loadHTMLFile($url);
	$statsList=getElementsByClassName($document,'stats1','td');
	foreach ($statsList as $statsTD) {
		$checkTD=getPreviousSiblingTag($statsTD,'td');
		if (isset($checkTD)) {
			$anchorList=$checkTD->getElementsByTagName('a');
			foreach ($anchorList as $anchor) {
				$testHREF=$anchor->getAttribute('href');
				if (strpos($testHREF,(string) $siteId)>0) {
					$spanList=$statsTD->getElementsByTagName('span');
					foreach ($spanList as $span) {
						if (get_class($span->firstChild)=='DOMText') {
							if (is_numeric($span->textContent)) return $span->textContent;
						}
					}
				}
			}
		}
	}
	return false;
}

echo 'Value: ',getStats('http://xtremetop100.com/mu-online',1132207972);

?>

Code (markup):

That should do it. In theory you could go simpler by loading it as text, then looking for that number, then taking the second span immediately following, but that would be really fragile. The above parses the page properly using the DOM to find the elements, so it's a bit more tolerant of changes to the content -- though it does take it quite a while to run.

You could probably do better using PHP's "xpath" routines, but I've not yet grasped quite how those work.

deathshadow, Aug 6, 2012 IP

theolympian Peon

Messages:: 3

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

I use to do view page source to get those texts, and it works.

theolympian, Aug 6, 2012 IP

Log in or Sign up

grab text from website

Shimurai Well-Known Member

writingwhiz Peon

furqanartists Well-Known Member

setyp Member

deathshadow Acclaimed Member

deathshadow Acclaimed Member

theolympian Peon

Useful Searches