PHP-DOM Scraping HTML content

Camay123 Well-Known Member

Messages:: 3,423

Likes Received:: 86

Best Answers:: 0

Trophy Points:: 160

#1

OK,

So I want to be able to enter a url in some form and treat it with php/dom.

I want to extract certain informations from it based on html tag.

For example, I want to enter a url and extract the text content of the H1 tag.

Im running in 2 difficulties:

1) I extract the whole h1 tag but just want the innerHTML.

2) If my H1 is linked inside, I extract the link <a href...> tag also.

Im using this parser code: http://simplehtmldom.sourceforge.net/manual.htm

Here is my code so far:
<?php

include("simple_html_dom.php");

$html = file_get_html("http://".$_POST["url"]);

foreach($html->find('H1') as $element); 
echo $element;



?>
Code (markup):
Any help is appreciated as im trying to learn all this as an experiment project.

Camay123, Oct 12, 2012 IP

pixelator Peon

Messages:: 9

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#2

Hi,

I'm not sure If this might help. Why use jQuery to access and manipulate the client side scripts.

$("h1 a").each(function(){

alert($(this).attr("href"));

});
Click to expand...

pixelator, Oct 12, 2012 IP

koconder Well-Known Member

Messages:: 122

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 115

#3

jQuey is a good way to handle this... also you can use XPATH which will treat the DOM as XML and elements.

koconder, Oct 13, 2012 IP

Red Swordfish Media Peon

Messages:: 18

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

You might also check into using PHP file_get_contents. You can use strpos to pass in the start and stop information.

Red Swordfish Media, Oct 13, 2012 IP

ThePHPMaster Well-Known Member

Messages:: 737

Likes Received:: 52

Best Answers:: 33

Trophy Points:: 150

#5

Seems like an interesting object:

To get the InnerHTML:
echo $element->innertext;
PHP:
To get the outer Href, I would search the parent src attr:
echo $element->parent()->href;
PHP:

ThePHPMaster, Oct 15, 2012 IP

KsNitro Greenhorn

Messages:: 60

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 20

#6

You may want to have a look at zend_dom as part of the Zend php framework.

Here is a Link.

KsNitro, Oct 17, 2012 IP

SHRT.IN Peon

Messages:: 25

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 1

#7

Alright I just tested this and it works. Enjoy:

<?php$content = file_get_contents("http://" . $_POST['url']);
$start_limiter = '<h1>';$end_limiter = '</h1>';
$start_pos = strpos($content,$start_limiter);if ($start_pos === FALSE){die("Starting limiter ".$start_limiter." not found ");}
$end_pos = strpos($content,$end_limiter,$start_pos);
if ($end_pos === FALSE){die("Ending limiter ".$end_limiter." not found ");}

$h1tag= substr($content, $start_pos, ($end_pos)-$start_pos);echo $h1tag;?>

PHP:

SHRT.IN, Oct 17, 2012 IP