PHP and scrapping help

chosenlight Active Member

Messages:: 363

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 58

#1

Hi everyone,

I'm working on a project where i need to extract content between 2 html tags using php. I tried a bunch of things but its not working for me. If anyone can provide code or point me to a script would be wonderful.

So for example: website target has the following

bunch of stuff I don't need
Data I need
bunch of stuff I don't need
Data I need
bunch of stuff I don't need
Data I need
bunch of stuff I don't need

printing on the screen using maybe comas between each set would be perfect.

Thanks everyone

chosenlight, Oct 26, 2013 IP

jscg Well-Known Member

Messages:: 161

Likes Received:: 5

Best Answers:: 3

Trophy Points:: 108

Digital Goods:: 2

#2

You should use PHP Simple HTML DOM Parser: http://simplehtmldom.sourceforge.net/

Manual: http://simplehtmldom.sourceforge.net/manual.htm

This is just example how it can be used:
include('simple_html_dom.php');

// Create DOM from URL
$html = file_get_html('http://example.com/');

foreach($html->find(('font[color=#0083c6]') as $element)
 echo $element->plaintext . ' ';
PHP:

Last edited: Oct 27, 2013

jscg, Oct 27, 2013 IP

ryan_uk Illustrious Member

Messages:: 3,983

Likes Received:: 1,022

Best Answers:: 33

Trophy Points:: 465

#3

jscg said: ↑

You should use PHP Simple HTML DOM Parser: http://simplehtmldom.sourceforge.net/
Click to expand...

That's not needed as preg_match (there is a preg_match_all, if you need it for more than one occurence, too) or domDocument can be used to do such a task (then just write your own custom, re-usable function). I hope the OP is using it for ethical purposes or has permission to retrieve content from a site in this way.

ryan_uk, Oct 27, 2013 IP

Darkforest Active Member

Messages:: 88

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 85

Digital Goods:: 1

#4

I'd say go the jscg way if you are working on bigger websites. The Simple HTML DOM library comes in really handy when you have to re-use the code a lot or the websites have complex structures.
However, if it's a small site, then preg_match should be more than enough

Also, a further note. PHP will not be able to parse ANY JavaScript that you give it. Say, if the site is partially generated using JavaScript, PHP has no way of seeing the part of the website that JavaScript has changed or generated.

In order to make this work, you'll need far more complicated things. You'll probably need a whole server and a browser to render the page with some additional libraries to get the elements that you want

Darkforest, Oct 27, 2013 IP

ian11 Greenhorn

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 11

#5

I recommend using simple built in PHP functions for this like so:
$url = "http://jafty.com/";
$HTML = file_get_contents($url);
$begin_tag = "<title>";
$end_tag = "</title>";
$array1 = explode($begin_tag, $HTML);//text you want will be at key 1
$content1 = $array1[1];
$array2 = explode($end_tag, $content1);//text you want will be at key 0
$content2 = $array2[0];
echo $content2;
PHP:
The above code will print the contents of the title tag of my website, Jafty.com, to the browser window. Simply change $url, $begin_tag and $end_tag variable values to suite your needs and run the script and it will work. I did test it by the way.

ian11, Oct 28, 2013 IP

kutchbhi Active Member

Messages:: 130

Likes Received:: 4

Best Answers:: 2

Trophy Points:: 70

#6

Say No to simple html dom (memory leak issues)
Say No regex (ugly)
Say Yes to querypath!!

$qp = $htmlqp('file/string') ;
$fontTags = $qp->find('font');

foreach(fontTags as $tag ){
echo $tag->text() ;
}

http://www.ibm.com/developerworks/web/library/os-php-querypath/index.html

kutchbhi, Oct 28, 2013 IP

PinoyEngine™ Well-Known Member

Messages:: 298

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 145

#7

If you're scraping an online website, cURL could help you.

PinoyEngine™, Oct 29, 2013 IP

hav0c Notable Member

Messages:: 1,391

Likes Received:: 39

Best Answers:: 0

Trophy Points:: 235

Digital Goods:: 1

#8

Maybe you could try something similar to this:

<?php
$url = file_get_contents("URL-HERE");

preg_match('/(.*?)<\/font>/isu', $url, $data);

?>
Click to expand...

Use "print_r($data) to print the array and of course "print_r($data[1])" or whichever specific piece(s) you need to make visible.

Last edited: Oct 29, 2013

hav0c, Oct 29, 2013 IP

chosenlight Active Member

Messages:: 363

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 58

#9

wow - thank you guys for all your help! I have a few options here and ill try them out and let you know which worked best for me. Thanks again.
Sorry for disappearing there, was traveling w/o inet access!

@ryan_uk ; what I'm retrieving is public domain.

chosenlight, Oct 29, 2013 IP

chosenlight Active Member

Messages:: 363

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 58

#10

Hi all - just a little bit more help

@ian11 I tried your method, it works to retrieve only the first record. I tried to print_r($content2) - it also only showed only the first result

@jscg your method returns an unexpected T_AS error. I see that your using a foreach not for loop so that can't be the problem. I also changed color=#0083c6 to color=\"#0083c6\" but it didn't accomplish much.

chosenlight, Oct 29, 2013 IP

jscg Well-Known Member

Messages:: 161

Likes Received:: 5

Best Answers:: 3

Trophy Points:: 108

Digital Goods:: 2

#11

Paste full error line.

jscg, Oct 30, 2013 IP

ezprint2008 Well-Known Member

Messages:: 611

Likes Received:: 15

Best Answers:: 2

Trophy Points:: 140

Digital Goods:: 1

#12

The problem with doing this is that you don't know what will be on a page /code source (IF you're going to use the script to target multiple sites)
I've tried this with dictionary.com to grab definitions of words and whether they are noun,verb, etc
The code was rough off-road crapology to make it work but since I was only using Dictionary.com I was able to see how they formatted their responses
It still sometimes comes with extra characters. But that's a good question = how do you grab information (when you dont know what it will be) and not include all the junk for the source code. strip it with HTML chars etc. sanitize scripts etc. I've never tried CURL so maybe Pinoy has something?
If you're only going to grab from one site all the time I have a rough code that can do the trick as long as their formatting stays the same. I mark the area and line of source code
and then remove the string match to clear the junk and it leaves only the part ya want between those tag formatted areas.

ezprint2008, Oct 30, 2013 IP

chosenlight Active Member

Messages:: 363

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 58

#13

@jscg -

Thanks for your help, here is the error message, I just changed the file path an d my username. Its hosted on netfirms beleive php 5.2

Parse error: syntax error, unexpected T_AS in /hermes/bosoraweb128/b1783/nf.[myusername]/public_html/[myfile.php] on line 5
<?php
include('simple_html_dom.php');
// Create DOM from URL
$html = file_get_html('http://[webpath]');
foreach($html->find(('font[color=\"#0083c6\"]') as $element)
 echo $element->plaintext . ' ';
?>
Code (markup):
Now forgive me, I'm not fimiliar with how DOM works, but is simple_html_dom.php a file that has to be uploaded to the server?

@ezprint2008 you know, I'm not even trying to do that. I am trying to grab info from one page and one page only. All the data is between the colored font tag and there is no special charcs. It's basically a whole bunch of names with no hyphens, apostrophes, or anything like that.
Thanks

chosenlight, Oct 30, 2013 IP

hav0c Notable Member

Messages:: 1,391

Likes Received:: 39

Best Answers:: 0

Trophy Points:: 235

Digital Goods:: 1

#14

Have you had a chance to try the few lines of code I've suggested? Might be exactly what you need in the simplest of ways.

hav0c, Oct 30, 2013 IP

chosenlight Active Member

Messages:: 363

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 58

#15

@hav0c . Thanks your reply. Your code prints "Array ( )", so its not finding anything. I am sure font tags are accurate, I cut and paste them.
Does it make a difference that the target page is an ASP page . (ie: ur ends in "browse.asp?c=a") - don't think it should, but thought I should mention just in case.

my code is :
<?php
$url = file_get_contents("http://www.[domainname].com/browse.asp?c=a");
preg_match('/(.*?)<\/font>/isu', $url, $data);
print_r($data);
?>
Code (markup):
Thanks so much for your help.

chosenlight, Oct 30, 2013 IP

ROOFIS Well-Known Member

Messages:: 1,234

Likes Received:: 30

Best Answers:: 5

Trophy Points:: 120

#16

DOM is definitely the way to go but you can do so without a major class to call on.

I'm a bit rusty on Document Object Model orientation but I've coded below an example that should do what your asking for, but do yourself a favour and look on the php reference pages listed in the example script regarding char encoding ect.

You can easily modify to functionize by giving it a function name and making a return call instead of echo.
<pre>
<?php
#see:http://php.net/manual/en/domdocument.loadhtmlfile.php
#see:http://php.net/manual/en/domdocument.getelementsbytagname.php

#@ to supress errors for invalid html pages!
$doc = new DOMDocument();
@$doc->loadHTMLFile("http://www.[webpage you seek]/");
$elements = $doc->getElementsByTagName('font');
foreach ($elements as $font) {
echo $font->nodeValue, PHP_EOL; #or " ", for side by side rendering!
}
die();
?>
</pre>
PHP:
.

Last edited: Oct 31, 2013

ROOFIS, Oct 31, 2013 IP

stephan2307 Well-Known Member

Messages:: 1,277

Likes Received:: 33

Best Answers:: 7

Trophy Points:: 150

#17

I would strongly suggest not using the DOM. DOM will only work correctly if the html is structured correctly. I find that most of the time it will fail. use preg_match_all and it will work if valid html or not.

stephan2307, Nov 7, 2013 IP

ryan_uk likes this.

Log in or Sign up

PHP and scrapping help

chosenlight Active Member

jscg Well-Known Member

ryan_uk Illustrious Member

Darkforest Active Member

ian11 Greenhorn

kutchbhi Active Member

PinoyEngine™ Well-Known Member

hav0c Notable Member

chosenlight Active Member

chosenlight Active Member

jscg Well-Known Member

ezprint2008 Well-Known Member

chosenlight Active Member

hav0c Notable Member

chosenlight Active Member

ROOFIS Well-Known Member

stephan2307 Well-Known Member

Useful Searches