Hi everyone, I'm working on a project where i need to extract content between 2 html tags using php. I tried a bunch of things but its not working for me. If anyone can provide code or point me to a script would be wonderful. So for example: website target has the following bunch of stuff I don't need <font color="#0083c6">Data I need</font> bunch of stuff I don't need <font color="#0083c6">Data I need</font> bunch of stuff I don't need <font color="#0083c6">Data I need</font> bunch of stuff I don't need printing on the screen using maybe comas between each set would be perfect. Thanks everyone
You should use PHP Simple HTML DOM Parser: http://simplehtmldom.sourceforge.net/ Manual: http://simplehtmldom.sourceforge.net/manual.htm This is just example how it can be used: include('simple_html_dom.php'); // Create DOM from URL $html = file_get_html('http://example.com/'); foreach($html->find(('font[color=#0083c6]') as $element) echo $element->plaintext . '<br>'; PHP:
That's not needed as preg_match (there is a preg_match_all, if you need it for more than one occurence, too) or domDocument can be used to do such a task (then just write your own custom, re-usable function). I hope the OP is using it for ethical purposes or has permission to retrieve content from a site in this way.
I'd say go the jscg way if you are working on bigger websites. The Simple HTML DOM library comes in really handy when you have to re-use the code a lot or the websites have complex structures. However, if it's a small site, then preg_match should be more than enough Also, a further note. PHP will not be able to parse ANY JavaScript that you give it. Say, if the site is partially generated using JavaScript, PHP has no way of seeing the part of the website that JavaScript has changed or generated. In order to make this work, you'll need far more complicated things. You'll probably need a whole server and a browser to render the page with some additional libraries to get the elements that you want
I recommend using simple built in PHP functions for this like so: $url = "http://jafty.com/"; $HTML = file_get_contents($url); $begin_tag = "<title>"; $end_tag = "</title>"; $array1 = explode($begin_tag, $HTML);//text you want will be at key 1 $content1 = $array1[1]; $array2 = explode($end_tag, $content1);//text you want will be at key 0 $content2 = $array2[0]; echo $content2; PHP: The above code will print the contents of the title tag of my website, Jafty.com, to the browser window. Simply change $url, $begin_tag and $end_tag variable values to suite your needs and run the script and it will work. I did test it by the way.
Say No to simple html dom (memory leak issues) Say No regex (ugly) Say Yes to querypath!! $qp = $htmlqp('file/string') ; $fontTags = $qp->find('font'); foreach(fontTags as $tag ){ echo $tag->text() ; } http://www.ibm.com/developerworks/web/library/os-php-querypath/index.html
Maybe you could try something similar to this: Use "print_r($data) to print the array and of course "print_r($data[1])" or whichever specific piece(s) you need to make visible.
wow - thank you guys for all your help! I have a few options here and ill try them out and let you know which worked best for me. Thanks again. Sorry for disappearing there, was traveling w/o inet access! @ryan_uk ; what I'm retrieving is public domain.
Hi all - just a little bit more help @ian11 I tried your method, it works to retrieve only the first record. I tried to print_r($content2) - it also only showed only the first result @jscg your method returns an unexpected T_AS error. I see that your using a foreach not for loop so that can't be the problem. I also changed color=#0083c6 to color=\"#0083c6\" but it didn't accomplish much.
The problem with doing this is that you don't know what will be on a page /code source (IF you're going to use the script to target multiple sites) I've tried this with dictionary.com to grab definitions of words and whether they are noun,verb, etc The code was rough off-road crapology to make it work but since I was only using Dictionary.com I was able to see how they formatted their responses It still sometimes comes with extra characters. But that's a good question = how do you grab information (when you dont know what it will be) and not include all the junk for the source code. strip it with HTML chars etc. sanitize scripts etc. I've never tried CURL so maybe Pinoy has something? If you're only going to grab from one site all the time I have a rough code that can do the trick as long as their formatting stays the same. I mark the area and line of source code and then remove the string match to clear the junk and it leaves only the part ya want between those tag formatted areas.
@jscg - Thanks for your help, here is the error message, I just changed the file path an d my username. Its hosted on netfirms beleive php 5.2 Parse error: syntax error, unexpected T_AS in /hermes/bosoraweb128/b1783/nf.[myusername]/public_html/[myfile.php] on line 5 <?php include('simple_html_dom.php'); // Create DOM from URL $html = file_get_html('http://[webpath]'); foreach($html->find(('font[color=\"#0083c6\"]') as $element) echo $element->plaintext . '<br>'; ?> Code (markup): Now forgive me, I'm not fimiliar with how DOM works, but is simple_html_dom.php a file that has to be uploaded to the server? @ezprint2008 you know, I'm not even trying to do that. I am trying to grab info from one page and one page only. All the data is between the colored font tag and there is no special charcs. It's basically a whole bunch of names with no hyphens, apostrophes, or anything like that. Thanks
Have you had a chance to try the few lines of code I've suggested? Might be exactly what you need in the simplest of ways.
@hav0c . Thanks your reply. Your code prints "Array ( )", so its not finding anything. I am sure font tags are accurate, I cut and paste them. Does it make a difference that the target page is an ASP page . (ie: ur ends in "browse.asp?c=a") - don't think it should, but thought I should mention just in case. my code is : <?php $url = file_get_contents("http://www.[domainname].com/browse.asp?c=a"); preg_match('/<font color="#0083c6">(.*?)<\/font>/isu', $url, $data); print_r($data); ?> Code (markup): Thanks so much for your help.
DOM is definitely the way to go but you can do so without a major class to call on. I'm a bit rusty on Document Object Model orientation but I've coded below an example that should do what your asking for, but do yourself a favour and look on the php reference pages listed in the example script regarding char encoding ect. You can easily modify to functionize by giving it a function name and making a return call instead of echo. <pre> <?php #see:http://php.net/manual/en/domdocument.loadhtmlfile.php #see:http://php.net/manual/en/domdocument.getelementsbytagname.php #@ to supress errors for invalid html pages! $doc = new DOMDocument(); @$doc->loadHTMLFile("http://www.[webpage you seek]/"); $elements = $doc->getElementsByTagName('font'); foreach ($elements as $font) { echo $font->nodeValue, PHP_EOL; #or " ", for side by side rendering! } die(); ?> </pre> PHP: .
I would strongly suggest not using the DOM. DOM will only work correctly if the html is structured correctly. I find that most of the time it will fail. use preg_match_all and it will work if valid html or not.