At the prime age of 49 I've decided to start attending night school to get some basic/advance knowledge of programming. I want to make a web spider for my website that just collects the words and puts them in a database to utilize as a search engine, but have no idea where to start will anyone be able to help ?? My structure that I think will work will be as follows - index page.php - just mock up a site home page and put an 'include' for a search bar uptop. Results page - for the search results. Spider - for searching through the pages. << (PROBLEM) Database - for storing all the data collected by the spider. using mySQL If anyone has the time to write the code for the spider or helping me I would be very appreciative, I have exhausted all avenues and have gotten nowhere in my progress so any help here will be a bonus. Thanks Eddie
I just need one for a php website so what is the one to use ?? I just need a really basic one that collects all the words from a php webpage and puts them in a database.
Eddie, You have not defined several key factors such as if the website is dynamically driven with a DB backend or if all content is regular HTML files. The more information you give the more help you receive.
Oh Ok really sorry. It's going to be regular HTML files, the content on the HTML files is going to be very basic headings, and few bits of texts nothing major. It's more to see if I can create the spider for myself. From what I've read the spider I want to create is a very basic one but I don't know where to start...
I'm using Apache which was distributed to my at my night classes so I'm guessing it's shared if that answers your question, I'm a beginner so some of these questions are going a little over my head.
http://www.phpcodesnippets.com/tag/php-website-spider I've found this link with some coding on it, I feel this is what I need but I just want the spider to extract words from the webpage not anything else what coding would have to be changed ??? Any help would be highly appreciated. Cheers Eddie
Eddie, Use fopen, fread etc.. to get the content of the page into a variable, you are going to programming classes, so I am assuming you can do all this. The second thing you can do is to split all the text by using the explode function (which explodes the text at whitespace, something like explode(" ",$ content) and put it to a big array. After this you can select which words to add into the database using loops and lots of if statements. I'd love to give a step by step but as Chemo said, it would be re-inventing the wheel.
Right next problem guys.... Im persistant lol. I've got the coding I want below... <? /* * populate.php * * Script for populating the search database with words, * pages and word-occurences. */ /* Connect to the database: */ mysql_pconnect("localhost","root","") or die("ERROR: Could not connect to database!"); mysql_select_db("test"); /* Define the URL that should be processed: */ $url = addslashes( $_GET['url'] ); if( !$url ) { die( "You need to define a URL to process." ); } else if( substr($url,0,7) != "http://" ) { $url = "http://$url"; } /* Does this URL already have a record in the page-table? */ $result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\""); $row = mysql_fetch_array($result); if( $row['page_id'] ) { /* If yes, use the old page_id: */ $page_id = $row['page_id']; } else { /* If not, create one: */ mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")"); $page_id = mysql_insert_id(); } /* Start parsing through the text, and build an index in the database: */ if( !($fd = fopen($url,"r")) ) die( "Could not open URL!" ); while( $buf = fgets($fd,1024) ) { /* Remove whitespace from beginning and end of string: */ $buf = trim($buf); /* Try to remove all HTML-tags: */ $buf = strip_tags($buf); $buf = ereg_replace('/&\w;/', '', $buf); /* Extract all words matching the regexp from the current line: */ preg_match_all("/(\b[\w+]+\b)/",$buf,$words); /* Loop through all words/occurrences and insert them into the database: */ for( $i = 0; $words[$i]; $i++ ) { for( $j = 0; $words[$i][$j]; $j++ ) { /* Does the current word already have a record in the word-table? */ $cur_word = addslashes( strtolower($words[$i][$j]) ); $result = mysql_query("SELECT word_id FROM word WHERE word_word = '$cur_word'"); $row = mysql_fetch_array($result); if( $row['word_id'] ) { /* If yes, use the old word_id: */ $word_id = $row['word_id']; } else { /* If not, create one: */ mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")"); $word_id = mysql_insert_id(); } /* And finally, register the occurrence of the word: */ mysql_query("INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id)"); print "Indexing: $cur_word<br>"; } } } fclose($fd); ?> It won't let me define the url I want no matter what I try, what could possibly be the problem ?? The URL I want it to check is - http://localhost/MyPHP/%23%23test1%23%23.php