A php script. It will examine a page and return only the text. And tell which category it would be long to. * Arts & Entertainment * Shopping * Sports & Recreation * News * Business & Industrial * Health * Home & Garden * Culture & Society * Technology * Travel * Reference * Games * Employment & Recruiting * Education * Finance * Hobbies * Law * Parenting & Family * People & Relationships * Real Estate * Automotive Is it possible to create a php script that will do this and if so; What does it have to do? Does it examine keywords and determine which category it belongs to; and how would it know this? How would you do it if there are hundreds of thousands of keywords? Thanks.
Even though your post is still somewhat vague, I can tell you if there are "hundreds of thousands" of entries, you will need some kind of database (most common being MySQL). As for how, learn how. There's thousands of PHP tutorials, and PHP actually documents every single function it has on its website (php.net/functionnamehere). Telling you "how" would be giving you source code. Either learn to do it yourself or look on php script listing websites and find something similar.
In PHP you can use file_get_contents() to get the page source code. Then using a regular expression parse the HTML and find the related information.
Thanks allaboutgeo. I understand the basics. But how would it be possible to categorize a page automatically to those categories? Would you need a giant database of keywords?
To determine which category the page belongs to would be quite difficult - if you could - you would probably be working for google or the likes
I'd create a dictionary of keywords for each category. Download the contents of the page, and see how many instances of each keyword are on the page, and then match it up with the category that has the most results.