Site Scrapper?

nitreb Well-Known Member

Messages:: 137

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 105

#1

Hi,

I'm wondering if it will be possible for a programmer to go to a website such as this http://www.tandfonline.com/action/showPublications?display=byAlphabet and write a program to retrieve infos (specific ones- title and issn) on all of their journals, and add them to a site automatically. Is that possible? If not, what would be the easiest way in your opinion of doing so?

Thanks in advance for your replies.

Solved! View solution.

nitreb, Aug 2, 2012 IP

Alejandro131 Greenhorn Best Answer

Messages:: 24

Likes Received:: 0

Best Answers:: 3

Trophy Points:: 21

#2

Theoretically it is possible to do this. If you don't want to compile your code and are more of a web programmer you could use php. First off you can start with the function file_get_contents to get the source of the webpage:
$pageSource = file_get_contents('http://www.tandfonline.com/action/showPublications?display=byAlphabet&');
Code (markup):
From there on you would have to add code for recognition of which links are the ones for books and follow them, and get their page source, as there are no issn on the main page list.

Another thing you would have to consider code for traversing on the page numbers as I see that there is a pagination algorithm on the site displaying only 20 titles per page.

After you get the links to every book page you would have to get the title and issn from the books respective page (again identifying things like issn text or the div in which that information is written) and you would have to structure the data gathered to the needs of your site database.

All in all this isn't such an impossible to manage feat, but you could have problems with traffic as the pages are a lot and I don't know if you would have some legal issues using the information you have gathered from that site. If you have any suspicions you could ask the site's administration about that and they even might be willing to give you the title and issn information if you're going to use them properly.

Alejandro131, Aug 2, 2012 IP

Rila Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

For scrappers programmers use cURL or file_get_contents() from php ,next it makes an algorithm to sort the data extracted from the source code(remove tags,scrap between 2 tags,etc).The automatic scrappers are putted on cronjobs!

Rila, Aug 3, 2012 IP

nitreb Well-Known Member

Messages:: 137

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 105

#4

Thanks for the replies guys, both replies where great, just found out about the 'best answer' option, I'll just chose the first to reply, but again thanks to both of you Alejandro and Rila.

nitreb, Aug 3, 2012 IP

Log in or Sign up

Site Scrapper?

nitreb Well-Known Member

Alejandro131 Greenhorn Best Answer

Rila Peon

nitreb Well-Known Member

Useful Searches