Get all pages of website

Discussion in 'PHP' started by astrazone, Oct 8, 2009.

  1. #1
    I am working on a script that gets all the pages of some website.
    Like the directory submitters with their smart spider that gets all the pages.

    lets take youtube.com for example,
    I need to get all the "watch?v=biscKgR2Drc".

    and put it into a database.
    Is there a simple way to do it?
     
    astrazone, Oct 8, 2009 IP
  2. caprichoso

    caprichoso Well-Known Member

    Messages:
    433
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    110
    #2
    Well, it way too simple:

    1. You get HTML for a giver URL, usually the home page
    2. Parse HTML looking for any kind of link
    3. For each link found you go back to step 2

    You can improve this algorithm by putting a control to whether you already parsed a given URL in step 3

    :)
     
    caprichoso, Oct 8, 2009 IP
  3. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #3
    Thanks.

    but I am not sure that I will get 'all' the pages.
    my other Idea was searching for something and getting all the links,
    after that going to the next page until its the last one.

    I am still looking for new ideas.
     
    astrazone, Oct 8, 2009 IP
  4. caprichoso

    caprichoso Well-Known Member

    Messages:
    433
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    110
    #4
    I don't see how does it differ from what I've written before.

    When you say "searching for something" and "getting all the links". Can you be more precise? How would you "searching for something" in PHP?
     
    caprichoso, Oct 8, 2009 IP
  5. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #5
    no I mean in the search box.

    like example.com/index.php?search=something&page=1

    I get all the links and then change the &page= to the next page.
     
    astrazone, Oct 9, 2009 IP
  6. caprichoso

    caprichoso Well-Known Member

    Messages:
    433
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    110
    #6
    Oh, I see. But again, I don't see the logic behind that method.

    Let's say you search Google for youtube.com and parse all HTML crawling resulting links. That won't let you get all the links from youtube.com. In fact, Google doesn't index ALL urls for a give site. And never returns all indexed links for a single search term.

    Anyway, you can create a script which does that Google crawling you proposed. What would you use for feeding the search term? How many URLs do you reckon the script will crawl?

    Take into account that Google may ban your robot after a while.
     
    caprichoso, Oct 9, 2009 IP
  7. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #7
    I never actually wanted to search Google or Youtube.
    I wanted to search simple site with lots of useful information.
    arcades/blogs/video sharing/images/others

    I know that google can ban my ip for that. thats why I am not going to use google for it.
     
    astrazone, Oct 9, 2009 IP
  8. caprichoso

    caprichoso Well-Known Member

    Messages:
    433
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    110
    #8
    Ok. Then, you will go to a site, let's call it A.com and then you'll take a look to its search box to find out how to interact with it. Once you did it, you can build a robot which parses URLs from search results. Again you will need something like a list of search terms for your robot to look at.

    You will have to modify your robot as per site basis. Unless you are dealing with sites using open source forums. Then a single robot script should work on a set of sites sharing the forum software.

    Your robot will need:

    Load a search term list for looking into the site
    For each search term, POST/GET to the search URL with search term
    Parse results and put URLs into the local database
    Step to next search term

    There is no simple way to do this. Anyway it's not a challenge for an experienced PHP programmer.
     
    caprichoso, Oct 9, 2009 IP
  9. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #9
    actually I took a deeper look to that website and found the big category list.

    its just huge list of search terms.

    looks like

    <li><a href="index.php?q=x">x</a></li>
    <li><a href="index.php?q=y">y</a></li>
    <li><a href="index.php?q=z">z</a></li>

    so my robot will retrive all the data from that page.
    insert it into a database.
    second part of the bot will scan all the pages from the db and retrive all links into the category.
    I will have everything categorized.
    and I need to be careful with same urls.
    the 3rd part is going to each link and getting the "swf" with the video.

    I hope that I am in the right direction.
    I cant say that I am a very experienced PHP programmer but I cant see any problems. YET.

    Thanks for all the help.
     
    astrazone, Oct 9, 2009 IP
  10. caprichoso

    caprichoso Well-Known Member

    Messages:
    433
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    110
    #10
    Well, in that case this goes very much simple.

    The parsing for such HTML is easy. preg_match() should work fine. And you can add a UNIQUE index to the URL column in your database for disallowing duplicates.

    The owner of the site you are going to copy all content from will be amused. :)
     
    caprichoso, Oct 9, 2009 IP
  11. w47w47

    w47w47 Peon

    Messages:
    255
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    i can make you such a php script but for $$$, if you are interested, just PM me how much you are willing to pay.
     
    w47w47, Oct 9, 2009 IP
  12. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #12
    lol I know php stop offering here your services is not the correct place try BST forum.
     
    astrazone, Oct 9, 2009 IP
  13. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #13
    Thats how i do it, lemme know it you find something i was also considering using sitemap.xml.
     
    g_bot, Oct 9, 2009 IP
  14. w47w47

    w47w47 Peon

    Messages:
    255
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #14
    just use preg_match_all i think that this is the easiest way to do it.
     
    w47w47, Oct 10, 2009 IP
  15. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #15
    I am still a newbie in PHP so can someone please explain to me what this means?

    preg_match_all("|<[^>]+>(.*)</[^>]+>|U", "<b>example: </b><div align=left>this is a test</div>",$out, PREG_PATTERN_ORDER);

    mainly the smiley faces part : "|<[^>]+>(.*)</[^>]+>|U"
     
    astrazone, Oct 11, 2009 IP
  16. w47w47

    w47w47 Peon

    Messages:
    255
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #16
    didn't you said that you know PHP ? :O

    try to search google for: regular expressions aka. regex

    or maybe i will explain it to you tomorrow. i go to sleep now. :>
     
    w47w47, Oct 11, 2009 IP
  17. astrazone

    astrazone Member

    Messages:
    358
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    33
    #17
    I know some parts and some parts I know less. im in between a good programmer and a lazy one.
    I have big knowledge in XML handling and files but less with regex.

    I just dont like when people want to get money in help forums.
     
    astrazone, Oct 11, 2009 IP
  18. caprichoso

    caprichoso Well-Known Member

    Messages:
    433
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    110
    #18
    Regular expression is something you learn by reading the reference. You have to read a little and it can be boring. But regex is a wonderful tool.
     
    caprichoso, Oct 11, 2009 IP