What about code to crawl web page and get keywords/summary?

Discussion in 'Programming' started by was2_0, Sep 13, 2006.

  1. #1
    It can return a simple summary title/description, in any programming language, ... it's more like a text processor/formattor.

    Is it possible? Thanks!
     
    was2_0, Sep 13, 2006 IP
  2. alemcherry

    alemcherry Guest

    Best Answers:
    0
    #2
    Possible? defintely.
    You may be able to pick up some PHP script and do some customizations to fit your needs. Lots of similar PHP scripts are freely available.
     
    alemcherry, Sep 13, 2006 IP
  3. drewbe121212

    drewbe121212 Well-Known Member

    Messages:
    733
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    125
    #3
    Yeah. I wouldn't recomend PHP for this though. It is to slow for this sort of application.
     
    drewbe121212, Sep 13, 2006 IP
  4. Mrblogs

    Mrblogs Peon

    Messages:
    48
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Do you want to do this, server-side.. or client-side?

    Ie.. do you want a web-spider? or a script that you can run on your server?
     
    Mrblogs, Sep 13, 2006 IP
  5. wmtips

    wmtips Well-Known Member

    Messages:
    601
    Likes Received:
    70
    Best Answers:
    1
    Trophy Points:
    150
    #5
    I can not agree. Yes PHP is slower than compiled applications, but it's server side and easier to implement. For test I've added time counters to Keyword Density Analyzer. So the results for test url http://www.w3c.org (~35 KB) :
    Download time 1.0035 sec., processing time 1.5908 sec.
    I think it is not so bad for scripting language to process 35 KB of text. Sure implementation depends on goals, so if performance is important, you need a compiled language. But PHP still good for simple tasks like that.
     
    wmtips, Sep 13, 2006 IP
  6. was2_0

    was2_0 Guest

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Client-side will be very good! That do not cost to my server!

    Server-side is also good, but it should not cost a lot of my server's CPU, otherwise...

    Are there any JavaScript, PHP or C/C++ code available? I have not find any...

    Thanks!
     
    was2_0, Sep 14, 2006 IP
  7. drewbe121212

    drewbe121212 Well-Known Member

    Messages:
    733
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    125
    #7
    I wrote a spider in PHP :)

    While indivual page downloads to retrieve the text / links was not a problem at all, it was the recurring find links, add links to the unvisited list, and continue going that was causing problems.

    I wrote it using CURL / Sockets, and was a little disapointed in it. Had I wrote it in PERL the same way, I would have seen much better performance.


    With that being said, I still love PHP as it is my baby ;) Just not very good for this type of thing! :)
     
    drewbe121212, Sep 16, 2006 IP
  8. DrMalloc

    DrMalloc Peon

    Messages:
    130
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    0
    #8
    DrMalloc, Sep 20, 2006 IP