1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Problems sitemapping a large site.

Discussion in 'Google Sitemaps' started by LaCabra, Nov 8, 2005.

  1. #1
    Hi folks,
    SEMrush
    Well I've been trying to sitemap a large site for several days now and for some reason I can never get it to sitemap the entire site. I have used many of third party products listed on the google sitemap page (free ones) and am unable to get a complete sitemap. When I use the PC based products the application gets hung when it reaches a certain number of pages. The webbased ones are very limited and designed for small sites and again no go. I wouldn't mind paying for a PC based application but after these experiences I am a little doubtful that any of these will work. If any of you want to try and let me know which app they have used I would greatly appreciate it. Of course I would like the entire site mapped. The site in question is www.leatherpages.com it is a leather industry portal. Please let me know if any of you succeed and which app you used. I don't want to scare anyone but there should be in around 100,000 + mapable pages.:eek:

    cheers
    Frank
     
    LaCabra, Nov 8, 2005 IP
    SEMrush
  2. davert

    davert Banned

    Messages:
    345
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #2
    That includes the Python one?
     
    davert, Nov 8, 2005 IP
  3. Michael

    Michael Raider

    Messages:
    677
    Likes Received:
    92
    Best Answers:
    0
    Trophy Points:
    150
    #3

    It is normally better to run a sitemap generator on your server for exactly that reason but have you tried GSiteCrawler?

    - Michael
     
    Michael, Nov 8, 2005 IP
  4. LaCabra

    LaCabra Goats R Us

    Messages:
    1,954
    Likes Received:
    241
    Best Answers:
    0
    Trophy Points:
    0
    #4
    my site is a ASP site and my ISP is a pain when it comes to installing other apps. GSitecrawler hung around 7500 pages.
     
    LaCabra, Nov 8, 2005 IP
  5. hans

    hans Well-Known Member

    Messages:
    2,924
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #5
    Hi all

    i had the very same problems my my site and google sitemaps
    specially to sitemap my dynamic pages
    ... of course i tried just about anything on the market
    all the tools listed somewhere

    python
    google sitemapper
    php
    java
    online and offline
    remote or installed
    all failed

    then one day i convinced a friend to adopt the rather orphaned webalizer and to upgrade the code to current status ...
    and to add instantly a new feature - a google sitemapper

    the new access_log analyzing tool is now called Angolizer - has its own new project page
    and does the best google sitemap so far - any number of pages, dynamic or static
    it also runs 1AND1.com modified logfile-format as well as common logfile format

    however of course

    you have to run Angolizer on logfiles - offline or online to create the google sitemap
    ( i know that another - may be the python - also does the same - but it sucks so many resources that my host killed the application online
    and offline it simply used to much to be of use )

    Angolizer is written in C and fast

    recently i run on a x86_64 turion ( Acer Ferrari 4005WMLi ) my current years accumulated logfiles stats INCLUDING sitemaps
    for some 25 million lines it took about 14 minutes on an otherwise busy system
    hence it is fast enough to be of use for any heavy duty pro use

    assuming that every page has been visited at least once
    all useful URLs are logged and hence also can be sitemapped - dynamic OR static never mind !!

    a URL-exclude list lets you configure URLs to be excluded from the sitemap
    and a URL-list can be viewed for manual corrections if needed

    the sitemap can be automatically stored to a path for instant upload or availability ( online or offline )
    the sitemap creates code that is validated by google

    may be you give it a try ...

    the statistics as well have improvements compared to webalizer
    more bots listed
    more query strings analyzed
    country resolution using GeoIP
    validated HTML output
    better colors for easier reading

    I expect some additional nice new features to come out soon
    the possibilities are manifold

    the Angolizer compiles currently on Linux X86 and X86_64

    have fun
    God bless
     
    hans, Nov 9, 2005 IP
  6. LaCabra

    LaCabra Goats R Us

    Messages:
    1,954
    Likes Received:
    241
    Best Answers:
    0
    Trophy Points:
    0
    #6
    wheeewww thanks for the great info!
     
    LaCabra, Nov 9, 2005 IP
  7. Mister Tut

    Mister Tut Guest

    Messages:
    837
    Likes Received:
    42
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I was using the free php tool from, I think, enarion? Anyway, it did get most pages, but sucked up so many cpu cycles, my host started threatening me with a plan bump.
     
    Mister Tut, Nov 9, 2005 IP
  8. LaCabra

    LaCabra Goats R Us

    Messages:
    1,954
    Likes Received:
    241
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks for trying Mister Tut! Its a big site!
     
    LaCabra, Nov 9, 2005 IP
  9. chachi

    chachi The other Jason

    Messages:
    1,600
    Likes Received:
    57
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Yeah, we have been toying with our own setup now and using a real crawler (as that seemed to make the most sense from a "visibility" standpoint). What we found is that the crawlers end up eating a crapload of memory as they get deeper into the bigger sites. Death comes anywhere from 5k to 10k pages. The only reason we started on this endeavor is because the ones that are out there are tremendous pieces of crap. All I can say is that I have a hell of a lot more respect for what the big search engines do now. :)
     
    chachi, Nov 9, 2005 IP