1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

GSiteCrawler

Discussion in 'Google Sitemaps' started by CalBoy, Oct 20, 2005.

  1. #1
    Can anyone share their experience with GSiteCrawler?
     
    CalBoy, Oct 20, 2005 IP
  2. Deano

    Deano Sail away with me.

    Messages:
    890
    Likes Received:
    41
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I use GSiteCrawler, it is simple to use and gives good results, lots of stats, and it's free. go for it.
     
    Deano, Oct 22, 2005 IP
  3. SportsOutlaw

    SportsOutlaw Active Member

    Messages:
    952
    Likes Received:
    37
    Best Answers:
    0
    Trophy Points:
    70
    #3
    I really hope it is worth it. The damn thing has been crawling my site all day building the site map.

    I am now at.... Records waiting 48200 and that number keeps getting bigger, whatever that means.
     
    SportsOutlaw, Oct 23, 2005 IP
  4. mdvaldosta

    mdvaldosta Peon

    Messages:
    4,079
    Likes Received:
    362
    Best Answers:
    0
    Trophy Points:
    0
    #4
    can anyone explain what this is, does, or have a link?
     
    mdvaldosta, Oct 23, 2005 IP
  5. SportsOutlaw

    SportsOutlaw Active Member

    Messages:
    952
    Likes Received:
    37
    Best Answers:
    0
    Trophy Points:
    70
    #5
    http:// johannesmueller.com /gs/

    After hearing alot of positive results, I figured I would give it a try. Right now I am simply tired of watching it spider my site and just want it to be done.
     
    SportsOutlaw, Oct 23, 2005 IP
  6. sgtsloth

    sgtsloth Peon

    Messages:
    205
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    It generates Google Sitemaps.

    And, yes SportsOutlaw, it can take a while. But I don't know how its speed compares to other sitemap generators out there.
     
    sgtsloth, Oct 23, 2005 IP
  7. kniveswood

    kniveswood Well-Known Member

    Messages:
    764
    Likes Received:
    29
    Best Answers:
    0
    Trophy Points:
    120
    #7
    I like it for its ease of use too. It can crawl your site, ftp your sitemap, and notify Google with a ping. Can't vouch for its speed though, as my site is fairly small.
     
    kniveswood, Oct 23, 2005 IP
  8. softplus

    softplus Peon

    Messages:
    79
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Usually -- if it seems to take much too long - you might be crawling too much. Many sites offer lots of ways to go through them, which usually lead to the same pages using different URLs (worst case is session-ids in the parameters). If your site should be much smaller but you see the GSiteCrawler working on lots and lots of URLs, you might want to pause the crawlers and check the data they've been returning in the URLs-table. Often you can see that you "should have" set up some filters to make sure that it only grabs the real URLs you really want.

    On the other hand, keep in mind that search engines work similarly, and if your site is generating a massive tangle of URLs leading to the same pages, then perhaps the search engine bots are getting tied up as well? That would be a "bad thing" and something you should try to clean up as fast as possible. Also, this can easily trigger Google's "duplicate content" rule which has become a bit of a killer lately..

    One other thing you can do if your site seems to grow in the GSiteCrawler is to stop the program, run the database-compressor (dbcompress.exe in the same folder) and then restart the program and restart the crawlers. It won't miss anything, you can do that as often as you want. The database grows very fast and will slow the thing down if it is too large (I'm working on a version with larger databases).

    The crawlers are actually pretty fast, but of course nothing like full line speed, as they have lots of work to do. If you have a network, you can set up a shared GSiteCrawler installation and start it on several PCs at the same time to speed it up (however, the database will end up slowing things down at some point again).

    Hope it helps!
     
    softplus, Oct 24, 2005 IP
  9. Pootwan

    Pootwan Active Member

    Messages:
    153
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    53
    #9
    Do you really have such a huge number of static pages (as in .html files sitting on a file system)? Maybe not -- if it is a database generated site, then you want to consider how to create the XML directly from your database. Would be more efficient and save you some work (and bandwidth).

    Pootwan
     
    Pootwan, Oct 26, 2005 IP
  10. srijit

    srijit Peon

    Messages:
    75
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #10
    hey that looks nice. i was having no luck running google sitemaps. will definetely try this. thanks a lot :)
     
    srijit, Oct 29, 2005 IP
  11. Fishing Forum

    Fishing Forum Active Member

    Messages:
    537
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    60
    #11
    Thanks for the tool as been looking for one ( most online ones only go to the max of 750 urls )

    Must abmit its take a long time to make one , as started using it today and 2 hrs later still going on a smallish site . So I think this is a click start and go to bed or work

    Can anyone say how google likes the maps this does i.e full indexing ect
     
    Fishing Forum, Oct 31, 2005 IP
  12. scottj

    scottj Peon

    Messages:
    168
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I tried this one. It crawled something like 150,000 pages on my site before I finally killed it. It seems to get the job done, but if you have a large site, beware.
     
    scottj, Nov 10, 2005 IP
  13. NosferatusCoffin

    NosferatusCoffin Active Member

    Messages:
    49
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #13
    GSite is a great program. However, there is one caveat that was mentioned earlier and that is if you are using it to crawl a BB that uses session-ids to create a map, it will spider on forever, creating hundreds of thousands links, many of them going to the same page via many different front and back links.

    One of my BB's uses Ikonboard and when I when have the session filter enabled, it will crawl the correct url's, but without the session-id appended to the URL, it will only go to the home forum page (i.e. http://www.domain/cgi-bin/ib/ikonboard.cgi?f=1&t=1 will not go to the forum 1/topic 1 page, but just the forum's home page).

    Granted, this seems to be an Ikonboard issue and not a GSite one as Ikonboard seems to demand that the session-id be embedded within the URL in order for the browser to go to the correct URL.

    For all other sites I have used it for, it works very well. Even for one of my client's sites, which is WordPress and has over 2,000 pages, it crawled the whole site properly in less then 45 minutes. So, I would highly recommend it, except for sites that require a session parameter.
     
    NosferatusCoffin, Jan 17, 2007 IP
  14. softplus

    softplus Peon

    Messages:
    79
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Yep, session-ids is a problem... but not just with the GSiteCrawler - it will kill just about any crawler, including those from the search engines. It makes a lot of sense to get rid of the session-ids for search engines and you will need to at the very least filter them out when using the GSiteCrawler. If your board removes session-ids for certain user-agents, you can just set the user-agent in the GSiteCrawler (File / Global Options) to one that is on your list. Session-IDs should never be indexed, they can cause all sorts of problems.
     
    softplus, Jan 17, 2007 IP
  15. NosferatusCoffin

    NosferatusCoffin Active Member

    Messages:
    49
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #15
    Right. I filter out the sessions-ids when I setup the site for GSite indexing. It will crawl and report correctly. Unfortunately, stupid Ikonboard requires the session-id to be in the actual URL in order to actually go to the desired post page, as opposed to just going to the BB's home page.

    I am thinking of just junking Ikonboard and going with another CMS, probably Joomla, since that has finally matured to the point that I am not getting Secunia security alert every other day. And GSite will crawl such a CMS properly. Main problem is either finding or writing up a Ikonboard to Joomla DB converter.
     
    NosferatusCoffin, Jan 17, 2007 IP
  16. softplus

    softplus Peon

    Messages:
    79
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Can you PM me your URL so that I can take a quick look? How do the search engines index the site?
     
    softplus, Jan 17, 2007 IP
  17. NosferatusCoffin

    NosferatusCoffin Active Member

    Messages:
    49
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #17
    The URL is:
    http://www.asmallcornerofsanity.com/

    As I mentioned, filtering is enabled. The problem is that the session-id tag that Ikonboard uses is "s=" which is too open-ended.

    For example to view a post within a thread, Ikonboard will return it as:
    http://www.asmallcornerofsanity.com/cgi-bin/ib/ikonboard.cgi?;act=ST;f=6;t=794

    I went to "Remove Parameters" and see that it had "s" as one the session-id tags as default filter. I changed that to "s=", as Remove Parameter is supposed to strip that portion out and should return:
    http://www.asmallcornerofsanity.com/cgi-bin/ib/ikonboard.cgi?act=ST;f=6;t=794 which will take you to the actual post within the thread as opposed to the BB's main home page.

    When "s" or "s=" is enabled within the filter, the crawler returns next to nothing. When they are removed, the index returns well over 300,000 URLs before it is killed. So, I am sort of damned if I do and damned if I don't remove the session-id or vice-versa.

    The only thing I can think of is that it might be stumbling over the ";" and then not return the URL at all.

    As for how Google etc index the site, it indexes it very little as the site is not indexed properly via their sitemap generators or GSite. I have a few thousand posts on the site and none of them can be indexed. This is why I am thinking of switching to Joomla, Mambo or some other CMS that will index properly.
     
    NosferatusCoffin, Jan 17, 2007 IP
  18. vwdforum

    vwdforum Well-Known Member

    Messages:
    782
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    110
    #18
    I'm crawling a vbulletin board at the moment and its taking ages!

    I've setup the following as banned urls to try and reduce it

    /forum/member.php
    /forum/calender.php
    /forum/search.php



    I seem to be getting a load of cron.php anyone know what that is.

    Any able to tell me what not to crawl on a forum?

    Thanks
     
    vwdforum, Jan 24, 2007 IP
  19. sukantab

    sukantab Well-Known Member

    Messages:
    2,075
    Likes Received:
    49
    Best Answers:
    0
    Trophy Points:
    110
    #19
    I always prefer xml-sitemaps.com for making sitemaps...
    They are trustworthy and they come first on google...
     
    sukantab, Jan 24, 2007 IP
  20. softplus

    softplus Peon

    Messages:
    79
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #20
    NosferatusCoffin, those kinds of URLs will be problematic since you can't remove the session ID using normal methods. Even if the GSiteCrawler were to remove the session IDs, Google would find them and index those URLs anyway. What you might try is to use a bit of URL rewriting with mod_rewrite to get the session-IDs filtered out when a bot comes (including the GSiteCrawler). You should really work on getting the session-IDs removed if you need to be indexed...

    vwdforum, those URLs look like they could be filtered. You might also want to block them with the robots.txt. If you reach a "cron.php" with a crawler, you might want to trace back how you got there. In general you should not have any cron-scripts linked on your site... You can also block those with the robots.txt.

    Keep in mind that any links like that which the GSiteCrawler finds, could (and usually will) also be found by all other search engine crawlers. It makes a lot of sense to check your site to see what is crawled and what is (safely) ignored. By minimizing unnecessary URLs you can concentrate your site's value on the important URLs, instead of having the crawlers go all sorts of places where you don't want them to go (like endless calendar scripts...).
     
    softplus, Jan 24, 2007 IP