so here's how just under 2 hours of my day went today 13:15 - came up with an idea for a web archiving website 13:24 - registered dayarchive.net and pointed to server 13:29 - knocked up a quick script to generate the site I wanted 13:41:19 - run the created script to generate the site 13:50:32 - site generated, started doing the css/template 14:22:04 - completed site tempaltes and css 14:23:39 - regenerated site site made .. 14:29 - added the site to google webmaster tools - and submitted to google + yahoo add url 14:30 - pinged a few blog ping servers 14:30:07 - yahoo slurp visits (head call) 14:30:44 - moreoverbot visits 14:30:49 - moreoverbot grabs homepage 14:30:49 - moreoverbot grabs rss 14:30:54 - googlebot grabs rss 14:30:55 - googlebot grabs homepage 14:33:05 - first scraper comes in via google blog search 14:35:47 - googlebot grabs it's first 2nd level page 14:37:14 - blog pulse live grabs rss + homepage 14:39:44 - tecnorati bot grabs rss + homepage 14:39:54 through 14:45:58 - googlebot grabs 82 pages and indexes them 14:44:11 - another scraper 14:52:53 - first visitor from google.com with a search for: "Turista movie spoilers" (on front page for this already) 14:56:59 through 15:06:36 R6_feedFetcher harvests site (meanwhile....) 15:00:15 second real vistor comes in from (refer blocked) but real user 15:02:56 third real vistor comes in from google.com search for "lou mcfadden winesburg" 15:06:00 second visitor views a second page 15:07:35 fourth visitor lands from google.com search for "julia louis-dreyfus bares all" it's now another 2 hours later and the sites just hit 38 visits from google and my first $1 in adsense i guess the lesson is (and this is purely to myself).. I've spent 8 months developing a "perfect" bh system and it's still not finished - in 1 hour I used scripts I'd made months ago to get a site up in the top tens on 2/3 word phrases simply by keeping ti simple (and weirdly with no seo, just neat free for all content) - I could have made 500 sites in those 8 months, I guess you just can't (possibly shouldn't) automate everything - the human touch is what makes it work! hope your all ahving a good holiday! blacknet
first two lines mate... 13:15 - came up with an idea for a web archiving website 13:24 - registered dayarchive.net and pointed to server registered.... ^^^^ going to add in the archiving in a mo and maybe monetize with something better than adsense! (however thats not my niche)
sure can.. it's not - made it myself update: traffic was going up and up - added in another level with another 380+ pages and now the big g has started de-indexing me or should i say not re-crawling.. I'm currently tryign a few things to see if I can force them to come back.. update:00:21GMT not sure what one did, but g didn't index or re-crawl - however in the last hour both msn and ask/teoma have done full spiders with the initial entrypoint being the new url of the rss feed - yahoo alos hit it 7 times ver the past 45 minutes..) the change.. changed "link"s in rss to point to more rsss feeds rather than actual pages..
another update! I decided that adding in the extra pages was a bit of overkill - and seemed to be more of a negative, especially with over 1000 generated in 2 days (on a new domain). Also discovered that gbot seemed to confuse itself with daily feeds, constantly checking /2007-12-27.xml for updates rather than /rss (so watch out for that in your logs guys) Further, I realised that although the rss was blog like.. the front page wasn't, and perhaps overkill on links (100+) - that whole link to word count ratio killed it I assume; as did the big g, mass de-indexing occured (on blog search, and vague index started on the main serps) Further more! the whole "update once a day" thing was doing me no favours at all. So I've changed the whole site!! changes made: page subjects are now gathered from multiple sources and checked to find "new trends" on the net - and actually verify them! the process is cron'd every 2 minutes content retrieval happens every 2 minutes aswell, with pages pre-generated. publishing of a single item occurs randomly sometime between 2 and 20 minutes. frontpage, rss and archives are all updated at every "publish". the front page now list's in a blog stylee the latest 25, (as does the main rss) - with a full daily archive available. so.. hows it working.. well the changes have been live for just under an hour and > well the big g, technorati, moreover and spherebot have all hit index and /rss on every publish on the first publish g went looking at the old rss feed, so I 301'd it to the new rss 2nd update g went to the correct rss and homepage 3rd update, big g realised the site was changing and sent in the proper bot to do a full crawl of everything int he rss I must stress this is all within 32 minutes of changing the site. and more to the point G now classes it as a rela site and not a blog, so blog search = nothing whilst normal serps are http://www.google.com/search?hl=en&q=site:dayarchive.net&sa=N&tab=bw you can see the changes by checking these two links: new: http://dayarchive.net/ old: http://dayarchive.net/2007-12-28 if anybody wants any clarification just reply and I'll clarify whatever you want. hope nobody minds, just keepign a report here, if not just for my own benefit [keeps me focussed]
Cool script and nice idea for a site. Some noob questions if you don't mind.... In the header you have a google ad but it doesn't say 'ads by...', is that a glitch? Is that allowed? If so how? Also, I'm curious as to where the feeds are coming from.
yeah sure thing: google ads are "referer" ad's, text link only much nicer one thinks! they're also embedded in the text right near "more" to hopefully get some extra clicks in a legal manner. the feeds, well they're not actually all feeds - it's a cross reference between a few yahoo api's, google apps, my own db's and some general rss feeds filtered by time. there are 700+ scripts working together to find the data, and 3 to build the site, 2 to display lol. edit: the pages displayed aren't actually any feeds, they're pre-generated static pages, which where gen'd when the new "hot term" is found. - cron's great
another update: big g was only doing token index hits, so I changed to doing a proper rpc post ping to them, sure enough the big g came and spidered again, hitting the rss, getting the changes and spidering them. indexed within 2 minutes of crawl - nice update: 48 minutes later, and after another 5 updates and 3 g index hits, another crawl and index! somethings working
update: now the big g is crawlign and indexing every page within 30 seconds of publish joy - cracked it?
Cool. Thanks for the answers. Does your script have the ability to filter more specific results rather than just the search trends?
Well yes, the scripts collectively can do pretty much anything, it's just a matter of putting them together in a way that works for what you want Bet I don't, and if I do bet the sites back in within 24 hours - already been down the delisting route (within 36 hours), counteracted it, and this is three days later! edit: thats a bit big headed actually, i hope i don't get de-listed, and will do everything I can to prevent and counter it, end of the day it's out of my total control though *shrugs* Update well I just left the system to work away by itself over night, and sure enough googles been hitting on every publish, and doing a full update every 2-4 publishes (roughly every 45 minutes), same with moreover, sphere scout, technorati etc. googles now got all 169 published pages indexed and is passing traffic through quite frequently. Should eb an interesting day today unique visitors (not inc spiders or myself): Fri 12/28/2007 : 169 Sat 12/29/2007: 77 Sun 12/30/2007: 233 (so far) remember, friday was launch one, which got listed great then delisted over saturday, saturday late on through to sunday is the new method. edit update on closer inspection I found that big g hadn't actually done a full index for over 2.5 hours, so I got my head to thinking why - final reasons where 1: site had been running out of content to publish, thus publishing less frequently 2: i was using a loop to generate the index page which was making it 1.8 seconds to generate actions taken: 1: broadened filters to allow an extra 3 phrases per 5 minutes to be checked and verifired, this means that there's always something to publish, without overkilling and making it too obvious where db sources are coming from and getting done for dup content 2: made script to pre published index page on ever site update, thus now static html (with a little scripting twist) 3: also removed some google ads, as they where killing load tims in normal browsers results big g is back to crawlign and indexing every 40 (rough) minutes (every third publish) traffics had a major boost aswell thanks to a few long tails
Another Day; Day 4, Had moments of paranoia over the past 24, mainly because I noticed that with all the changes to the system I'd managed to gain myself a few repeat pages; coupled with the fact I've been monitoring logs line by line to see exactly whats going on, and the reaction to each change. To cut it short, gbot was hitting 3 day old pages that where almost exact duplicates of pages the system had just published about 20 minutes earlier, after g did this 3 times in a row; and the sites indexed page count dropped by 3 I feared the worst; Reacted and deleted 630 rss feeds and mod_rewrite'd a quick fix to 404 everything prior to the change. An hour later and gbot hadn't appeared back - oh hell - checked the serps and the indexed pages count had gone up by 230 (with the old pages) - waited another hour and g was both frequently crawling the new pages and doing a slow cache forming crawl of the old pages (all getting 404'd). I took a gamble and removed the rules allowing all the old content back, and removing a couple of the duplicates manually; seems to have paid off! Really though, either possible action was a gamble.. Stats Other than that yesterday was all about monitoring and making sure things are going as planned; as this is only a tiny practical test of something far bigger thats been months (years) in the making Stats Update (unique visitors, spiders and myself removed): 28/12/2007 169 29/12/2007 77 30/12/2007 336 31/12/2007 184 (so far, its early..) pages in g: 327 latest page in index: 22 minutes ago happy new year
Well done, blacknet! There's a lot to be said about spontaneity, for just going out there and implementing an idea the minute it occurs. Good for you. Wishing you continued success with it.
many thanks Isaa a quick update, the site just hit the 290 unique mark for the day a few minutes ago, and the system has pubished 113 articles today with a further 115 articles ready to publish; This would indicate to me that it could be producing content twice as quickly as it is. here's a new years gamble then, let's double up publish speed and see what happens with the big g.. infact, perhaps a vague increase gradually increasing over two days would be better, and some alternating atricle posters..? I'll let you know what I decide and how it pans out.. 336 uniques to beat! ps: slow delisting is happening - which is GREAT!!! as the site is being de-listed as a blog, and listed as a "real" site in the proper serps instead - I've mananged to get it to flick between the two twice, so I think I've finally figured out what the "technical" difference between a blog and a "site" is (as far as g is concerned, at this time) - probably already aout of date :'(
New Year Update! First Off.. HAPPY NEW YEAR GUYS it's 1AM here in the uk Stats: I'd got my figures wrong previously, had been generating reports with a time offset wrong (by an hour) *doh* so here's the daily update and correct figures. 28/12/2007 177 29/12/2007 69 30/12/2007 350 31/12/2007 413 that's all uniques with my own ip's and bot's removed. only source of traffic is serps - fully automated system. well that's it! it's going well.. also managed to get me a nice little domain "usyou.com" for free thanks to register.com of all people (and it has backlinks in yahoo and gblogs + webmaster info in google webmaster tools). the best bit i guess is the phrase "us you" has 580,000,000 pages in g, and thankfulyl that means I don't need to seo for anything, seeing as almost every paragraph of text ever will have the words "us" and "you" in there so hopefully by next week we'll be on round 2 ps: goal of all this. I want atleast half of cnet's "tv.com" traffic, if not all of it
Quick Update.. site's been left to it's own devices, all traffic is organic and coming in nicely.. ad's are being clicked! stats are as follows: 28/12/2007 177 29/12/2007 69 30/12/2007 350 31/12/2007 413 01/01/2008 399 02/01/2008 360 03/01/2008 374 04/01/2008 494 05/01/2008 498 06/01/2008 514