Hi everyone, I want to share my insights about data mining business. Since the start it was a very strange area for me. I've been in web scraping business for the last two years or so, in and out. Two years in, but it's still difficult for me to understand how big and important this market is. Actually, it took me a while to understand who are my main competitors and until this point I still believe I haven't yet identified them all (yeah, that's kind of crazy...). This what I don't get. For many businesses their main idea is based on the success of data harvesting, or in other words, how effectively they can 'borrow' the information from multiple sources. Yet there is almost no information on the internet about companies that provide such services, and the ones that are a bit more open are trying to sell their business as a legit service for low-end data harvesting. In most cases it's true, but high scale data scraping when 1000's of websites are constantly monitored and scanned is nowhere near close to being 'legal' (at least in terms of social responsibility). Let me explain - data can be the main ingredient of your product/service and you can get it without any problems. You scrape a website that does not really care if it's being scraped (i.e. government agencies, national statistics departments). Completely opposite is when your targets don't like to be scraped. They are going to try to block you (by your IP address most of the time), and that's where 'advanced techniques' come in. And that's were it all starts to look at little less transparent. I could understand why these companies want to stay away from public. They don't need to shout about what they're doing as their customers find them on their own. What I don't get is that no one else talks about it. No one really shares opinion about data scraping market. Technically it falls under 'big data' definition, let's be realistic - it's a completely different thing. Strange, to say the least. What do you guys think about that? Do you think if this market is going to get more transparency, especially with 'big data' expanding? Cheers!
The question is one of incentive: why would someone tell you they're having a competitive advantage from web scraping? The minute they do that, they lose the edge.
You paint with a wide paint brush when you say "Everybody Does It"! I certainly don't scrape and I know thousands more who don't. It's illegal and against the TOS of most service providers.
I would say the fact of the matter is, it's going to happen. You can either take advantage or not.. it does not mean it'll go away.
I bet a lot fewer people scrape than you think. Is there any chance that you're reading about techniques from articles that are several years old? From the perspective of a web host, I can definitely tell you that we see almost no abuse tickets about that anymore, but used to see it constantly 5+ years ago. I know correlation doesn't mean causation, but still.
I wouldn't use the word scrape. It's more of spider or miner. It's not about stealing any more. It's about interoperability. You know what the biggest scraper on the planet is? Guess!
I think you want to believe "everybody does it" to make yourself feel better about doing something unethical and often illegal. Everyone doesn't do it and obviously most people who do participate in such activities don't want to talk openly, especially in writing on the internet, about their illicit activities. Even in the cases where the specific activity isn't illegal (and lets be honest, in most cases it IS illegal), it's still a shady, lowlife way of doing business. So the question pretty much answers itself.
Thanks for your responses! I guess you're all right, at least in a way. Once you start talking about web scraping, you become prone to losing that advantage, especially if you're a big player. On the other hand, it's pretty much obvious who does anonymous web scraping and who doesn't. Actually, the biggest players (someone like Bl00mberg, I suppose) are no longer doing that as many smaller companies are giving the data for free and on demand (via APIs), but market's #2 and everyone that's below are surely scraping/mining/digging as hard as they can. I'm just generalizing by saying 'everyone'. I had in mind businesses that are directly involved in competitive intelligence (even though that's just one example where data mining can be beneficial or even a must). They ARE scraping the internet. And I wouldn't say data scraping is illegal per se. It's still a grey area. Some TOS clearly states against it, some are vague about this, and some doesn't even mention that. Most surprising to me is that sometimes websites want to be scraped, even though their generic TOS prohibit that. If you're a cheap small airlines company, you'd love your flight rates to be scraped and compared with big airlines, as you know you'd be cheaper. If you're Turkish Airlines though, you won't be that happy as your service is more expensive as it is premium. Therefore simple price comparison won't tell a full story and will mislead customer. Anyway, I consider myself to be new in this area. I have a lot to learn and understand. Thank you for opinions, no matter what they are. TheDataPlanet, it's Google, isn't it? Nathan, D-Fish was an amazing player! Calm and composed. As a coach he's terrible though...
So much misinformation in this thread. Scraping is not illegal. And it's not against many website's ToS barring google and other search engines. How do you think google gets its information? Scraping and data mining. Also scraping isn't the same as data mining. Data mining is using algorithms to extract patterns, trends, and other meaningful information from a data set (ie. Machine learning). Scraping is the process of actually getting the data set.
PDD, thanks for your input. Perhaps the terminology I've used wasn't completely right, so thank you for making it clear what's what. Either way, you're not totally right about scraping to be legal and ToS not going against it. Most of the time it's true, especially when it comes to what Google is doing. Even though they are scraping, it's very simple activity with almost none real time data monitoring or deep drilling. Data harvesting techniques that I'm talking are way more serious.