I am paying someone to put in data in some directories / personal sites and i do not want it to end in DP some one selling it. Can some one just spider my site and take the data? How do i stop them from doing this? I see many databases are for sale here on DP there is also something called spider that takes stuff from other sites.
Basically if you allow people to visit and read a website, someone will be able to steal it! Its harder to take something like a database as access permissions mean they cant just download it or something, but the data will still be accessible. You best defence is hiring a lawyer and sending out letters telling people to stop using your copyrighted data.
It is not that easy to spider your site to make the database. But as MattD said, if someone makes up his mind to do that. Then there is not option for you to prevent it from stealing. You can use Copyscape to search for copies of your content on other sites. If you find some site using it, then you can file for a DCMA suit. With the DCMA notice, you can have the site removed from Google and Yahoo. Warn the site owner about that, and most probably he will remove it. But of course, still you won't be able to prevent stealing
I think the correct term is "anything that isn't nailed down"...... If you leave something unattended, theres always a chance someone will steal it, there is no way to stop people from stealing your content that is foolproof, lawyers cost loadsa money and probably more than you'll loose in the long run - even if it's not, lawyers are famously arrogant - nobody wants that. I sat for a while with my ide open ( not writing, helps me think ), and I suppose there are some measures you could put in place. BEFORE you make this data available to the public at large, research the sort of thing you're trying to stop, find out if this "spider" sends any identifiable data that can differentiate it from an actual browser or a legitimate robot Setup the sites in such a way that a human presence is needed to retrieve data - I'm certain that can be done, but have no suggestions Only accept posted data from your own domain name If your site is aimed at a particular country then deploy geoip services on the equipment to block requests from outside that region of earth - might seem quite an odd thing to say, but in a way the less people that can get in the better, if theres no point in all of england viewing your data that's some 60 million potential thieves you have stopped in thier tracks ..... Possibly the best suggestion I have would be, spiders work by reading particular tags at particular locations and matching predefined patterns to extract whatever they are stealing / storing, this is a massive weakness on thier part, you only have to change the source code of your page and it renders the spider useless If theres a particular section of the site with sensitive data, you could make it so all traffic going there would have to have the correct referer to get it All of these things aren't fantastic, however it's better than doing nothing or having to put up with any arrogance from jumped up school boys in thier daddies law firm ..... Lastly, the person inputting the data has everything they need to steal it, first make sure you trust that person, second make sure you pay them enough for it to be worthwhile for them to do a good job and come back That's all I got .....
All excellent suggestions. I just have one minor little thing to add... The most common way I'm seeing to do this lately is called captchas. They're those funky images with hard-to-read letters and numbers that people are putting on their blogs to try to make you prove that you're a real person before adding a comment.
Captchas are not fool proof, and has the possible side effect of pissing of your users though - think very carefully before you implement them. Spiders are not limited at all by the structure of a page or the HTML - sure there might be some that operate this way, but it is not "how they work". Also for trying to block IP addresses based on country - its essentially pointless. If someone wanted to steal your data they can just use one of the millions of proxy sites or even just google cache (presuming you haven't already stopped google caching your page).
The spiders he is talking about do work in that way, there is NO other way for them to work, we're not on about web bots like google here, we're on about scipt kiddies writing scripts that just nick other peoples data by downloading pages........ Like I said, none of these measures are fool proof, but its better than doing nothing.
i understand they can just copy. Thats easy one by one but i dont one to just steal my database right off is that so easy? Stop IP sounds good they will not view more then 10 pages for sure. Open proxy is not that great beacuse Thailand has that as well for everybody.
But we can change the referer header programmatically, right? I think asking for CAPTCHA validation everytime a user makes a request is not good. And validating just once can be manipulated. You should limit it to every N requests. Ofcourse, it is not easy. But if the database is worth the difficulty, then it can be done.
They wont be able to steal the schema (unless you are using a MS Access database or something which they might be able to get with HTTP), but they will be able to view some data that subsequently appears on your pages. I've not given this idea much thought (ha!) but you might be able to do something sneaky with the table structure that isn't visible to the end user (as your PHP etc will deal with all of the queries/joins etc automatically) but makes duplicating the database difficult. Obviously this is of limited value as they will still see some data, but it would make wholesale duplication of your database difficult.
I've not given this idea much thought (ha!) but you might be able to do something sneaky with the table structure that isn't visible to the end user (as your PHP etc will deal with all of the queries/joins etc automatically) but makes duplicating the database difficult. Obviously this is of limited value as they will still see some data, but it would make wholesale duplication of your database difficult. How do i do this?
No idea - might not even be possible, but its something to sit down and think about when you are designing your db schema
Id like to see some scripts you guys could come up with for that... MattD seems to have it right though... captchas will piss off customers and high anonymity proxy servers with some fancy coding get right around any IP blocking mech you might come up with... but again Id take a look at what anyone comes up with and let you know what I think...
Ill add that I would for sure start with IP blocking... that will weed out many lil script kiddies... maybe from there look into membership based pages where users are required to login and have some "remember me" cookie sets or something... if your site is actually useful, people will signup and then you can really stop any spiders at that point.... long as the login is checking for the users IP that is...