Hello all, I have a question/confusion, one of my friend having millions of documents in .doc format each .doc file is about 20mb and having 5 tables and he has above 2000000 megabytes of files (more than 2 terabyte) now, he wants a website to store all data. all his files mainly consist of tables one solution as i know is to convert all files to excel format then to html format and then to index all the files systematically.(i know this is too much time consuming) BUT, can u please suggest me any other easy solution doing this. also what kind of server do i use? and any other suggestions are most welcome
You've got problems I would probably try to install apache locally then get php to zip -> upload (curl & fsock allow ftp connections) -> delete (or move if he wants to keep them) the files individually. All this can be done programmatically so you don't have to sweat any of it. It's a bigger problem if he's wanting all these to be static pages that google can index. I've never heard of a 2TB server so he's potentially looking at having multiple dedicated ones ... I hope he's got some cash.
click a link and wait to download. 20MB loads up in my browser in about 1 minute ... if it doesn't crash firefox. It's probably best to have a download link so that IE won't try to open it locally.
Are the documents larger than they need to be? Can you reduce them to data and eliminate the Microsoft cruft that inflates the file size? It depends on what the content is like and what sort of presentation you require.
More importantly, do they have to be documents? If it's possible to just store the content of the documents in (several) database, then it might be more feasible. Although, with this amount of data, dedicate, multiple servers are needed, and it won't be cheap. There is no way you can make this work running on crappy hardware. I wouldn't even try to run a project like this on a home-server, even though all files would be stored locally.
no they dont have to be documents they a just simple tables in doc format, we can convert it to xml or html, any other solution other than scribd? pages must be google search friendly sample page: http://ecoport.org/ep?SearchType=interactiveTableView&itableId=80010
scribd is indeed an awful solution, don't waste any further time even thinking about it. In the very least, gzip all your HTML files. You'll save an incredible amount of space, and make the web pages load about 50 times faster. And it doesn't require any advanced knowledge. The page you linked to is 420KB of raw HTML, and gzips down to 9KB. With one gzip batch job, you can compress your 2 terabytes down to (roughly extrapolated) 43 gigabytes, which is much easier to handle.