I have a data storage question. Let's say you have a list of: 12,000 venues and 36,000 contests. Each venue entry has a name and address along with about 150 other fields of data (ranging from 2 char INTs to 10k text fields). Each contest has a name, prize amount, and about 20 other fields of data (ranging from 2 char INTs to 10k text fields). You've concluded the following: - Each venue will probably have 48-96k of data and each contest will probably have 24-36k of data. - Users will be heavily searching for venue/contest names and also filtering results by location (address is actually a few fields... street, town, state, zipcode). - You'll want to have an advanced search which will allow users to search for venues/contests based on the data. - You'll want to run various analyitical routines on the data quite regularly (comparisons, recommendations, etc.). My question is, what's the best way to set this up assuming you have a high amount of traffic and 1 box handling this? My first instinct was to create 2 db tables. 1 to store the basic venue information (for search and indexing purposes). 1 to store the basic contest information (same as above). Then store the "other" fields in a flat file data-table format (<dl><dt><dd>). My concern is that to search/anayalize these files will be ridiculously stressful but at the same time if I have 5-20 tables each with a ton of rows/fields that just seems wrong too. How should I be doing this? Development platform will be PHP and MySQL.
First off, get your text fields out of the primary table. If a user needs the data in the text field, query it directly. TEXT and BLOB columns destroy efficiency when they are included in large result-sets. Personally I wouldn't even consider a flat file storage for something like this. Simply by the fact that there will be a lot of searching, a database seems the most appropriate. After that, I would start with tables: venues, contests, and venue_detail. If the overhead is too high, figure out how to further segment and optimize the design for efficiency. More than likely you aren't going to need every column for most queries, so it's possible to logically separate date based on what will be queried on the application level. Also, make sure you are using the correct indexes and query specific columns (IE: Don't use SELECT(*)). What sort of traffic volume are you anticipating, and what hardware do you have available for this? You should definitely use a front-end caching system like APC or memcached. Either way, the amount of overhead could quickly get out of hand if you have a large volume of users, or there are many unique queries so db caching is bypassed.
The primary tables contain no TEXT/BLOB fields. Just a couple of INTs and few VCHARS (ranging from 4-32 chars) along with a DATETIME. Just the bare necessities to include relevant basic data and the most common searching method (by name or location and these are indexed). The other data includes a ton of information, including some TEXT fields. It can be categorized to make sense logically and semantically but there's just so much of it that I thought maybe having 20 tables would have been a bit silly. Some of those tables will need a lot of fields too (dozens). I guess what you're saying is, this is the most efficient way of doing this? Maybe some type of mix between flat file and database recording? As in, the stuff that'll get queried should go in the db and the more "TEXTY" fields can be dropped into flat files? I'll have to look into a caching system but I did plan to create .php files for each venue/contest upon entry, this way when those pages are viewed it will use the file system rather than the db server. Keep in mind this would be done only for viewing the full data. Traffic expectations are pretty high but not insane. Maybe 20,000-40,000 unique visitors per day. The box specs are unknown at this time. Figure something pretty decent but not amazing, perhaps something in the $5-7k USD price range. This isn't something I really researched yet so the cost vs expectation might be way off heh.