Search Algorithm

Discussion in 'Programming' started by binu794, Feb 8, 2009.

  1. #1
    Hi all,
    Im planning to design a knowledge base site. the site will use MS-Access as database and asp.net. the whole idea is to store the details in the database and when the user enters a keyword in the search box the result should bring up the accurate pages where the information the user want is stored.
    The search page also includes some filters for advanced search purpose.
    Can anyone help plz me with any good searching algorithms.??
     
    binu794, Feb 8, 2009 IP
  2. rohan_shenoy

    rohan_shenoy Active Member

    Messages:
    441
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    60
    #2
    If you are developing the site just for your practice or learning purpose, then it would be good. Just look up the reference manual for Access and it may have some function like MATCH()[match() against() is found in MySQL...search for its equivalent in Access].

    If you want advanced algo, then you will have to consider weightage for the keywords in topic title, keyword density in body text, etc. which can grow complex.

    However, if you are try to develop a commercial script, I will advise you against as it will only waste your time and effort. There are many free and open-source scripts that can be run on PHP and MySQL(both free). Eg: Wordpress, with little theme modification can be used as an excellent knowledge base script. OR may be even Wiki for that matter. I am discouraging you from designing your site, but just warning you because a lot of programmers waste time reinventing the wheel without exploring existing and better options.
     
    rohan_shenoy, Feb 8, 2009 IP
  3. mmerlinn

    mmerlinn Prominent Member

    Messages:
    3,197
    Likes Received:
    819
    Best Answers:
    7
    Trophy Points:
    320
    #3
    mmerlinn, Feb 8, 2009 IP
  4. ccoonen

    ccoonen Well-Known Member

    Messages:
    1,606
    Likes Received:
    71
    Best Answers:
    0
    Trophy Points:
    160
    #4
    Access is not the correct route for a large scale website - the application only allows for 10 concurrent connections which means only 10 searches at a time - I'd switch to MySQL, PostGres, SQL Server, or other database solution
     
    ccoonen, Feb 9, 2009 IP
  5. binu794

    binu794 Active Member

    Messages:
    174
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #5
    Thank u all for replying.. i think i need to change the database.
     
    binu794, Feb 14, 2009 IP
  6. it career

    it career Notable Member

    Messages:
    3,562
    Likes Received:
    155
    Best Answers:
    0
    Trophy Points:
    270
    #6
    Yes mySQL would be better.
     
    it career, Feb 15, 2009 IP
  7. dimitar christoff

    dimitar christoff Active Member

    Messages:
    882
    Likes Received:
    62
    Best Answers:
    0
    Trophy Points:
    90
    #7
    erm, the bloke's asking about search algorithms and you are telling him to change his database?

    The one thing that really annoys me when shopping around are searches on sites that give you irrelevant results. For instance, searching for 'black pack' - a very generic search string where I'd expect to get backpacks and daypacks in black. The site, chosen at random from google: gooutdoors.co.uk (see the search results yourselves here: http://www.gooutdoors.co.uk/product-list&Text=black pack).

    When you expect backpacks and get things like "Silva Ranger 3 Compass", "Lifesystems HeadNet Mosquito Hat" and "Wayfayrer Beef Stew and Dumplings", you know something is wrong with the search script.

    Deciding to look into this further, I clicked on the Lifesystems Mosquito Hat and scanned for the words 'pack' and 'black' - fairly generic. Here they were:

    - Can screw down into a small "stuff pack"
    - Ultrafine black mesh

    Why do we get this problem? Lazy coding. The most basic search practice out there is to do something like:

    1. Break string into words.
    2. Compose the search query targeting known data fields like title, description, features, word by word, imploding into the query. At this point the where statement can look like 'where (description like '%black%' or features like '%black%' or title like '%black%') and (description like '%pack' ... etc etc)'
    3. Display the results and hope for the best.

    Here is another favourite search of mine that works on this site:
    the this, Found 253 product(s) - page 1 of 22

    I think that it's fair to say, certain words should not be used to score results, they are just too generic to be considered. Unless I am typing something like 'the north face' my 'the' should be dismissed, just as 'this' needs to be dropped.

    So, what is the alternative? Oddly enough, the most accurate search results are achieved via accurate tagging and product knowledge. This goes like that:

    1. Assign tags to each product. You can build an aliases table for common tags and errors. For example, you want to alias things like berghaus with berghouse, berghaus, burghaus etc (you'd be surprised how many people make mistakes).
    2. Build the search algorithm to break down the string into parts and analyse them. Drop all common words that won't help and keep the 'useful' bits only. See below for a suggestion of what can be removed from your search.
    3. What words you have left treat as tags and select all products that have these tags applied to them.
    4. Refine for relevance. This is done by assigning a number of hits on a product. Basically - If I search for Berghaus RG1 Jacket, that's a possible 3 tagwords hit. If the store has the RG1, it should show me just that and none of the results with 2 hits (jacket + berghaus). If they don't have the RG1, we will have an array of jackets by Berghaus and finally, an array of just jackets, so we can display the first ones with the 2 hits as the more relevant results to the search.

    Advantages: always get the right and relevant results.
    Disadvantages: you need to manage it, you need to update it and you need to monitor for people making mistakes and aliasing them. The increased conversion ratio will justify the man hours put into tagging your product base.

    I hope this gives you some ideas anyway.

    As promised, here is my list of 'bad words' that I disregard from search strings:
    $badwords = array(
            "a", "a's", "able", "about", "above", "according", "accordingly", "across", "actually", 
            "afterwards", "again", "against", "ain't", "all", "allow", "allows", "almost", "alone", 
            "along", "already", "also", "although", "always", "am", "among", "amongst", "an", "and", 
            "another", "any", "anybody", "anyhow", "anyone", "anything", "anyway", "anyways", "anywhere", 
            "apart", "appear", "appreciate", "appropriate", "are", "aren't", "around", "as", "aside", 
            "ask", "asking", "associated", "at", "available", "away", "awfully", "b", "be", "became", 
            "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", 
            "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", 
            "both", "brief", "but", "by", "c", "c'mon", "c's", "came", "can", "can't", "cannot", "cant", 
            "cause", "causes", "certain", "certainly", "changes", "clearly", "co", "com", "come", "comes", 
            "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", 
            "corresponding", "could", "couldn't", "course", "currently", "d", "definitely", "described", 
            "despite", "did", "didn't", "different", "do", "does", "doesn't", "doing", "don't", "done", 
            "down", "downwards", "during", "e", "each", "edu", "eg", "eight", "either", "else", 
            "elsewhere", "enough", "entirely", "especially", "et", "etc", "even", "ever", "every", 
            "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", 
            "f", "far", "few", "fifth", "first", "five", "followed", "following", "follows", "for", 
            "former", "formerly", "forth", "four", "from", "further", "furthermore", "g", "get", "gets", 
            "getting", "given", "gives", "go", "goes", "going", "gone", "got", "gotten", "greetings", 
            "h", "had", "hadn't", "happens", "hardly", "has", "hasn't", "have", "haven't", "having", 
            "he", "he's", "hello", "help", "hence", "her", "here", "here's", "hereafter", "hereby", 
            "herein", "hereupon", "hers", "herself", "hi", "him", "himself", "his", "hither", 
            "hopefully", "how", "howbeit", "however", "i", "i'd", "i'll", "i'm", "i've", "ie", "if", 
            "ignored", "immediate", "in", "inasmuch", "inc", "indeed", "indicate", "indicated", 
            "indicates", "inner", "insofar", "instead", "into", "inward", "is", "isn't", "it", 
            "it'd", "it'll", "it's", "its", "itself", "j", "just", "k", "keep", "keeps", "kept", 
            "know", "knows", "known", "l", "last", "lately", "later", "latter", "latterly", "least", 
            "less", "lest", "let", "let's", "like", "liked", "likely", "little", "look", "looking", 
            "looks", "ltd", "m", "mainly", "many", "may", "maybe", "me", "mean", "meanwhile", "merely", 
            "might", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "n", "name", 
            "namely", "nd", "near", "nearly", "necessary", "need", "needs", "neither", "never", 
            "nevertheless", "new", "next", "nine", "no", "nobody", "non", "none", "noone", "nor", 
            "normally", "not", "nothing", "novel", "now", "nowhere", "o", "obviously", "of", "off", 
            "often", "oh", "ok", "okay", "old", "on", "once", "one", "ones", "only", "onto", "or", 
            "other", "others", "otherwise", "ought", "our", "ours", "ourselves", "out", "outside", 
            "over", "overall", "own", "p", "particular", "particularly", "per", "perhaps", "placed", 
            "please", "plus", "possible", "presumably", "probably", "provides", "q", "que", "quite", 
            "qv", "r", "rather", "rd", "re", "really", "reasonably", "regarding", "regardless", 
            "regards", "relatively", "respectively", "right", "s", "said", "same", "saw", "say", 
            "saying", "says", "second", "secondly", "see", "seeing", "seem", "seemed", "seeming", 
            "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", 
            "several", "shall", "she", "should", "shouldn't", "since", "six", "so", "some", "somebody", 
            "somehow", "someone", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", 
            "sorry", "specified", "specify", "specifying", "still", "sub", "such", "sup", "sure", "t", 
            "t's", "take", "taken", "tell", "tends", "th", "than", "thank", "thanks", "thanx", "that", 
            "that's", "thats", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", 
            "there's", "thereafter", "thereby", "therefore", "therein", "theres", "thereupon", "these", 
            "they", "they'd", "they'll", "they're", "they've", "think", "third", "this", "thorough", 
            "thoroughly", "those", "though", "three", "through", "throughout", "thru", "thus", "to", 
            "together", "too", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", 
            "twice", "two", "u", "un", "under", "unfortunately", "unless", "unlikely", "until", "unto", 
            "up", "upon", "us", "use", "used", "useful", "uses", "using", "usually", "v", "value", 
            "various", "very", "via", "viz", "vs", "w", "want", "wants", "was", "wasn't", "way", "we", 
            "we'd", "we'll", "we're", "we've", "welcome", "well", "went", "were", "weren't", "what", 
            "what's", "whatever", "when", "whence", "whenever", "where", "where's", "whereafter", "whereas", 
            "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", 
            "who's", "whoever", "whole", "whom", "whose", "why", "will", "willing", "wish", "with", 
            "within", "without", "won't", "wonder", "would", "would", "wouldn't", "x", "y", "yes", "yet", 
            "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "z", "
    );
    PHP:
     
    dimitar christoff, Feb 15, 2009 IP
  8. binu794

    binu794 Active Member

    Messages:
    174
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #8
    Thank you very much christoff. U were really helpfull. i will definately take ur idea for building my knowledge base search.. Thanx again..
     
    binu794, Feb 15, 2009 IP
  9. rohan_shenoy

    rohan_shenoy Active Member

    Messages:
    441
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    60
    #9
    rohan_shenoy, Feb 16, 2009 IP
  10. manilaoffice

    manilaoffice Peon

    Messages:
    18
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Thank you for that post and for sharing those link it helps a lot.
     
    manilaoffice, Feb 16, 2009 IP
  11. websecrets

    websecrets Peon

    Messages:
    97
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Are you planning on using keywords for search or spoken text?

    More specifically... are you going to have a seperate field in the database with articles in one and specific keywords/tags that relate to those articles in another?.. or do you want to have the articles and let people search through all the text from that?

    I've done it both ways and using keywords/tags will generate a more specific answer but you'll need to do some database modifications and a speech query to search entire articles. I can post some code if you need.
     
    websecrets, Feb 17, 2009 IP