Search Algorithm

binu794 Active Member

Messages:: 174

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#1

Hi all,
Im planning to design a knowledge base site. the site will use MS-Access as database and asp.net. the whole idea is to store the details in the database and when the user enters a keyword in the search box the result should bring up the accurate pages where the information the user want is stored.
The search page also includes some filters for advanced search purpose.
Can anyone help plz me with any good searching algorithms.??

binu794, Feb 8, 2009 IP

rohan_shenoy Active Member

Messages:: 441

Likes Received:: 20

Best Answers:: 0

Trophy Points:: 60

#2

If you are developing the site just for your practice or learning purpose, then it would be good. Just look up the reference manual for Access and it may have some function like MATCH()[match() against() is found in MySQL...search for its equivalent in Access].

If you want advanced algo, then you will have to consider weightage for the keywords in topic title, keyword density in body text, etc. which can grow complex.

However, if you are try to develop a commercial script, I will advise you against as it will only waste your time and effort. There are many free and open-source scripts that can be run on PHP and MySQL(both free). Eg: Wordpress, with little theme modification can be used as an excellent knowledge base script. OR may be even Wiki for that matter. I am discouraging you from designing your site, but just warning you because a lot of programmers waste time reinventing the wheel without exploring existing and better options.

rohan_shenoy, Feb 8, 2009 IP

mmerlinn Prominent Member

Messages:: 3,197

Likes Received:: 819

Best Answers:: 7

Trophy Points:: 320

#3

rohan_shenoy

I went and looked at your blog about PHP email validation. Suspecting that this was too simple I did a search and found these:

FILTER_VALIDATE_EMAIL is not RFC2822 compliant

PHP Filter_Var FILTER_VALIDATE_EMAIL Newline Injection Vulnerability

You might want to note the potential problems in your blog.

mmerlinn, Feb 8, 2009 IP

ccoonen Well-Known Member

Messages:: 1,606

Likes Received:: 71

Best Answers:: 0

Trophy Points:: 160

#4

Access is not the correct route for a large scale website - the application only allows for 10 concurrent connections which means only 10 searches at a time - I'd switch to MySQL, PostGres, SQL Server, or other database solution

ccoonen, Feb 9, 2009 IP

binu794 Active Member

Messages:: 174

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#5

Thank u all for replying.. i think i need to change the database.

binu794, Feb 14, 2009 IP

it career Notable Member

Messages:: 3,562

Likes Received:: 155

Best Answers:: 0

Trophy Points:: 270

#6

binu794 said: ↑

Thank u all for replying.. i think i need to change the database.
Click to expand...

Yes mySQL would be better.

it career, Feb 15, 2009 IP

dimitar christoff Active Member

Messages:: 882

Likes Received:: 62

Best Answers:: 0

Trophy Points:: 90

#7

erm, the bloke's asking about search algorithms and you are telling him to change his database?

The one thing that really annoys me when shopping around are searches on sites that give you irrelevant results. For instance, searching for 'black pack' - a very generic search string where I'd expect to get backpacks and daypacks in black. The site, chosen at random from google: gooutdoors.co.uk (see the search results yourselves here: http://www.gooutdoors.co.uk/product-list&Text=black pack).

When you expect backpacks and get things like "Silva Ranger 3 Compass", "Lifesystems HeadNet Mosquito Hat" and "Wayfayrer Beef Stew and Dumplings", you know something is wrong with the search script.

Deciding to look into this further, I clicked on the Lifesystems Mosquito Hat and scanned for the words 'pack' and 'black' - fairly generic. Here they were:

- Can screw down into a small "stuff pack"
- Ultrafine black mesh

Why do we get this problem? Lazy coding. The most basic search practice out there is to do something like:

1. Break string into words.
2. Compose the search query targeting known data fields like title, description, features, word by word, imploding into the query. At this point the where statement can look like 'where (description like '%black%' or features like '%black%' or title like '%black%') and (description like '%pack' ... etc etc)'
3. Display the results and hope for the best.

Here is another favourite search of mine that works on this site:
the this, Found 253 product(s) - page 1 of 22

I think that it's fair to say, certain words should not be used to score results, they are just too generic to be considered. Unless I am typing something like 'the north face' my 'the' should be dismissed, just as 'this' needs to be dropped.

So, what is the alternative? Oddly enough, the most accurate search results are achieved via accurate tagging and product knowledge. This goes like that:

1. Assign tags to each product. You can build an aliases table for common tags and errors. For example, you want to alias things like berghaus with berghouse, berghaus, burghaus etc (you'd be surprised how many people make mistakes).
2. Build the search algorithm to break down the string into parts and analyse them. Drop all common words that won't help and keep the 'useful' bits only. See below for a suggestion of what can be removed from your search.
3. What words you have left treat as tags and select all products that have these tags applied to them.
4. Refine for relevance. This is done by assigning a number of hits on a product. Basically - If I search for Berghaus RG1 Jacket, that's a possible 3 tagwords hit. If the store has the RG1, it should show me just that and none of the results with 2 hits (jacket + berghaus). If they don't have the RG1, we will have an array of jackets by Berghaus and finally, an array of just jackets, so we can display the first ones with the 2 hits as the more relevant results to the search.

Advantages: always get the right and relevant results.
Disadvantages: you need to manage it, you need to update it and you need to monitor for people making mistakes and aliasing them. The increased conversion ratio will justify the man hours put into tagging your product base.

I hope this gives you some ideas anyway.

As promised, here is my list of 'bad words' that I disregard from search strings:
$badwords = array(
        "a", "a's", "able", "about", "above", "according", "accordingly", "across", "actually", 
        "afterwards", "again", "against", "ain't", "all", "allow", "allows", "almost", "alone", 
        "along", "already", "also", "although", "always", "am", "among", "amongst", "an", "and", 
        "another", "any", "anybody", "anyhow", "anyone", "anything", "anyway", "anyways", "anywhere", 
        "apart", "appear", "appreciate", "appropriate", "are", "aren't", "around", "as", "aside", 
        "ask", "asking", "associated", "at", "available", "away", "awfully", "b", "be", "became", 
        "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", 
        "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", 
        "both", "brief", "but", "by", "c", "c'mon", "c's", "came", "can", "can't", "cannot", "cant", 
        "cause", "causes", "certain", "certainly", "changes", "clearly", "co", "com", "come", "comes", 
        "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", 
        "corresponding", "could", "couldn't", "course", "currently", "d", "definitely", "described", 
        "despite", "did", "didn't", "different", "do", "does", "doesn't", "doing", "don't", "done", 
        "down", "downwards", "during", "e", "each", "edu", "eg", "eight", "either", "else", 
        "elsewhere", "enough", "entirely", "especially", "et", "etc", "even", "ever", "every", 
        "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", 
        "f", "far", "few", "fifth", "first", "five", "followed", "following", "follows", "for", 
        "former", "formerly", "forth", "four", "from", "further", "furthermore", "g", "get", "gets", 
        "getting", "given", "gives", "go", "goes", "going", "gone", "got", "gotten", "greetings", 
        "h", "had", "hadn't", "happens", "hardly", "has", "hasn't", "have", "haven't", "having", 
        "he", "he's", "hello", "help", "hence", "her", "here", "here's", "hereafter", "hereby", 
        "herein", "hereupon", "hers", "herself", "hi", "him", "himself", "his", "hither", 
        "hopefully", "how", "howbeit", "however", "i", "i'd", "i'll", "i'm", "i've", "ie", "if", 
        "ignored", "immediate", "in", "inasmuch", "inc", "indeed", "indicate", "indicated", 
        "indicates", "inner", "insofar", "instead", "into", "inward", "is", "isn't", "it", 
        "it'd", "it'll", "it's", "its", "itself", "j", "just", "k", "keep", "keeps", "kept", 
        "know", "knows", "known", "l", "last", "lately", "later", "latter", "latterly", "least", 
        "less", "lest", "let", "let's", "like", "liked", "likely", "little", "look", "looking", 
        "looks", "ltd", "m", "mainly", "many", "may", "maybe", "me", "mean", "meanwhile", "merely", 
        "might", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "n", "name", 
        "namely", "nd", "near", "nearly", "necessary", "need", "needs", "neither", "never", 
        "nevertheless", "new", "next", "nine", "no", "nobody", "non", "none", "noone", "nor", 
        "normally", "not", "nothing", "novel", "now", "nowhere", "o", "obviously", "of", "off", 
        "often", "oh", "ok", "okay", "old", "on", "once", "one", "ones", "only", "onto", "or", 
        "other", "others", "otherwise", "ought", "our", "ours", "ourselves", "out", "outside", 
        "over", "overall", "own", "p", "particular", "particularly", "per", "perhaps", "placed", 
        "please", "plus", "possible", "presumably", "probably", "provides", "q", "que", "quite", 
        "qv", "r", "rather", "rd", "re", "really", "reasonably", "regarding", "regardless", 
        "regards", "relatively", "respectively", "right", "s", "said", "same", "saw", "say", 
        "saying", "says", "second", "secondly", "see", "seeing", "seem", "seemed", "seeming", 
        "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", 
        "several", "shall", "she", "should", "shouldn't", "since", "six", "so", "some", "somebody", 
        "somehow", "someone", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", 
        "sorry", "specified", "specify", "specifying", "still", "sub", "such", "sup", "sure", "t", 
        "t's", "take", "taken", "tell", "tends", "th", "than", "thank", "thanks", "thanx", "that", 
        "that's", "thats", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", 
        "there's", "thereafter", "thereby", "therefore", "therein", "theres", "thereupon", "these", 
        "they", "they'd", "they'll", "they're", "they've", "think", "third", "this", "thorough", 
        "thoroughly", "those", "though", "three", "through", "throughout", "thru", "thus", "to", 
        "together", "too", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", 
        "twice", "two", "u", "un", "under", "unfortunately", "unless", "unlikely", "until", "unto", 
        "up", "upon", "us", "use", "used", "useful", "uses", "using", "usually", "v", "value", 
        "various", "very", "via", "viz", "vs", "w", "want", "wants", "was", "wasn't", "way", "we", 
        "we'd", "we'll", "we're", "we've", "welcome", "well", "went", "were", "weren't", "what", 
        "what's", "whatever", "when", "whence", "whenever", "where", "where's", "whereafter", "whereas", 
        "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", 
        "who's", "whoever", "whole", "whom", "whose", "why", "will", "willing", "wish", "with", 
        "within", "without", "won't", "wonder", "would", "would", "wouldn't", "x", "y", "yes", "yet", 
        "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "z", "
);
PHP:

dimitar christoff, Feb 15, 2009 IP

binu794 Active Member

Messages:: 174

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#8

Thank you very much christoff. U were really helpfull. i will definately take ur idea for building my knowledge base search.. Thanx again..

binu794, Feb 15, 2009 IP

rohan_shenoy Active Member

Messages:: 441

Likes Received:: 20

Best Answers:: 0

Trophy Points:: 60

#9

mmerlinn said: ↑

rohan_shenoy

I went and looked at your blog about PHP email validation. Suspecting that this was too simple I did a search and found these:

FILTER_VALIDATE_EMAIL is not RFC2822 compliant

PHP Filter_Var FILTER_VALIDATE_EMAIL Newline Injection Vulnerability

You might want to note the potential problems in your blog.
Click to expand...

Thanks a lot Mmerlinn, I have updated my post.

rohan_shenoy, Feb 16, 2009 IP

manilaoffice Peon

Messages:: 18

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#10

Thank you for that post and for sharing those link it helps a lot.

manilaoffice, Feb 16, 2009 IP

websecrets Peon

Messages:: 97

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#11

Are you planning on using keywords for search or spoken text?

More specifically... are you going to have a seperate field in the database with articles in one and specific keywords/tags that relate to those articles in another?.. or do you want to have the articles and let people search through all the text from that?

I've done it both ways and using keywords/tags will generate a more specific answer but you'll need to do some database modifications and a speech query to search entire articles. I can post some code if you need.

websecrets, Feb 17, 2009 IP

Log in or Sign up

Search Algorithm

binu794 Active Member

rohan_shenoy Active Member

mmerlinn Prominent Member

ccoonen Well-Known Member

binu794 Active Member

it career Notable Member

dimitar christoff Active Member

binu794 Active Member

rohan_shenoy Active Member

manilaoffice Peon

websecrets Peon

Useful Searches