I'm trying to write a local search engine that searches CJK (Chinese, Japanese, Korean) variants. See: http://hkiug.ln.edu.hk/unicode/hkiug_tsvcc_table-UnicodeVersion-1.0.html Each line represents different ways a 'word' can be written/typed. They all mean the same 'word'. Ideally, if I search for a word, my result will also pull up entries that contain the variants. How do I process a query that has many words without having to search exponential queries? Is that even possible? I'm in SQL, ASP environment, but any hint at how to go about this would be a lifesaver!
Interesting question! If I were doing this I'll choose one format as the canonical form. All data are indexed using this canonical form. Then, all queries will be converted to canonical form before searching the database. That should do it, I think
The database I'm searching through uses different variants in the entries, and that can't be changed due to the need for accurate representation (it's books, btw). It's all kinds of frustrating!
You can show the results in it's original form. Just that when you index or search, you only do it on the canonical form. If you're relying on SQL to do the search, just keep 2 copies in the database: - canonical copy for searching, and - original copy for showing search result to user.
i'm not sure if it's gonna work for those characters, but you might be able to use the mysql SOUNDEX function: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex something like this: SELECT * FROM mytable WHERE SOUNDEX ( character) LIKE SOUNDEX (inputted_text) Code (markup): where character is the column which represents the character and inputted_text is the text that should be matched against