1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

What's the faster search algorithm?

Discussion in 'PHP' started by djDeathx, Oct 14, 2005.

  1. #1
    Hello!

    What would be a good seach algorithm in terms of having thousands and thousands (50,000 - 100,000) of different files:

    Searching by extension of file? Given that each file name is stored in a .txt based on its extension. (ie: there would be many text files pertaining to each unique file extension)

    OR

    Searching by name? Given that each file name is stored in a .txt based on the first character (ie: there would be 26 different text files, each containing files whose first char corresponds to a unique letter of the alphabet)

    I keep thinking that by name is faster, but I'm having second doubts about by extension.

    Ideas?
    Thx
     
    djDeathx, Oct 14, 2005 IP
  2. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You can't implement a database driven solution?
     
    exam, Oct 14, 2005 IP
  3. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #3
    Give more specifics on what you are trying to do (what is your data structure, what do you search in it, is it almost static or you often need to add more data etc.). Your question is too broad.
     
    nohaber, Oct 14, 2005 IP
  4. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #4
    The fastest would be store it all in one file which you read in and sort into a searchable data structure on program startup. Then when you do the searches everything's already in memory and will be extremely fast :)
     
    exam, Oct 14, 2005 IP
  5. djDeathx

    djDeathx Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I have a list files, which are going to be available on the web through private access, and have decided to use php to build a search engine.

    I'm listing every single file, getting its name and then based on the algorithm I plan on implementing, store all files it in a .txt or db, but I haven't set-up MySQL on this server. I'm debating whether I should, or would it be easier to use 'plain and simple' .txt to store. When I add more files, I plan on updating my .txt's so searching can include the new data.

    I think the following search algorithm is best if someone was to search all files:

    I'm think that having 26 folders (a-z) and each folder having 2 .txt files then it would work out best.

    example:

    Folder: a
    a-k.txt
    l-z.txt

    Folder: b
    a-k.txt
    l-z.txt

    Folder: c
    a-k.txt
    l-z.txt

    where the folder name is the first character of the file being stored, and each file in the folder pertains to the 2nd character. If the word to be searched was apple.jpg then searching would proceed as follows:
    -Go to folder a
    -Go to file l-z.txt and find the name (the extension would be irrelevant)

    The .txt 2 files in each folder will be sorted alphabetically, so it should really speed things up!

    I figure that sorting by 1st and 2nd char would make the search fast and simple. Improvement constantly happens.

    If people want to search by extension then that would make a much faster search. I'm thinking of having both, depending on what the user picks: Search for (for example) .jpg, .mp3, .zip/.rar, .pdf, .avi, .exe, etc.. or if the user doesn't know the extension s/he can search all files

    This is a personal project, so I can make own my rules

    Thanks for the replies :)
     
    djDeathx, Oct 15, 2005 IP
  6. forkqueue

    forkqueue Guest

    Messages:
    401
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #6
    The answer rather depends on what file system you're using, but I suspect that they would both be pretty similar. In each case the search will involve reading from the file table and comparing against a string. Neither should be massively computationally expensive.

    One thing I would recommend is that you split the files into subdirectories, as most file systems perform better with less that 10 000 files per directory. When setting up mail/web clusters I've tried a variety of naming schemes, but to be honest the simple /i/m/image.jpg type structure seems to work as well as any, and is very easy to code for. It'd be a good idea to take a look at your filenames first though - if there's a lot of files with the first two characters then it might be a good idea to come up with a more elaborate scheme for separating things out.

    HTH
     
    forkqueue, Oct 16, 2005 IP
  7. djDeathx

    djDeathx Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I'm running it on Win Server 2k3 Corp. Ed. I have Thousands and thousands of directories. It's a fileserver which hosts a lot of data, which is organized by name w/ directories and sub-directories depending on the files/data. There isn't a single folder which has thousands of files with no sub-directories. (A folder can have 5,000 files, split possibly in roughly 500 sub-directories)

    My filenaming doesn't have anything special, by that I mean, there aren't any 2 character filenames, they're all greater than 2, that's why if I sort from the first 2 characters, I think it would be very efficient and effective. If I was to sort by 3 characters that would be even more efficient and effective.

    ie:
    Folder a (first char)
    --Folder a-k (2nd char)
    ----File a-k (3rd char)
    ----File l-z
    --Folder l-z
    ----File a-k
    ----File l-z

    so now searching for apple.jpg would involve
    -Go to folder a
    -Go to folder l-z
    -Go to file l-z, find apple.jpg

    each time it's sorted in a sub-folder, searching becomes O(n/2), it cuts down the places to look for a word by 2

    Although I have all these ideas, I don't know if it would be very efficient.
    I want 1 way (if possible) to search by name and/or by extension.

    Current Counted Files: 161,551
    Current Counted Folders: 51,668
     
    djDeathx, Oct 16, 2005 IP
  8. Dread

    Dread Peon

    Messages:
    323
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #8
    I think the way your looking at it now is a good way of doing it. Perhaps the last step could just be 1 file, i doubt you'd have to narrow it down further. This way of doing it should give you plenty of scope for file count growth up to about the million mark.
    26 folders (a-z) for first letter
    2 folders (a-k, l-z) for second letter
    1 file (a-z) for third letter

    I'd still take the database solution over any file based solution though.
     
    Dread, Oct 17, 2005 IP
  9. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #9
    Do you search text in files? or filenames? If so, you can make a hash table on the file names, where each hash index has a list of files whose names match the hash index.
     
    nohaber, Oct 17, 2005 IP
  10. djDeathx

    djDeathx Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Ah! soo much pressure! Don't you know that peer-pressure always wins?
    I got MySQL 5.1 installed. I guess I'll try it out and if it works better with MySQL then I'll stick with it, else I'm using my new found algorithm!

    Thanks guys!
     
    djDeathx, Oct 17, 2005 IP
  11. djDeathx

    djDeathx Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    I'm in love again! First was when I discovered php, now MySQL!
     
    djDeathx, Oct 17, 2005 IP
  12. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Glad you made the switch! You're in for some fun.

    What I would do is basically scan your HHD and put all the file information in a MySql table. The table would have 3 columns. filename - ext - path. The filename and extention columns each would have an index to make searching fast. When you want to search by filename, search on column one and when you want to search by extention, search on column 2, and return the filesystem path to take you directly to the file.

    Have fun!
     
    exam, Oct 17, 2005 IP
  13. djDeathx

    djDeathx Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #13
    On top of those fields I also decided to include last modified date of files and folders. The directory though will only show a relative directory, not the full directory, since I need to map it to a URL. Made a nice php script to dump all the data sepparated by tabs in a .txt, then load it in MySQL, Baam! Works like a charm.

    The only problem I came accross was the attempt to use connect_mysql() in php, after Googling it I found out that for php5, I'll have to manually go in my php.ini and fix(allow) the extension... and as the smartass that I am, I decide to do the latest upgrade ...from 5.04 to 5.05. ahem, and I lost it all.

    Now I get "The parameter is incorrect." every time I try to run a script. I don't know what I've possibly done wrong but I followed the same steps from memory from the first time I installed it...then looked up the steps again (which were the same) and no luck! :( It feels like loosing the power to be "God"..err.. not that I've ever had it.

    Sigh.. has anyone received this problem or know what it is?
     
    djDeathx, Oct 17, 2005 IP
  14. Dread

    Dread Peon

    Messages:
    323
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    0
    #14
    "The parameter is incorrect."

    Thats a little vague, wanna elaborate and post the entire error? Is that a mysql error or php? also when you said connect_mysql() you meant mysql_connect() right? :p
     
    Dread, Oct 18, 2005 IP
  15. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Open your php.ini and set the extension_dir to where the extentions are and then a little below that, uncomment the line including the mysql extention ;extension=php_mysql.dll
     
    exam, Oct 18, 2005 IP
  16. djDeathx

    djDeathx Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Been there done that. I don't know what else could be the problem.


    I get the error when attempting to execute any php script, I can't even get close to executing a my mysql code to the server :(
     
    djDeathx, Oct 18, 2005 IP
  17. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #17
    Try reverting to php 5.04 or even php 4 and see if that helps. Might be a but in php's mysql support?
     
    exam, Oct 18, 2005 IP