Web Hosting Talk







View Full Version : When to start hashing directories


freakysid
04-11-2002, 03:30 AM
Hi,

I am writing some scripts that are going to involve caching output into various directories which will be created on the fly and routinely removed by a cron job. Thinking ahead and thinking of scalability, I was wondering are their any rules of thumb to follow to know when you should start hashing sub-directories in a directory? Is there a rule of thumb or a formula here?

Thanks
:)

freakysid
04-11-2002, 05:42 PM
Further clarification of what I am on about :) I'm not talking about anything complicated like messing with the actual file system itself (its ext2 BTW). Just that I know that in general when you are creating, say lots of virtual server doc roots or virtual maildirs, etc, you often "hash" them out (if hash is the correct term to use in this context) using a crude method.

Example,

a/
abanathy/
almond/
apple/
/b
bah/
banana/
boobies/

etc

Add another level to the hashing:

a/
ab/
aboriginie/
abooboo/
al/
almond/
alarmist/

etc



So, there must be some rule of thumb or algorithm that ppl use here to decide to what level of hashing out of directories produces the optimal balanced tree strucure in the file system.

:)

priyadi
04-11-2002, 07:58 PM
I think there is no specific algorithm that works on all cases. That depends on the distribution and amount of your 'names'. Your goal is to minimize the amount of directory entries used on each directory.

Your example could be OK for small amount of names, but won't work well for a very large amount of names. It has distribution problem as well, some of the letters could be used far more than the others.

I suggest not to use first letter algorithm, but instead hash the names. The simplest way is to sum the ASCII character number for each character in a name, divide it by a number (this number is the maximum amount of hashing directories determined by you), then get the remainder. Use the remainder as a hash name. Do it twice or more with different number if you need more directory level. This way, on almost all cases you solve the distribution problem.