Web Hosting Talk







View Full Version : google searchengine programming


onauc
12-14-2004, 08:53 PM
Howdy,

I want to know how the google, webcrawler etc. searchengines really work as I am learning php programming and want to write a searchengine.
I have read around 10 websites, found on google, about “how searchengines work” and not a single one of them make it clear if it is the spider or the index or the search software does the ranking according to it’s ranking algorithm.
All they ever say is that, a searchengine has 3 softwares :
a) the spider
b) the index
c) the search system (search-box, template, etc.)
The spiders crawl the web collecting webpages and then forward them to the index and then the search software searches the index for the sought keywords/phrases.
Also, some say that the spiders copy the whole website into it’s index. So, in other words, there is 2 copies of a website. One residing in the website owner’s webserver and the other residing on the index of the searchengine.
So now, I can only assume 3 possibilities how a searchengine works from all this:

1.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) gives the ranking according to the searchengine’s ranking algorithm.
This means, the spider nor the index is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.

OR

2.
The spider does the ranking according to the searchengine’s ranking algorithm.
It visits a website and grabs all it’s html codes (copy a website) and then finally dump the html codes to it’s index. When it dumps the copies of websites it ranks them according to the searchengine’s algorithm.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm because that has been already done by the spider when dumping the data onto the index.
This means, the spider is responsible for giving the ranking and not the index nor the search-system responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.

OR

3.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is not only a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website but also the system that does the ranking.
When it receives data from the spider, it ranks the links in it’s database according to the searchengine’s ranking algorithm.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm.
Frankly, all it does is output a copy of certain parts of the index onto a searcher’s screen.
This means, neither the spider or the search-system is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.


So, which assumption is correct according to the 3 above ?


Ok, I am not thinking of competing with google but you should understand that I want to run a searchengine and it should have a spider, an index and a search facility and I should be able to teach it ranking algorithms.
The web-scripts out there do not offer the admin to teach his searchengine (that runs with these ready-made web-scripts) their own ranking algorithm.
The web-script developing company built the ranking algorithms and we admins cannot change them.
The major searchengines can change their ranking algorithms from time to time when they find-out that webmasters have guessed their ranking algorithms and are abusing them to get their non-relevant websites ranked high under every keyword under the sky.
eg.
I run a search-engine. I use a ready-made web-script. My search-engine one day gets popular. Now, you decide to get traffic to your website from it.
You check what ready-made web-script I am using and you buy that script and experiment on it and find-out the ranking algorithm.
Now, you falsely optimise your website so it ranks high under every keyword on my searchengine, even those keywords that are not really related to your website. Sooner or later, people dump my searchengine. My venture comes to a dead-end.
Now, to avoid all this, I must be able to change my ranking algorithm when I fiond-out that webmasters have found-out my ranking algorithm and are abusing it.
Typical that these ready-made searchengine web-scripts do not offer the admin to change the ranking algorithm and create their own algorithms too.

Also, what is peer-to-peer searchengine ?

luki
12-14-2004, 11:31 PM
That's a long post ;) you have... my thoughts:

1) The key is not the data collection or even the data management (those are trivial), it's the algorithm to rank the data to be able to extract parts of it that are relevant to a query. This algorithm is you "trade secret" so you just wouldn't use a ready-made script. It's just not good enough. Hence no worries about some one getting the same script and falsely boosting his rank.

2) It depends on your algorithm, but I would assume to rank a site properly you need to know other data in the dataset. Hence a spider cannot do the ranking because it doesn't know what else is out there. If you only have one site, how will you know if it's good or bad? A spider should concentrate on what it does best: spider the web for data.

Once you collect enough data you can rank the elements/records properly -- presumably considering other data in the dataset. Assuming your ranking algorithm is a computationally "expensive", you would do it once per dataset, and do it again when the database refreshes. You don't want to do it more often than needed... so once.

Then you have your "front end" use the ranking index and pull whatever data is necessary based on the query.

This scenario is very simplified, but in the gist I think that's it.

onauc
12-15-2004, 05:29 PM
Originally posted by luki
That's a long post ;) you have... my thoughts:

1) The key is not the data collection or even the data management (those are trivial), it's the algorithm to rank the data to be able to extract parts of it that are relevant to a query. This algorithm is you "trade secret" so you just wouldn't use a ready-made script. It's just not good enough. Hence no worries about some one getting the same script and falsely boosting his rank.

2) It depends on your algorithm, but I would assume to rank a site properly you need to know other data in the dataset. Hence a spider cannot do the ranking because it doesn't know what else is out there. If you only have one site, how will you know if it's good or bad? A spider should concentrate on what it does best: spider the web for data.

Once you collect enough data you can rank the elements/records properly -- presumably considering other data in the dataset. Assuming your ranking algorithm is a computationally "expensive", you would do it once per dataset, and do it again when the database refreshes. You don't want to do it more often than needed... so once.

Then you have your "front end" use the ranking index and pull whatever data is necessary based on the query.

This scenario is very simplified, but in the gist I think that's it.

So, what's this part of the searchengine called that does the ranking ?
I guess the spider only collects data just like an offline browser to your hard-drive's certain folder called index which is really a cache and some other agent ranks and creates a 2nd index and the query interface collects results of searches from this 2nd index.
I must know what they really call this 2nd index or this agent.:D

m-b
12-15-2004, 07:12 PM
the algorithm itself is imho called "pagerank algorithm": http://www.google.com/search?hl=en&q=pagerank+algorithm&btnG=Google+Search
But this feature comes behind other problems: crawling (parallel tasks or processes, or even distributed -> knowledge on distributed computing), parsing (meta-tags if wanted) or simply striping html-tags off the document, indexing itself (e.g stemming+stop words, creating an inverted index, optimizing that index), parsing of search queries ...

I guess that google calculates page ranking parallel to the parsing-process, but handling those huge matrices (terrabyte size!) might also only be possible through slpitting the data into chunks and using distributed calculations on those chunks (might be somehow similar to the edge chasing algorithm for recognition of dead locks on distributed systems: through the network connected graphs and searching for circles in those graphs)!

Well, you've got much work coming up to you ;)
Have fun and good luck!

You might want to read this book: http://www.amazon.com/exec/obidos/tg/detail/-/1558605703/qid=1103151115/sr=8-1/ref=pd_ka_1/002-2410179-4637616?v=glance&s=books&n=507846


Michael

onauc
12-16-2004, 09:33 PM
M-B,

What do you mean by "inverted index" ?
Anyway, I thought that each keyword would have a folder in the index and each folder will list links associated with the folder keyword. This way, if you search for a keyword (eg cars) then my searchengine will only look for the links in that key-word's folder (eg cars) and grab all the links in that folder (eg cars) and then display it on your screen or the search result's page and of-course the listings of each link will be according to the ranking algorithm.
This way, the searchengine does not have to search through the whole index (terrabytes of data).
Ok, this method is ok for me to undertake providing you do a keyword search but it's not ok if you do a keyphrase search because it is obvious that I cannot create folders based on keyphrases apart from the known-ones cos I can't predict what phrases people will search. And so, I am stuck. :(

m-b
12-17-2004, 04:37 AM
well, operations based on your file system might not be performant!
but you are trying to create a normal index.

figure out a file (e.g. called index.dat) with the following structure:
word1: 1,5,12,34,67
word2: 2,4,9,12,23

this would be a normal index with the indexed words and document numbers or pages of a document where these words occur.
an inverted index would now not have the document numbers, but the differences between them stored:
word1: 1,4,7,22,33
word2: 2,2,5,3,11

this will save space, since you won't have to save document numbers like 111786096345, ... but smaller difference values!

If you are really stuck, you might want to take a look on Jakarta Lucene (http://jakarta.apache.org/lucene/docs/index.html) or Nutch (http://www.nutch.org), where all of this work is already done ...


Michael

onauc
12-17-2004, 08:25 AM
M - B,

Mmm, maybe, you can write one for me and we newbies can learn from your source-code ? ;)