Need to design a small search engine something like Solr or ElasticSearch.
We need to create our own inverted index and store it(Trie/B+Tree/Hashing).
We have already covered the part of designing a web crawler. That will fetch documents which we now need to index and search.
This indexing and searching is also a part of Type Ahead Search (General).
We will discuss how to maintain the inverted index (not focussing on how to create these).
So now we have the posting and documents ids and frequency of each token like
token fre docId
queen 23 8711
Now we need to use proper DS to save this data. Below are a few proposed solutions –
1. We could maintain a hash table with tokens as key and the frequency and docId as value. If the hash gets too big we could use a consistence hashing to distributed the key across several machines.
2. We could also use a sorted set. All the tokens will be sorted lexicographically. Whenever we need to find a word we could do a binary search as the data is sorted. This data could also get too big to fit in memory and so we can distributed it across several machines. We could again use a consistence hashing algo to distribute the load where each set will be stored.
The key in set will be token and the values will be fre and the docId. Now when we search we will find the data in set using binary search.
– If we need to search for multiple full word search like “arts and delhi” for example.
We first do a search for arts and then a search for delhi. Then we need to find the intersection of these two result set.
– We also maintain a the frequency each word in a hash table. This sometimes is useful when we need to apply intersection.
3. For partial text search we could also maintain a Trie. This work but requires a lot of space for saving the extra pointers.
4. B+ Tree – This is by far the most efficient method to store and query the data that I found. B+ is used in DMBS for fast lookup.
In a B+ tree all the data resides in the leaf nodes. With a depth of just 4, these tree can store upto 5 billion records.
Sharding a b+ tree also seems simpler. The root node can be in the main memory and the leaf node can be in the disk.
Let’s first remember the query types. Our search engine is going to answer 3 types of queries that we generally use while searching.
1) One Word Queries (OWQ): OWQ consist of only a single word. Such as computer, or university. The matching documents are the ones containing the single query term.
2) Free Text Queries (FTQ): FTQ contain sequence of words separated by space like an actual sentence. Such as computer science, or Brown University. The matching documents are the ones that contain any of the query terms.
3) Phrase Queries (PQ): PQ also contain sequence of words just like FTQ, but they are typed within double quotes. The meaning is, we want to see all query terms in the matching documents, and exactly in the order specified. Such as “Turing Award”, or “information retrieval and web search”.
The input in PQ is again a sequence of words, and the matching documents are the ones that contain all query terms in the specified order. We will now use the positional information about the terms which we didn’t need to use in OWQ and FTQ. The implementation is as follows. First we need the documents that all query terms appear in. We will again get the list of documents for each query term as we did in FTQ, but now we will take the intersection of them instead of union. Because we want the documents that all query terms appear in, instead of the documents that any query term appears. Then, we should check whether they are in correct order or not. This is the tricky part. For each document that contains all query terms, we should do the following operations. Get positions of the query terms in the current document, and put each to a separate list. So, if there are n query terms, there will be n lists where each list contains the positions of the corresponding query term in the current document. Leave the position list of the 1st query term as it is, subtract 1 from each element of the 2nd position list, subtract 2 from each element of the 3rd position list, …, subtract n-1 from each element of nth position list. Then intersect all the position lists. If the result of the intersection is non-empty, then all query terms appear in the current document in correct order, meaning this is a matching document. Perform these operations on all documents that every query term appears in. The matching documents are the ones that have non-empty intersection. Why does this algorithm work? Because, for each query term, we check whether it appears right after the previous query terms in the document. And we do this in an elegant way. Additionally, there is an optimization while performing position list intersections. We can start intersecting from smaller lists, because the size of the final intersection is always less than or equal to the size of the smallest list. Therefore, we can short-circuit whenever the intermediate intersection result becomes an empty list, obtaining an efficient algorithm.
An example will make everything clear. Let’s say we search for “computer science department”. We first get the document list of all query terms, as we did in FTQ: computer: [1, 2, 3], science: [1, 2, 3], and department: [1, 2]. Then we intersect all these lists to get the documents that contain all query terms, which is [1, 2]. Next, we should check whether the order is correct or not. First, we get the postings list of the query terms for document 1. Which is computer: [1, [2, 5]], science: [1, ], and department: [1, [4, 6]. Then, we extract the positions of each query term, and put them in separate lists, resulting in [ [2, 5], , [4, 6] ]. Each list corresponds to the positional information of a query term. We don’t touch the first list, but subtract i-1 from the elements in the ith list, resulting in [ [2, 5], , [2, 4] ]. Finally, we take the intersection of the lists, which is . Since the intersection is not empty, we conclude that document 1 is a matching document. Next, we check document 2. Get the positions of query terms and put them to separate lists as before: [ [1, 7], [2, 5], [0, 6] ]. Perform the subtractions: [ [1, 7], [1, 4], [-2, 4] ]. And take the intersection: . The result of the intersection is empty, meaning the query terms don’t appear in the correct order, so this is not a matching document. There is no more document that contains all query terms. So, the result of the phrase query is document 1 only: .