CTCI: How would you design the data structures for a very large social network (Facebook, LinkedIn, etc)? Describe how you would design an algorithm to show the connection, or path, between two people (e g , Me -> Bob -> Susan -> Jason -> You).
Facebook went through a number of iterations on production before finalizing there current design. This kind of search differs from general search as it’s more personalized.
FB architecture is divided into two part – first degree and general search.
These two are executed in parallel and then aggregated.
The main focus of this post is on first degree search(friend and friends of friends(fof)).
When searching data for a new user, if they don’t have enough friends then fb goes on to search friends of friends for this case for a better result.
1. FB started with Trie DS to store the data.
In this case, for each user has, there is a Trie structure that contains all his/her friends (and objects ?). The image below for the Trie is maintained for each user(in this image for Chuck). Now we do prefix search on each of those Tries and get the result. The leave node is there friend and the path is the name (So there is a trie ds which has all my friend and friends’s object( this is needed to show – Sameer’s Photo) …??).
– Those pointers in the Tries are really big.
– Pointers point to random places: touch the graph randomly, no partition. There’s no distinction between hot and cold data. Since the data set is large and it’s not easy to shard trie it’s difficult to maintain.
2. Option 2: Sorted Vector of name, ID – Linearization of the Trie.
It’s read biased. For each of the friend, there is a sorted vector structure for his/her friends and objects. Now prefix matches becomes binary search for the range of the prefix matches. There are no random pointers now, so we can store this object with the original user as the key, and partition gets easier.
Not good at memory consumption: multiple copies of one user’s data
Scales badly when indexing more terms (multiple names for a user)
O(n) insertion: When you inserting a new data, you need to move all consequent data to give space for the new data.
3. Brute Force using Adjacency Matrix
No duplication. We save the pointer to friends. Then we pick all those friends and do a brute force for prefix search. Resource intensive. A simple variation of this is to use a bloom filter. Each user has a bloom filter which tracks as to what all prefix are present. So suppose Chuck has no friend with name starting with ‘d’ then this can be verified from bloom filter before brute force.
FB had one bloom filter with one hashing algo for friends of friends.
FB has two bloom filter with two hashing algo for objects of friends. One on the first letter of the name and the second for di-graph of the name.
For update there’s a lock on the object.