10:16 – Vertical Scaling – Instead of having all my servers on a single box like – mail server, db server, application server – we move all to a specialized server.
This allows for task based specialization and tuning. All database require different type of tuning. Like tuning a web server(requires more processing power, serves more request so needs TCP IP tuning) is different from tuning a db server(required more RAM, ).
So we can individually optimize each box for the task they need to perform.
This is a finite scalability and is not economical. Here we discussed vertical partitioning at the services layer.
14:07 – Horizontal Scaling
In horizontal scaling we have homogenous nodes unlike vertical partitioning where we have heterogeneous nodes – means each node has the same copy of the app server, performs the same task – are basically identical.
15:34 – Load Balancer
Two type – Hardware(faster) and Software(). Both are now equally configureable.
17:00 – Sticky Session
If we have multiple nodes and each has it’s own session management. If a request goes to node 1 and next time goes to node 2 then there’s no way for them to maintain the session.
Cons – creates an asymmetrical load distribution.
If an app server goes down for some time, all the request will to redirected to a other server and as we are using sticky session all further request will keep going to those server only, even if the failed server is up. So load distribution is uneven.
19:20 – Central Session Storage – can be a NFS mount or a RDMS. So now the load balancer can send the request to any app server as all use the same central session storage. But then this becomes the single point of failure.
21:00 – Clustered Session Management – peer to peer topology. Each app server is a peer. Whenever a session is created/updated each peer communicates with other to update them about it – broadcast to all other peers – mutual message passing between the servers.
Pro – no central session store
Con – the number of messages passed b/w server can very quickly increase as we go on adding more node.
Rare situation – If the message passing is slow then, it may happen that before the session data is updated across all nodes a new request from the user has come. This can cause an valid session to look invalid or app server will have old session data. This can happen if the request are coming from an automated client as the user request can’t be that fast.
( This is what Reddit follows. For the same page different users might get different count of upvotes for the same article. When there’s a upvote/downvote the count is updated on a single session server which is then polled to other servers. Before this data is replicated across if the page is loaded – that app server uses session data that’s on that cluster and is stale).
23:15 – Sticky session with a central session store / Sticky session with a clustered session
26:00 – Load Balancer
Active Passive Load Balancer – One is active, other is passive. The passive keeps checking if the other one is still up. If the active one fails – the passive load balancer takes over.
Active Load Balancer – Both are serving certain number of app server. If one goes down the other one handles all the request. But now the load on the active balancer increases. So when we begin with this configuration we should make sure that both balancers are working at less than 50% utilization – to be able to take over the other balancer load when that fails.
29:00 – Database Server –
Common Mistake – The problem is not with disk/ram – that’s consumed. we directly increase the ram and cpu – but this doesn’t solves the problem. But the problem is that we have a lot of disk I/O – disk can’t keep up with the CPU – means the DB is not able to read and write at the rate to compare with the rate of incoming request.
So we move the data to a SAN. They can churn out data at a very fast rate and we can utilize the CPU to a much better extent. SAN are not very expensive these days.
32:00 – Horizontal Scaling for the database
36:00 – Replication
Master – Slave Replication –
Master – Master Replication – conflict resolution is needed.
Actual replication can happen async/sync.
Proc and Cons for both –
async – fast response. Once write is done to the master, it sends the response. We need to take care of critical read(like for payment write to master db and not read for the page which shows payment detail these critical read will can be from master) .
sync – slow as all the slaves db writes will happen before response is sent to client.
Shared nothing generally hits a limit. It basically works in case of much more read then write. When we hit a vertical scaling limit.
52:00 – Horizontal and Vertical Partition of database
In vertical we move half of the tables across two nodes.So if we have 20 tables – 10 tables on each node. There’s will a proxy which will handle it internally.
In horizontal we slice a table horizontally. We take data from a table and spread it across few cluster. So when querying for the data we need to have code at the app level/db level to decide which cluster to query to.Con – Sql union will not be straight forward. Some drivers support this but we might need to do it at the app level.
How we can divide this cluster ? This can be done in multiple ways like first one million users goto a first cluster and so on.. Or we can use a hash based function to decide as to which cluster does this row goes to.
Do note that this is not Table Partitioning – partition data to multiple table in the same database on the same node. Like we can have points table for each user based on month.
Each of these cluster can like handle specific number of user like 1 million.
So what we do now is to combine all the above component into a single set. So one set will contain – load balancer, db server, session management server, app server, mail server and so forth. Each one is configured to prevent single point of failure and has cluster and all. Now this one set is configured to handle 1 million user. When our user base increases what we do is, instead of adding more db server and all we make new set that can handle for next million user.
This also has problem. If we need to generate reports we will have to have special code to get data for users on other set. If sizing is not proper then migration of data across set will be a pain.
1:03 – Adding caching. We can cache user object and even cache session data.
1:05 – Adding a HTTP accelerator in front of web server – reverse proxy. like for static content the server(tomcat, apache) will redirect it to lighter web server( which has less memory requirement).
Choosing the correct server – blocking and unblocking.
1:08 Few stuff to use –
IP Routing Table – this will redirect the request to the nearest node
CDN like Akamai
Async non blocking IO
Grid Computing and parallel processing across node
HTTP accelerator (like Varnish)
In horizontal partitioning we need to split data across multiple server. But there will be a lot of static data (like conf) which we need in all nodes. These data can be in a central cluster and not just in one cluster.
Adding queues for async task