We start with a simple and most tough question of all times – Why is Website Slow –
– Slow mysql query (too many queries. MySql out of capacity). Vertical Scaling took a hit.
– Scale out by Master and Slave Conf. Master – write and Slave – Read. Distribute read load.
– They had a separated all critical write and read from non-critical read. Like when order is placed it’s written to master. When the order confirmation page is loaded but if the slave db has not been updated it would create an issue. So this is a critical read. These kind of critical read has been separated in the data layer so that in case in case the master to slave replication is slow the important details are still shown to the user.
– Lot of queries for analytics purposes added a lot of load to the slave(read only) db. So separate server for analytical purpose. This also allows us to tweak this db for read by using MYISAM storage engine as it’s more suited for read. Added separate index that was not required on prod db. So we should always separate the user facing db and analytical db.
– Started moving towards Service Oriented Architecture
– Website was again slow. So added a lot of logging data – to monitor memory and time. These logging started taking a toll on website performance. All the response were being written to a single file. As the number of concurrent users grew the write increased. As such all process would try to write to the same file, causing many process to wait till they got hold of the file. This is because of the way PHP works. Only one process could write to a file, as such other process would have to wait. (Currently at Perk we use LogStash to logging).
– Since they moved to a SOA they needed to connect to a number of services like 5 to 10 for each request. PHP does not support multi threading and does not allows to maintain a thread poll. So for each request we had to create new connection which were serially (no multi-threading).
– For logging issues they wrote a Java middleware for logging. PHP would hand over the logging data to the Java Process which they handle. PHP process is not blocked because of logging.
– They deploy new features only to 10% of the users and then they monitor the metric. If there’s a drastic change then there might be issues.
– They use APC cache. They have issues with cache invalidation.
– They were using Thrift for serialization and de-serialization. Now at this point we need to always consider that the speed of serialization and de-serialization. Like PHP native serialization and de-serialization will be faster( as expected).
– They started using memcache. They would serialization the object and save it.
Now in case of a cache miss, they had to do a n/w call to fetch the data for the prodoct (from CMS) and then de-serialize and serialize it. They added more complexity.
So they were facing the issue of cache invalidation. To handle this, they set the ttl of each object as infinite and would expire it only when the object got update from CMS.
– They had a notification service which would indicate as to which object has been updated. These service had a number of endpoint to invalidate and update the cache.
What happens in case of failure of the notification service(the object would never be updated). They notification service handled the failure gracefully. They would retry in case of failure for specified number of time and after that raise an issue(can be handled any way we want – may be db or a send out an email)
– TIMEOUT….set TIMEOUT for all service call.
– Load page async via ajax call. Like for loading the recommendation for home page or on a product page, we can load the product details first. Once that’s done, do an ajax call to load the non-critical data like recommendation. Improves the user experience.