There are 6 ways to kill your servers – let us learn scalability in the hard way
Let us learn how to scale your application without having any
Experience, it is very difficult. Now there are many websites that are devoted to these issues, unfortunately, there is no solution that is suitable for all cases. We still need to find solutions ourselves, which are suitable for our requirements. Just as I do.
Several years ago, my boss came to me and said: “We have a new project for you, namely to transfer a website, which already has 1 million visitors per a month. You need to move this website and make sure that traffic could grow in the future without any problems”. I was already an experienced programmer, but I did not have any experience in the field of scalability. I had to learn scalability in the hard way.
Software of a website presented CMS in PHP using MySQL and Smarty. First, a hosting company was found that had experience with such projects. We gave them our required configuration:
• Load balancing (with reserve)
• 2 web servers
• MySQL server (with reserve)
• Machine for development
What we got (hosting company said that this will be enough):
• Load balancing – Single core, 1 GB RAM, Pound
• 2 web servers – Dual core, 4GB RAM, Apache
• MySQL server – Quad core, 8GB RAM
• Machine for development – Single core, 1 GB RAM
In order to synchronize files a hosting company set DRBD in the configuration active-active.
Finally, the transfer time has come. Early in the morning, we switched the domain to a new IP and started to monitor our scripts. We immediately have received traffic and everything seemed to work well. Pages have been downloaded quickly, MySQL handles a lot of queries and everyone was happy.
Then suddenly rang the phone: “We cannot open the website, what is happening?” We looked at our software to monitor and saw that the servers were down and the website did not work. Of course, the first thing we called a hosting company and said: “All our servers were down. What is happening?” They promised to check the servers and call us back after that. Later they called and said: “Your file system is damaged. What did you do?” They stopped a balancer and told me to look at one of the web servers. I opened the index.php and was shocked. It contained a strange piece of code in C, error messages, and something similar to the log files. After a little investigation, we found that the reason for this was our DRBD.
Lesson # 1
If we put the cache Smarty in active-active DRBD cluster under a high load and your website will be down.
While the hosting company was restoring web servers, I rewrote a part of CMS, so that the cache files Smarty were stored in a local file system. The problem was found and eliminated, and we were back online.
Now it was the beginning of the day. Typically, the peak of traffic occurs at the end of the day and keeps until the early evening. At night, there were not any visitors. Again we began to observe the system. The website was downloading, but when approaching the peak of traffic time the load increased and responses slowed. I increased the lifetime of the cache Smarty, hoping it would help, but it did not help. Soon the servers begin issuing time-out errors or blank pages. Two web servers could not handle the load.
Our client was nerves, but he understood that the moving a website brings with it some problems.
Somehow we had to reduce the load and we discussed it with the hosting company. One of their administrators suggested a good idea: “Now your servers are on Apache+mod_php. Would you consider transferring them to Lighttpd? This is a small project, but even Wikipedia is using it”. We agreed.
Lesson # 2
If we set on the servers “out of the box” web server and do not configure anything and a website will be down.
Administrator reconfigured our servers as quickly as he could. He refused to use Apache and went to the configuration of Lighttpd+FastCGI+Xcache. How long will the servers last this time?
Surprisingly, the servers worked well. The load was lower than before, and the average response was good. After that we decided to go home and relax, because it was too late.
The servers handle the load reasonably well during the following days, but at peak times they were close to collapse. We found that the bottleneck is MySQL, and again we called a hosting company. They advised to use master-slave replication MySQL with slave on each web server.
Lesson # 3
Even the powerful server of database has its limitations – and when they are achieved a website will be down.
This problem was not so easy to fix. CMS was very simple in this respect and there was no possibility to separate SQL-queries. Modification of this took some time, but the result was worthwhile.
MySQL replication is truly a miracle and the website was finally stable. In the coming weeks the website started gaining popularity and the number of users has been constantly increasing. It was only a matter of time before traffic will once again exceed our resources.
Lesson # 4
Let us do not plan anything in advance and a website sooner or later will be down.
Fortunately, we continued to observe and plan. We have optimized the code and reduced the number of SQL-queries, and unexpectedly we learned about MemCached. First, I added MemCached in some basic functions, which was hard. When we launched our changes to production, we could not believe the results; it was like we have found the Holy Grail. We reduced the number of requests per a second for at least 50%. We decided to use MemCached instead of buying another web server.
Lesson № 5
Do not cache anything and spend the money on new hardware or website will be down.
MemCached helped us to reduce the load on MySQL by 70-80%, which resulted in a huge performance boost. Pages downloaded faster.
Finally, our configuration seemed ideal. Even at peak time, we do not have to worry about the possible falls or the great response time. But suddenly one of the web servers started to bring us problems – error messages, blank pages, etc. The load was normal and in the most cases the server was running fine, but only in “most cases”.
Lesson # 6
Let us fut a few hundred thousand small files in one directory and forget about Inode, and a website will be down.
We were so busy optimizing MySQL, PHP and web servers that have not paid enough attention to the file system. Smart Cache was stored in the local file system in the same directory. Solution was the transfer of Smarty on a separate section with ReiserFS. Also, we have enabled Smarty option ‘use_subdirs’.
During the following years, we have continued to optimize. We have put cache Smarty in memcached, and set Varnish to reduce the load on I / O system, then switched to Nginx (Lighttpd randomly gave 500 error), so we bought the better hardware and so on.
Scaling of the web site is an endless process. As soon as you fix one bottleneck, most likely you will come across something else. Let us never think “That is all, we have done”. It will ruin the servers and possibly the business. Optimization is an ongoing process. If we cannot do the work themselves due to lack of expertise / resources, we need to find a competent partner to work together. Let us never stop to discuss with a team and partners the current issues and problems that may arise in the future.
Here is about the author - Steffen Konerow, High Performance Blog.
“Translated from another resource”
|Vote for this post
Bring it to the Main Page