MySQL: Execution Cannot Be Pardoned
The website and online store "Eldorado" is about 40 thousand purchases every day. There is probably no need to explain what this means for the company's business.
Historically, the store runs on the Bitrix engine with a huge amount of custom code and add-ons. The storage is a MySQL cluster with four master servers.
A significant number of companies have monolithic applications, and many have to work with them. There are plenty of ways to deal with the monolith, but, unfortunately, few people write about successful ones. I hope that the story of how we prop up our monolith (until we saw it) will be of interest to you.
We are well aware that massive architecture can bring a lot of problems. But it is easy to destroy it, by a simple strong-willed decision, it is impossible: sales are going, the site should work, changes for users should not be radical. Therefore, the transition from a monolith to a set of microservices takes time, which we need to hold out: to ensure the system is operational and its resistance to stress.
What was the problemFor a very long time, the database cluster on the eldorado.ru website was built according to the following scheme:
All masters in this scheme work simultaneously and all are in active mode, sequentially playing the replication stream ... M1-> M2-> M3-> M4-> M1-> M2-> M3-> M4-> M1-> M2 ...
In our setup, this configuration gave the only plus - it allowed the system to work and keep its load. The fact is that the application request balancer, after any update to ensure consistency, switches the entire read stream to this master, and one master was once not enough to hold the entire read stream.
But such a scheme could not provide either reliability or speed of work. Although it looked simple, it had a number of flaws. She was very slow to update data in the cluster: in the worst case, there were up to five replication arms (depending on which master the changes were initially initiated on). Due to these delays, many problems arose both in the operation of the site and when placing orders in the online store.
Cons of such a scheme :
The slaves of the zone farthest from the active master receive data updates in the worst case only after 4 times the transaction execution time, sometimes there were frenzied replication delays;
Any failure on any of the masters leads to data inconsistency throughout the cluster until it is eliminated;
In some cases of hardware failures, ring recovery without data loss is close to impossible (unloading and unrolling from a backup - many hours);
Even a planned shutdown of any master automatically requires either removing the slaves from the load or switching them to another master;
There are no tools that can work fine in such a topology (and it's good that not, don't use it);
Any heavy UPDATE / DELETE and even SELECT under certain circumstances will block replication in the entire ring during its execution;
The slave, until the completion of the transaction, does not yet know that he is lagging behind in slave_status by seconds_behind_master.
You can emulate this behavior in your test environments by enabling an artificial replication delay of 1-2 seconds on the slaves (which we did), this is a great way to test your application for readiness for such distributed architectures using the MASTER_DELAY = N option.
And, finally, switching to another database in our case is not an option, because the system is too large-scale and much in it is tied to the use of MySQL features and even to the nuances of its internal query optimizer.
How we solved itWe didn’t want to change the schema on our own (this is a risky operation) and started by looking for a consulting company that could propose and deploy a new architecture, and do it so that the site remains accessible and the switch is not noticeable. Among these companies were the largest integrators and software developers.
Some companies simply did not answer us (and this is normal), while others wrote that they were not ready to take on such a task. At the same time, the question of the possible cost of the project did not even arise.
If big integrators do not want to get involved with the task, then it has become doubly interesting to solve it themselves. We had good equipment at our disposal, problems with stability and fault tolerance were in the second priority, and at the very beginning we wanted to somehow speed up the data transfer. Several options that appeared in MySQL versions 5.6 and 5.7 worked well for this.
True, the documentation transparently hinted that it would not be possible to simply enable them, tk. in the ring there will definitely be a slave with a smaller version, but here it is like this:
The 5.7 master is able to read the old binary logs written prior to the upgrade and to send them to the 5.7 slaves. The slaves recognize the old format and handle it properly.For tests, it was enough for us to raise the test ring, for example, through mysqld_multi, and run typical queries on it (you can even on the same host, 4 instances on different ports with different offset sets), like this:
Binary logs created by the master subsequent to the upgrade are in 5.7 format. These too are recognized by the 5.7 slaves.
In other words, when upgrading to MySQL 5.7, the slaves must be MySQL 5.7 before you can upgrade the master to 5.7.
mysql -h127.0.0.1 -P 3302 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"After that, you can change the version of any instance by running the config with the desired port with another binary, which was put next to it, and making mysql_upgrade for the data / system tables.
mysql -h127.0.0.1 -P 3302 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3301, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master1-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
mysql -h127.0.0.1 -P 3303 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3303 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3302, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master2-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
mysql -h127.0.0.1 -P 3304 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3304 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3303, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master3-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
mysql -h127.0.0.1 -P 3301 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3301 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3304, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master4-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
On production, we could, for a limited time, send all traffic to only one master, update another at this time, then switch to it and re-update all the others. But for this, it was necessary to ensure the compatibility of the binlog format between versions, so that absolutely all transactions were lost successfully.
A little more documentation useful for our case:
To avoid incompatibilities, set the following variables on the MySQL 5.6 master:
binlog_checksum=NONEThis flag turned out to be mandatory for us, although we do not use any NDB, without it, replication ended between servers 5.6 - 5.5, and mysqlbinlog reads the log without this option with the error ERROR: Error in Log_event :: read_log_event (): 'Sanity check failed', data_len : 8203, event_type: 30, if enabled, then everything even starts and works.
log_bin_use_v1_row_events=1 (NDB Cluster only)
We did not include the GTID, because in addition to the requirement for compatibility with all old tools, objectively we do not see enough advantages for the transition.
gtid_mode=OFFThe simplest test to check the correctness of replication is to upload a dump in turn to the server with 5.5 and the server with 5.6 and see if everything will be ok.
Unfortunately, although expectedly, the tests were unsuccessful.
Last_Error: Column 18 of table 'eldorado2.b_adv_banner' cannot be converted from type '<unknown type>' to type 'datetime'datetime in 5.6 is special, it has microseconds added, so in 5.6 it has a new datetime unknown in 5.5
version 5.6 - it can work in a cluster in a ring in parallel with 5.5, if at the same time no fields with new types of fields are created in any of the tables running through replication. (datetime 5.6! = datetime 5.5, similar to time, timestamp, there are more than 240 such fields in our database).
We could not completely guarantee the absence of DDL with these fields, and we did not want to jeopardize the performance of the entire cluster. But we had a safer Plan B.
It meant the presence of additional hardware for maneuvers and raising a complete copy of the cluster nearby, fortunately, we had such hardware. And since there is such a possibility, then it was necessary to create a "normal" cluster at once.
But, at the same time, it is necessary to ensure the preservation of the operability of all current monitoring, debugging, and binlog analysis tools and to eliminate as much as possible all the existing shortcomings of the current architecture.
Multichannel replicationYou need a silver bullet to keep the sites intact and the admins to be fed. This is multichannel replication. We a priori did not trust new opportunities and were not sure of the technology, we could not find such information or cases anywhere, there is little public experience in large production.
Therefore, we thought of everything ourselves, the plan was as follows:
From one of the slaves, we begin to deploy another shoulder: from scratch, we deploy a new cluster immediately on version 5.7, build an entire new topology with new options;
We carry out all the load tests we need and check the stability;
We tune the configuration, and if everything is fine, we re-deploy the entire cluster, but without modifying the data in any way, so that it is ready for the subsequent switching of the productive load.
- Everyone pleases the one in front, while the one standing behind pleases him. The question is, which of them is happier than the rest? I stared dumbfounded at the outlandish creature with the face of a tired woman. So here it is, the famous "mystery of the Sphinx"! For my taste, all this was somehow too much!In the target scheme, 4 masters are 4 independent recording streams for each slave, which are processed independently.
( Max Fry "My Ragnarok ")
On all masters, it was now possible to turn off log_slave_updates - they simply do not need to relay anything anywhere, they send all changes in the main thread (=> the load on the master is even lower).
And at the same time, you can also enable the minimum binlog format and parallel processing of transactions along the way (quite conditionally, you need to understand it correctly):
slave_parallel_workers=5With this setup, we could switch the load to a new cluster at any time, but this route is one-way and does not provide for rollback.
For the lifetime of connections to the old cluster, the log_slave_updates option on one master entry point to the new cluster is still there, and therefore all changes from the connections to the "old" cluster are perfectly delivered to the new one, and immediately after they die off this option was disabled, the application to this moment looked at 3 other masters and data streams did not intersect in any way.
As a result, we got the following set of advantages:
If a long request is blocking something, then this affects only one replication thread out of 4 and does not affect the others in any way;
The new binlog format, which was previously impossible just because of the MySQL version, now takes up several times less space and, accordingly, traffic, due to this, much more changes can pass through the entire cluster;
Now you can turn off any master absolutely painlessly without affecting everything else;
The accidents of masters are now not so scary, now you can clone any server in a minute in any incomprehensible situation, re-generate the server_id, create credits for slave access and - the new master is ready.
There is also a "minus":
Each master has much more slaves and it is easier to run into the channel (in fact, this is not an increase in traffic, but a redistribution in time and space).
What did the new scheme giveThe move to the new scheme turned out to be successful, we carried it out in one day, on August 28, 2020. The experience of using the new architecture has shown that the number of replication problems has been reduced by three to four times (it is impossible to completely get rid of them). The stability of the system has increased. And the main result was an increase in the maximum throughput of the system. If earlier developers could blame any incomprehensible problems on replication, now it doesn't work for them.
The number of client problems caused by replication delays has been reduced by several times, which means that clients began to experience at least a little less pain. Now we can turn off any master server at any time in order to carry out work on it. This does not affect the entire cluster or stop the replication process.
The cluster serves the main site "Eldorado" - for the most part, an old monolithic application with product cards, a personal account, a basket, order processing, a call center, etc. At the time of this writing, the total read load on the cluster (only on the slaves) is 40k rps, approximately 5k rps per database server, excluding the technical load on individual technical slaves, which is significantly higher at peak times. It may not seem like a lot, but you need to take into account the nature and complexity of these queries.
We really hope that our experience will be useful to someone. In addition to multichannel replication, we also use many interesting things, such as blackhole and federated tables, they also allow you to remove a lot of headaches (and add a little for those who do not understand why they are needed), if someone is interested in the nuances and any other questions about to our MySQL - Wellcome in the comments.
For half a year of commercial operation, we have not yet encountered any problems related to multichannel and we can definitely recommend such a setup as sufficiently fault-tolerant and reliable.
The monolith itself is now in the process of cutting into a number of separate and independent services, some of which we will shard and tell you about our experience - stay tuned.
Special thanks to my excellent team, we would not have done this without her!
|Vote for this post
Bring it to the Main Page