Software RAID-6 for Linux: experience of array’s recovery of 16Tb
A few days ago, one of the hard drives went down on the budget’s array of 16x1Tb drives. The array level: RAID 6. The situation got complicated by the fact that (as it turned out) previously had stood a cooler on the video card of the same server that was not previously noticed, and after HDD was replaced due to changes in cooling mode of the casing, and this became appear in the form of hanging during synchronization, which itself is very unpleasant. It turned out that an array stopped auto-building, and a few drives were marked as faulty, so I had to deal with it seriously, checking out with the wiki, manuals and forums (forums are the most useful, as they describe the experience of specific people in specific situations).
The structure of array:
md0 - / root 8x1 Gb, RAID 6
md1 - / data: 16x999 Gb, RAID 6
First, all assembly experiments were performed on md0, that is on the root partition, which itself has no great value, except the fact that it is adapted system.
So I downloaded from the 1st disc Debian in a “Resque mode”
Attempt to create the array through
mdadm - assemble - scan
it led to the conclusion’s error that “there are not enough drives to build the array”.
Continuing on science:
1. It is necessary to save the information of description for the arrays that contain information about a specific disk with number in the array. In case you have to build using “dangerous methods”:
mdadm - examine / dev / sd [abcdefghijklmnop] 1> / mnt/raid_layout1
mdadm - examine / dev / sd [abcdefghijklmnop] 2> / mnt/raid_layout2
These files contain something similar to the following below for all the HDD, which partition sdX1 has superblock (in my case, only 8 out of 16 for md0 have superblock sdX1)
Below is one sample of the sections:
Version : 0.90.00
Raid Level : raid6
Used Dev Size : 975360 (952.66 MiB 998.77 MB)
Raid Devices : 8
Total Devices : 8
Number Major Minor RaidDevice State
this 4 8 177 4 active sync /dev/sdl1
0 0 8 97 0 active sync /dev/sdg1
1 1 8 113 1 active sync /dev/sdh1
2 2 8 129 2 active sync /dev/sdi1
3 3 8 145 3 active sync /dev/sdj1
4 4 8 177 4 active sync /dev/sdl1
5 5 8 193 5 active sync /dev/sdm1
6 6 8 209 6 active sync /dev/sdn1
7 7 8 225 7 active sync /dev/sdo1
Briefly, what this means:
sdf2 is active parsing partition
Version 0.90.00 is a superblock version
Also, you will see a lot of useful information, such as the size of the array, UUID, level, number of devices, etc.
But the most important thing is a table at the bottom of the list, the first line indicates what number in a row is HDD in the array:
this 4 8 177 4 active sync / dev/sdl1
As well, pay close attention to the version of the superblock! In my case it is 0.90.00.
Here we see its number in the array, 4 same numbers can be found in the output for all other devices from the list. Please note that a character of drive is different in the line of status - sdl1 - this means that the disk has been initialized on another SATA port, and then it was moved. It's not critical information, but can be useful.
Critical is the device’s name and its number in the array (they will swap when moving devices from port to port).
Let us save the created file raid_layout (for example: on a flash drive), so it will not to be lost, and proceed to the next step:
2. Let us try to create an array
We can create an array using 2 ways: automatic and manual.
mdadm - assemble - scan-v
If it was created automatically, you may consider yourself lucky, you just need to check whether all HDD are in the array, and if they are not, you need to add the missing HDD. But, in my case, automatic assembly fails with an error that there are not enough working devices:
mdadm: / dev/md2 assembled from 4 drives - not enough to start the array
The array was created on 4 of 8 disks. Certainly, it will not work, because Raid 6 let be absent only 2 discs.
Check the status of the array
cat / proc / mdstat
md2: inactive sdn1  (S) sdk1  (S) sdj1  (S) sdp1  (S)
There is a fine point - if you come across that marked as failed or initialized as failed in HDD’s list, then the assembly immediately is terminated, and therefore it is useful a flag “-v” in order to see, which of HDD assembly is terminated.
mdadm - assemble / dev / sd [abcdefgh] 1-v
Same thing, but we pointed out specifically, what HDD to use for the assembly.
Most likely, an array will not be created, as well as in the case of automatic assembly. But, assembling manually, you begin to understand better the essence of what is happening.
The array also will not be assembled, if the disk is marked as “faulty” in the metadata section.
Here I started an array of data, because / root array I have lost, why and how I will describe below. Assembling an array and ignoring the status “faulty” - you can add a flag “-f” (force) - in my case, it solved the problem of assembling the basic data section, the section has been successfully reassembled using the following command:
mdadm - assemble / dev/md3 / dev / sd [abcdefghijklmnop] 2-f
For sure, an easy way to assemble it would be as follows:
But as I got to the flag “-f” through the hardship, it is clear now.
The sections that are marked as faulty or outdated were added to the array. With high probability, the faulty or outdated section may be marked, when SATA cable does not fit tight, which is not uncommon.
However, I have an array mode as degraded on 14 disks out of 16.
Now, in order to restore the normal operation of the array, you need to add two missing disks:
mdadm - add / dev/md3 / dev/sdX2
where X is a character of new HDD section
Below I show the difficulties that I encountered, in order to save others from my own mistakes:
I used the recommendations of WIKI - Linux RAID Recovery (raid.wiki.kernel.org / index.php / RAID_Recovery) working with an array of Linux RAID WIKI, I advise to be careful using them, because a webpage briefly describes the process and thanks to these recommendations, I broke / root (md0) of my array.
mdadm - create - assume-clean - level = 6 - raid-devices = 10 / dev/md0 / dev/sdb1 / dev/sdc1 / dev/sdd1 / dev/sde1 / dev/sdf1 missing missing / dev / sdl1 / dev/sdk1 / dev/sdj1
This line demonstrates how to recreate the array, knowing what devices and their order, which it includes. It is very important to consider a version of the superblock, as the new mdadm create the superblock 1.2, and it is in the beginning of this section, and 0.90 located in the end. Therefore, we should add the flag “- metadata = 0.90”.
Once, I have assembled an array using “- create”, the file system was destroyed, and there were not found the primary and backup superblock ext4. At first, I found that the new superblock is not 0.90 and is 1.2, which could cause the destruction of the section, but it was not a cause, because changing the version of RAID superblock on 0.90 and searching for the backup superblock ext4 - was unsuccessful.
Since / root partition is not the most important part, then I decided to experiment - the array has been reformatted and then stopped:
mdadm-stop / dev/md2
and once again was created through “- create”: as result - a file system was broken again, though It should not happen.
Maybe someone has successfully restored sections through “- create”, I would be glad to add helpful information to this article.
It is obvious that recommendations should be used at your own risk from this article; there is no guarantee that in your case everything will work exactly as in my case.
“Translated from another resource”
|Vote for this post
Bring it to the Main Page