{"id":2726,"date":"2023-12-17T19:10:15","date_gmt":"2023-12-18T00:10:15","guid":{"rendered":"https:\/\/blog.lufamily.ca\/kang\/?p=2726"},"modified":"2023-12-18T09:30:34","modified_gmt":"2023-12-18T14:30:34","slug":"replacing-nvme-boot-disk","status":"publish","type":"post","link":"https:\/\/blog.lufamily.ca\/kang\/2023\/12\/17\/replacing-nvme-boot-disk\/","title":{"rendered":"Replacing NVME Boot Disk"},"content":{"rendered":"\n<p>A few months ago, the boot disk of our media server begin to incur some errors, such as the ones below:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code>Dec 17 03:01:35 avs kernel: &#91;32515.068669] EXT4-fs error (device nvme1n1p2): htree_dirblock_to_tree:1080: inode #10354778: comm tar: Directory block failed checksum\nDec 17 03:02:35 avs kernel: &#91;32575.183005] EXT4-fs error (device nvme1n1p2): htree_dirblock_to_tree:1080: inode #13500463: comm tar: Directory block failed checksum\nDec 17 03:02:35 avs kernel: &#91;32575.183438] EXT4-fs error (device nvme1n1p2): htree_dirblock_to_tree:1080: inode #13500427: comm tar: Directory block failed checksum<\/code><\/pre>\n\n\n\n<p>The boot disk is a NVME device and I thought it may be due to over heating, so I purchased a heat sink and installed it. Unfortunately the errors persisted after the heat sink.<\/p>\n\n\n\n<p>I decided to replace the boot disk with the exact same model which was the Samsung 980Pro 1TB. This should have been a pretty easy maintenance task. We clone the drive, and swap in the new drive. However, Murphy is sure to strike!<\/p>\n\n\n\n<p>My usual goto cloning utility is Clonezilla, unfortunately this utility did not like cloning NVME drives. The utility resulted in a kernel panic after trying multiple versions. I am not sure what is the problem here. It could be Clonezilla or the USB 3.0 NVME enclosure that I was using for the new disk.<\/p>\n\n\n\n<p>I resigned to using the <code>dd<\/code> command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dd if=\/dev\/source of=\/dev\/target status=progress<\/code><\/pre>\n\n\n\n<p>Unfortunately this would have taken way too long something like 20+ hours, so I gave up with this approach.<\/p>\n\n\n\n<p>I decided to do a good old restore of the nightly backup. I started by cloning the partition table:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sfdisk -d \/dev\/olddisk | sfdisk \/dev\/newdisk<\/code><\/pre>\n\n\n\n<p>I then proceeded with the restore of the nightly backup. Murphy strikes twice! The nightly backup was corrupted! I guess it is not surprising when the root directory&#8217;s integrity is in question. The whole reason why we are doing this exercise.<\/p>\n\n\n\n<p>Without the nightly backup, I had to resort to a live backup. I booted system again, and performed:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo su -\nmount \/dev\/new_disk_root_partition \/mnt\/newboot\ncd \/\ntar -cvpf - --exclude=\/tmp --exclude=\/home\/kang\/log --exclude=\/span --exclude=\"\/var\/lib\/plexmediaserver\/Library\/Application Support\/Plex Media Server\/Cache\" --one-file-system \/ | tar xvpzf - -C \/mnt\/newboot --numeric-owner<\/code><\/pre>\n\n\n\n<p>The above took about an hour. I then copy the <code>\/span<\/code> directory manually, because this directory tends to change while the server is up and running.<\/p>\n\n\n\n<p>With all the contents copied, I forgot how to install grub and had to re-teach myself again. I had to use a live copy Ubuntu USB and use that to boot up the machine, and then mount both the root and efi partitions respectively.<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code>nvme1n1                              259:0    0 931.5G  0 disk\n\u251c\u2500nvme1n1p1                          259:1    0     1G  0 part  \/boot\/efi\n\u251c\u2500nvme1n1p2                          259:2    0 915.4G  0 part  \/\n\u2514\u2500nvme1n1p3                          259:3    0  15.1G  0 part  &#91;SWAP]<\/code><\/pre>\n\n\n\n<p>And install GRUB.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo su -\nmkdir \/efi\nmount \/dev\/nvme1n1p1 \/efi\nmount \/dev\/nvme1n1p2 \/mnt\ngrub-install --efi-directory \/efi --root-directory \/mnt<\/code><\/pre>\n\n\n\n<p>I also have fix the <code>\/etc\/fstab<\/code> to ensure the root partition and <code>\/boot\/efi<\/code> partition are properly referenced by their corresponding, correct <code>UUID<\/code>. The <code>blkid<\/code> command came in handy to find the <code>UUID<\/code>. For the swap partition, I had to use the <code>mkswap<\/code> command before I get the <code>UUID<\/code>.<\/p>\n\n\n\n<p>After I rebooted, I reinstalled GRUB one more time with the following as super user:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>grub-install \/dev\/nvme1n1<\/code><\/pre>\n\n\n\n<p>I also updated the <code>initramfs<\/code>&nbsp;using:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>update-initramfs -c -k all<\/code><\/pre>\n\n\n\n<p>For something that should have taken less than an hour, it took the majority of the day. The server is now running with the new NVME replacement disk. Hopefully this resolves the file system corruptions. We have to wait and see!<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Update: The Day After<\/h4>\n\n\n\n<p>The same errors occurred again! I noticed that these corruptions occur when we do a system backup. How ironic! I later confirmed that performing the <code>tar<\/code> command on the root directory during the backup process can cause such an error. I now have to see why this is. I will disable the system backup for the next few days to see if the errors come back or not.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few months ago, the boot disk of our media server begin to incur some errors, such as the ones below: The boot disk is a NVME device and I thought it may be due to over heating, so I purchased a heat sink and installed it. Unfortunately the errors persisted after the heat sink. &hellip; <a href=\"https:\/\/blog.lufamily.ca\/kang\/2023\/12\/17\/replacing-nvme-boot-disk\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Replacing NVME Boot Disk&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[111],"tags":[97,5,28,6],"class_list":["post-2726","post","type-post","status-publish","format-standard","hentry","category-tech","tag-linux","tag-nas","tag-technology","tag-ubuntu"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p7V6i8-HY","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts\/2726","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/comments?post=2726"}],"version-history":[{"count":3,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts\/2726\/revisions"}],"predecessor-version":[{"id":2730,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts\/2726\/revisions\/2730"}],"wp:attachment":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/media?parent=2726"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/categories?post=2726"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/tags?post=2726"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}