Journey to Ubuntu 24.04 LTS Ended in Another Rescue

My NAS is currently running Ubuntu 22.04.5 LTS. I have tried in the past to perform a do-release-upgrade, and ended up with a system that will not boot.

Since then, I have moved many services away from the NAS. I thought I should give it one more try, and I did just that yesterday. Unfortunately the result ended up the same, resulting another rescue.

I thought I should document the rescue process here again.

# Wipe the root fs
mkfs.ext4 /dev/nvme1n1p2

# Restore from backup
mount /dev/nvme1n1p2 /mnt
mount /dev/backup_partition /mntb
rsync -aAXv /mntb/ /mnt/

# Ensure the root file system new UUID is the same in /etc/fstab
vi /mnt/etc/fstab

# chroot to install the boot partition
mount /dev/nvme1n1p1 /mnt/boot/efi
for i in /dev /dev/pts /proc /sys /run; do mount -B $i /mnt$i; done
mount -t efivarfs efivarfs /mnt/sys/firmware/efi/efivars 
chroot /mnt

# Identify your EFI partition again just in case (e.g., /boot/efi)
sudo grub-install

# Below is more forceful but mostly optional and unnecessary
# grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=GRUB --removable --recheck

sudo update-grub
exit

# Exit and reboot
umount -l /mnt
reboot

By now I have become an expert in rescuing failed upgrades with Ubuntu.

I have upgraded my TUF GAMING B550-PLUS motherboard to version 3636. This is a recently released BIOS from ASUS in January of 2026. My previous version of the BIOS was from 2024.

I will give myself another breather, say about a week, before attempting to try again.

Resetting SolarEdge Inverters

In a previous post, one of our two SolarEdge inverters encountered an error and one quick fix is to reset the inverters. This year we had a similar issue.

Three days ago, our solar system encountered a grid voltage issue. Our XWPro inverter was in AC PassThru mode causing the SolarEdge inverters to detect the same grid issue. Our solar system is AC coupled. With XWPro handling grid-tied net metering, and battery charging and discharging, and SolarEdge for solar energy generation.

AC Qualification Limit Exceeded

This grid event cause both SolarEdge inverters to go into a “Grid Profile Limit” mode where its AC output was limited to around 100W. When I reset both inverters through the main breaker panel, one recovered while the other continue with the limited output behaviour. To fix the second one, I had to perform a hard reset on the inverter. Below are the steps needed.

Main Breaker Panel
SolarEdge Inverter Control Positions

First I had to switch off the inverter at position A, and then turn off the DC disconnect at position B. I then had to switch off the breaker on the main panel.

The important part is to wait 5 to 10 minutes to wait for the inverter to discharge for the full reset to happen.

Once the time has passed, perform the action in reverse. Turn back on the breaker, the DC disconnect (B), and then finally turn back on the inverter (A).

Luckily after this hard reset procedure, the second SolarEdge inverter has been fully restored with normal operation.

Home Automation Garage Door Opener on Life Support

More than nine years ago, I created a remote garage door opener that connected to my HomeKit setup. This has proven to be a budget-friendly and super handy device, as I am able to control my garage door from anywhere in the world. I came up with this solution before WiFi-based remote garage door openers were commercialized.

However, recently the Raspberry Pi Zero W started to randomly lose WiFi network connection, and I have to reboot it all the time. Of course, this is very frustrating. Since the device is plugged into a ceiling plug, the same socket that is used for the actual garage door opener, it is quite inconvenient to cycle the device. I typically had to restart the whole garage by resetting the breaker on the main electrical panel.

I have some extra ESP32-S3 SuperMini boards on the side that I was going to replace the PiZero W with. I bought these from Pinduoduo (拼多多) when I was in China last year. Due to my laziness, I did not get around to it. Something else happened that allowed me to find another workaround.

About three and a half years ago, I purchased the VOCOlinc HomeKit Smart Plugs from Amazon. I used these to remotely control some fans in the house. One of these was recently freed up. I can then plug the adapter used to power the Pi Zero into the Smart Plug. Now I have a remote way to remotely power cycle the Pi Zero. A remote device to control the power of another remote device! Not only can I cycle the Pi Zero remotely, I can also programmatically determine when to cycle the device.

The Smart Plug is setup with my HomeKit environment and I recently learned that on a Mac, you can use the Shortcut App to toggle an accessory or scene with HomeKit.

I also found out that once I have a Shortcut, I can invoke it using the shortcuts command line command.

Using this shortcut concept, I can create a periodic cron job that effectively check the connectivity of the Pi Zero every 15 minutes. If I am unable to connect, I can effectively remote restart the Pi Zero. The script is listed below:

#!/usr/bin/env zsh
#
# This script is meant to be run as root

logger "cyclePizero.sh: INFO test connectivity to pizero.localdomain"
if ! ping -q -c 1 pizero.localdomain >/dev/null; then
        logger "cyclePizero.sh: ERROR unable to ping pizero.localdomain"
        logger "cyclePizero.sh: INFO restarting the resolved daemon"
        systemctl restart systemd-resolved.service
        logger "cyclePizero.sh: INFO cycling pizero.localdomain"
        ssh bigbird -n 'shortcuts run "Toggle Garage Opener"'
        sleep 3
        ssh bigbird -n 'shortcuts run "Toggle Garage Opener"'
        logger "cyclePizero.sh: INFO cycling completed"
else
        logger "cyclePizero.sh: INFO pizero.localdomain ping successfully"
fi

Note that I also sometimes have to restart the name resolution service, system-resolved. This is another reason sometimes HomeKit fails to communicate with the Pi Zero.

Hopefully this patch will work until I finally have time to replace it with the ESP32.

Target Sports Canada

Today we had fun at our first shooting range in Canada. A friend of ours was kind enough to arrange a group outing at Target Sports Canada. They offered an unlicensed shooting experience of groups between 2 to 5 people.

The whole experience was about 2 hours. The registration was very simple. After about 30 minutes of orientation, we got prepped with glasses and ear protection and went into the shooting range.

The range was safe and organized. We ended up shooting a rifle, a shotgun, a 9mm hand gun and a 45 Colt 1911. I cannot recall the other models. We had about 10 rounds each, and it was fun to experience the different model of the guns.

I found the hand guns to be most enjoyable. The shotgun’s kickback was something to experience. Overall I think the entire group had loads of fun including my wife who tagged along for the trip.

Here is a short video of our experience:

Our first experience at the shooting range

WebAuthn with Email Implementation

Over the past few years I have developed several services that can be accessed using a web site. Many if not all of these sites require authentication. In the past I have typically adopted a typical user id and password technique, and more recently an email based authentication along with the user’s external IP address, so that they do not need to be burdened with remembering the password.

When my iPhone started to adopt the WebAuthn passkey solutions I wanted to make use of this convenient solution for my sites as well. As you can see from the chart below, the adoption across the different platforms and devices are now universal.

Compatibility List

I went about to develop my own identity provider server using the python WebAuthn package. Why did I develop my own solution and instead of using one of the open source solution? I wanted to learn how this works, and what better way to do it than implementing my own version. I also wanted to customize it based on a list of authorized email with the ability to track and manage the access.

This was also the first solution where I used AI to help me vibe code the browser side of the solution. It used the navigator.credentials object to do most of the heavy lifting. The AI generated code at the time is fraught with many errors and bad assumptions which I had fix manually. This was more than a year ago, so I am sure things have improved by now.

In the end, I deployed this custom identity service on auth.lufamily.ca. This custom service also handled the email authentication flow, which goes something like this:

No Email Sent Yes Goto Site Already Registered? Register with Email Read Email and Click on Welcome Login with Email Enter Site

There are no passwords with the above approach. All the users need to remember are the email addresses that they used to register with the site access. The login and registration page looks like this:

Registration and Login Page

For access provisioning, I simply use a JSON file to bind the email address to the allowed web sites. Below is an example:

{
  "jdoe@gmail.com": [
    {
      "user": "John",
      "site": "https://site1.lufamily.ca"
    }
  ],
  "jane.doe@icloud.com": [
    {
      "user": "Jane",
      "site": "https://site1.lufamily.ca"
    },
    {
      "user": "Jane",
      "site": "https://site2.lufamily.ca"
    }
  ]
}

When the user registers, they will receive an email looking like:

Sample onboarding or registration email

In the beginning, I wrote custom code on my web site to use my identity service. However I found out I can write an Apache Lua script to check for token provisioning and invoking of the identity service. I needed some other Lua packages to write my script so I had to figure out which version of Lua is my Apache2 using.

ldd /usr/lib/apache2/modules/mod_lua.so                                                                                                
        linux-vdso.so.1 (0x00007ffd141ea000)
        liblua5.3.so.0 => /lib/x86_64-linux-gnu/liblua5.3.so.0 (0x0000729ff6d47000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x0000729ff6d0d000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000729ff6a00000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000729ff6919000)
        /lib64/ld-linux-x86-64.so.2 (0x0000729ff6dbe000)

Once I found out that I was using version 5.3, I needed to enable mod_lua with Apache 2:

sudo a2enmod lua

Followed by the installation of three packages that I needed for my script:

sudo luarocks --lua-version 5.3 install lua-cjson
sudo luarocks --lua-version 5.3 install luasec
sudo luarocks --lua-version 5.3 install luasocket

These three packages allow me to process JSON data structures, and communicate with my identity server (auth.lufamily.ca). This way I can add authentication to any sites that I host with Apache2 web server with a virtual host configuration that looks like this:

<VirtualHost *:80>

  <Location / >
    LuaCodeCache forever
    LuaHookAccessChecker /path/to/checkAuthLuFamily.lua handle

    RewriteEngine On

    RewriteCond %{ENV:token} ^$
    RewriteRule ^ https://auth.lufamily.ca/register/%{ENV:cbsite}" [L,R=302]

    Header set Set-Cookie "token=%{token}e; Max-Age=1800; Path=/; HttpOnly; Secure; SameSite=Strict" env=token
  </Location>

</VirtualHost>

The checkAuthLuFamily.lua script is used to check if a token is provided either as an HTTP GET parameter, Authorization Bearer value, or a secure, http-only cookie. The token is actually a JWT token containing user specific attributes derived from the JSON file earlier. This token is provisioned when authentication is successful. If the token is missing, then this means the user has yet to be authenticated so we automatically redirect them to the registration page. If the token is valid, then the script will store a new refreshed token with extended expiry (another 30 minutes) into the environment variable which we use to reset the cookie. Any future requests to the same site will preserve the cookie/token.

I love this flexibility. This means I can add authentication to any site that I host with Apache2 without changing the code or modifying the site. This also means that I can develop future sites and services without having to worry about authentication.

I have not provided the source code here, because I am still testing it, but I wanted to document the concept and the approach, so that I can refer to my own creation in the future.

Replacing Fail Drive in Existing VDEV

In a previous post, I discussed creating a brand new VDEV with new drives to replace an existing VDEV. However, there is another approach that I chose to use in a very recent event for my NAS (Network Attached Storage) hard drive when it started to encounter write errors and later checksum errors.

The output of zpool status -v

The affected VDEV is mirror-4. Since there are 16 hard drives involved in this storage pool, I had to find out which hard drive is having the issue. I had to perform the following command line operations to obtain the serial numbers of the drives within the VDEV.

Shell commands to get the Serial Number.

It was the WD60EFRX drive that failed. This is a WD60EFRX Western Digital Red 6TB 5400RPM drive. I was curious to see how old is the drive, so I used the smartctl utility to find out the number of powered on hours that this particular drive endured.

The 4.2 years (37033 / 24 / 365 = 4.2) is well over the 3 years warranty promised by Western Digital, so I took this unfortunate opportunity to get two new Seagate IronWolf Pro 12TB Enterprise NAS Internal HDD Hard Drive. The idea is not just to replace the drive with issue but also to expand the pool, and get an extra 6TB drive from the existing mirror that is still good, and use it as part of my offline backup strategy.

Once the new drives arrived and connected to the system, I simply performed an attach command to add them to the mirror VDEV.

commands to attach the new drive

After attaching the new drives, the zfs pool begins to automatically resilver. The above image was taken several hours after the attachment, and we are now waiting for the last drive to complete its resilvering. Since one of the new drive has already completed its resilvering, this means we have regained full redundancy.

After the resilvering is completed, I will then detach both old drives from the mirror using the detach command.

zpool detach vault /dev/disk/by-id/wwn-0x50014ee2b9f82b35-part1
zpool detach vault /dev/disk/by-id/wwn-0x50014ee2b96dac7c-part1

The first drive will be chucked into the garbage bin, and the second drive will be used for offline backup. Before I use the second drive for offline backup, I need to remove all zfs information and meta data from the drive to avoid any unintentional future conflicts. We do this using the labelclear command like below.

zpool labelclear /dev/disk/by-id/wwn-0x50014ee2b96dac7c

For extra safety, we can also destroy the old partition by using parted and relabeling the disk and create a new partition table. If the above command fails, we can use the dd command to just zero out the first few blocks of the drive.

dd if=/dev/zero of=/dev/disk/by-id/wwn-0x50014ee2b96dac7c bs=1M count=100

In summary, this is the general strategy moving forward. When a drive on my NAS pool starts to fail (before actual failure), I take the opportunity to replace all the drives in the entire mirror with higher capacity drives, and use the remaining good one to serve as offline backup.

Moving My Blog

Since I had difficulties in upgrading my NAS, as I detailed here on this post. I decided that I need to move my NAS services to another server called, workervm. The first service that I decided to move is this web site, my blog, which is a WordPress site hosted by an Apache2 instance with a MySQL database backend.

I decided that instead of installing all the required components on workervm, I will use run WordPress inside a podman container. I already have podman installed and configured for rootless quadlet deployment.

The first step is to backup my entire WordPress document root directory and moved the contents to the target server. I placed the contents on /mnt/hdd/backup on workervm. I also need to perform a dump of the SQL database. On the old blog server, I had to do the following:

sudo mysqldump -u wordpressuser -p wordpress > ../../wordpress.bk.sql

I then proceeded to create the following network, volume, and container files on workervm in ${HOME}/.config/containers/systemd:

I wanted a private network for all WordPress related containers to share and also ensure that DNS requests are resolved properly. Contents of wordpress.network:

[Unit]
Description=Network for WordPress and MariaDB
After=podman-user-wait-network-online.service

[Network]
Label=app=wordpress
NetworkName=wordpress
Subnet=10.100.0.0/16
Gateway=10.100.0.1
DNS=192.168.168.198

[Install]
WantedBy=default.target

I also create three podman volumes. The first is where the database contents will be stored. Contents of wordpress-db.volume:

[Unit]
Description=Volume for WordPress Database

[Volume]
Label=app=wordpress

Contents of wordpress.volume:

[Unit]
Description=Volume for WordPress Site itself

[Volume]
Label=app=wordpress

We also needed a volume to store Apache2 related configurations for WordPress. Contents of wordpress-config.volume:

[Unit]
Description=Volume for WordPress configurations

[Volume]
Label=app=wordpress

Now with the network and volumes configured, lets create our database container with wordpress-db.container:

[Unit]
Description=MariaDB for WordPress

[Container]
Image=docker.io/library/mariadb:10
ContainerName=wordpress-db
Network=wordpress.network
Volume=wordpress-db.volume:/var/lib/mysql:U
# Customize configuration via environment
Environment=MARIADB_DATABASE=wordpress
Environment=MARIADB_USER=wordpressuser
Environment=MARIADB_PASSWORD=################
Environment=MARIADB_RANDOM_ROOT_PASSWORD=1

[Install]
WantedBy=default.target

Note that the above container refers database volume that we configured earlier as well as the network. We are also using the community forked version of MySQL (MariaDB).

Finally we come to the configuration of the WordPress container, wordpress.container:

[Unit]
Description=WordPress Application
# Ensures the DB starts first
Requires=wordpress-db.service
After=wordpress-db.service

[Container]
Image=docker.io/library/wordpress:latest
ContainerName=wordpress-app
Network=wordpress.network
PublishPort=8168:80
Volume=wordpress.volume:/var/www/html:z
Volume=wordpress-config.volume:/etc/apache2:Z
# Customize via Environment
Environment=WORDPRESS_DB_HOST=wordpress-db
Environment=WORDPRESS_DB_USER=wordpressuser
Environment=WORDPRESS_DB_PASSWORD=################
Environment=WORDPRESS_DB_NAME=wordpress

[Install]
WantedBy=default.target

Notice the requirement for the database container to be started first, and this container also uses the same network but the two volumes are different.

We have to refresh the system since we changed the container configurations.

systemctl --user daemon-reload

We can then start the WordPress container with:

systemctl --user start wordpress

Once the container is started, we can check both the WordPress and its database container status with:

systemctl --user status wordpress wordpress-db

And track its log with:

journalctl --user -xefu wordpress

It is now time to restore our old content with:

podman cp /mnt/hdd/backup/. wordpress-app:/var/www/html/

podman unshare chmod -R go-w ${HOME}/.local/share/containers/storage/volumes/systemd-wordpress/_data

podman unshare chown -R 33:33 ${HOME}/.local/share/containers/storage/volumes/systemd-wordpress/_data

The copy will take some time, and once it is completed, we have to fix the permissions and ownerships. Note that both of these have to be performed with podman unshare command so that proper uid and gid mapping can be performed.

I also had to restore the database contents with:

cat wordpress.bk.sql | podman exec -i wordpress-db /usr/bin/mariadb -u wordpressuser --password=############# wordpress

Lastly I needed to modify my main/old Apache server where the port forwarding is directed to so that blog.lufamily.ca requests are forwarded to this new server and port.

Define BlogHostName blog.lufamily.ca
Define DestBlogHostName workervm.localdomain:8168

<VirtualHost *:443>
    ServerName ${BlogHostName}
    ServerAdmin kangclu@gmail.com
    DocumentRoot /mnt/airvideo/Sites/blogFallback
    Include /home/kang/gitwork/apache2config/ssl.lufamily.ca

    SSLProxyEngine  on

    ProxyPreserveHost On
    ProxyRequests Off

    ProxyPass / http://${DestBlogHostName}/
    ProxyPassReverse / http://${DestBlogHostName}/

    # Specifically map the vaultAuth.php to avoid reverse proxy
    RewriteEngine On
    RewriteRule /vaultAuth.php(.*)$ /vaultAuth.php$1 [L]

    ErrorLog ${APACHE_LOG_DIR}/blog-error.log
    CustomLog ${APACHE_LOG_DIR}/blog-access.log combined
</VirtualHost>

Note that on the old server I still have the document root pointed to a fallback directory. In this fallback directory I have php files that I needed to be served directly without being passed to WordPress but the requested path shares the same domain name as my WordPress site. The rewrite rule performs this short circuit processing. When vaultAuth.php is requested, we skip the reverse proxy all together.

This is working quite well. I am actually using the new location of this blog site to write this post. I plan to migrate the other services on my NAS in a similar manner with podman.

The idea is that once the majority of the services have been ported to workervm, then I can reinstall my NAS with a fresh install of Ubuntu 24.04 LTS without doing a migration.

Update 2026-02-28:

I had to move my blog to a different virtual machine because the current one had a network stack corruption. What I found was that the podman volume concept was super handy. I was able to use podman import/export commands to easily move my blog storage and database without having to worry about permissions and other file system nuances.

Rescuing Old MacBook Pro’s

I have a couple of old MacBook Pro’s from late 2016 (MacBook Pro 13,3) and another one from mid 2017 (MacBook Pro 14,3). These laptops have been sitting on my shelves since the pandemic. In 2023 I upgraded them with Sonoma using OpenCore Legacy Patcher (OCLP). I documented the process here. Both of these laptops are Intel based Mac and they have the infamous Touch Bar. These computers are no longer compatible with the most recent macOS. At the time of writing, the latest version is macOS 26 code named Tahoe.

Old laptop hardware spec’s

My original idea in 2026 is to install a suitable Linux distribution. I prepared three distributions:

  • Linux Mint
  • Lubuntu
  • Zorin OS

After several hours of trying these distributions, they all had issues with the Wifi. The driver simply fail to install. A laptop without Wifi is somewhat pointless because you cannot move around with them. Another show stopper with Linux is that we cannot get the Touch Bar to work. At first I didn’t think it was a big deal until I realized that the all important ESC key and all the function keys are on the Touch Bar. Therefore, it is somewhat impractical.

At this point, I was going to chuck them into the e-waste bin, and then I remember that a couple of years ago I played with OCLP. This is a little app that allows you to download a version of macOS installer and create a bootable USB drive with a boot-loader that will make certain firmware adjustments so that an incompatible macOS can be installed on old unsupported hardware, such as these laptops. This time instead of Sonoma, we’ll install Sequoia.

Unfortunately, OCLP still does not support macOS Tahoe, but Sequoia is not too bad. On another Intel based Mac mini, I prepared a bootable USB drive with Sequoia using OCLP, and then I went into the program’s settings to select my targeted Mac model. This allows the program to build and install OpenCore on to the same USB boot drive’s EFI partition.

Once the USB drive is prepared with BOTH the installer and the OpenCore EFI partition with the selected targeted hardware (in our case either MacBook Pro 13,3 or 14,3), we can then use the bootable USB drive on our old MacBooks.

Sequoia on a 2017 MacBook Pro!

The installation process begins with powering on the old MacBook with the USB drive plugged in while holding down the Option key. This will show the current bootable OS that we will be replacing, the EFI partition containing OpenCore, and the new installer that we prepared with macOS Sequoia. We want to select the EFI OpenCore first, and then select the Sequoia Installer. This way the installer will be running with the firmware fixes.

When the installer is running, there will be several reboots. Once the install is completed, there is one last step that we must do. We have to perform a Post Install Root Patch. This effectively replace the OS drivers with old drivers that are compatible with your old hardware.

With the OCLP, I was able to get both laptops to run Sequoia giving an 8 and 9 years old laptop new life. However there are downsides:

  • We cannot perform automated updates from Apple, so I turned off automatic updates and downloads of new OS updates;
  • When OCLP has a new app version, we will need to create a new OpenCore partition installed on the laptop bootable drive’s EFI partition, and we will also have to reapply the root patches;
  • We can only update new OS when they are supported by OCLP, so for Tahoe we will have to await a new version;

I think the disadvantages are negligible when compared to just throwing away the hardware.

I still have a 10+ years old MacBook Air which I look forward to trying with Sequoia.

Ubuntu 22.04 LTS to 24.04 LTS Upgrade Fail

Last Saturday, I decided it was time to switch my NAS server from 22.04 LTS to 24.04 LTS. I’ve been putting it off for ages, worried that the upgrade might not go as planned and something could go wrong. Since 24.04 is already in its fourth point release, I figured the risks should be manageable and it’s time to take the plunge.

I backup my system nightly so the insurance was in place. After performing a final regular update to the system, I started with the following:

sudo apt update && sudo apt upgrade && sudo apt dist-upgrade

I then rebooted the system and executed:

sudo do-release-upgrade

After answering a few questions to save my custom configuration files for different services, it said the upgrade was done. I then rebooted the system, but BOOM! It won’t boot.

The BIOS knows the bootable drive, but when I tried to boot it, it just went back into the BIOS. It didn’t even give me a GRUB prompt or menu.

I figured this wasn’t a big deal, so I booted up the system with the 24.04 LTS Live USB. The plan is to just reinstall GRUB, and hopefully, that will fix the system.

Once I’ve booted into the Live USB and picked English as my language, I can jump into a command shell by pressing ALT-F2. Alternatively, you can press F1 and choose the shell option from the help menu. But, I found that the first method opens up a shell with command line completion, so I went with that.

The boot disk had the following layout (output from both fdisk and parted):

sudo fdisk -l /dev/nvme1n1
Disk /dev/nvme1n1: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: Samsung SSD 980 PRO 1TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 90B9F208-2D05-484D-8C8C-B3AE71475167

Device              Start        End    Sectors   Size Type
/dev/nvme1n1p1       2048    2203647    2201600     1G EFI System
/dev/nvme1n1p2    2203648 1921875000 1919671353 915.4G Linux filesystem
/dev/nvme1n1p3 1921875968 1953523711   31647744  15.1G Linux swap

sudo parted /dev/nvme1n1                                                                                                       
GNU Parted 3.4
Using /dev/nvme1n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: Samsung SSD 980 PRO 1TB (nvme)
Disk /dev/nvme1n1: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system     Name  Flags
 1      1049kB  1128MB  1127MB  fat32                 boot, esp
 2      1128MB  984GB   983GB   ext4
 3      984GB   1000GB  16.2GB  linux-swap(v1)  swap  swap

As I described in this post, we want to make sure that the first partition is marked for EFI boot. This can be done in parted with:

set 1 boot on
set 1 esp on

I didn’t have to perform the above since the first partition (/dev/nvme1n1p1) is already recognized as EFI System. We also need to ensure that this partition is formatted with FAT32. This can be done with:

sudo mkfs.vfat -F 32 /dev/nvme1n1p1

Since this was already the case, I also did not have to perform this formatting step.

The next step is to mount the root directory and the boot partition.

mount /dev/nvme1n1p2 /mnt
mount /dev/nvme1n1p1 /mnt/boot/efi

We now need to bind certain directories under /mnt in preparation for us to change our root directory to /mnt.

for i in /dev /dev/pts /proc /run; do sudo mount --bind $i /mnt$i; done
mount --rbind /dev /mnt/dev
mount --rbind /sys /mnt/sys
mount --rbind /run /mnt/run
mount -t proc /proc /mnt/proc
chroot /mnt
grub-install --efi-directory=/boot/efi /dev/nvme1n1
update-grub

mount --make-rslave /mnt/dev
umount -R /mnt
exit

If we do not use the –rbind option for /sys, then we may get an EFI error when running grub-install. There are two alternatives that solves the same issue, although used less often, you can also choose one of the following (but not BOTH):

mount --bind /sys/firmware/efi/efivars /mnt/sys/firmware/efi/efivars
mount -t efivarfs none /sys/firmware/efi/efivars

The reinstallation of GRUB did not solve the problem. I had to perform a full system restore using my backup. The backup was created using rsync as described on this post. However, I learned that this backup was done incorrectly! I excluded certain directories using the name instead of /name. This caused more exclusion than intended. The correct method of the backup should be:

sudo rsync --delete \
        --exclude '/dev' \
        --exclude '/proc' \
        --exclude '/sys' \
        --exclude '/tmp' \
        --exclude '/run' \
        --exclude '/mnt' \
        --exclude '/media' \
        --exclude '/cdrom' \
        --exclude 'lost+found' \
        -aAXv / ${BACKUP}

and the restoration command is very similar:

mount /dev/sdt1 /mnt/backup
mount /dev/nvme1n1p2 /mnt/system

sudo rsync --delete \
        --exclude '/dev' \
        --exclude '/proc' \
        --exclude '/sys' \
        --exclude '/tmp' \
        --exclude '/run' \
        --exclude '/mnt' \
        --exclude '/media' \
        --exclude '/cdrom' \
        --exclude 'lost+found' \
        -aAXv /mnt/backup/ /mnt/system/

After the restore, double check that /var/run is soft-linked to /run.

Once the restoration is completed, I follow the above instructions again to re-install GRUB, and I was able to boot back into my boot disk.

Since this upgrade attempt has failed, I now have to figure out a way to move my system forward. I think what I will do is to port all of my services on my NAS as podman root-less quadlets, and then just move the services into a brand new Ubuntu clean installation. This is probably easier to manage in the future.