Instaclustr Node Patching — finding a Kernel bug with the AWS EC2 Xen Hypervisor
As some of our AWS customers are aware, last week Instaclustr delayed some upcoming cluster maintenance on some of our AWS clusters. This was due to a bug found with a portion of our managed service fleet, and some AWS hypervisors.
We wanted to follow up with technical details around the issue we have found, as well as provide more information on the current status of the maintenance on affected clusters.
How does the Instaclustr OS patching cycle work?
For our Managed Platform customer cluster instances, we follow a documented process listed below for patching our Debian-based host operating systems.
At each stage of our patching process, Instaclustr performs a thorough performance evaluation and metric analysis of all applications to ensure that node health or performance does not degrade when customer clusters are patched. Only after an extended soak period to identify any lingering issues do we proceed to the next step.
- Step 1: Perform thorough internal testing of various node infrastructures, including extensive performance testing
- Step 2: Upgrade a small portion of our non-production SLA tier clusters (developer size nodes)
- Step 3: Upgrade all remaining non-production SLA tier clusters
- Step 4: Make this Operating System version the default for all new nodes
- Step 5: Upgrade all production SLA tier clusters
- Step 6: Perform final analysis of the fleet to ensure that all applicable nodes are running the correct OS version
Issues found in testing
While testing these changes on nodes in our non-production SLA tier, we experienced some unexpected behaviour from a small portion of customer nodes. It appeared that some instances failed to boot after upgrading Debian. After searching through the logs, we found an unusual error trace that appeared during startup:
[ 6.902955] ena: The ena device sent a completion but the driver didn't receive a MSI-X interrupt (cmd 8), autopolling mode is OFF[ 6.914601] ena: Failed to submit get_feature command 12 error: -62[ 6.921175] ena 0000:00:03.0: Cannot init indirect table[ 6.927008] ena 0000:00:03.0: Cannot init RSS rc: -62[ 6.947883] ena: probe of 0000:00:03.0 failed with error -62[ 65.783000] nvme nvme0: I/O 15 QID 0 timeout, completion polled
Immediately after noticing some instability with this subset of nodes during the customer upgrades, we performed remediation activities on any affected instances, and halted our patching process while we could investigate the issue further, and form a plan which would ensure customer cluster availability.
After some trial and error, we were able to replicate this issue, but not on a very consistent basis. From our testing, it seemed to only affect AWS I3, R4 and M4 instances. However, it was only failing on these instances around 90% of the time, with varying failure rates across regions.
With some further investigation, we tracked it down to a Linux kernel bug, and found that it had already been reported. The reported issue is that there is an issue with the MSI-X vectors setup failing, which causes the enhanced networking setup on Elastic Network Adapters to fail. As a result of this, instances that use this configuration may fail to boot.
After some further digging and confirming hardware used on the affected instances, we were able to determine that this bug affected the AWS Xen hypervisor, in particular version 4.2. We were able to identify the Xen hypervisor version on our managed AWS EC2 Instances by viewing
We did not find any issues with patching our offerings on Microsoft Azure, Google Cloud Platform, or customers running on their own infrastructure.
Recommendations from AWS
While testing the instances that had successfully rebooted after the OS upgrade, we found that version 4.11 of the Xen hypervisor was unaffected by the bug. Once we had confirmed this, we were able to determine with 100% accuracy whether an instance was vulnerable to this bug or not.
Rumours were circulating online that AWS would be fixing this bug on all their Xen hypervisors by early June, 2022. As we do have patching cycles we adhere to, we wanted to get further clarification around their timeline.
We have reached out to AWS for assistance, and more information on any scheduling details for their deployment. Unfortunately, they were unable to provide an ETA for a resolution, or the patching of their Xen hypervisors.
AWS suggested that affected instances could be migrated towards instance sizes that utilise their Nitro hypervisor. Instaclustr does offer a range of instance sizes including nitro hypervisor based instances, such as the R6g and I3en; however, for some workloads, upgrading to a Nitro hypervisor based instance size may lead to an increase in infrastructure cost of up to 50%, and/ or a reduction in application performance. This meant that switching a large portion of our fleet to a new instance size quickly would not be feasible.
We attempted to mitigate this issue by testing Debian backport images. Debian backport images contain the same OS patches but use a much newer Linux kernel version. Unfortunately, we found that using Debian backport image would lead to a different, but similar boot-time exception regarding Marvell NVMe devices.
With patching scheduled customer clusters rapidly approaching, we were unable to sufficiently test and implement any additional alternative solutions to this issue. As a result of this, for a subset of our AWS fleet, we have utilised a different plan for patching, which involves targeted updates for specific packages and various other fixes, without updating our host kernel.
For any customers currently running on GCP or Azure, your clusters are unaffected and will continue to be patched on schedule.
For any customers on AWS, we have put in place appropriate measures to ensure that you will not be directly affected by this issue. We will reach out to you directly if we need to reschedule your maintenance window.
If you do have any questions, please don’t hesitate to reach out to our support team.
Originally published at https://www.instaclustr.com on June 06, 2022.