A few weeks ago I started setting up my servers in my home office; two Dell R440 rack machines. Only, one of them did not show on the network.

The purpose of these are just enough to get me off the cloud, give me some real hands-on experience with infrastructure and hardware, and make it super easy for me to spin up any kind of random business / hobby project that I have in mind without having to think about cost.

These servers connect to the internet via Deco XE75 mesh router. Basically, they hook into one Deco unit via Ethernet (which is that white cylinder on top of the drawers), but since the Deco only has a few ports, I use a Netgear 108e switch to plex the connections giving us more ports.

Unfortunately AT&T has its modem in the living room rather than the office room, so that requires that I have a second Deco unit in the living room to initialize the network (otherwise I would have cables running all over the house which is really annoying and not something I want. The good news is that the two Deco units communicate to each other via their 6ghz dedicated backhaul network, so gigabit internet is made available to the servers in this indirect setup, and I've not noticed any network speed issues from this setup.

Now, the primary Deco does not literally connect directly to the modem as I mentioned above. AT&T provides a modem called an Arris BGW320 which they give to their business class customers, but I don't really trust AT&T hardware to handle firewall/security stuff.

Instead, I place the AT&T modem into what is basically a "transparent bridge" mode so that it passes its network and all static IP addresses to a SonicWall TZ470 hardware firewall, which manages the connection, security, and NAT for the entire home network.

Now back to the server rack. Bessy, the top server, was on the network running fine. The bottom server, Swirl, was not showing as connected to the Deco network. Very strange!

What I know is that the Ethernet used to work fine, but now it doesn't. What changed? Well, three things I can think of. 1. I updated the BIOS. 2. I updated Ubuntu and ran $ apt autoremove. 3. I moved offices. Yes, I did all of this without testing things in between, which in retrospect I now know was a mistake, so any three of these things could be the cause. Oopsies!

At first, I checked the basics: since I knew Bessy had known-working Ethernet cable and port, I just swapped it from Bessy into Swirl to test the connection. Swirl still did not show as being on the network. The servers have a second Ethernet port for failover, so I tried the second one as well, but it didn't work either. This ruled out cable/port issues.

Swirl is running Ubuntu 22.04, and since it is a Dell server, it comes with this nice remote BIOS-like system called iDRAC (not a sponsor).

You connect to iDRAC through the web browser. It shows up as its own device on the network with its own IP Address. Notably, Swirl's iDRAC is actually connected to the network and functioning fine. It does have its own Ethernet port, which made me wonder if it also used its own network card that is separate from the network card that the host OS (i.e. Ubuntu) interfaces with.

I like iDRAC a lot because it shows the entire status of the machine, including hardware status. For example, as you can see in the screenshot, it even logged that the server lid was closed while the power is off. Cool! This is because even when the server is powered off, iDRAC runs in its own separate subsystem and is always powered on as long as the system has a power supply.

When I checked the iDRAC to see if something was wrong with the network card, everything looked okay.

It said the link was "Up" and that the link speed was 1000 Mbps. I confirmed that this was not referring to the iDRAC network connection, because it would toggle to "Down" when I disconnected the OS ethernet.

This indicated to me that it probably was not a hardware issue, and therefore it was probably something wrong with my Ubuntu setup or my home network.

I would love to ssh into the server and check the network configuration, however I cannot because it isn't on the network. Fortunately, iDRAC also provides a "virtual console" for enterprise licenses, which is basically a remote feed of the video card output with some keyboard/mouse input forwarding. In other words, it is their remote desktop client which actually provides access even before the server is booted.

After using this to get access to the server, I see something simliar to the following (henceforth most of these outputs are my closest recollection of what happened, since unfortunately I was not smart enough to actually document everything I did in real time):

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state DOWN

eno1 is the server's Ethernet interface, and it doesn't show any IP address allocation, instead just saying that the state is DOWN. Weird.

First I try to manually set it back to "UP"

$ ip link set eno1 up

It pauses for a second and then returns with no output, so I'm not sure what exactly it did. When I check $ ip addr again, the result is the same: eno1 state DOWN

Ubuntu uses a program called netplan to manage network interfaces, so I double check to make sure my netplan is configured properly:

$ sudo vim /etc/netplan/networkmanager.yaml

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: true
    eno2:
      dhcp4: true

This all looks correct. eno2 defines the second failover ethernet port, which as mentioned above was not working either.

Running netplan with debug to ensure it can apply properly:

$ sudo netplan --debug generate
DEBUG:command generate: running ['/lib/netplan/generate']
** (generate:330275): DEBUG: 15:10:22.213: starting new processing pass
** (generate:330275): DEBUG: 15:10:22.213: starting new processing pass
** (generate:330275): DEBUG: 15:10:22.213: We have some netdefs, pass them through a final round of validation
** (generate:330275): DEBUG: 15:10:22.213: eno2: setting default backend to 1
** (generate:330275): DEBUG: 15:10:22.213: Configuration is valid
** (generate:330275): DEBUG: 15:10:22.213: eno1: setting default backend to 1
** (generate:330275): DEBUG: 15:10:22.213: Configuration is valid
** (generate:330275): DEBUG: 15:10:22.214: Generating output files..
** (generate:330275): DEBUG: 15:10:22.214: openvswitch: definition eno1 is not for us (backend 1)
** (generate:330275): DEBUG: 15:10:22.214: NetworkManager: definition eno1 is not for us (backend 1)
** (generate:330275): DEBUG: 15:10:22.214: openvswitch: definition eno2 is not for us (backend 1)
** (generate:330275): DEBUG: 15:10:22.214: NetworkManager: definition eno2 is not for us (backend 1)

Everything looks normal... I think. After some research, I realized it says NetworkManager: definition eno1 is not for us because netplan supports multiple backends--one of them being NetworkManager. We're actually using the networkd backend, and that's why we get those messages, because the other backends are confirming that they're not using those interfaces. Fine.

$ sudo netplan apply provides no output, just silently succeeding or failing. In any case, running it still did not change the eno1 status.

During this process, I learned of a command which would provide more detailed information about the network.

$ networkctl status -a
● 2: eno1                                                                      
                     Link File: /usr/lib/systemd/network/99-default.link
                  Network File: /run/systemd/network/10-netplan-eno1.network
                          Type: ether
                         State: no-carrier (configuring)                                

It showed that the Ethernet status was yellow, no-carrier, configuring. What could this mean?

This lead to a thread I found with a similar issue: https://askubuntu.com/questions/497850/eth0-no-carrier-ifconfig-shows-no-ip-address

Jithin Pavithran suggests that the Ethernet may be misconfigured internally; that the OS is unaware of the required speed and duplexing capabilities.

$ ethtool eno1

Settings for eno1:
    ...
    Speed: Unknown!
    Duplex: Unknown! (255)
    ...

Yes! This is the same thing that I have. Now I'm feeling optimistic that this is the problem, so I apply their suggested solution of manually defining these configurations.

$ ethtool -s eno1 speed 1000 duplex full

This pauses for a second and provides no output. Sadly, when I check the configurations again afterward, Speed and Duplex are still Unknown! so this seems to have done nothing.

What's crazy to me is that, at this point, I still have absolutely no idea if this is a hardware issue or a software issue. There was a feeling of having done a decent amount of troubleshooting and diagnostics and yet having made very little progress toward actually understanding the problem.

At this point, I realize that maybe a lot of logs are getting swallowed up and placed in journalctl or dmesg, so I repeat all of these troubleshootings steps while monitoring these in realtime which I could do by splitting my session using Byobu.

Funny side note, I only knew about Byobu because one time months ago I read through a lot of the Ubuntu Server Guide, which mentions (at time of writing) that Byobu is installed by default. Why is my long-term memory crystal-clear for random facts like this, but my short term memory barely allows me to remember where I put my phone or the names of people?

Unfortunately, I did not see anything that appeared useful or relevant. If there were errors, I either missed them or they did not appear in those logs.

In retrospect, there were other troubleshooting steps I could have taken much earlier on to provide bigger clues to the problem, which I will mention later.

At this point, however, I'm instead theorizing that maybe my firmware/driver for the Ethernet card was corrupted or deleted somehow during the autoremove process. The idea is that it may be networkd is not able to configure the Ethernet properly because it does not have the drivers it needs.

In order to fix this, I learn that I have to dig through this mountain of firmware that exists in a single repository to find the firmware for my card, then I have to save it on my laptop and use a USB to transfer it to the server. Hats off to the maintainers of these firmwares and this repo; you are all built different.

So I install the firmware manually and reboot the server... and... still nothing on the network. Actually, now the problem is a lot worse.

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

What in tarnation? ip addr doesn't show eno1 exists at all.

Hang on...

$ networkctl status -a

Nothing

$ ethtool eno1      
netlink error: no device matches name (offset 24)
netlink error: No such device
netlink error: no device matches name (offset 24)
netlink error: No such device
netlink error: Operation not permitted
netlink error: no device matches name (offset 24)
netlink error: No such device
netlink error: no device matches name (offset 24)
netlink error: No such device
No data available

WTF? It's as if Ubuntu doesn't even know the Ethernet interface exists.

By this point, I'm feeling that I've so thoroughly fucked my system that it's better to just completely reinstall Ubuntu. So I back up as much of the system as I reasonably can into a thumbdrive and then boot into a fresh Ubuntu Installer.

Earlier, I mentioned that there were other troubleshooting steps I could have done sooner. This is one of them. I think I should have tried to boot into an OS from a thumbdrive long before doing a firmware reinstallation to see if it could connect to the internet. The idea occurred to me when, during the process of trying to reinstall Ubuntu, not even the installer recognized that there was an available network connection.

Now I'm starting to strongly suspect that this actually is a hardware issue.

After doing some research, I find that you can run some commands to see what hardware is installed on the system. I'm not entirely sure how this all works, but my vague understanding is that the BIOS provides all hardware to the host somehow, so even if the OS is missing firmware/drivers, it should still be aware of all hardware, including unknown/incompatible hardware.

To see this, you can use lshw to ls (List) hw (Hardware). Well, I did that, and it did not show anything related to networking/ethernet was available. Finally getting somewhere! Because lshw is not showing the Ethernet, that can only mean one of two things: either the BIOS is not providing it (maybe a setting is disabled?) or the hardware is malfunctioning.

I go back into the iDRAC to check the system logs and the BIOS settings. Sure enough, another clue to the problem has appeared:

iDRAC has complained that the Ethernet device is not detected. Also, it spams "Embedded NIC 1 Port 1 link is started/down" dozens of times per minute. Apparently it has been doing this link up/down loop for quite some time.

Finally the evidence seems very compelling that it is a hardware issue. I'm not sure why the issue suddenly got worse after reinstalling the firmware. Maybe just the act of rebooting the server was the straw that broke it? In any case, since the server is under warranty, this felt like enough evidence to bring to a discussion with Dell support.

I got on a support chat with Dell and showed them the problem. Initially they repeated some of my troubleshooting steps for themselves, and dismissed it as an OS issue because the Ethernet green lights flashed a signal indicating the link was functioning normally, and iDRAC reported that the network card link was Up.

They wanted to close the ticket as wont-fix, but just before they hung up, I asked them if they could explain the Lifecycle Logs posted above, and why the network link would go down/up repeatedly with known working links.

They paused for a while before admitting they did not know. Luckily the support agent I had was a badass, and they felt compelled to investigate further.

This is when they had me boot into their own custom diagnostics OS. Finally they saw the same missing hardware reported by their diagnostics and agreed it was not an OS issue, it was a hardware issue.

Dell ordered a full motherboard replacement which a technician came and replaced within a couple of business days. This in fact resolved the issue, and the server is back on the network functioning normally.

The lesson here seems to be that diagnosing unknown motherboard issues can make you feel like you're going insane because your systems can give you conflicting / wrong information.

Also, don't make multiple massive system changes without testing in-between, because the number of variables are insane.

Also also, always try easiest troubleshooting steps (such as booting into an OS on a usb thumbdrive to reproduce the issue) before reinstalling obscure firmware and making strange system configuration changes.

And above all else, do not get a broken motherboard to begin with!

Diagnosing and fixing an illusive network issue on Ubuntu 22.04 (Jammy)