DHCP changing the IP address suddenly on embedded systems

This post was written by eli on October 8, 2012
Posted Under: ARM,Linux

The problem

It first seemed extremely odd: The IP address of my embedded Linux machine changed suddenly after a few hours, breaking the ssh connection I had up, and messing up the NFS mount.

The problem turned out to be the lack of a hardware clock (RTC) on the board + clock being updated with NTP + the DHCP address obtained by dhclient.

(Why there’s no RTC is completely beyond me. After all, a decent chip costs no more than $0.75 and is so important to the OS’ health).

Here’s a typical acquisition of an address, as seen in /var/log/syslog:

Jan  1 00:00:13 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port
 67 interval 3
Jan  1 00:00:14 localhost dhclient: DHCPREQUEST of 10.1.1.196 on eth0 to 255.255.255.255 port 67
Jan  1 00:00:14 localhost dhclient: DHCPOFFER of 10.1.1.196 from 10.1.1.3
Jan  1 00:00:14 localhost dhclient: DHCPACK of 10.1.1.196 from 10.1.1.3

A few lines later it says

Jan  1 00:00:16 localhost dhclient: bound to 10.1.1.196 -- renewal in 8512 seconds.

So far so good. Only the time is around January 1st, 1970 midnight, that is epoch time zero + 16 seconds. No hardware clock, so the system started from zero. But soon enough the clock jumps:

Sep 29 17:32:44 localhost ntpdate[1858]: step time server 91.189.94.4 offset 1348939931.970235 sec

And after quite a while we have

Sep 29 20:23:39 localhost NetworkManager[1375]: <info> (eth0): DHCPv4 state changed bound -> expire
Sep 29 20:23:39 localhost NetworkManager[1375]: <info> (eth0): DHCPv4 state changed expire -> preinit
Sep 29 20:23:39 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Sep 29 20:23:42 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 4
Sep 29 20:23:43 localhost dhclient: DHCPREQUEST of 10.1.1.197 on eth0 to 255.255.255.255 port 67
Sep 29 20:23:43 localhost dhclient: DHCPOFFER of 10.1.1.197 from 10.1.1.3
Sep 29 20:23:43 localhost dhclient: DHCPACK of 10.1.1.197 from 10.1.1.3
Sep 29 20:23:43 localhost dhclient: bound to 10.1.1.197 -- renewal in 9793 seconds.

The DHCPDISCOVER indicates that the DHCP client isn’t renewing its lease, but wants to start over.

And /var/lib/dhcp/dhclient-ca4860e2-17c8-4280-a48c-f9d3a77f53aa-eth0.lease read, after all this:

lease {
 interface "eth0";
 fixed-address 10.1.1.196;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option routers 10.1.1.2;
 option dhcp-lease-time 21600;
 option dhcp-message-type 5;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 option dhcp-server-identifier 10.1.1.3;
 renew 4 1970/01/01 02:51:26;
 rebind 4 1970/01/01 05:15:19;
 expire 4 1970/01/01 06:00:19;
}
lease {
 interface "eth0";
 fixed-address 10.1.1.197;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option routers 10.1.1.2;
 option dhcp-lease-time 21600;
 option dhcp-message-type 5;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 option dhcp-server-identifier 10.1.1.3;
 renew 6 2012/09/29 23:06:56;
 rebind 0 2012/09/30 01:38:43;
 expire 0 2012/09/30 02:23:43;
}

Now let’s try to figure out what happened: First, Network Manager launched dhcpclient with something like

/sbin/dhclient -d -4 -sf /usr/lib/NetworkManager/nm-dhcp-client.action -pf /var/run/sendsigs.omit.d/network-manager.dhclient-eth0.pid -lf /var/lib/dhcp/dhclient-ca4860e2-17c8-4280-a48c-f9d3a77f53aa-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0

Note that the lease file is given explicitly.  An address is acquired, and the first entry in the lease file was made.

There is a slight contradiction between the statement in the log file that renewal was to take place 8512 seconds later (that is on epoch time 8528 = 02:22:08) and the renewal time in the lease file (02:51:26).

And then the NTP client kicked the time by some 42 years, so the leased IP address suddenly seems way outdated. But the DHCP client sleeps carelessly, being confident there is nothing to do for quite a while. This changes when apparently Network Manager wakes it up again, on what should have been 02:51:26 on Jan 1st 1970, but turned out to be 20:23:39 on Sep 29th 2012, which is 02:51:05 hours after the clock jump to 17:32:44. Talk about oversleeping.

So the DHCP client assumes everything has expired, and issues two DHCPDISCOVER requests, starting it all over again. The server supplies a new IP address, and oops, all network connections are suddenly lost. And the second entry in the lease file is generated.

The somewhat odd solution

Kill dhclient when ntpdate has caused a big jump. As simple as that. It’s not clear why this works at all, but there doesn’t seem to be a way to convince dhclient to just renew the IP and update the timestamps. Maybe updating the timestamps in the lease file would work, but that’s a bit too sophisticated for me right now.

Did Network Manager like it? No and no. It actually restarted the entire setup of eth0, causing a connection blackout of a few seconds.

Oct  8 12:48:57 localhost NetworkManager[1306]: <info> (eth0): DHCPv4 client pid 1501 exited with status -1
Oct  8 12:48:57 localhost NetworkManager[1306]: <warn> DHCP client died abnormally
Oct  8 12:48:57 localhost NetworkManager[1306]: <info> (eth0): device state change: activated -> failed (reason 'ip-config-expired') [100 120 6]
Oct  8 12:48:58 localhost NetworkManager[1306]: <warn> Activation (eth0) failed.
Oct  8 12:48:58 localhost dbus[831]: [system] Activating service name='org.freedesktop.nm_dispatcher' (using servicehelper)
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> (eth0): device state change: failed -> disconnected (reason 'none') [120 30 0]
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> (eth0): deactivating device (reason 'none') [0]
Oct  8 12:48:58 localhost avahi-daemon[1336]: Withdrawing address record for 10.1.1.188 on eth0.
Oct  8 12:48:58 localhost avahi-daemon[1336]: Leaving mDNS multicast group on interface eth0.IPv4 with address 10.1.1.188.
Oct  8 12:48:58 localhost avahi-daemon[1336]: Interface eth0.IPv4 no longer relevant for mDNS.
Oct  8 12:48:58 localhost dnsmasq[1700]: exiting on receipt of SIGTERM
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> DNS: starting dnsmasq...
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> (eth0): writing resolv.conf to /sbin/resolvconf
Oct  8 12:48:58 localhost dnsmasq[2083]: started, version 2.59 cache disabled
Oct  8 12:48:58 localhost dnsmasq[2083]: compile time options: IPv6 GNU-getopt DBus i18n DHCP TFTP conntrack IDN
Oct  8 12:48:58 localhost dnsmasq[2083]: warning: no upstream servers configured
Oct  8 12:48:58 localhost dbus[831]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Auto-activating connection 'Wired connection 1'.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) starting connection 'Wired connection 1'
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> (eth0): device state change: disconnected -> prepare (reason 'none') [30 40 0]
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 1 of 5 (Device Prepare) scheduled...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 1 of 5 (Device Prepare) started...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) scheduled...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 1 of 5 (Device Prepare) complete.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) starting...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> (eth0): device state change: prepare -> config (reason 'none') [40 50 0]
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) successful.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) scheduled.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) complete.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> (eth0): device state change: config -> ip-config (reason 'none') [50 70 0]
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds)
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> dhclient started with pid 2105
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Beginning IP6 addrconf.
Oct  8 12:49:01 localhost avahi-daemon[1336]: Withdrawing address record for fe80::bc2d:19ff:fe77:94ab on eth0.
Oct  8 12:49:01 localhost avahi-daemon[1336]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::bc2d:19ff:fe77:94ab.
Oct  8 12:49:01 localhost avahi-daemon[1336]: Interface eth0.IPv6 no longer relevant for mDNS.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete.
Oct  8 12:49:01 localhost dhclient: Internet Systems Consortium DHCP Client 4.1-ESV-R4
Oct  8 12:49:01 localhost dhclient: Copyright 2004-2011 Internet Systems Consortium.

and then

Oct  8 12:49:01 localhost dhclient: Listening on LPF/eth0/be:2d:19:77:94:ab
Oct  8 12:49:01 localhost dhclient: Sending on   LPF/eth0/be:2d:19:77:94:ab
Oct  8 12:49:01 localhost dhclient: Sending on   Socket/fallback
Oct  8 12:49:01 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Oct  8 12:49:02 localhost dhclient: DHCPREQUEST of 10.1.1.188 on eth0 to 255.255.255.255 port 67
Oct  8 12:49:02 localhost dhclient: DHCPOFFER of 10.1.1.188 from 10.1.1.3
Oct  8 12:49:02 localhost dhclient: DHCPACK of 10.1.1.188 from 10.1.1.3
Oct  8 12:49:02 localhost dhclient: bound to 10.1.1.188 -- renewal in 239 seconds

The use of DHCPDISCOVER is not a good sign: A renewal would consist of a request only. But there’s only one of them. When the address changed, there were two DHCPDISCOVER in a row.

(And never mind the short lease time. I changed this on the server temporarily to speed up things)

And the lease file now reads:

lease {
 interface "eth0";
 fixed-address 10.1.1.188;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option dhcp-lease-time 600;
 option routers 10.1.1.2;
 option dhcp-message-type 5;
 option dhcp-server-identifier 10.1.1.3;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 renew 4 1970/01/01 00:05:19;
 rebind 4 1970/01/01 00:09:05;
 expire 4 1970/01/01 00:10:20;
}
lease {
 interface "eth0";
 fixed-address 10.1.1.188;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option routers 10.1.1.2;
 option dhcp-lease-time 600;
 option dhcp-message-type 5;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 option dhcp-server-identifier 10.1.1.3;
 renew 1 2012/10/08 12:53:01;
 rebind 1 2012/10/08 12:57:47;
 expire 1 2012/10/08 12:59:02;
}

Still messy, but at least it’s the same address!

Hooking on ntpdate

So the trick is to catch the calls to ntpdate, and kill dchclient if the time jump was big. First some background regarding Ubuntu 12.04 LTS:

If the “Set the time” in the “Time & Date” GUI setting is set to “Automatically set from the Internet”, we’ll have /etc/network/if-up.d/ntpdate call /usr/sbin/ntpdate-debian each time an interface comes up. That’s actually a script, which runs /etc/default/ntpdate to get the server (ntp.ubuntu.com = 91.189.94.4) and options, and eventually goes

exec /usr/sbin/ntpdate $NTPOPTIONS "$@" $NTPSERVERS

So this is a good place to check before-after. Change that to

before=$(/bin/date +%s)
maxafter=$((before + 600))

/usr/sbin/ntpdate $NTPOPTIONS "$@" $NTPSERVERS

after=$(/bin/date +%s)

[ "$after" -gt "$maxafter" ] && killall dhclient

which simply checks the difference between the epoch time before and after the call to ntpdate. If that exceeds 10 minutes, better kill the DHCP client before it wakes up by itself and maybe switches an IP address.

In reality, it’s not such a problem if the IP address happens to change due to a time jump, because it takes some 7-8 seconds from the moment the IP address is acquired until the time jump occurs, and the eth0 interface is shut down again for a restart. So even if the new address is different, it’s compared with an IP address that was there for a very brief moment.

Is this an ugly hack? Indeed it is. But compared with the time travel of 42 years and more, what’s sending a little signal to an innocent process?

 

Reader Comments

Which dhcp client is it? you could consider modifying it to use a monotonic clock on the system.

I recently had a similar problem with nut upsd where if the time was moved backwards it would stop monitoring the ups for the duration of the gap.

I guess that this problem will be prevalent in many applications as well. Most times people do not assume the clock is really going to move in any other direction than forward and in a constant rate.

#1 
Written By Baruch on October 10th, 2012 @ 14:30

The DHCP client is dhclient (as it says all over the post).

And in my case, the clock jumped from 1970 to 2012, so it surely remained monotonic. This was not time moving backwards, it was time jumping.

#2 
Written By eli on October 10th, 2012 @ 14:36

I believe most monotonic clock implementations are also irreverent of time jumps as they look at things like rdtsc and the like so their time movement is constant. At least it was on my tests on intel cpus. The implementation on arm devices may differ.

#3 
Written By Baruch on October 10th, 2012 @ 15:50

It doesn’t matter if it is an ugly hack if it gets the job done. There are no points for style.

#4 
Written By Matt on February 7th, 2013 @ 21:10

This does not work for me without editing the leases file to fix the dates.

So for an even uglier, but working, hack, you can use the following shell function. It sets all dates to next year in the leases file. It seems that they will be reset to correct values in dhcp-lease-time seconds anyway (as far as I can see). I am not going to even pretend that this is very robust, but it seems to solve this very vexing problem for me.

fixleases()
{
year=`date +%Y`
year=`expr $year + 1`
dt=`date +”$year/%m/%d %H:%M:%S”`
leases=`echo /var/lib/NetworkManager/dhclient-*.lease`
#echo LEASES $leases

sed -i \
-e “/^ *renew/c \\ \\ renew 2 ${dt};” \
-e “/^ *rebind/c \\ \\ rebind 2 ${dt};” \
-e “/^ *expire/c \\ \\ expire 2 ${dt};” \
$leases
}

#5 
Written By Jean-Francois Dockes on November 7th, 2017 @ 16:50

Hi !

thanks all this explanation. It was helpfull for me to find out what’s wrong on my laptop.

I’ve had the same kind of issue… only when I’m out of my country.
At home, everything is fine. When I arrive in India, I’ve got the same problem than you. With same message on syslog.

It only needed for me to set up time on my computer to …. local time and reboot.
And then, everything is fine.

#6 
Written By Xavier on May 2nd, 2018 @ 11:42

Add a Comment

required, use real name
required, will not be published
optional, your blog address