A perl script sending mails for testing a mail server

Just set up your mail server? Congratulations! Now you should test it. In particular, check if it relays mails to other servers and if the response time is reasonable. Here’s a script for doing the testing. Just edit the arguments to send_mail() to match your setting.

#!/usr/bin/perl
use warnings;
use strict;
use Net::SMTP;

send_mail('127.0.0.1', # Host
 'sender@nowhere.com', #From
 'myself@myhost.com', #to
 'Just a test, please ignore',  #Message body
 "Testing email.\n" # Subject
 );

sub send_mail {
 my ($SMTP_HOST, $from, $to_addr, $body, $subject, $msg) = @_;

 $msg = "MIME-Version: 1.0\n"
 . "From: $from\n"
 . "To: " . ( ref($to_addr) ? join(';', @$to_addr) : $to_addr ) . "\n"
 . "Subject: $subject\n\n"  # Double \n
 . $body;

 #
 # Open a SMTP session
 #
 my $smtp = Net::SMTP->new( $SMTP_HOST,
 'Debug' => 1,       # Change to a 1 to turn on debug messages
 Port => 587,
 );

 die("SMTP ERROR: Unable to open smtp session.\n")
 if(!defined($smtp) || !($smtp));

 die("Failed to set FROM address\n")
 if (! ($smtp->mail( $from ) ) );

 die("Failed to set receipient\n")
 if (! ($smtp->recipient( ( ref($to_addr) ? @$to_addr : $to_addr ) ) ) );

 $smtp->data( $msg );

 $smtp->quit;
}

Two things to note:

The Port assignment marked red above makes an encryption connection with the server. It can be changed to 25, but many servers don’t answer strangers on that port.

July 2024 update: Nowadays, the situation seems to be the opposite. Mail servers that appear in the MX records and owned by Internet actors seem to answer only to port 25, and apparently only small servers answer to 587. Possibly because port 587 is commonly used for outgoing mail by MUAs (mail clients wishing to send an email) and port 25 is used only between servers (MTAs) …?

And if this script is used to talk with a remote server, odds are it won’t work due to authentication issues. If your server runs sendmail, it can be made less picky by making the following temporary changes to allow for testing:

In /etc/mail/sendmail.mc, change confAUTH_OPTIONS from `A’ to `’ (nothing, no authentication required). Also, change

DAEMON_OPTIONS(`Port=submission, Name=MSA, M=Ea')dnl

to

DAEMON_OPTIONS(`Port=submission, Name=MSA')dnl

and then compile the configuration file and restart the server with

# make -C /etc/mail
# service sendmail restart

Needless to say, it’s recommended to return the original settings after the testing is done. Your mail server should have some self-respect.

Anyhow, a typical output should look like this:

Net::SMTP>>> Net::SMTP(2.31)
Net::SMTP>>>   Net::Cmd(2.29)
Net::SMTP>>>     Exporter(5.62)
Net::SMTP>>>   IO::Socket::INET(1.31)
Net::SMTP>>>     IO::Socket(1.30_01)
Net::SMTP>>>       IO::Handle(1.27)
Net::SMTP=GLOB(0x7c9d98)<<< 220 myhost.localdomain ESMTP Sendmail 8.14.4/8.14.4; Mon, 14 Jan 2013 14:03:26 +0200
Net::SMTP=GLOB(0x7c9d98)>>> EHLO localhost.localdomain
Net::SMTP=GLOB(0x7c9d98)<<< 250-myhost.localdomain Hello localhost.localdomain [127.0.0.1], pleased to meet you
Net::SMTP=GLOB(0x7c9d98)<<< 250-ENHANCEDSTATUSCODES
Net::SMTP=GLOB(0x7c9d98)<<< 250-PIPELINING
Net::SMTP=GLOB(0x7c9d98)<<< 250-8BITMIME
Net::SMTP=GLOB(0x7c9d98)<<< 250-SIZE
Net::SMTP=GLOB(0x7c9d98)<<< 250-DSN
Net::SMTP=GLOB(0x7c9d98)<<< 250-ETRN
Net::SMTP=GLOB(0x7c9d98)<<< 250-AUTH DIGEST-MD5 CRAM-MD5 LOGIN PLAIN
Net::SMTP=GLOB(0x7c9d98)<<< 250-DELIVERBY
Net::SMTP=GLOB(0x7c9d98)<<< 250 HELP
Net::SMTP=GLOB(0x7c9d98)>>> MAIL FROM:<sender@nowhere.com>
Net::SMTP=GLOB(0x7c9d98)<<< 250 2.1.0 <sender@nowhere.com>... Sender ok
Net::SMTP=GLOB(0x7c9d98)>>> RCPT TO:<myself@myhost.com>
Net::SMTP=GLOB(0x7c9d98)<<< 250 2.1.5 <myself@myhost.com>... Recipient ok
Net::SMTP=GLOB(0x7c9d98)>>> DATA
Net::SMTP=GLOB(0x7c9d98)<<< 354 Enter mail, end with "." on a line by itself
Net::SMTP=GLOB(0x7c9d98)>>> MIME-Version: 1.0
Net::SMTP=GLOB(0x7c9d98)>>> From: sender@nowhere.com
Net::SMTP=GLOB(0x7c9d98)>>> To: myself@myhost.com
Net::SMTP=GLOB(0x7c9d98)>>> Subject: Testing email.
Net::SMTP=GLOB(0x7c9d98)>>>
Net::SMTP=GLOB(0x7c9d98)>>>
Net::SMTP=GLOB(0x7c9d98)>>> Just a test, please ignore
Net::SMTP=GLOB(0x7c9d98)>>> .
Net::SMTP=GLOB(0x7c9d98)<<< 250 2.0.0 r0EC3Qm3030991 Message accepted for delivery
Net::SMTP=GLOB(0x7c9d98)>>> QUIT
Net::SMTP=GLOB(0x7c9d98)<<< 221 2.0.0 myhost.localdomain closing connection

Perl script rectifying the encoding of a mixed UTF-8 / windows-1255 file

Suppose that some hodge-podge of scripts and template files create a single file, which has multiple encoding of Hebrew. That is, both UTF-8 and windos-1255. If the different encodings don’t appear in the same lines, the following script makes it all UTF-8:

#!/usr/bin/perl
use strict;
use warnings;

use Encode;
binmode(STDOUT, ":utf8");

while (defined (my $l = <>)) {
 eval { print decode("utf-8", $l, Encode::FB_CROAK); };
 next unless ($@);

 eval { print decode("windows-1255", $l, Encode::FB_CROAK); };

 if ($@) {
    print STDERR "Failed to decode line $.: $l\n";
    print $l;
  }
}

The binmode() call mutes warnings about wide characters (UTF-8) going to standard output.

The first decode() call does nothing if the handled line contains pure ASCII or UTF-8 encoded characters. The thing about it is the third argument, CHECK, which is set to FB_CROAK (man Encode for details). This tells decode() to die() if a malformed character is encountered. Being enclosed in an eval { }, it’s just a test. If this no-operation decoding goes by peacefully (that is, $@ is undefined) we know that the line was printed and one can go on to the next line.

If it does fail, a second attempt to decode() takes place, this time from the selected encoding. If it fails, the line is printed as is, and a warning is issued to standard output.

As a final note, it looks like this script could be improved by using FB_QUIET instead. This option makes decode() run as far as it can, and overwrites the input variable to contain the undecoded part. So this could be a method to munch through a string chunk by chunk, trying different encodings each time. Or so says the manual page. I never tried it.

 

Setting up a VPS server. It was a bumpy road.

Introduction

These are my own notes as I set up an OpenVZ VPS server, based upon CentOS 5.6 to function as a web and mailing list server. A $36/year 128 MB RAM machine was good enough for this.

Since there’s some criticism about the hosting provider, and it looks like they’re OK after all, I’m leaving their name out for now. The main purpose of this post is to help myself getting started again, if that is ever necessary (I sure hope it will never be).

Foul #1: Mails from hosting provider marked as spam

This is the first time it happens to me that automated emails from any service providers go to Gmail’s spam box. That includes the welcome mails as I subscribed, the payment confirmation and the message telling me the server is up. And messages about support tickets. None arrived.

Spamassassin gives these mails some points (1.3 or so) as well. I’ve hardly seen anything like this from any decent automatic mail producer. I first thought this was a major blunder, but then it turns out that machine-generated emails tend to get caught by spam filters. Since email messages that are relayed by a mailing list (mailman) don’t get caught, it looks like the spam filter checks the “received” chain of headers for the first hops of the message, and tries to figure out if that’s a decent ISP there. Just a wild guess.

Workaround: Add a filter in Gmail to never send emails from *@the-hosting-provider.com to the spam box. Simple, when you know about it.

Foul #2: 12 hours from payment to server running

Even for a low-cost service, 12 hours of “pending” is a bit too much. In particular when $36 have been paid. That alone filters out most scammers, I suppose.

Foul #3: Root password not set

Maybe I was naive to expect that the root password would be set in the server, so I tried to SSH the server with the password I had assigned during the subscription, but was consistently denied.

Workaround: Enter the VPS control panel and change the password.

Foul #4: Uncertified HTTPS link

The control panel of the VPS is accessed with a link to an IP address. Which is a bit weird, but let’s leave that alone. I mean, what about buying a domain for that purpose? To make things even worse, they supply an HTTPS link as well. Which works, but makes the browser display a scare “GET ME OUT OF HERE” message.

An uncertified HTTPS link is better than HTTP, even though cryptologists will argue that in the absence of a certificate, a man-in-the-middle attack is possible. But let’s get serious. It’s not really dangerous. It’s just yet another sign that they don’t give a shit. Setting up a domain and certifying it something you would expect from any serious company, just to avoid that scary warning message. But they didn’t.

Bump #1: Lacking yum repository

Among the first thing I did after logging in (because I’m addicted):

# yum install git
Loaded plugins: fastestmirror
Determining fastest mirrors
 * base: mirror01.th.ifl.net
 * extras: mirror01.th.ifl.net
 * updates: mirror01.th.ifl.net
base                                                    | 1.1 kB     00:00    
base/primary                                            | 967 kB     00:00    
base                                                                 2725/2725

[ ... yada yada ... ]

vz-updates/primary                                      | 1.0 kB     00:00    
vz-updates                                                                 3/3
Setting up Install Process
No package git available.
Nothing to do

Are you kidding me? Using

# rpm -qa --last | head

I got a list of packages installed, many of which were marked with “el5″, which isn’t surprising,

# cat /etc/redhat-release
CentOS release 5.6 (Final)

since it’s a CentOS 5 distro (EL = Enterprise Linux).

The list of existing repos:

# yum repolist
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror01.th.ifl.net
 * extras: mirror01.th.ifl.net
 * updates: mirror01.th.ifl.net
repo id                                repo name                                      status
base                                   CentOS-5 - Base                                enabled: 2,725
extras                                 CentOS-5 - Extras                              enabled:   286
updates                                CentOS-5 - Updates                             enabled: 1,003
vz-base                                vz-base                                        enabled:     5
vz-updates                             vz-updates                                     enabled:     3
repolist: 4,02

So where’s git? On Fedora 12, I checked where git was loaded from (for comparison) with

$ yumdb info 'git*'
Loaded plugins: presto, refresh-packagekit
git-1.7.2.3-1.fc12.x86_64
 changed_by = 1010
 checksum_data = 470af233244731e51076c6aac5007e1eebd2f73f23cd685db7cd8bd6fb2b3dd1
 checksum_type = sha256
 command_line = install git-email
 from_repo = updates
 from_repo_revision = 1291265770
 from_repo_timestamp = 1291266900
 reason = user
 releasever = 12

[ ... here comes info about git-deamon and other packages ]

So CentOS’ repository doesn’t have git? That looks odd to me. A last try:

# yum list available | less

No, git wasn’t on the list. The fix was to add Repoforge to the list of repositories on the server (following the instructions):

# wget http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
# rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm

And then “yum install git” went fine.

Bump #2: Bad host name

OK, it’s partly my fault. A short host name (without a dot) isn’t good enough. At least not for sending mails. A fully-qualified host name (such as example.com, as opposed to just “example”) is needed. Or sendmail starts up very slowly and then refuses to send mails.

Bump #3: Setting up reverse DNS

For the server to be able to send emails that aren’t immediately detected as spam, its IP must have the reverse DNS set to its host name.

There is a place to edit the rDNS name in the control panel (under “Network”) but it was disabled. Contact support, it said. So I did.

Having rDNS disabled by default is somewhat understandable to at least keep an eye on spammers. On the other hand, show me a spammer paying $36 upfront.

It took support 11 hours to answer this support request, asking me to supply the rDNS record I needed for manual setting. The actual fix came an hour later, so overall this was fairly OK.

The automatic feature is simply not supported. But it’s not like decent people need to change their rDNS every day.

Bump #4: No swap

It seems like there is no way to activate a swap file on the VPS server (in particular, losetup returns with “permission denied” so there is nothing to attach the swap partition to). So there’s no choice than to make sure that the overall memory consumption doesn’t exceed the allocation of virtual RAM, which is 128 MB in my case. Or processes will just die. I can understand the commercial sense in this limitation: If users would start putting large swap files on their systems, they would buy lower-cost machines and then complain that they’re not responsive.

The figure to keep track of is the amount of free + cached memory. For example,

$ cat /proc/meminfo
MemTotal:         131072 kB
MemFree:           47516 kB
Cached:            27156 kB
[ ... ]

The free memory is 47416 + 27156 = 74672 kB, which means 131072 – 74672 = 56400 kB is used. These are the figures that appear in the Control Panel.

Installation note: Setting up sendmail

By default, sendmail doesn’t accept external connections. Edit /etc/mail/sendmail.mc, changing

DAEMON_OPTIONS(`Port=smtp,Addr=127.0.0.1, Name=MTA')dnl

to

DAEMON_OPTIONS(`Port=smtp, Name=MTA')dnl

This removes the restriction that only the local address is listened to. And also from

dnl DAEMON_OPTIONS(`Port=submission, Name=MSA, M=Ea')dnl

to

DAEMON_OPTIONS(`Port=submission, Name=MSA, M=Ea')dnl

(remove the “dnl” hence uncommenting the line).

Check with the hosting provider if they supply a mail relay server. Relaying through a well-reputed server can decrease the spam score of your mails. Besides, the hosting provider can decide to block all outgoing direct connections with mail servers out of the blue, because spams were flying out from their servers.

A line like the following should be added (possibly close to where SMART_HOST is mentioned in the file).

define(`SMART_HOST',`relay.myprovider.com')dnl

And then compile the file into sendmail.cf, and restart the server as follows:

# make -C /etc/mail
# service sendmail restart

Then test the server, possibly using a script. In particular, verify that the mail server isn’t relaying (accepting messages to other domains) or the server turns into a spam machine.

Installation note: Mailman

# yum install mailman

Then customize. See instructions here and also have a look at /usr/share/doc/mailman-2.1.9/INSTALL.REDHAT. There’s a need to access the host by its domain name (as opposed to just the IP address) so the local computer’s /etc/hosts may need to be fiddled with when working on a server not yet allocated with the new address.

Note: Do not edit /usr/lib/mailman/Mailman/mm_cfg.py for changing DEFAULT_URL_HOST and DEFAULT_EMAIL_HOST, if they happen to say

DEFAULT_URL_HOST   = fqdn
DEFAULT_EMAIL_HOST = fqdn

because in this case they’re set up automagically.

First, make sure that the mailman daemon is off:

# service mailman stop

To migrate a few lists from one server to another, copy the respective lists in /var/lib/mailman/{lists,archives} into the new server. Note that the “data” directory doesn’t contain any information on the lists, so therefore just adding these directories is enough.

The lists will not appear in the web interface if there was a domain switch during the list migration, as can be observed by searching for ‘web_page_url’ in the output of

# /usr/lib/mailman/bin/dumpdb config.pck

To fix this, go (for each list)

# /usr/lib/mailman/bin/withlist -l -r fix_url the-list-name

Make sure the files are owned by mailman with

# chown -R mailman:mailman ... the copied directories ...

At this point, the list should appear on the web console. Copy the entries into /etc/aliases, more or less like this:

## listname mailing list
listname:              "|/usr/lib/mailman/mail/mailman post listname"
listname-admin:        "|/usr/lib/mailman/mail/mailman admin listname"
listname-bounces:      "|/usr/lib/mailman/mail/mailman bounces listname"
listname-confirm:      "|/usr/lib/mailman/mail/mailman confirm listname"
listname-join:         "|/usr/lib/mailman/mail/mailman join listname"
listname-leave:        "|/usr/lib/mailman/mail/mailman leave listname"
listname-owner:        "|/usr/lib/mailman/mail/mailman owner listname"
listname-request:      "|/usr/lib/mailman/mail/mailman request listname"
listname-subscribe:    "|/usr/lib/mailman/mail/mailman subscribe listname"
listname-unsubscribe:  "|/usr/lib/mailman/mail/mailman unsubscribe listname"

And finally, turn the service on again:

# service mailman start

And make it a permanent service:

# chkconfig mailman on

Changing the subscription confirmation message (to tell users to look in the spam folder): Edit /usr/lib/mailman/templates/en/subscribe.html, and remove /var/lib/mailman/lists/{listname}/en/subscribe.html (if it exists, and contains nothing special for the list).

Then restart qrunner to flush cached templates:

# /usr/lib/mailman/bin/mailmanctl restart

See these two pages for more info about this (even though they’re not very accurate).

Fax to PC on HP Officejet 4500

I bought the Officejet 4500 to be able to send and receive faxes every now and then, use it as a scanner, and actually never print with it. I’m not sure I would recommend this to anyone. Not that it matters much, as it’s pretty phased out. Anyhow, the overall feeling about this machine and its software is that it looks for any excuse in the world to waste some ink. After all, ink is what HP makes its profits from.

The device (including its fax) refuses to do anything without ink cartridges with a good level of ink installed. Since I don’t intend to print at all, my main concern is that the ink will dry out, and I won’t be able to fax with it a few years from now, with no spare parts available.

Sometimes the machine won’t power up completely (with the exclamation sign blinking) unless there is paper loaded. It’s like the printer prepares to hijack some paper for something I never wanted it to do.

Many years ago, HP represented decent engineering. They’ve gone a long way since.

OK, to the point

First, FAX to PC must be activated on the PC side. In the HP solution center, click Settings, hover over “More Fax Settings” and pick “Digital Fax Setup Wizard”. This setting has most likely been completed during installation (a destination folder set up etc.).

If the Fax to PC feature has been (accidentally) disactivated on the printer, pick “Fax Settings” in the HP solution center, on the “Digital Fax Settings” tab make sure to activate the feature.

Then, on the printer, press the button with the wrench (“Tools”) > Fax Settings > Fax to PC and press the right arrow button until it says “Fax Print: Off” and then press OK. Don’t get confused by the part that says “Fax to PC: Off”: It isn’t really off, but it will be when pressing OK (and the fix above applies). An asterisk next to the setting indicates that it’s active, otherwise that state will become active only when pressing OK.

To receive a fax manually, press the left arrow as necessary to move the triangle to its leftmost position, press the green (“Start”?) button and the “2″ button for receiving a fax (with the phone off hook at least).

Remember that the HP Digital Imaging Monitor must be active for the fax to be received to the computer.

After a fax has arrived, a new file is silently generated in the designated destination folder (e.g. C:\fax). It’s a TIFF file with the name made up from the data and time the fax arrived (e.g. _20121217_114702.tif). No message appears on the screen.

If the computer is off, or the Monitor is off, the received fax will be pending in the machine without any message on the LCD display or anything indicating that action should be taken. The file is created as soon as the Monitor is activated (plus some 20 seconds or so). I don’t know how long the fax stays there, but I suppose turning off the fax machine will delete it.

It seems like there’s no support for Fax to PC for Linux. It looks like someone got fired at HP for supporting an ink-saving feature altogether.

I really miss my good old plain fax modem.

Signed arithmetics in Verilog: The only rule one needs to know

The golden rule is: All operands must be signed.

Verilog, it seems, is strongly inclined towards unsigned numbers. Any of the following yield an unsigned value:

  • Any operation on two operands, unless both operands are signed.
  • Based numbers (e.g. 12′d10), unless the explicit “s” modifier is used)
  • Bit-select results
  • Part-select results
  • Concatenations

So the bottom line is to either use the $signed system function, or define signed wires and registers.

For example, to multiply a signed and unsigned register, yielding a signed value (of course), go something like this:

reg         [15:0] a; // Unsigned
reg signed  [15:0] b;
wire signed [16:0] signed_a;
wire signed [31:0] a_mult_b;

assign signed_a = a; // Convert to signed
assign a_mult_b = signed_a * b

Note that signed_a is one bit wider than “a”, so there’s room for the sign bit, which is always zero. If this wasn’t for this extra bit, a’s MSB would be treated as the sign bit in signed_a.

It may seem necessary to explicitly determine signed_a’s MSB with sometime like {1′b0, a} instead of just “a”, but the Verilog standard is pretty explicit about the signed vs. unsigned being determined by the expression only, and not by the left hand side. So “a” is treated as an unsigned value, and is hence extended by zero.

 

Permission denied to directory, despite group permission set OK

I tried to change directory to eli from other users belonging to the group “eli” and it failed with

$ cd ../eli/
-bash: cd: ../eli/: Permission denied

despite everything begin OK with the classic UNIX settings.

Reminder: After settings groups, there’s a need to either logout and login again, or use “su -” to refresh group settings. The “id” command reveals the effective group memberships.

It turns out that there’s another layer of settings, ACL (Access Control List), which is yet another way to make sure the computer is so protected that it drives you mad.

So let’s list the files:

$ ls -l
total 44
drwxrwx---+ 86 eli         eli          4096 2012-10-16 16:14 eli/
drwx------.  2 root        root        16384 2010-01-15 23:59 lost+found/

Note the ‘+’ and ‘.’ at the end of the “regular” permissions. What they tell us, is that there’s an ACL record on the “eli” directory. So effectively, the classic permissions are overridden. And this has nothing to do with SELinux, which is disabled on my computer.

Let’s see what we’ve got there:

$ getfacl eli
# file: eli
# owner: eli
# group: eli
user::rwx
user:qemu:--x
group::---
mask::rwx
other::--

So it means what it says: Despite the classic permissions, noone expect myself and qemu has permissions to the directory.

The remedy is to remove all ACL entries, and then set the permissions with chmod.

$ setfacl -b eli
$ ls -l
total 40
drwx------. 86 eli         eli          4096 2012-10-16 16:14 eli/
drwx------.  2 root        root        16384 2010-01-15 23:59 lost+found/
$ chmod g+xrw eli/
$ getfacl eli
# file: eli
# owner: eli
# group: eli
user::rwx
group::rwx
other::---

And now the system behaves like good old UNIX.

 

Workaround: “git push” from msysgit (git for Windows) hangs

I had set up my plain git-daemon and everything seemed to work fine, until I tried it from Windows. It just didn’t return from the command. According to a discussion in a newsgroup, the problem is a bug in msysgit’s implementation of side-band-64k, whatever that is.

Personally, even if an upgrade to msysgit existed, or if it was fairly easy to fix this and recompile, I wouldn’t want to take that path: That would require quite a few fixes to make in my case.

So my choice was to fix it on the server. The bug isn’t there, but I can stop the server from announcing that it supports this feature, so the client won’t even try. I don’t know what the impact of disabling this feature is, but it seems like it is about allowing status data for the impatient user to be sent multiplexed with actual object data.

In a xinetd setting, the main daemon calls runs git-daemon when a connection is made, which in turn calls “/usr/bin/git receive-pack” on a push request from the client. Just to make it clear, there is a git-receive-pack executable, but it’s not the one executed. Makes me wonder why it’s there at all.

The git program discloses its capabilities by sending a string saying something like “report-status delete-refs side-band-64k ofs-delta” on the TCP stream (on git version 1.7.2.3). So all that’s needed is to make sure the “side-band-64k” part is not transmitted. I suppose that will influence the way I’ll fetch from remote repositories in the future, but that’s a minor impact (I hope).

The sane way to fix this would be getting the source for git and recompile. I went for hacking the binary directly.

Namely, use XEmacs to open /usr/bin/git for hex editing (after making a backup copy, of course), find the place where that long capability string appears, and change “side-band-64k” to “side-bond-64k”. It’s exactly one byte to fix in the binary file.

And that does the trick. Not the most beautiful workaround, but quick and effective!

I suppose a similar manipulation would work on the client’s executable. But as mentioned above. it’s not effective for me.

DHCP changing the IP address suddenly on embedded systems

The problem

It first seemed extremely odd: The IP address of my embedded Linux machine changed suddenly after a few hours, breaking the ssh connection I had up, and messing up the NFS mount.

The problem turned out to be the lack of a hardware clock (RTC) on the board + clock being updated with NTP + the DHCP address obtained by dhclient.

(Why there’s no RTC is completely beyond me. After all, a decent chip costs no more than $0.75 and is so important to the OS’ health).

Here’s a typical acquisition of an address, as seen in /var/log/syslog:

Jan  1 00:00:13 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port
 67 interval 3
Jan  1 00:00:14 localhost dhclient: DHCPREQUEST of 10.1.1.196 on eth0 to 255.255.255.255 port 67
Jan  1 00:00:14 localhost dhclient: DHCPOFFER of 10.1.1.196 from 10.1.1.3
Jan  1 00:00:14 localhost dhclient: DHCPACK of 10.1.1.196 from 10.1.1.3

A few lines later it says

Jan  1 00:00:16 localhost dhclient: bound to 10.1.1.196 -- renewal in 8512 seconds.

So far so good. Only the time is around January 1st, 1970 midnight, that is epoch time zero + 16 seconds. No hardware clock, so the system started from zero. But soon enough the clock jumps:

Sep 29 17:32:44 localhost ntpdate[1858]: step time server 91.189.94.4 offset 1348939931.970235 sec

And after quite a while we have

Sep 29 20:23:39 localhost NetworkManager[1375]: <info> (eth0): DHCPv4 state changed bound -> expire
Sep 29 20:23:39 localhost NetworkManager[1375]: <info> (eth0): DHCPv4 state changed expire -> preinit
Sep 29 20:23:39 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Sep 29 20:23:42 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 4
Sep 29 20:23:43 localhost dhclient: DHCPREQUEST of 10.1.1.197 on eth0 to 255.255.255.255 port 67
Sep 29 20:23:43 localhost dhclient: DHCPOFFER of 10.1.1.197 from 10.1.1.3
Sep 29 20:23:43 localhost dhclient: DHCPACK of 10.1.1.197 from 10.1.1.3
Sep 29 20:23:43 localhost dhclient: bound to 10.1.1.197 -- renewal in 9793 seconds.

The DHCPDISCOVER indicates that the DHCP client isn’t renewing its lease, but wants to start over.

And /var/lib/dhcp/dhclient-ca4860e2-17c8-4280-a48c-f9d3a77f53aa-eth0.lease read, after all this:

lease {
 interface "eth0";
 fixed-address 10.1.1.196;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option routers 10.1.1.2;
 option dhcp-lease-time 21600;
 option dhcp-message-type 5;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 option dhcp-server-identifier 10.1.1.3;
 renew 4 1970/01/01 02:51:26;
 rebind 4 1970/01/01 05:15:19;
 expire 4 1970/01/01 06:00:19;
}
lease {
 interface "eth0";
 fixed-address 10.1.1.197;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option routers 10.1.1.2;
 option dhcp-lease-time 21600;
 option dhcp-message-type 5;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 option dhcp-server-identifier 10.1.1.3;
 renew 6 2012/09/29 23:06:56;
 rebind 0 2012/09/30 01:38:43;
 expire 0 2012/09/30 02:23:43;
}

Now let’s try to figure out what happened: First, Network Manager launched dhcpclient with something like

/sbin/dhclient -d -4 -sf /usr/lib/NetworkManager/nm-dhcp-client.action -pf /var/run/sendsigs.omit.d/network-manager.dhclient-eth0.pid -lf /var/lib/dhcp/dhclient-ca4860e2-17c8-4280-a48c-f9d3a77f53aa-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0

Note that the lease file is given explicitly.  An address is acquired, and the first entry in the lease file was made.

There is a slight contradiction between the statement in the log file that renewal was to take place 8512 seconds later (that is on epoch time 8528 = 02:22:08) and the renewal time in the lease file (02:51:26).

And then the NTP client kicked the time by some 42 years, so the leased IP address suddenly seems way outdated. But the DHCP client sleeps carelessly, being confident there is nothing to do for quite a while. This changes when apparently Network Manager wakes it up again, on what should have been 02:51:26 on Jan 1st 1970, but turned out to be 20:23:39 on Sep 29th 2012, which is 02:51:05 hours after the clock jump to 17:32:44. Talk about oversleeping.

So the DHCP client assumes everything has expired, and issues two DHCPDISCOVER requests, starting it all over again. The server supplies a new IP address, and oops, all network connections are suddenly lost. And the second entry in the lease file is generated.

The somewhat odd solution

Kill dhclient when ntpdate has caused a big jump. As simple as that. It’s not clear why this works at all, but there doesn’t seem to be a way to convince dhclient to just renew the IP and update the timestamps. Maybe updating the timestamps in the lease file would work, but that’s a bit too sophisticated for me right now.

Did Network Manager like it? No and no. It actually restarted the entire setup of eth0, causing a connection blackout of a few seconds.

Oct  8 12:48:57 localhost NetworkManager[1306]: <info> (eth0): DHCPv4 client pid 1501 exited with status -1
Oct  8 12:48:57 localhost NetworkManager[1306]: <warn> DHCP client died abnormally
Oct  8 12:48:57 localhost NetworkManager[1306]: <info> (eth0): device state change: activated -> failed (reason 'ip-config-expired') [100 120 6]
Oct  8 12:48:58 localhost NetworkManager[1306]: <warn> Activation (eth0) failed.
Oct  8 12:48:58 localhost dbus[831]: [system] Activating service name='org.freedesktop.nm_dispatcher' (using servicehelper)
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> (eth0): device state change: failed -> disconnected (reason 'none') [120 30 0]
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> (eth0): deactivating device (reason 'none') [0]
Oct  8 12:48:58 localhost avahi-daemon[1336]: Withdrawing address record for 10.1.1.188 on eth0.
Oct  8 12:48:58 localhost avahi-daemon[1336]: Leaving mDNS multicast group on interface eth0.IPv4 with address 10.1.1.188.
Oct  8 12:48:58 localhost avahi-daemon[1336]: Interface eth0.IPv4 no longer relevant for mDNS.
Oct  8 12:48:58 localhost dnsmasq[1700]: exiting on receipt of SIGTERM
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> DNS: starting dnsmasq...
Oct  8 12:48:58 localhost NetworkManager[1306]: <info> (eth0): writing resolv.conf to /sbin/resolvconf
Oct  8 12:48:58 localhost dnsmasq[2083]: started, version 2.59 cache disabled
Oct  8 12:48:58 localhost dnsmasq[2083]: compile time options: IPv6 GNU-getopt DBus i18n DHCP TFTP conntrack IDN
Oct  8 12:48:58 localhost dnsmasq[2083]: warning: no upstream servers configured
Oct  8 12:48:58 localhost dbus[831]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Auto-activating connection 'Wired connection 1'.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) starting connection 'Wired connection 1'
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> (eth0): device state change: disconnected -> prepare (reason 'none') [30 40 0]
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 1 of 5 (Device Prepare) scheduled...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 1 of 5 (Device Prepare) started...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) scheduled...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 1 of 5 (Device Prepare) complete.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) starting...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> (eth0): device state change: prepare -> config (reason 'none') [40 50 0]
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) successful.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) scheduled.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 2 of 5 (Device Configure) complete.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started...
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> (eth0): device state change: config -> ip-config (reason 'none') [50 70 0]
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds)
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> dhclient started with pid 2105
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Beginning IP6 addrconf.
Oct  8 12:49:01 localhost avahi-daemon[1336]: Withdrawing address record for fe80::bc2d:19ff:fe77:94ab on eth0.
Oct  8 12:49:01 localhost avahi-daemon[1336]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::bc2d:19ff:fe77:94ab.
Oct  8 12:49:01 localhost avahi-daemon[1336]: Interface eth0.IPv6 no longer relevant for mDNS.
Oct  8 12:49:01 localhost NetworkManager[1306]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete.
Oct  8 12:49:01 localhost dhclient: Internet Systems Consortium DHCP Client 4.1-ESV-R4
Oct  8 12:49:01 localhost dhclient: Copyright 2004-2011 Internet Systems Consortium.

and then

Oct  8 12:49:01 localhost dhclient: Listening on LPF/eth0/be:2d:19:77:94:ab
Oct  8 12:49:01 localhost dhclient: Sending on   LPF/eth0/be:2d:19:77:94:ab
Oct  8 12:49:01 localhost dhclient: Sending on   Socket/fallback
Oct  8 12:49:01 localhost dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Oct  8 12:49:02 localhost dhclient: DHCPREQUEST of 10.1.1.188 on eth0 to 255.255.255.255 port 67
Oct  8 12:49:02 localhost dhclient: DHCPOFFER of 10.1.1.188 from 10.1.1.3
Oct  8 12:49:02 localhost dhclient: DHCPACK of 10.1.1.188 from 10.1.1.3
Oct  8 12:49:02 localhost dhclient: bound to 10.1.1.188 -- renewal in 239 seconds

The use of DHCPDISCOVER is not a good sign: A renewal would consist of a request only. But there’s only one of them. When the address changed, there were two DHCPDISCOVER in a row.

(And never mind the short lease time. I changed this on the server temporarily to speed up things)

And the lease file now reads:

lease {
 interface "eth0";
 fixed-address 10.1.1.188;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option dhcp-lease-time 600;
 option routers 10.1.1.2;
 option dhcp-message-type 5;
 option dhcp-server-identifier 10.1.1.3;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 renew 4 1970/01/01 00:05:19;
 rebind 4 1970/01/01 00:09:05;
 expire 4 1970/01/01 00:10:20;
}
lease {
 interface "eth0";
 fixed-address 10.1.1.188;
 filename "/pxelinux.0";
 option subnet-mask 255.255.255.0;
 option routers 10.1.1.2;
 option dhcp-lease-time 600;
 option dhcp-message-type 5;
 option domain-name-servers 10.2.0.1,10.2.0.2;
 option dhcp-server-identifier 10.1.1.3;
 renew 1 2012/10/08 12:53:01;
 rebind 1 2012/10/08 12:57:47;
 expire 1 2012/10/08 12:59:02;
}

Still messy, but at least it’s the same address!

Hooking on ntpdate

So the trick is to catch the calls to ntpdate, and kill dchclient if the time jump was big. First some background regarding Ubuntu 12.04 LTS:

If the “Set the time” in the “Time & Date” GUI setting is set to “Automatically set from the Internet”, we’ll have /etc/network/if-up.d/ntpdate call /usr/sbin/ntpdate-debian each time an interface comes up. That’s actually a script, which runs /etc/default/ntpdate to get the server (ntp.ubuntu.com = 91.189.94.4) and options, and eventually goes

exec /usr/sbin/ntpdate $NTPOPTIONS "$@" $NTPSERVERS

So this is a good place to check before-after. Change that to

before=$(/bin/date +%s)
maxafter=$((before + 600))

/usr/sbin/ntpdate $NTPOPTIONS "$@" $NTPSERVERS

after=$(/bin/date +%s)

[ "$after" -gt "$maxafter" ] && killall dhclient

which simply checks the difference between the epoch time before and after the call to ntpdate. If that exceeds 10 minutes, better kill the DHCP client before it wakes up by itself and maybe switches an IP address.

In reality, it’s not such a problem if the IP address happens to change due to a time jump, because it takes some 7-8 seconds from the moment the IP address is acquired until the time jump occurs, and the eth0 interface is shut down again for a restart. So even if the new address is different, it’s compared with an IP address that was there for a very brief moment.

Is this an ugly hack? Indeed it is. But compared with the time travel of 42 years and more, what’s sending a little signal to an innocent process?

 

High resolution images of the Zedboard

At times, it’s useful to have a high-resolution picture of the board in front of you. For example, finding the correct place to touch with a probe is easier when the point is first found on the computer screen.

These are two very detailed images of the Zedboard by Digilent (and Avnet), which is one of two boards having the Zynq-7000 EPP platform by Xilinx (when these lines are written).

I suppose this will save a few pairs of eyes out there. Unfortunately, the text on most chips is unreadable (it’s quite tricky to capture it on camera. Or at all).

The images below are small, and are just links to the bigger files.

Image of the Zedboard, front view

The Zedboard, front view. Click to enlarge.

Image of the Zedboard, back view

The Zedboard, back view. Click to enlarge.

Zynq-7000 EPP: Does it connect with AXI3 or AXI4?

The short answer

  • The P7 ARM processor’s buses run AXI3
  • It’s not as important as it seems at first

The supposed conflict

Xilinx have been transferring most of its CoreGen IP cores from all kinds of interfaces to AXI4 over the last few years. With the transaction of Microblaze-related IP cores together with the anticipation for an ARM-based platform to replace the old PowerPC embedded processors, it was somehow obvious that these new Cortex A9 processors would talk with the world through AXI4. But no, they ended up with AXI3, and Xilinx’ documentation seems to prefer hinting that fact, rather than spelling it out loud and clear.

So what’s the difference between AXI3 and AXI4? Well, the most outstanding difference is that AXI4 allows up to 256 beats of data per burst (1 kByte on 32-bit data buses) while AXI3 allows no more than 16 beats (64 bytes on 32-bit data buses). This means that if an AXI4-based IP core is connected as master directly to the Zynq-7000′s AXI wires, it may attempt bursts that the slave will not be able to support. Or more precisely, the burst length signals on AXI3 are only 4 bits, as opposed to AXI4′s counterparts with 8 bits. So the bursts requests won’t go through correctly. As I’ll explain below, this is not likely to happen.

There is a variety of other differences between the bus interfaces, many of which don’t pose a compatibility issue when connecting an existing IP core running AXI4 to an AXI processor, since the newer version is more restrictive in many respects. A summary can be found in chapter 13 of the IHI0022C AMBA specification.

No reason for alarm

On the face of it, this AXI3/4 interconnect conflict is a recipe for a disaster. In reality, the XPS gracefully hides this detail from the innocent design engineer.

The key is in the way XPS requires us to connect an AXI master to the processor: Through an axi_interconnect IP core. These IP cores are always involved, being the embodiment of those colored bus columns displayed in the XPS GUI for connecting and disconnecting modules in a processor design.

As it says in the AXI interconnect IP core’s datasheet (“AXI3 Slave Converted”, page 27 or so) this interconnect module splits bursts larger than 16 beats (regardless of AWCACHE/ARCACHE), if the slave declares itself as AXI3. So if an engineer who is unaware of this AXI version issue just connects an AXI4 IP core to the Zynq processor, it will turn out fine: XPS automatically generates the AXI interconnect IP core, which detects the need for translation, and transparently makes sure the bursts end up on the slave OK.

But sometimes it matters

An interesting case is when there’s a single master and single slave on a bus. If the bus widths are same on both ends, and the bus protocols match, there’s no need for any intermediate logic: The interconnect should consist of just plain wires. Indeed, AXI interconnect IP core’s datasheet describes this situation on its page 8 (or so) under the title “Pass Through”. It’s not just an issue of saving logic and memory for interconnect FIFOs, but the conversion logic also has some combinatoric paths between the master and slave, which may reduce the maximal bus clock frequency. I suppose it’s possible configure the interconnect so it pushes a register in that path, but I never got to investigate that.

This is relevant in particular with custom IP cores. If they are made AXI3 compatible and declared as such (with the C_M_AXI_PROTOCOL parameter in the .mpd file), both logic and some headache can be saved. It’s also important to remember to match the processor’s data bus width to the core’s.

So all in all, this AXI3/4 incompatibility sounds worse than it really is. I’m just left to wonder why this happened at all.