Xilinx’ Zynq Z007s: Is it really single core?

This post was written by eli on September 14, 2018
Posted Under: ARM,FPGA,Linux kernel,Zynq

Introduction

Xilinx’ documentation says that XC7Z007S, among other “S” devices, is a single-core device, as opposed to, for example, its older brother XC7Z010, which is dual-core. So I compared several aspects of the PS part of a Z007S vs. Z010, and to my astonishment, I found that Z007S is exactly the same: Two CPUs are reported by the hardware itself, SMP is kicked off on both, and a simple performance test I made showed that Z007S runs two processes in parallel as fast as Z010.

So the question is: In what sense is XC7Z007S single-core? For now, I have no answer to that. I’ll update this post as soon as someone manages to explain this to me. In the meanwhile, I’ve tried to get this figured out in Xilinx’ forum.

The rest of this post outlines the various similarities between the Z007S vs. Z010 I tested. The PL bitfiles of different Zynq devices are incompatible, so there’s no chance I mistook which devices I worked with.

The tests below were made with Xillinux-2.0 (kernel v4.4) on two Z-turn Lite boards, one carrying Z007S, and one Z010.

Found 2 CPUs?

I started wondering when the kernel’s dmesg log indicated that it had found 2 CPUs on a Z007S:

[    0.132523] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.132586] Setting up static identity map for 0x82c0 - 0x82f4
[    0.310962] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.311065] Brought up 2 CPUs
[    0.311102] SMP: Total of 2 processors activated (2664.03 BogoMIPS).
[    0.311121] CPU: All CPU(s) started in SVC mode.

Also, /proc/cpuinfo consistently listed two CPUs. One could think that it’s because two CPUs are declared in the device tree, but removing one of them makes no difference.

On Z010, the exact same log and appears in this matter, and /proc/cpuinfo says the same.

CPU’s hardware register reporting two CPUs

According to the Zynq-7000 AP SoC Technical Reference Manual (UG585), the processor’s SCU_CONFIGURATION_REGISTER indicates the number of CPUs present in the Cortex-A9 MPCore processor in bits 1:0. Binary ’01′ means two Cortex-A9 processors, CPU0 and CPU1. Binary ’00′ means one Cortex-A9 processor, CPU0.

Using Xillinux-2.0′s poke kernel utility to read the processor’s SCU_CONFIGURATION_REGISTER register, I got exactly the same result on Z007S and Z010:

poke read addr=f8f00004: value=00000511

In other words, both devices report two processors.

I’m under the impression that the kernel uses this register to tell the number of CPUs by virtue of the scu_get_core_count() (defined in arch/arm/kernel/smp_scu.c) function, called by zynq_smp_init_cpus() in arch/arm/mach-zynq/platsmp.c.

The latter function sets the kernel’s “CPU possible” bits, so it’s how the Zynq-specific kernel setup code tells the kernel framework which CPUs indexes are good for use.

Also, the U-Boot code used by Xillinux for Z-Turn Lite prints out the processor count, based upon SCU_CONFIGURATION_REGISTER, as well as other info. For Z007S it gave:

U-Boot 2013.07 (Sep 17 2018 - 11:51:45)              

Detected device ID code 0x3 (XC7Z007S) with 2 CPU(s), PS_VERSION = 3
Strapped boot mode: 5 (SD Card)

and for Z010:

U-Boot 2013.07 (Sep 17 2018 - 11:51:45)              

Detected device ID code 0x2 (XC7Z010) with 2 CPU(s), PS_VERSION = 3
Strapped boot mode: 5 (SD Card)

A simple benchmark test

The proof is in the pudding. I wrote a simple program, which forks into two processes, each running a certain amount of plain CPU-intensive operations, and then quits. The output of this program is of no interest, but it’s printed out to avoid the compiler from optimizing away the crunching. Its listing is given at the end of this post for reference.

Using the “time” utility to measure the execution times, I ran the program on Z007S and Z010, and consistently got the same results, except for slight fluctuations:

# time ./work 400
Parent process done with LSR at e89c4641
Child process quitting with LSR at e89c4641
Parent process quitting

real	0m3.604s
user	0m7.030s
sys	0m0.010s

The 3.6 seconds given as “real” is the wall clock time. The 7 seconds of “user” time is the amount of consumed CPU. And as one would expect from a program that runs on two processes on a dual core machine, the consumed CPU time is approximately double the wall clock time. This is the result I expected from Z010, but not from Z007S.

Just to be sure I wasn’t being silly, I booted the kernel with “nosmp” in the kernel command line, which forced a single-CPU bringup. Indeed, the kernel reported finding one CPU in its logs, and /proc/cpuinfo reflected that as well.

And the pudding?

# time ./work 400
Parent process done with LSR at e89c4641
Child process quitting with LSR at e89c4641
Parent process quitting

real	0m6.998s
user	0m6.970s
sys	0m0.010s

Exactly as expected: With one processor, forking into two processes has no advantage. The CPU time is the wall clock time. I waited twice as long for it to finish.

At some point I suspected that the specific Linux version I used had a specific scheduler issue, which allowed a single-core CPU to perform as well as a dual-core. However the dual-core results were repeated on a Zybo board with three completely different kernels (except Xillinux-2.0) and yielded the same results (or slightly worse, with older kernels).

Conclusion

Given the results above, it’s not clear why Z007S is labeled as a single-core device. It’s not a matter of how it quacks or walks, but in the end, the device performs twice as fast when the work is split into two processes.

Or I missed something here. Kindly comment below if you found my mistake.

———————————–

Appendix: The benchmark program’s listing

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <signal.h>
#include <errno.h>
#include <string.h>
#include <sys/wait.h>

static unsigned int lsr_state;

int main(int argc, char *argv[]) {

  int count, i, j, bit;
  pid_t pid;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s count\n", argv[0]);
    exit(1);
  }

  count = atoi(argv[1]);

  lsr_state = 1;

  pid = fork();

  if (pid < 0) {
    perror("Failed to fork");
    exit(1);
  }

  for (i=0; i<count; i++)
    for (j=0; j<(1<<20); j++) {
      bit = ((lsr_state >> 19) ^ (lsr_state >> 2)) & 0x01;

      lsr_state = (lsr_state << 1) | bit;

      if (lsr_state == 0) {
	fprintf(stderr, "Huh? The LSR state is zero!\n");
	exit(1);
      }
    }

  if (pid == 0) {
    fprintf(stderr, "Child process quitting with LSR at %x\n", lsr_state);
    return 0;
  }

  fprintf(stderr, "Parent process done with LSR at %x\n", lsr_state);

  pid = wait(&i);

  fprintf(stderr, "Parent process quitting\n");

  return 0;
}

saved as work.c, compiled with

# gcc -O3 -Wall work.c -o work

directly on the Zynq board itself (Xillinux comes with a native gcc compiler). But cross compilation should make no difference.

Add a Comment

required, use real name
required, will not be published
optional, your blog address