Cache coherency on i.MX25 running Linux

This post was written by eli on February 4, 2014
Posted Under: ARM,Freescale,Linux kernel

What this blob is all about

Running some home-cooked SDMA scripts on Freescale’s Linux 2.6.28 kernel on an i.MX25 processor, I’m puzzled by the fact, that cache flushing with dma_map_single(…, DMA_TO_DEVICE) doesn’t hurt, but nothing happens if the calls are removed. On the other hand, attempting to remove cache invalidation calls, as in dma_map_single(…, DMA_FROM_DEVICE) does cause data corruption, as one would expect.

The de-facto lack of need for cache flushing could be explained by the small size of the cache: The sequence of events is typically preparing the data in the buffer, then some stuff in the middle, and only then is the SDMA script kicked off. If the cache lines are evicted naturally as a result of that “some stuff” activity, one gets away with not flushing the cache explicitly.

I’m by no means saying that cache flushing shouldn’t be done. On the contrary, I’m surprised that things don’t break when it’s removed.

So why doesn’t one get away with not invalidating the cache? In my tests, I saw 32-byte segments going wrong when I dropped the invalidation. That is, some segments, typically after a handful of successful data transactions of less than 1 kB of data.

Why does dropping the invalidation break things, and dropping the flushing doesn’t? As I said above, I’m still puzzled by this.

So I went down to the details of what these calls to dma_map_single() do. Spoiler: I didn’t find an explanation. At the end of the foodchain, there are several MCR assembly instructions, as one should expect. Both flushing and invalidation apparently does something useful.

The rest of this post is the dissection of Linux’ kernel code in this respect.

The gory details

DMA mappings and sync functions practically wrap the dma_cache_maint() function, e.g. in arch/arm/include/asm/dma-mapping.h:

static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
		size_t size, enum dma_data_direction dir)
{
	BUG_ON(!valid_dma_direction(dir));

	if (!arch_is_coherent())
		dma_cache_maint(cpu_addr, size, dir);

	return virt_to_dma(dev, cpu_addr);
}

It was verified with disassembly that dma_map_single() was implemented with a call to dma_cache_maint().

This function can be found in arch/arm/mm/dma-mapping.c as follows

/*
 * Make an area consistent for devices.
 * Note: Drivers should NOT use this function directly, as it will break
 * platforms with CONFIG_DMABOUNCE.
 * Use the driver DMA support - see dma-mapping.h (dma_sync_*)
 */
void dma_cache_maint(const void *start, size_t size, int direction)
{
	const void *end = start + size;

	BUG_ON(!virt_addr_valid(start) || !virt_addr_valid(end - 1));

	switch (direction) {
	case DMA_FROM_DEVICE:		/* invalidate only */
		dmac_inv_range(start, end);
		outer_inv_range(__pa(start), __pa(end));
		break;
	case DMA_TO_DEVICE:		/* writeback only */
		dmac_clean_range(start, end);
		outer_clean_range(__pa(start), __pa(end));
		break;
	case DMA_BIDIRECTIONAL:		/* writeback and invalidate */
		dmac_flush_range(start, end);
		outer_flush_range(__pa(start), __pa(end));
		break;
	default:
		BUG();
	}
}
EXPORT_SYMBOL(dma_cache_maint);

The outer_* calls are defined as null functions in arch/arm/include/asm/cacheflush.h, since the CONFIG_OUTER_CACHE kernel configuration flag isn’t set.

The dmac_* macros are defined in arch/arm/include/asm/cacheflush.h as follows:

#define dmac_inv_range			__glue(_CACHE,_dma_inv_range)
#define dmac_clean_range		__glue(_CACHE,_dma_clean_range)
#define dmac_flush_range		__glue(_CACHE,_dma_flush_range)

where __glue() simply glues the two strings together (see arch/arm/include/asm/glue.h) and _CACHE equals “arm926″ for the i.MX25, so e.g. dmac_clean_range becomes arm926_dma_clean_range.

These actual functions are implemented in assembler in arch/arm/mm/proc-arm926.S:

/*
 *	dma_inv_range(start, end)
 *
 *	Invalidate (discard) the specified virtual address range.
 *	May not write back any entries.  If 'start' or 'end'
 *	are not cache line aligned, those lines must be written
 *	back.
 *
 *	- start	- virtual start address
 *	- end	- virtual end address
 *
 * (same as v4wb)
 */
ENTRY(arm926_dma_inv_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
	tst	r0, #CACHE_DLINESIZE - 1
	mcrne	p15, 0, r0, c7, c10, 1		@ clean D entry
	tst	r1, #CACHE_DLINESIZE - 1
	mcrne	p15, 0, r1, c7, c10, 1		@ clean D entry
#endif
	bic	r0, r0, #CACHE_DLINESIZE - 1
1:	mcr	p15, 0, r0, c7, c6, 1		@ invalidate D entry
	add	r0, r0, #CACHE_DLINESIZE
	cmp	r0, r1
	blo	1b
	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
	mov	pc, lr

/*
 *	dma_clean_range(start, end)
 *
 *	Clean the specified virtual address range.
 *
 *	- start	- virtual start address
 *	- end	- virtual end address
 *
 * (same as v4wb)
 */
ENTRY(arm926_dma_clean_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
	bic	r0, r0, #CACHE_DLINESIZE - 1
1:	mcr	p15, 0, r0, c7, c10, 1		@ clean D entry
	add	r0, r0, #CACHE_DLINESIZE
	cmp	r0, r1
	blo	1b
#endif
	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
	mov	pc, lr

/*
 *	dma_flush_range(start, end)
 *
 *	Clean and invalidate the specified virtual address range.
 *
 *	- start	- virtual start address
 *	- end	- virtual end address
 */
ENTRY(arm926_dma_flush_range)
	bic	r0, r0, #CACHE_DLINESIZE - 1
1:
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
	mcr	p15, 0, r0, c7, c14, 1		@ clean+invalidate D entry
#else
	mcr	p15, 0, r0, c7, c6, 1		@ invalidate D entry
#endif
	add	r0, r0, #CACHE_DLINESIZE
	cmp	r0, r1
	blo	1b
	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
	mov	pc, lr

The CONFIG_CPU_DCACHE_WRITETHROUGH kernel configuration flag is not set, so there are no shortcuts.

Exactly the same snippet, only disassembled from the object file (using objdump -d):

000004d4 <arm926_dma_inv_range>:
 4d4:	e310001f 	tst	r0, #31
 4d8:	1e070f3a 	mcrne	15, 0, r0, cr7, cr10, {1}
 4dc:	e311001f 	tst	r1, #31
 4e0:	1e071f3a 	mcrne	15, 0, r1, cr7, cr10, {1}
 4e4:	e3c0001f 	bic	r0, r0, #31
 4e8:	ee070f36 	mcr	15, 0, r0, cr7, cr6, {1}
 4ec:	e2800020 	add	r0, r0, #32
 4f0:	e1500001 	cmp	r0, r1
 4f4:	3afffffb 	bcc	4e8 <arm926_dma_inv_range+0x14>
 4f8:	ee070f9a 	mcr	15, 0, r0, cr7, cr10, {4}
 4fc:	e1a0f00e 	mov	pc, lr

00000500 <arm926_dma_clean_range>:
 500:	e3c0001f 	bic	r0, r0, #31
 504:	ee070f3a 	mcr	15, 0, r0, cr7, cr10, {1}
 508:	e2800020 	add	r0, r0, #32
 50c:	e1500001 	cmp	r0, r1
 510:	3afffffb 	bcc	504 <arm926_dma_clean_range+0x4>
 514:	ee070f9a 	mcr	15, 0, r0, cr7, cr10, {4}
 518:	e1a0f00e 	mov	pc, lr

0000051c <arm926_dma_flush_range>:
 51c:	e3c0001f 	bic	r0, r0, #31
 520:	ee070f3e 	mcr	15, 0, r0, cr7, cr14, {1}
 524:	e2800020 	add	r0, r0, #32
 528:	e1500001 	cmp	r0, r1
 52c:	3afffffb 	bcc	520 <arm926_dma_flush_range+0x4>
 530:	ee070f9a 	mcr	15, 0, r0, cr7, cr10, {4}
 534:	e1a0f00e 	mov	pc, lr

So there’s actually little to learn from the disassembly. Or at all…

Add a Comment

required, use real name
required, will not be published
optional, your blog address