Cache coherency on i.MX25 running Linux
What this blob is all about
Running some home-cooked SDMA scripts on Freescale’s Linux 2.6.28 kernel on an i.MX25 processor, I’m puzzled by the fact, that cache flushing with dma_map_single(…, DMA_TO_DEVICE) doesn’t hurt, but nothing happens if the calls are removed. On the other hand, attempting to remove cache invalidation calls, as in dma_map_single(…, DMA_FROM_DEVICE) does cause data corruption, as one would expect.
The de-facto lack of need for cache flushing could be explained by the small size of the cache: The sequence of events is typically preparing the data in the buffer, then some stuff in the middle, and only then is the SDMA script kicked off. If the cache lines are evicted naturally as a result of that “some stuff” activity, one gets away with not flushing the cache explicitly.
I’m by no means saying that cache flushing shouldn’t be done. On the contrary, I’m surprised that things don’t break when it’s removed.
So why doesn’t one get away with not invalidating the cache? In my tests, I saw 32-byte segments going wrong when I dropped the invalidation. That is, some segments, typically after a handful of successful data transactions of less than 1 kB of data.
Why does dropping the invalidation break things, and dropping the flushing doesn’t? As I said above, I’m still puzzled by this.
So I went down to the details of what these calls to dma_map_single() do. Spoiler: I didn’t find an explanation. At the end of the foodchain, there are several MCR assembly instructions, as one should expect. Both flushing and invalidation apparently does something useful.
The rest of this post is the dissection of Linux’ kernel code in this respect.
The gory details
DMA mappings and sync functions practically wrap the dma_cache_maint() function, e.g. in arch/arm/include/asm/dma-mapping.h:
static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir) { BUG_ON(!valid_dma_direction(dir)); if (!arch_is_coherent()) dma_cache_maint(cpu_addr, size, dir); return virt_to_dma(dev, cpu_addr); }
It was verified with disassembly that dma_map_single() was implemented with a call to dma_cache_maint().
This function can be found in arch/arm/mm/dma-mapping.c as follows
/* * Make an area consistent for devices. * Note: Drivers should NOT use this function directly, as it will break * platforms with CONFIG_DMABOUNCE. * Use the driver DMA support - see dma-mapping.h (dma_sync_*) */ void dma_cache_maint(const void *start, size_t size, int direction) { const void *end = start + size; BUG_ON(!virt_addr_valid(start) || !virt_addr_valid(end - 1)); switch (direction) { case DMA_FROM_DEVICE: /* invalidate only */ dmac_inv_range(start, end); outer_inv_range(__pa(start), __pa(end)); break; case DMA_TO_DEVICE: /* writeback only */ dmac_clean_range(start, end); outer_clean_range(__pa(start), __pa(end)); break; case DMA_BIDIRECTIONAL: /* writeback and invalidate */ dmac_flush_range(start, end); outer_flush_range(__pa(start), __pa(end)); break; default: BUG(); } } EXPORT_SYMBOL(dma_cache_maint);
The outer_* calls are defined as null functions in arch/arm/include/asm/cacheflush.h, since the CONFIG_OUTER_CACHE kernel configuration flag isn’t set.
The dmac_* macros are defined in arch/arm/include/asm/cacheflush.h as follows:
#define dmac_inv_range __glue(_CACHE,_dma_inv_range) #define dmac_clean_range __glue(_CACHE,_dma_clean_range) #define dmac_flush_range __glue(_CACHE,_dma_flush_range)
where __glue() simply glues the two strings together (see arch/arm/include/asm/glue.h) and _CACHE equals “arm926″ for the i.MX25, so e.g. dmac_clean_range becomes arm926_dma_clean_range.
These actual functions are implemented in assembler in arch/arm/mm/proc-arm926.S:
/* * dma_inv_range(start, end) * * Invalidate (discard) the specified virtual address range. * May not write back any entries. If 'start' or 'end' * are not cache line aligned, those lines must be written * back. * * - start - virtual start address * - end - virtual end address * * (same as v4wb) */ ENTRY(arm926_dma_inv_range) #ifndef CONFIG_CPU_DCACHE_WRITETHROUGH tst r0, #CACHE_DLINESIZE - 1 mcrne p15, 0, r0, c7, c10, 1 @ clean D entry tst r1, #CACHE_DLINESIZE - 1 mcrne p15, 0, r1, c7, c10, 1 @ clean D entry #endif bic r0, r0, #CACHE_DLINESIZE - 1 1: mcr p15, 0, r0, c7, c6, 1 @ invalidate D entry add r0, r0, #CACHE_DLINESIZE cmp r0, r1 blo 1b mcr p15, 0, r0, c7, c10, 4 @ drain WB mov pc, lr /* * dma_clean_range(start, end) * * Clean the specified virtual address range. * * - start - virtual start address * - end - virtual end address * * (same as v4wb) */ ENTRY(arm926_dma_clean_range) #ifndef CONFIG_CPU_DCACHE_WRITETHROUGH bic r0, r0, #CACHE_DLINESIZE - 1 1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry add r0, r0, #CACHE_DLINESIZE cmp r0, r1 blo 1b #endif mcr p15, 0, r0, c7, c10, 4 @ drain WB mov pc, lr /* * dma_flush_range(start, end) * * Clean and invalidate the specified virtual address range. * * - start - virtual start address * - end - virtual end address */ ENTRY(arm926_dma_flush_range) bic r0, r0, #CACHE_DLINESIZE - 1 1: #ifndef CONFIG_CPU_DCACHE_WRITETHROUGH mcr p15, 0, r0, c7, c14, 1 @ clean+invalidate D entry #else mcr p15, 0, r0, c7, c6, 1 @ invalidate D entry #endif add r0, r0, #CACHE_DLINESIZE cmp r0, r1 blo 1b mcr p15, 0, r0, c7, c10, 4 @ drain WB mov pc, lr
The CONFIG_CPU_DCACHE_WRITETHROUGH kernel configuration flag is not set, so there are no shortcuts.
Exactly the same snippet, only disassembled from the object file (using objdump -d):
000004d4 <arm926_dma_inv_range>: 4d4: e310001f tst r0, #31 4d8: 1e070f3a mcrne 15, 0, r0, cr7, cr10, {1} 4dc: e311001f tst r1, #31 4e0: 1e071f3a mcrne 15, 0, r1, cr7, cr10, {1} 4e4: e3c0001f bic r0, r0, #31 4e8: ee070f36 mcr 15, 0, r0, cr7, cr6, {1} 4ec: e2800020 add r0, r0, #32 4f0: e1500001 cmp r0, r1 4f4: 3afffffb bcc 4e8 <arm926_dma_inv_range+0x14> 4f8: ee070f9a mcr 15, 0, r0, cr7, cr10, {4} 4fc: e1a0f00e mov pc, lr 00000500 <arm926_dma_clean_range>: 500: e3c0001f bic r0, r0, #31 504: ee070f3a mcr 15, 0, r0, cr7, cr10, {1} 508: e2800020 add r0, r0, #32 50c: e1500001 cmp r0, r1 510: 3afffffb bcc 504 <arm926_dma_clean_range+0x4> 514: ee070f9a mcr 15, 0, r0, cr7, cr10, {4} 518: e1a0f00e mov pc, lr 0000051c <arm926_dma_flush_range>: 51c: e3c0001f bic r0, r0, #31 520: ee070f3e mcr 15, 0, r0, cr7, cr14, {1} 524: e2800020 add r0, r0, #32 528: e1500001 cmp r0, r1 52c: 3afffffb bcc 520 <arm926_dma_flush_range+0x4> 530: ee070f9a mcr 15, 0, r0, cr7, cr10, {4} 534: e1a0f00e mov pc, lr
So there’s actually little to learn from the disassembly. Or at all…