Hardware acceleration
GavinWraith (26) 1563 posts |
Are there any parts of RO5 which could take advantage of the XScale’s ability to transfer data from memory to memory in parallel? See http://homepage.ntlworld.com/rik.griffin/ . I am thinking simply of providing hooks which XScale machines could exploit. |
Ben Avison (25) 445 posts |
Lots. The obvious candidates are filing systems, many of which do a lot of copying around of data, often in the background. But due to the rather broken memory controller in the Iyonix’s IOP321 chip, even synchronous memory copies work a lot faster from the application accelerator – once you pass the threshold of setting up the transfer, that is. This means that things like the C library’s memcpy() function or SWI Wimp_TransferBlock could see major speed improvements. I originally planned to add support for memory-to-memory DMA to the DMAManager (currently it only does memory-to-peripheral and peripheral-to-memory) and then create a HAL device to drive the application accelerator. This has the advantage of keeping all the PagesSafe/PagesUnsafe handling in one place and presenting a unified scatter list API to all callers. But the main phase of development of the Iyonix stopped before this could be done. There’s also a bit of an issue with there being a number of modules out there all trying to use limited resources like the XScale DMA engine and AAU. I recognise that for some uses (like copying a section-mapped RAM screenbuffer across the PCI bus to a region of IO space) a general-purpose scatter list engine might be overkill, so I’m not suggesting that everything goes via the DMAManager. When I heard that Geminus was going to use the DMA engine, I proposed to Adrian that a service call interface be used to avoid clashes, modelled on Service_ClaimFIQ (and related calls) – I’ve no idea whether he implemented anything though. Another thing I remember planning was a general-purpose memory-copy SWI - either in the kernel for speed of dispatch, or possibly in a separate module. This could be a central point which knew whether the current platform has memory-to-memory DMA acceleration and the threshold at which to use it. It would also act as a central repository of CPU-driven memory-to-memory copy routines, and could select the best-performing one for the current platform based upon CPU features (eg speed of unexecuted LDM/STM or support for instructions like LDRH or LDRD), cache line lengths, the presence of a merging write buffer, word-alignment, whether the source and/or destination were uncached, etc. |
Thomas Milius (126) 44 posts |
2005 I wrote the IntelDMA module which makes usage of the XScale DMA Unit. Due to the absense of a general Interface (RISC OS has DMA-module e.g. on the RISC PC) it is bit standing alone. The module mentioned above is a similar attempt. IntelDMA access has been added to KinoAMp since a while and gives it a heavy speed burst on the Iyonix. Geminus also gives an acceleration but AFAIR makes usage of the NVIDIA card itself (perhaps including DMA of the card). However the XScale acceleration unit has a heavy overhead you have to cope with which make only sense for really large amounts of memory to transfer. And where will you use it? For Hardisc the Southbridge DMA seemed to be used already. Inside applications I see only rare usage of large memory copy (perhaps memory clean up). Most interesting is in my opinion the memory to video transfer. Gemnius provides this already in a fine way. The checksum calculation and memory filling of the XScale is not an aspect you are using really often in my opinion. |