Raspberry Pi RISC OS System Programming Book
Pages: 1 2
Rick Murray (539) 13840 posts |
Far be it from me to be anywhere near smart enough to disagree with you – but could I please ask you to justify the statement that calling SWIs directly is “bad”? You are correct in that calling SWIs the RISC OS way thrashes the data cache, basically because the OS needs to load the instruction as data (not as code) to see what the SWI number is.1 However, the OS needs to load the SWI in any case to know it is the CallASWI SWI; and since this is the cache-thrash, surely it doesn’t matter if it is CallASWI or any other SWI being called – the damage has been done. The SWI handler, in Kernel.s.Kernel, does this (from line 577):
So… Why is calling SWIs directly frowned upon? By my understanding of this, the SWI is always read by loading the instruction as data. However as CallASWI is decoded (line 735/737), the SWI number is fudged appropriately (function at line 780/787), then we are thrown back into the dispatch again for the SWI proper (line 590). For the purposes of the book, however, it is much less grief to call _swix(). ;-) I have written software where all the SWI calls were assembler veneers to the SWIs, basically ‘cos I like writing ARM code. However I think the amount of work that it took (a lot) over the time saved on calls to _kernel_swi() doesn’t necessarily justify the effort. 1 Not teaching you to suck eggs ;-) it is for others reading who might not know why; this is why Linux on ARM no longer uses SWIs by number, instead SWI &0 is always called and the number is provided in a register (R7?), so the instruction read doesn’t need to be done… SWI was a brilliant idea, but it lives in a time before processors had caches, certainly before the instruction and data caches were separated. 2 “né” (said like neck) is a Japanesism for when you are asking a question, but expecting an affirmative response. |
Jeffrey Lee (213) 6048 posts |
It’s because there’s only one implementation of _swix(). This means there’s only one SWI instruction, and subsequently only one address which the kernel needs to read from to get the SWI number. This means there’ll only be one cache line needed to hold that address. The more code that uses _swix(), the greater the chances are that the next time a SWI is called (via _swix()) the D-cache will still contain the _swix() SWI number – thereby avoiding the CPU having to stall while performing a main memory access. With other approaches (worst-case being inline assembler, where you’d have one SWI instruction for each piece of code you want to call a SWI from) there’d be many more addresses for the kernel to have to load from, and as a consequence if you were to call lots of SWIs in a sequence the D-cache could quickly fill up with cache lines containing nothing but SWI numbers. This is great if you’re planning on continuing to call those SWIs, but not so great if you want to switch to doing something memory-intensive where you’d much rather prefer the D-cache to contain the data you’re working on. Note that machines with unified caches aren’t immune to the problem. They won’t suffer from the kernel having to stall in order to read the SWI instruction (because executing the instruction will have already caused it to be loaded into the cache), but I’ll admit I’m not sure how much of an impact this has on code execution. It’s probably only noticeable on machines with terrible memory buses (e.g. standard StrongARM RiscPC), or in extreme cases where you’re expecting a throughput of millions of SWIs per second. |
Steve Pampling (1551) 8170 posts |
Rhetorical questions. Questions to which you already know the answer but you ask to make a point. |
Rick Murray (539) 13840 posts |
Interesting idea, using a single address. That said, I looked at specs for my iPad’s A5 SoC (I’d imagine the ARM part to be “typical”) and it has a 32K instruction cache, and a 32K data cache. So isn’t this question more or less moot for anything other than an idling system? Plus the number of SWIs called by the Wimp (not to mention other parts of the OS) is scary. All of these will be likewise affected. Let’s not even think about what happens when tasks are paged in/out. ;-) |
nemo (145) 2546 posts |
I have to admit to being sceptical about theoretical claims rather than empirical timings of SWI/swix strategies. While on the one hand one does litter the D-cache with SWI instructions to enable them to be decoded, any other method falls foul of the ARM’s limited range of immediate constants…
This isn’t the whole picture – how did that SWI number get to be in that “one address”? It had to be put there, having been got from somewhere else. Now some SWI numbers may happen to be immediate constants, and perhaps the compiler makes others out of a couple of immediate constants… but maybe it just loads it out of memory somewhere and you have another useless cacheline fetched.
And whatever impact it is found to have will be massively dependent on platform and memory map details. Cache coherency is not straightforward in non-trivial programs, so I’d hesitate to recommend any particular strategy with the promise that It Will Always Be Fastest. In other words, use what is most convenient, and in assembler that’s certainly plain old SWIs. |
Rick Murray (539) 13840 posts |
…? Isn’t a SWI number always encoded in to the instruction itself? Or are you referring to synthesising a SWI instruction? |
Rick Murray (539) 13840 posts |
While we’re talking about SWIs, what’s this? http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch02s08s13.html It says that That’s like the old 26 bit behaviour. Has this been reintroduced with the ARM11, or am I looking at Thumb code? I always liked the PS – to me, SVC is a processor mode, so if you don’t mind, I’m going to keep calling it SWI. ;-) |
Jeffrey Lee (213) 6048 posts |
Now some SWI numbers may happen to be immediate constants, and perhaps the compiler makes others out of a couple of immediate constants… but maybe it just loads it out of memory somewhere and you have another useless cacheline fetched. I’m think he’s talking about the SWI number that’s passed to OS_CallASWI, although the wording of his post makes it sound a bit like he’s talking about synthesising the SWI instruction (which is of course how _kernel_swi()/_swix() worked before OS_CallASWI was introduced). So basically: Although the OS_CallASWI instruction may stay in the data cache, there’s also likely to be a bit of code elsewhere which uses LDR to load the SWI number into a register before calling OS_CallASWI. So it doesn’t reach the holy grail of only needing one D-cache line to call any number of SWIs. But, if you consider that the C compiler will have grouped any immediate constants together into literal pools, there’s a chance that the D-cache line containing the SWI number also contains several other useful bits of data, so there should still be a lower percentage of wasted D-cache space than with calling SWIs directly.
There are two main reasons
|
Rick Murray (539) 13840 posts |
Ah, I see now. MOVS (etc) still does what they used to, but only at kernel level code – such as returning from exceptions and the like. Anything less would not set the flags accordingly so these instructions cannot be used in the (historically) expected manner. |
GavinWraith (26) 1563 posts |
I hope the spelling on the website does not leak into the book. Remember that kernel has no a in it. |
Holger Palmroth (487) 115 posts |
Apart from one prominent exception: http://en.wikipedia.org/wiki/KERNAL |
Bruce Smith (1838) 31 posts |
VFP Context Switching Kernal/Kernel |
GavinWraith (26) 1563 posts |
You are welcome. I am pretty arrogant ( and usually correct :) in matters of spelling and grammar, and I have had lots of experience in helping authors of books about programming tidy up their prose. The first book I did this for was Aake Wikstroem’s Introduction to Programming using Standard ML and more recently the first edition of Roberto Ierusalimschy’s Programming in Lua. I am certainly interested in reading about context switching and VFP. In fact if RISC OS on the RPi is to be exploited properly by young programmers this topic will be very useful. |
Trevor Johnson (329) 1645 posts |
Not to be confused with “petty error grant” ;-) |
Bruce Smith (1838) 31 posts |
*SOUNDGAIN |
Chris Hall (132) 3554 posts |
SoundGain is not in the PRM. Which module provides it? |
Trevor Johnson (329) 1645 posts |
Maybe SoundDMA. (New in 4.39, Where did |
Jeffrey Lee (213) 6048 posts |
Yes, it’s SoundDMA which provides *SoundGain. There isn’t a SWI to control it, and there aren’t any CMOS bytes to store the value either – it gets reset to 0 on each boot. |
Bruce Smith (1838) 31 posts |
Vector Floating Point Just wondered is there any kernel support for printing floating point numbers, ie some form of OS_ConvertVFP to convert a Dx or Sx register value into a string for printing? Ditto in reverse? |
Jeffrey Lee (213) 6048 posts |
Unfortunately not, but it’s probably something we should consider adding. |
Bruce Smith (1838) 31 posts |
If using VFPSupport_ calls to check and then create context is it still a requirement to turn the coprocessor on/off or does VFPSupport_CreateContext etc do this automatically? If it is still a requirement what is the correct order of calls in and out? |
Jeffrey Lee (213) 6048 posts |
VFPSupport handles enabling/disabling the coprocessor as appropriate. Any code people have to manually enable the coprocessor is obsolete and should be deleted (it could interfere with VFPSupport’s operation) |
Bruce Smith (1838) 31 posts |
How do I opt-in to Thumb code using the BBC BASIC Assembler? |
Pages: 1 2