Thinking ahead: Supporting multicore CPUs
Pages: 1 ... 14 15 16 17 18 19 20 21 22 23 24 25 26
Stuart Swales (8827) 1357 posts |
Believe the source when it does… |
Stuart Swales (8827) 1357 posts |
A candidate for recompiling with /softfp and apcs_softpcs for the easy win? |
Clive Semmens (2335) 3276 posts |
Gawd help us. |
Simon Willcocks (1499) 519 posts |
Strange. I can see BASIC, relatively well-defined, as being easier to port than assembler. Just consider WriteS! In assembler, is something code or data? It’s hard to tell. |
Clive Semmens (2335) 3276 posts |
And how, if you don’t have source. Especially if some devious c**t has been writing self-modifying code. Personally I’ve never done that in ARM assembler, but it was sometimes the only way to get things to run fast enough for real-time applications in 6502 code. |
Stuart Swales (8827) 1357 posts |
DPIngScan is 99% C, Simon. Clearly good enough performance for typical size images, but hundreds of millions of kernel mode transitions for the FP exceptions will soon start to hurt with larger images. [Ah, talking about Printer Manager. Just put a bomb under it.] |
Dave Higton (1515) 3534 posts |
BBC_Error: Untrapped Parser Error – Output File will be Empty 114 syntax errors detected |
Dave Higton (1515) 3534 posts |
For Printer Manager, we have the source. However, even the uncrunched (if that’s the correct term) version, from the ROOL RISC OS sources, is hard to understand because there are no comments, all the structure offsets are magic numbers rather than named variables, and some of the variable names are too short. These are presumably compromises because of the sheer size of it. It’s difficult to work out what a given function is trying to do. But if enough minds co-operate, I’m sure it can be decoded. |
Rick Murray (539) 13850 posts |
It’s stranger than that. The function is “rotate” in the code file “bm”. It looks like it will first perform a quick rotation (90°, 180°, -90°) to get to the closest orientation, and then it’ll perform a slow long winded with much use of FP rotation to make up the difference. It’s odd that it uses FP, given that it looks like a lot of the rest of the program uses 16 bit fixed point? Or am I mistaken on that? It looks to me as if “zmath” deals with fixed point maths.
Fixed that for you, as FPA code on a VFP machine is an insanity. |
Stuart Swales (8827) 1357 posts |
Always get the algorithm right before optimising. If rotating your test images only takes a few seconds, you’re not doing it to animate a real-time display, and other users haven’t complained loudly, leave it be. For all that folk round here gripe about FPA use on VFP-based systems, only a couple of people have bothered to seriously try out apcs_softpcs, with one deploying. Those that haven’t, well probably because their software is already ‘good enough’. Even I haven’t bothered releasing Fireworkz/VFP to mainstream, just to select users that I know it will genuinely help with their large matrix problems. |
Sveinung Wittington Tengelsen (9758) 237 posts |
I’d go for the relative complexity of bitmaps vs. the basic simplicity of geometric figures, straight lines and Bézier curves. |
David J. Ruck (33) 1636 posts |
Depends if interpolation is used to reduce artifacts from rotation, some of the better ones are going to be difficult in fixed point without introducing errors which negate the use of better algorithms. Some interpolation algorithms at the bottom of https://web.archive.org/web/20101127050322/http://www.all-in-one.ee:80/~dersch/interpolator/interpolator.html and more examples at http://photocreations.ca/interpolator/index.html |
Clive Semmens (2335) 3276 posts |
Not really – at worst you could use 64-bit integers instead of 32-bit, it’s still a lot faster than FP. Even if you’ve got VFP. |
David Pilling (8394) 96 posts |
I had to look at DPScan, rotations are done as combinations of X and Y shears (algorithm in Graphics Gems), and those shears are done using integers. Maybe slow, I’m sorry to hear how slow, but results are aliased so look better than the simple way of rotating a bitmap. So slow… I’m wishing that you have the ‘virtual memory’ set up wrongly. Chance to be using lots of memory whilst doing the rotation. I revisited the rotation code in my TWAIN drivers – I would make a better effort at it there. It would be great to recompile with a different floating point set up and get better performance. But so much time back in the early 90s was spent removing/avoiding floating point code. There is a Windows version of DPScan, 64 bit (!), multiple processor/thread (!!). Not a big deal actually, and after years of hankering after a dual processor machine, it was a let down to find that two processors only let you go twice as fast. Even so the things going on in DPScan are suitable for concurrent processing. Would it make a good benchmark, how big are these bitmaps that take so long to rotate. Seemingly the US Minuteman missiles were programmed using integers. You can do a lot if you can be bothered. |
Rick Murray (539) 13850 posts |
I didn’t see any virtual memory option, but I noticed that the memory setting was about 9MB. I changed that to be about quarter of a gigabyte. The rotation of one degree took just under eight seconds on a Pi 2. So there’s the trick – the memory allocation is a bit weird (probably due to its heritage) so one needs to tell it to use a tonne of memory if it needs to. Once you’ve done so, it doesn’t lose its noodle trying to rotate a 25.83MB image in 9MB.
Yes, it looks good. As good as PhotoDesk with bi-cubic interpolation, but twice as fast. ;)
Nothing much has changed, the DDE still doesn’t do VFP code even after all this time.
Mobile phone photo – in my case 4480×2016.
….what were you expecting? Of course, over in the x86 world there’s plenty of cheating. My Pentium 4 box is a dual core hyperthreading gizmo. It’s a lie, there’s only a single core but it does stuff concurrently so it sort of fakes having two cores.
I hope car ECUs and ABS units too. The thing about floating point is that there is inherent imprecision that could accumulate. This is why it’s a bad idea to use floats for money calculations.
As Stuart says: A candidate for recompiling with /softfp and apcs_softpcs for the easy win? |
Stuart Swales (8827) 1357 posts |
My fault for an insufficiently deep dive into the DPIngScan source. I just saw that it was still passing a f.p. arg down to the proc that sheared each scanline of the image. I’d still be tempted here to pass in the corresponding fixed-point integer scaled fractional value as you can then be damn sure the called proc isn’t mucking about pushing/pulling FP temporaries and reloading FP from stack.
You don’t think there’s loss of precision for general calculation by using fixed point??? My Pentium 4 box is a … room heater ;-) [My first P4 died when one of the heatsink clips broke the plastic tab on the CPU socket, heatsink and fan assembly torqued away slightly being vertical. It was running a climate model at full tilt, so overheated very quickly. I think the thermal throttle on the chip was up the other end which was still being cooled a bit.] |
Paul Sprangers (346) 525 posts |
So sorry. My max memory was also defaulted to about 9 MB. Changing that to about 250 MB, the very picture rotated instantaneously on the Pi4. Garlands for DPScan! |
Rick Murray (539) 13850 posts |
;) It’s been a learning experience for us, so it’s all good. Just shows it pays to fiddle in the settings.
Indeed. I’m impressed that C code outperforms something in assembler. I wonder what the algorithm differences are. I was using bi-cubic interpolation. There’s not a lot of difference between that and bi-linear, but both are better than nearest neighbour.
Generally my PC runs fairly cool, but when running OpenDuke (the old Duke Nukem open source runtime) it kicks up to vacuum cleaner mode in a matter of seconds. It’s worth noting that Redneck Rampage and SimCopter (etc) do not do this, so I wonder if it’s using busy loops or something wildly inefficient like that?
Found myself a copy of this as a PDF. The start talks about Derived Objects and says stuff like: |M| determinant of M and then we’re on to Basic Expressions and Functions with things like: / n \ n! | — | binomial coefficient ———————— \ i / (n-1)!i! WTF? I don’t think I’m going to understand |
Stuart Swales (8827) 1357 posts |
Binomial coefficient. That’s more easily understood as n choose i – given a number of things n, how many (distinct) combinations of i numbers of things are there? Easy to grok concrete examples for small n, i, after that it’s just do the arithmetic. n! is factorial. It’s just the product of all the numbers up to and including n. |
Rick Murray (539) 13850 posts |
Thanks but… I briefly skipped through the PDF and… it got increasingly scarier. Some of the diagrams were pretty. The maths was straight up nightmare fuel. |
David Pilling (8394) 96 posts |
Graphics Gems was a series of five books of academic papers published year by year in the early 1990s. A bit like conference proceedings, somewhere for people to publish research papers. Each volume consists of maybe 40 papers. Source code was available for them. You would be good to read the 1000s of pages of dense text from start to finish, more likely pick out papers of interest. The rotation by shearing one is page 179 of volume 1. No binomials are involved. But matrices… DPscan – as I recall, you set the max amount of RAM it is to use, beyond that it puts stuff on disc. If ‘disc’ is an SD card, you probably don’t want it hammering with writes.
I am an optimist, two people together can do more than twice what an individual can. Team work, management. So why not cores. |
Clive Semmens (2335) 3276 posts |
The reason a lot of (programmers’) time was spent avoiding floating point in the 1990s is that using fixed point saves a lot of computer time. This was especially true in FPA days, but it’s still true even if you’ve got VFP, assuming you’re looking for the same degree of precision (or looked at another way, the same rate of accumulation of errors). The only time FP is essential (as distinct from merely saving programmer effort) is when you’re multiplying or dividing quantities whose size is many orders of magnitude different. |
Rick Murray (539) 13850 posts |
To put this into context, given Acorn’s reticence to include FP hardware… I did a test on my Pi1 back in 2015. Performing FPA MUL 4,096,000 times: 80779.853376 in 388cs Performing VFP MUL 4,096,000 times: 80779.853376 in 7cs So, a lot of time saved not using FP. |
Clive Semmens (2335) 3276 posts |
Interesting, thanks Rick. That’s an even bigger difference than I expected. Next question: how long for 4,096,000 integer MULs? For 4,096,000 64-bit integer MULs? (64-bit integer MUL requires 4 MULs + 3 ADCs on Arch32…) |
Graeme (8815) 106 posts |
If I understand correctly, the VFP unit is a co-processor. So it can do the calculations for some instructions and at the same time let the ARM processor continue on. That is a big time advantage. |
Pages: 1 ... 14 15 16 17 18 19 20 21 22 23 24 25 26