RISC OS Open: Forum: Thinking ahead: Supporting multicore CPUs

Nov 18, 2023 5:36pm

Stuart Swales (8827) 1357 posts

Surely no-one really uses floating point for bit images?

Believe the source when it does…

Nov 18, 2023 5:37pm

Stuart Swales (8827) 1357 posts

A candidate for recompiling with /softfp and apcs_softpcs for the easy win?

Nov 18, 2023 5:40pm

Clive Semmens (2335) 3276 posts

Believe the source when it does…

Gawd help us.

Nov 18, 2023 5:52pm

Simon Willcocks (1499) 519 posts

Strange. I can see BASIC, relatively well-defined, as being easier to port than assembler. Just consider WriteS! In assembler, is something code or data? It’s hard to tell.

Nov 18, 2023 5:57pm

Clive Semmens (2335) 3276 posts

In assembler, is something code or data? It’s hard to tell.

And how, if you don’t have source. Especially if some devious c**t has been writing self-modifying code. Personally I’ve never done that in ARM assembler, but it was sometimes the only way to get things to run fast enough for real-time applications in 6502 code.

Nov 18, 2023 5:57pm

Stuart Swales (8827) 1357 posts

DPIngScan is 99% C, Simon. Clearly good enough performance for typical size images, but hundreds of millions of kernel mode transitions for the FP exceptions will soon start to hurt with larger images.

[Ah, talking about Printer Manager. Just put a bomb under it.]

Nov 18, 2023 6:46pm

Dave Higton (1515) 3534 posts

Martin Carradus’s Mindleaf website.

BBC_Error: Untrapped Parser Error – Output File will be Empty

114 syntax errors detected

Nov 18, 2023 7:02pm

Dave Higton (1515) 3534 posts

Strange. I can see BASIC, relatively well-defined, as being easier to port than assembler. Just consider WriteS! In assembler, is something code or data? It’s hard to tell.

And how, if you don’t have source.

For Printer Manager, we have the source. However, even the uncrunched (if that’s the correct term) version, from the ROOL RISC OS sources, is hard to understand because there are no comments, all the structure offsets are magic numbers rather than named variables, and some of the variable names are too short. These are presumably compromises because of the sheer size of it. It’s difficult to work out what a given function is trying to do.

But if enough minds co-operate, I’m sure it can be decoded.

Nov 18, 2023 7:23pm

Rick Murray (539) 13850 posts

Gawd help us.

It’s stranger than that. The function is “rotate” in the code file “bm”.

It looks like it will first perform a quick rotation (90°, 180°, -90°) to get to the closest orientation, and then it’ll perform a slow long winded with much use of FP rotation to make up the difference.

It’s odd that it uses FP, given that it looks like a lot of the rest of the program uses 16 bit fixed point? Or am I mistaken on that? It looks to me as if “zmath” deals with fixed point maths.

hundreds of millions of kernel mode transitions for the FP exceptions will soon start to hurt ~~with larger images~~.

Fixed that for you, as FPA code on a VFP machine is an insanity.

Nov 18, 2023 7:52pm

Stuart Swales (8827) 1357 posts

Always get the algorithm right before optimising. If rotating your test images only takes a few seconds, you’re not doing it to animate a real-time display, and other users haven’t complained loudly, leave it be.

For all that folk round here gripe about FPA use on VFP-based systems, only a couple of people have bothered to seriously try out apcs_softpcs, with one deploying. Those that haven’t, well probably because their software is already ‘good enough’. Even I haven’t bothered releasing Fireworkz/VFP to mainstream, just to select users that I know it will genuinely help with their large matrix problems.

Nov 18, 2023 9:19pm

Sveinung Wittington Tengelsen (9758) 237 posts

I’d go for the relative complexity of bitmaps vs. the basic simplicity of geometric figures, straight lines and Bézier curves.

Nov 19, 2023 1:29am

David J. Ruck (33) 1636 posts

It’s odd that it uses FP, given that it looks like a lot of the rest of the program uses 16 bit fixed point?

Depends if interpolation is used to reduce artifacts from rotation, some of the better ones are going to be difficult in fixed point without introducing errors which negate the use of better algorithms.

Some interpolation algorithms at the bottom of https://web.archive.org/web/20101127050322/http://www.all-in-one.ee:80/~dersch/interpolator/interpolator.html and more examples at http://photocreations.ca/interpolator/index.html

Nov 19, 2023 6:10am

Clive Semmens (2335) 3276 posts

some of the better ones are going to be difficult in fixed point without introducing errors which negate the use of better algorithms.

Not really – at worst you could use 64-bit integers instead of 32-bit, it’s still a lot faster than FP. Even if you’ve got VFP.

Nov 19, 2023 5:36pm

David Pilling (8394) 96 posts

I had to look at DPScan, rotations are done as combinations of X and Y shears (algorithm in Graphics Gems), and those shears are done using integers. Maybe slow, I’m sorry to hear how slow, but results are aliased so look better than the simple way of rotating a bitmap.

So slow… I’m wishing that you have the ‘virtual memory’ set up wrongly. Chance to be using lots of memory whilst doing the rotation.

I revisited the rotation code in my TWAIN drivers – I would make a better effort at it there.

It would be great to recompile with a different floating point set up and get better performance. But so much time back in the early 90s was spent removing/avoiding floating point code.

There is a Windows version of DPScan, 64 bit (!), multiple processor/thread (!!). Not a big deal actually, and after years of hankering after a dual processor machine, it was a let down to find that two processors only let you go twice as fast. Even so the things going on in DPScan are suitable for concurrent processing.

Would it make a good benchmark, how big are these bitmaps that take so long to rotate.

Seemingly the US Minuteman missiles were programmed using integers. You can do a lot if you can be bothered.

Nov 19, 2023 6:38pm

Rick Murray (539) 13850 posts

Chance to be using lots of memory whilst doing the rotation.

I didn’t see any virtual memory option, but I noticed that the memory setting was about 9MB. I changed that to be about quarter of a gigabyte. The rotation of one degree took just under eight seconds on a Pi 2.

So there’s the trick – the memory allocation is a bit weird (probably due to its heritage) so one needs to tell it to use a tonne of memory if it needs to. Once you’ve done so, it doesn’t lose its noodle trying to rotate a 25.83MB image in 9MB.
It’s a shame the value is a fixed amount rather than something like “use up to xx% of available free memory” that would have scaled better, but hey, fiddling the setting is an easy fix. Thanks for the pointer.

but results are aliased so look better than the simple way of rotating a bitmap.

Yes, it looks good. As good as PhotoDesk with bi-cubic interpolation, but twice as fast. ;)

But so much time back in the early 90s was spent removing/avoiding floating point code.

Nothing much has changed, the DDE still doesn’t do VFP code even after all this time.

how big are these bitmaps that take so long to rotate.

Mobile phone photo – in my case 4480×2016.

after years of hankering after a dual processor machine, it was a let down to find that two processors only let you go twice as fast.

….what were you expecting?

Of course, over in the x86 world there’s plenty of cheating. My Pentium 4 box is a dual core hyperthreading gizmo. It’s a lie, there’s only a single core but it does stuff concurrently so it sort of fakes having two cores.

Seemingly the US Minuteman missiles were programmed using integers.

I hope car ECUs and ABS units too. The thing about floating point is that there is inherent imprecision that could accumulate. This is why it’s a bad idea to use floats for money calculations.

It would be great to recompile with a different floating point set up and get better performance.

As Stuart says: A candidate for recompiling with /softfp and apcs_softpcs for the easy win?

Nov 19, 2023 6:41pm

Stuart Swales (8827) 1357 posts

My fault for an insufficiently deep dive into the DPIngScan source. I just saw that it was still passing a f.p. arg down to the proc that sheared each scanline of the image. I’d still be tempted here to pass in the corresponding fixed-point integer scaled fractional value as you can then be damn sure the called proc isn’t mucking about pushing/pulling FP temporaries and reloading FP from stack.

The thing about floating point is that there is inherent imprecision that could accumulate

You don’t think there’s loss of precision for general calculation by using fixed point???

My Pentium 4 box is a … room heater ;-) [My first P4 died when one of the heatsink clips broke the plastic tab on the CPU socket, heatsink and fan assembly torqued away slightly being vertical. It was running a climate model at full tilt, so overheated very quickly. I think the thermal throttle on the chip was up the other end which was still being cooled a bit.]

Nov 19, 2023 6:59pm

Paul Sprangers (346) 525 posts

So slow… I’m wishing that you have the ‘virtual memory’ set up wrongly.

I changed that to be about quarter of a gigabyte. The rotation of one degree took just under eight seconds on a Pi 2.

So sorry. My max memory was also defaulted to about 9 MB. Changing that to about 250 MB, the very picture rotated instantaneously on the Pi4. Garlands for DPScan!

Nov 19, 2023 8:08pm

Rick Murray (539) 13850 posts

Changing that to about 250 MB, the very picture rotated instantaneously on the Pi4.

;) It’s been a learning experience for us, so it’s all good. Just shows it pays to fiddle in the settings.

Garlands for DPScan!

Indeed. I’m impressed that C code outperforms something in assembler. I wonder what the algorithm differences are. I was using bi-cubic interpolation. There’s not a lot of difference between that and bi-linear, but both are better than nearest neighbour.

so overheated very quickly

Generally my PC runs fairly cool, but when running OpenDuke (the old Duke Nukem open source runtime) it kicks up to vacuum cleaner mode in a matter of seconds. It’s worth noting that Redneck Rampage and SimCopter (etc) do not do this, so I wonder if it’s using busy loops or something wildly inefficient like that?

algorithm in Graphics Gems

Found myself a copy of this as a PDF. The start talks about Derived Objects and says stuff like:

|M| determinant of M

and then we’re on to Basic Expressions and Functions with things like:

 / n \                            n!
|  —  |  binomial coefficient  ————————
 \ i /                         (n-1)!i!

WTF? I don’t think I’m going to understand ~~much~~ any of this…

Nov 19, 2023 8:17pm

Stuart Swales (8827) 1357 posts

Binomial coefficient. That’s more easily understood as n choose i – given a number of things n, how many (distinct) combinations of i numbers of things are there? Easy to grok concrete examples for small n, i, after that it’s just do the arithmetic.

n! is factorial. It’s just the product of all the numbers up to and including n.

Nov 19, 2023 8:22pm

Rick Murray (539) 13850 posts

Thanks but… I briefly skipped through the PDF and… it got increasingly scarier. Some of the diagrams were pretty. The maths was straight up nightmare fuel.

Nov 20, 2023 12:20am

David Pilling (8394) 96 posts

Graphics Gems was a series of five books of academic papers published year by year in the early 1990s. A bit like conference proceedings, somewhere for people to publish research papers. Each volume consists of maybe 40 papers. Source code was available for them.

You would be good to read the 1000s of pages of dense text from start to finish, more likely pick out papers of interest. The rotation by shearing one is page 179 of volume 1. No binomials are involved. But matrices…

DPscan – as I recall, you set the max amount of RAM it is to use, beyond that it puts stuff on disc. If ‘disc’ is an SD card, you probably don’t want it hammering with writes.

….what were you expecting?

I am an optimist, two people together can do more than twice what an individual can. Team work, management. So why not cores.

Nov 20, 2023 6:33am

Clive Semmens (2335) 3276 posts

The reason a lot of (programmers’) time was spent avoiding floating point in the 1990s is that using fixed point saves a lot of computer time. This was especially true in FPA days, but it’s still true even if you’ve got VFP, assuming you’re looking for the same degree of precision (or looked at another way, the same rate of accumulation of errors).

The only time FP is essential (as distinct from merely saving programmer effort) is when you’re multiplying or dividing quantities whose size is many orders of magnitude different.

Nov 20, 2023 7:52am

Rick Murray (539) 13850 posts

To put this into context, given Acorn’s reticence to include FP hardware…

I did a test on my Pi1 back in 2015.

Performing FPA MUL 4,096,000 times: 80779.853376 in 388cs
Performing VFP MUL 4,096,000 times: 80779.853376 in 7cs

So, a lot of time saved not using FP.

Nov 20, 2023 8:34am

Clive Semmens (2335) 3276 posts

Interesting, thanks Rick. That’s an even bigger difference than I expected.

Next question: how long for 4,096,000 integer MULs? For 4,096,000 64-bit integer MULs? (64-bit integer MUL requires 4 MULs + 3 ADCs on Arch32…)

Nov 20, 2023 9:06am

Graeme (8815) 106 posts

That’s an even bigger difference than I expected.

If I understand correctly, the VFP unit is a co-processor. So it can do the calculations for some instructions and at the same time let the ARM processor continue on. That is a big time advantage.

Thinking ahead: Supporting multicore CPUs

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Nov 18, 2023 5:36pm Stuart Swales (8827) 1357 posts	Surely no-one really uses floating point for bit images? Believe the source when it does…

Nov 18, 2023 5:37pm Stuart Swales (8827) 1357 posts	A candidate for recompiling with /softfp and apcs_softpcs for the easy win?

Nov 18, 2023 5:40pm Clive Semmens (2335) 3276 posts	Believe the source when it does… Gawd help us.

Nov 18, 2023 5:52pm Simon Willcocks (1499) 519 posts	Strange. I can see BASIC, relatively well-defined, as being easier to port than assembler. Just consider WriteS! In assembler, is something code or data? It’s hard to tell.

Nov 18, 2023 5:57pm Clive Semmens (2335) 3276 posts	In assembler, is something code or data? It’s hard to tell. And how, if you don’t have source. Especially if some devious c**t has been writing self-modifying code. Personally I’ve never done that in ARM assembler, but it was sometimes the only way to get things to run fast enough for real-time applications in 6502 code.

Nov 18, 2023 5:57pm Stuart Swales (8827) 1357 posts	DPIngScan is 99% C, Simon. Clearly good enough performance for typical size images, but hundreds of millions of kernel mode transitions for the FP exceptions will soon start to hurt with larger images. [Ah, talking about Printer Manager. Just put a bomb under it.]

Nov 18, 2023 6:46pm Dave Higton (1515) 3534 posts	Martin Carradus’s Mindleaf website. BBC_Error: Untrapped Parser Error – Output File will be Empty 114 syntax errors detected

Nov 18, 2023 7:02pm Dave Higton (1515) 3534 posts	Strange. I can see BASIC, relatively well-defined, as being easier to port than assembler. Just consider WriteS! In assembler, is something code or data? It’s hard to tell. And how, if you don’t have source. For Printer Manager, we have the source. However, even the uncrunched (if that’s the correct term) version, from the ROOL RISC OS sources, is hard to understand because there are no comments, all the structure offsets are magic numbers rather than named variables, and some of the variable names are too short. These are presumably compromises because of the sheer size of it. It’s difficult to work out what a given function is trying to do. But if enough minds co-operate, I’m sure it can be decoded.

Nov 18, 2023 7:23pm Rick Murray (539) 13850 posts	Gawd help us. It’s stranger than that. The function is “rotate” in the code file “bm”. It looks like it will first perform a quick rotation (90°, 180°, -90°) to get to the closest orientation, and then it’ll perform a slow long winded with much use of FP rotation to make up the difference. It’s odd that it uses FP, given that it looks like a lot of the rest of the program uses 16 bit fixed point? Or am I mistaken on that? It looks to me as if “zmath” deals with fixed point maths. hundreds of millions of kernel mode transitions for the FP exceptions will soon start to hurt ~~with larger images~~. Fixed that for you, as FPA code on a VFP machine is an insanity.

Nov 18, 2023 7:52pm Stuart Swales (8827) 1357 posts	Always get the algorithm right before optimising. If rotating your test images only takes a few seconds, you’re not doing it to animate a real-time display, and other users haven’t complained loudly, leave it be. For all that folk round here gripe about FPA use on VFP-based systems, only a couple of people have bothered to seriously try out apcs_softpcs, with one deploying. Those that haven’t, well probably because their software is already ‘good enough’. Even I haven’t bothered releasing Fireworkz/VFP to mainstream, just to select users that I know it will genuinely help with their large matrix problems.

Nov 18, 2023 9:19pm Sveinung Wittington Tengelsen (9758) 237 posts	I’d go for the relative complexity of bitmaps vs. the basic simplicity of geometric figures, straight lines and Bézier curves.

Nov 19, 2023 1:29am David J. Ruck (33) 1636 posts	It’s odd that it uses FP, given that it looks like a lot of the rest of the program uses 16 bit fixed point? Depends if interpolation is used to reduce artifacts from rotation, some of the better ones are going to be difficult in fixed point without introducing errors which negate the use of better algorithms. Some interpolation algorithms at the bottom of https://web.archive.org/web/20101127050322/http://www.all-in-one.ee:80/~dersch/interpolator/interpolator.html and more examples at http://photocreations.ca/interpolator/index.html

Nov 19, 2023 6:10am Clive Semmens (2335) 3276 posts	some of the better ones are going to be difficult in fixed point without introducing errors which negate the use of better algorithms. Not really – at worst you could use 64-bit integers instead of 32-bit, it’s still a lot faster than FP. Even if you’ve got VFP.

Nov 19, 2023 5:36pm David Pilling (8394) 96 posts	I had to look at DPScan, rotations are done as combinations of X and Y shears (algorithm in Graphics Gems), and those shears are done using integers. Maybe slow, I’m sorry to hear how slow, but results are aliased so look better than the simple way of rotating a bitmap. So slow… I’m wishing that you have the ‘virtual memory’ set up wrongly. Chance to be using lots of memory whilst doing the rotation. I revisited the rotation code in my TWAIN drivers – I would make a better effort at it there. It would be great to recompile with a different floating point set up and get better performance. But so much time back in the early 90s was spent removing/avoiding floating point code. There is a Windows version of DPScan, 64 bit (!), multiple processor/thread (!!). Not a big deal actually, and after years of hankering after a dual processor machine, it was a let down to find that two processors only let you go twice as fast. Even so the things going on in DPScan are suitable for concurrent processing. Would it make a good benchmark, how big are these bitmaps that take so long to rotate. Seemingly the US Minuteman missiles were programmed using integers. You can do a lot if you can be bothered.

Nov 19, 2023 6:38pm Rick Murray (539) 13850 posts	Chance to be using lots of memory whilst doing the rotation. I didn’t see any virtual memory option, but I noticed that the memory setting was about 9MB. I changed that to be about quarter of a gigabyte. The rotation of one degree took just under eight seconds on a Pi 2. So there’s the trick – the memory allocation is a bit weird (probably due to its heritage) so one needs to tell it to use a tonne of memory if it needs to. Once you’ve done so, it doesn’t lose its noodle trying to rotate a 25.83MB image in 9MB. It’s a shame the value is a fixed amount rather than something like “use up to xx% of available free memory” that would have scaled better, but hey, fiddling the setting is an easy fix. Thanks for the pointer. but results are aliased so look better than the simple way of rotating a bitmap. Yes, it looks good. As good as PhotoDesk with bi-cubic interpolation, but twice as fast. ;) But so much time back in the early 90s was spent removing/avoiding floating point code. Nothing much has changed, the DDE still doesn’t do VFP code even after all this time. how big are these bitmaps that take so long to rotate. Mobile phone photo – in my case 4480×2016. after years of hankering after a dual processor machine, it was a let down to find that two processors only let you go twice as fast. ….what were you expecting? Of course, over in the x86 world there’s plenty of cheating. My Pentium 4 box is a dual core hyperthreading gizmo. It’s a lie, there’s only a single core but it does stuff concurrently so it sort of fakes having two cores. Seemingly the US Minuteman missiles were programmed using integers. I hope car ECUs and ABS units too. The thing about floating point is that there is inherent imprecision that could accumulate. This is why it’s a bad idea to use floats for money calculations. It would be great to recompile with a different floating point set up and get better performance. As Stuart says: A candidate for recompiling with /softfp and apcs_softpcs for the easy win?

Nov 19, 2023 6:41pm Stuart Swales (8827) 1357 posts	My fault for an insufficiently deep dive into the DPIngScan source. I just saw that it was still passing a f.p. arg down to the proc that sheared each scanline of the image. I’d still be tempted here to pass in the corresponding fixed-point integer scaled fractional value as you can then be damn sure the called proc isn’t mucking about pushing/pulling FP temporaries and reloading FP from stack. The thing about floating point is that there is inherent imprecision that could accumulate You don’t think there’s loss of precision for general calculation by using fixed point??? My Pentium 4 box is a … room heater ;-) [My first P4 died when one of the heatsink clips broke the plastic tab on the CPU socket, heatsink and fan assembly torqued away slightly being vertical. It was running a climate model at full tilt, so overheated very quickly. I think the thermal throttle on the chip was up the other end which was still being cooled a bit.]

Nov 19, 2023 6:59pm Paul Sprangers (346) 525 posts	So slow… I’m wishing that you have the ‘virtual memory’ set up wrongly. I changed that to be about quarter of a gigabyte. The rotation of one degree took just under eight seconds on a Pi 2. So sorry. My max memory was also defaulted to about 9 MB. Changing that to about 250 MB, the very picture rotated instantaneously on the Pi4. Garlands for DPScan!

Nov 19, 2023 8:08pm Rick Murray (539) 13850 posts	Changing that to about 250 MB, the very picture rotated instantaneously on the Pi4. ;) It’s been a learning experience for us, so it’s all good. Just shows it pays to fiddle in the settings. Garlands for DPScan! Indeed. I’m impressed that C code outperforms something in assembler. I wonder what the algorithm differences are. I was using bi-cubic interpolation. There’s not a lot of difference between that and bi-linear, but both are better than nearest neighbour. so overheated very quickly Generally my PC runs fairly cool, but when running OpenDuke (the old Duke Nukem open source runtime) it kicks up to vacuum cleaner mode in a matter of seconds. It’s worth noting that Redneck Rampage and SimCopter (etc) do not do this, so I wonder if it’s using busy loops or something wildly inefficient like that? algorithm in Graphics Gems Found myself a copy of this as a PDF. The start talks about Derived Objects and says stuff like: \|M\| determinant of M and then we’re on to Basic Expressions and Functions with things like: / n \ n! \| — \| binomial coefficient ———————— \ i / (n-1)!i! WTF? I don’t think I’m going to understand ~~much~~ any of this…

Nov 19, 2023 8:17pm Stuart Swales (8827) 1357 posts	Binomial coefficient. That’s more easily understood as n choose i – given a number of things n, how many (distinct) combinations of i numbers of things are there? Easy to grok concrete examples for small n, i, after that it’s just do the arithmetic. n! is factorial. It’s just the product of all the numbers up to and including n.

Nov 19, 2023 8:22pm Rick Murray (539) 13850 posts	Thanks but… I briefly skipped through the PDF and… it got increasingly scarier. Some of the diagrams were pretty. The maths was straight up nightmare fuel.

Nov 20, 2023 12:20am David Pilling (8394) 96 posts	Graphics Gems was a series of five books of academic papers published year by year in the early 1990s. A bit like conference proceedings, somewhere for people to publish research papers. Each volume consists of maybe 40 papers. Source code was available for them. You would be good to read the 1000s of pages of dense text from start to finish, more likely pick out papers of interest. The rotation by shearing one is page 179 of volume 1. No binomials are involved. But matrices… DPscan – as I recall, you set the max amount of RAM it is to use, beyond that it puts stuff on disc. If ‘disc’ is an SD card, you probably don’t want it hammering with writes. ….what were you expecting? I am an optimist, two people together can do more than twice what an individual can. Team work, management. So why not cores.

Nov 20, 2023 6:33am Clive Semmens (2335) 3276 posts	The reason a lot of (programmers’) time was spent avoiding floating point in the 1990s is that using fixed point saves a lot of computer time. This was especially true in FPA days, but it’s still true even if you’ve got VFP, assuming you’re looking for the same degree of precision (or looked at another way, the same rate of accumulation of errors). The only time FP is essential (as distinct from merely saving programmer effort) is when you’re multiplying or dividing quantities whose size is many orders of magnitude different.

Nov 20, 2023 7:52am Rick Murray (539) 13850 posts	To put this into context, given Acorn’s reticence to include FP hardware… I did a test on my Pi1 back in 2015. Performing FPA MUL 4,096,000 times: 80779.853376 in 388cs Performing VFP MUL 4,096,000 times: 80779.853376 in 7cs So, a lot of time saved not using FP.

Nov 20, 2023 8:34am Clive Semmens (2335) 3276 posts	Interesting, thanks Rick. That’s an even bigger difference than I expected. Next question: how long for 4,096,000 integer MULs? For 4,096,000 64-bit integer MULs? (64-bit integer MUL requires 4 MULs + 3 ADCs on Arch32…)

Nov 20, 2023 9:06am Graeme (8815) 106 posts	That’s an even bigger difference than I expected. If I understand correctly, the VFP unit is a co-processor. So it can do the calculations for some instructions and at the same time let the ARM processor continue on. That is a big time advantage.