Ticket #483 (Fixed)Sat May 02 16:09:27 UTC 2020
Kernel 6.09 and later makes ABC compile things very slowly
Reported by: | Sprow (202) | Severity: | Major |
Part: | RISC OS: Module | Release: | |
Milestone: | Status | Fixed |
Details by Sprow (202):
Following a discussion with someone using ABC (not sure which version), it was mentioned that the project compiles very slowly.
Here, on a Titanium running RISC OS 5.24, the attached test project which comprises ~7000 PRINT statements compiles with !ABC 4.20 in ~5s. With RISC OS 5.27 it takes ~15s to get to 1% complete, scaling linearly that’d mean it’s now 300x slower.
I also tried !ABC 4.18 since that would be contemporary with RISC OS 5.24, but it was the same (ie. doesn’t appear to be caused by any new or fixed bugs in !ABC).
Going on a hunch, I’ve chopped it down to Kernel 6.08 (fast) to Kernel 6.09 (slow).
Rather than be sensible and start with a 5.24 snapshot and work forwards, I started from today’s RISC OS 5.27 sources and went backwards. That does mean the following components need regressing:
- Switcher to 1.51 (to avoid needing a define for a newer OS_Memory API)
- HAL_Titanium to 0.10 (to avoid needing a define for HAL offset flag of XHCI’s registers)
- PCI to 0.15 (to avoid needing a define for a newer OS_Memory API)
- XHCIDriver to 0.27 (to avoid needing a define for HAL offset flag of XHCI’s registers)
- Debugger to 2.02 (to avoid needing OS_MapInIO64 define)
- ScrModes to 0.72 (to avoid needing SyncPol flag define)
- SATADriver to 0.08 (to avoid needing a define for a newer OS_Memory API)
In fact, SATADriver 0.08 goes with Kernel 6.08, and SATADriver 0.10 goes with Kernel 6.09, unless the compile time switch is used instead.
Changelog:
Modified by Sprow (202) Sat, May 02 2020 - 16:10:01 GMT
- Attachment added: justprints.zip
Test program with 7000+ PRINT statements.
Modified by Jeffrey Lee (213) Sun, May 03 2020 - 15:49:57 GMT
It looks like ABC is making poor use of OS_SynchroniseCodeAreas. When compiling that test program, some simple instrumentation shows that 87860 global OS_SynchoniseCodeAreas operations are performed. That’s over 12 per PRINT statement!
On multi-core CPUs, global OS_SynchroniseCodeAreas calls will incur a significant performance hit, because there’s no (sane) SMP global IMB operation available. So instead the OS does a ranged clean of application space and the RMA (the two locations that code is traditionally likely to appear).
https://gitlab.riscosopen.org/RiscOS/Sources/Ke…
ABC 4.20 compiles the code in 8.8s on my single-core BB-xM, and Jonathan Abbot has previously reported that global OS_SynchroniseCodeAreas can take up to 2 seconds per 100 calls on a Pi 3, which about matches with your 15 seconds per 1% measurement.
https://www.riscosopen.org/forum/forums/4/topic…
(Large ranged OS_SynchoniseCodeAreas calls will also be slower, again because there’s no global IMB it can fall back on once the size of the area crosses a certain threshold. So it might be worth checking for abuse of those as well)
Modified by John Ballance (21) Thu, May 07 2020 - 18:55:45 GMT
A quick look shows the culprit. The Library source, around 1894, does an unbounded OS_SynchroniseCodeAreas. There are 5 or so other locations that use bounded calls.
A quick tweak of the code here confirms this is the call culprit, so we need to provide the correct bounds.
This bit of code is responsible for providing the USR and CALL dispatching. I’m not at present sure how to bound it though.
(proof: creating an arbitary bound of call address to call address +16k provided a compile time of 13 seconds, against many minutes with no bounds. The arbitary bounding isn’t sufficient though as the resultant code won’t self compile)
Modified by John Ballance (21) Thu, May 07 2020 - 21:21:44 GMT
- Status changed from Open to Fixed
Fix found empirically:
Bounding the offending OS_SynchroniseCodeAreas to R9 to R9 +&20 is sufficient to successfully recompile the compiler.
The diff file below [ADMIN EDIT: Redacted; closed source component; if you need more info, please contact ROOL for details] details ALL changes made, including up issuing the ABCLib module to 4.21. At present I do not appear to have write access to this source so cannot commit this mod.
Fixed version passed on to original complainant for verification.
Sprow: Could you please confirm it resolves in your environment…
FYI the minimum ode sync that reliably built the compiler was 0×20. I use the fact that the compiler produces the same ooutput with original basic compiler source and with compiled compiler as a reasonable validation.
Modified by John Ballance (21) Thu, May 07 2020 - 21:42:28 GMT
Update: Code now committed to CVS
Modified by John Ballance (21) Fri, May 08 2020 - 11:38:39 GMT
Following feedback from Paul Reuvers – who uses ABC commercially and exhaustively – I have updated !ABC to v4.22. It also became clear when I revisited the compiler compilation that the range synchronised needed a slight increase. It is now 0×3c rather than 0×20, which allows for the largest ‘frame’ that I can see inspecting the copiled compiler code.
The plus side is feedback from Paul suggesting the resultant code, run against ABCLibrary 4.22 is up to 10x faster than before!
The resultant !ABC 4.22 is attached. The ABCLibrary v4.22 can be found within the app [ADMIN EDIT: Compiler distro Zip removed; if you need more info, please contact ROOL for details]
Modified by Sprow (202) Thu, May 14 2020 - 17:58:40 GMT
Unfortunately the library provided (by email 08-May-2020 12:43) causes 3 of 4 of the ABC applications I have to crash during use.
I’m a little alarmed to see the phrase “Fix found empirically” followed by a sequence of incrementing guesses; the cache on my Titanium is well characterised, so it should be possible to know the correct solution.
Subsequently, an update from ROOL (by email 11-May-2020 08:44) works with 4 of 4 of the ABC applications I have, and the compilation time seem is equivalent (within 0.1s anyway!) on a recent SMP Kernel. So hopefully ROOL will push that out with the next DDE update.