Crash when dividing by 0 with signals disabled
Charles Ferguson (8243) 427 posts |
Hiya, In the xml/xslt tools I produced recently I have code that generates a NaN and Infinities by turning off signals and then doing a 0/0 and 1/0 respectively. This is a pretty normal way to generate these values. It always worked for me. However, that was because I was using RISC OS 4. On RISC OS 3.7 and RISC OS 5, it crashes badly. Details can be found here: |
Julie Stamp (8365) 474 posts |
I don’t know much about floating point numbers, but here’s what I get on my Rasperry Pi:
I get the same result on RPCEmu (except there’s no CEL in the FPEmulator version code). |
Stuart Swales (8827) 1357 posts |
Working OK for me too on ARMX6, RISC OS 5.29 (02 Nov 2020), SCL 6.08, FPE 4.37; compiled with Norcroft 5.86 [10 Feb 2021]. Has gcc got the optimisation for comparisons with f.p. constant values wrong when targetting arm-riscos? [Aside to others: when handling f.p. values that may have NaN/Inf, usually best to use the comparison macros in math.h] |
Charles Ferguson (8243) 427 posts |
Ok, so that means that when Julie Stamp and Stuart Swales compile with their compilers and Stubs on their machines there’s no problem. What about when you run the binary that was supplied? User reporting the issue was running it on RPCEmu on RISC OS 5.29. My tests were on RPCEmu using the RISC OS 5 from the easy start bundle (RISC OS 5.27 (19 Mar 2020)). Stuart:
gcc doesn’t come into it. This is compiled with Norcroft 5.18+a bit and stubsg. |
Steve Pampling (1551) 8172 posts |
Perhaps if you compile with something other than Stubsg and maybe rule in (or rule out) the age of the Norcroft package since Stuart has no visibility of an issue when using Norcroft 5.86 |
Stuart Swales (8827) 1357 posts |
It appears to be ‘faulting’ trying to continue from the FP exception somewhere in the guts of the SCL. Gerph’s binary fails on my system at the DVFD f0,f0,#0. I can reproduce the fault using my newer-than-thou compiler if I break out the divide by zero to a separate routine (newer Norcroft otherwise just computes the NaN (and the two booleans) at compile time). If I then replace the SIGFPE handler with one which prints “SIGFPE” and continues, then on the FP exception the SCL does call that handler, but still postmortems … quite deliberately. See https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/clib/s/cl_body#L353 I think here you’d have to wrap the divide by zero in (C99) feholdexcept & feclearexcept/feupdateenv to try to prevent the FPEmulator raising the exception. |
Charles Ferguson (8243) 427 posts |
It’s somewhat odd that it’s failing for the user when it worked for you – and I can only assume it’s some factor of the compiler or stubs that’s affecting it there. That said, I’ve tried the RISC OS 5 CLib on Pyromaniac with a full trace, and it failed with an error, reporting an invalid operation, but which didn’t crash in the way I described earlier. I’ve included the findings on the repo’s README.md, together with the full log of what the RO 5 CLib did. |
Stuart Swales (8827) 1357 posts |
Bizarrely my suggested feclearexcept(FE_DIVBYZERO) didn’t clear the /0 exception which is still then raised by feupdateenv(&env). If I change it to feclearexcept(FE_ALL_EXCEPT), that does the trick – it will happily divide by zero yielding NaN/Inf.
[Edit: That’s because 0/0 on the FPA gives an Invalid Operation exception (FP_INVALID_OP), not divide by zero (FE_DIVBYZERO).] |
Charles Ferguson (8243) 427 posts |
Nice that it optimises that into a NaN… BUT the side effect should still have been there! There should be an equivalent of a `__rt_divtest` call to raise the side effects of the optimised away calculation. The optimised away code and the lack of the side effect then explains why Julie and Stuart got versions that worked. Yay! One mystery solved.
Which I guess implies that it’s not honouring the SIG_IGN – which is what it looks like from the trace I’ve just done of the code. The SCL never even tries to disable the signals from being generated. Looking through git logs, in the ancient sources the signal handler disabling exceptions was addressed in 2001 in my version of CLib… Coo, over 20 years ago. Now I feel old. commit 1000c2c3e4994ce9dff991b7438dae691ce4a5b9 Author: justin <> Date: Sun Aug 5 02:23:44 2001 +0000 Summary: Added support for SIGFPE ignoring. Detail: * At present, if a SIGFPE happens we call our handler. It is run and on return we produce a postmortem request. This isn't useful. As a first stage to allowing a handler on SIGFPE, we allow SIG_IGN to disable all the exceptions associated with the FPE. This means that if we have code that does : signal(SIGFPE,SIG_IGN); subsequently, all FPE exceptions will be ignored (where previously a postmortem would have been produced). signal(SIGFPE,&handler); will restore the signal state and allow us to report exceptions in the normal manner. Admin: Tested with Galaxy doing an explicit overflow on Virginia; seems to work and continues as you might expect. In theory, this fixes the Galaxy crashes and should allow other users to perform their normal signal(SIGFPE,...) operations as they would under unix. Tag: RISC_OSLib-4_92 |
Charles Ferguson (8243) 427 posts |
For reference, in the actual change, I’ve reduced this to just setting the value of NaN and the infinities directly by encoding the 64bit values: https://github.com/gerph/libxml2/commit/b1909005f496e0d08664501ec35c9c7649de5ab4 |
Charles Ferguson (8243) 427 posts |
I have updated the write up at https://github.com/gerph/riscos-nantest to describe the findings using the CLib on Pyromaniac. |
Stuart Swales (8827) 1357 posts |
Presumably it borks on vanilla RISC OS 4.02 as well? |
Charles Ferguson (8243) 427 posts |
I was going to say ‘but I know it can’t be the problem, ‘cos I wrote it and it can’t affect this’, but that’s silly – I wrote it, so I know it could be a problem. So let’s try it with a 32bit stubs. Admittedly a 32bit stubs that I updated, but meh… it’s one less thing. charles@laputa ~/projects/RO/nantest (master)> riscos-amu BUILD32=1 riscos-cc -c -Wc -fa -IC: -za1 -apcs 3/32/fpe2/swst/fp -D__CONFIG=32 -o o32/nantest c/nantest Norcroft RISC OS ARM C vsn 5.18 (JRF:5.18.119) [Nov 13 2020] "c/nantest", line 21: Warning: floating point constant overflow: '/' "c/nantest", line 23: Warning: actual type 'long' mismatches format '%08x' "c/nantest", line 23: Warning: actual type 'long' mismatches format '%08x' "c/nantest", line 38: Warning: floating point constant overflow: '/' "c/nantest", line 40: Warning: actual type 'long' mismatches format '%08x' "c/nantest", line 40: Warning: actual type 'long' mismatches format '%08x' c/nantest: 6 warnings, 0 errors, 0 serious errors riscos-link -rescan -C++ -aif -o aif32.nantest-stubsg o32.nantest C:o.stubsGS nantest-stubsg: All built {Disc} charles@laputa ~/projects/RO/nantest (master)> riscos-amu BUILD32=1 -f MakefileStubs,fe1 riscos-link -rescan -C++ -aif -o aif32.nantest-stubs32 o32.nantest <Lib$Dir>.CLib.o.stubs-32 nantest-stubs32: All built {Disc} charles@laputa ~/projects/RO/nantest (master)> pyrodev --common --command aif32.nantest-stubs32 NaN test nan = nan nan = 7ff80000/e0000000 same = 1, different = 0 INF test inf = inf inf = 7ff00000/00000000 same = 1, different = 0 charles@laputa ~/projects/RO/nantest (master)> Tested, new aif posted (and the old one renamed to nantest-stubsg). Same problem, so it’s not stubsg that cause the problem. Or it is stubsg that causes the problem AND it’s in my stubs as well. And Stuart has repro’d with his own build too, so I’m pretty sure I’m in the clear here. |
Charles Ferguson (8243) 427 posts |
Going by that code comment, I’d assume so. Yup, just tested it and I get an exception at 3ffffffc. So basically this is something that I’ve fixed in Select. Yay. |
Stuart Swales (8827) 1357 posts |
Was the fix there to turn off FP exceptions at source when you do signal(SIGFPE, SIG_IGN), or let them run through the exception handling and then handle SIG_IGN being raised without calling postmortem()?
You do get a nice compiler warning about invalid operations ;-) |
Julie Stamp (8365) 474 posts |
I don’t believe that the LDMIB is aborting at all. The value a2 = &FFFFFFFD = SIG_IGN is exactly what you’d expect: the LDMIB is loading registers from the register dump, and that was the value of a2 when the DVFD F0, F0, #0 was executed because it was the second argument to the signal() immediately before. I don’t follow the relevant piece of Thanks for letting us know about the not ignoring signals. |
Stuart Swales (8827) 1357 posts |
It looks like the reason that you get |
Charles Ferguson (8243) 427 posts |
Stuart:
Yeah, if you don’t raise the exceptions you avoid going through a lot of extra hoops – and of course you /can’t/ have exceptions happen if you’re in FP code in SVC mode, so allowing them and hoping that the environment handler will catch them and recover wasn’t a solution. I’m not sure that that was actually in my mind at the time, but that’s certainly a reason why you shouldn’t try to leave it to the environment handlers to deal with it. Julie: I don’t believe that the LDMIB is aborting at all. The value a2 = &FFFFFFFD = SIG_IGN is exactly what you’d expect: the LDMIB is loading registers from the register dump, and that was the value of a2 when the DVFD F0, F0, #0 was executed because it was the second argument to the signal() immediately before. Oh, that’s interesting. I had assumed that that was the case when it crashed. Stuart:
Tut… I had intended to check what the comparison code did, but was dealing with the actual crash first. That I might be able to fix in the compiler. Maybe. My compiler-fu is not great for features. Making it work on 64 bit systems – easy… changing behaviour – scarey. |