RISC OS Open: Forum: Debugging locked Application

Mar 28, 2021 11:28am

I am trying to investigate an application which, under some weird circumstances and only on a remote user machine, ‘locks’ the machine with a never-ending hourglass. It is recoverable by using Alt-Break to Stop the offending task, but I am unable to diagnose the problem without some clues where the app is looping.

Any suggestions how I could get any debug information welcome!

Mar 28, 2021 11:31am

Chris Hall (132) 3559 posts

If the application is written in BASIC then add some TRACE statements and PRINT statements and do a *Spool filename before running it.

If it is not written in BASIC then someone else may help.

Mar 28, 2021 12:58pm

Martin Avison (27) 1494 posts

It is not Basic – that I can debug. It is a C program. And yes, I could put some debugging in that, but it might take several iterations as it is 60k lines – and rather tedious for me & user.

Mar 28, 2021 1:14pm

Julie Stamp (8365) 474 posts

I think you can do it, with a little bit of work.

Take your Report button for Alt-Break, and modify it to set a non-transient callback. After the user presses cancel, it will fall into your callback environment handler, where you have full access to the task user mode registers. Send them then to Reporter, or whatever else you would do with them, and then return from the callback in the usual way.

(I think this isn’t entirely foolproof, as if a callback is already set it will be missed, and C does use callbacks, but this hack will at least let you log the task state, even if it might crash after that.)

Mar 28, 2021 4:50pm

Sprow (202) 1158 posts

Any suggestions how I could get any debug information welcome!

If the application is written in BASIC then add some TRACE statements

If you wanted to get a big list of all the functions that were called in the order they were called, like TRACE PROC in BASIC you could hook up the profiling that the C compiler can output to Reporter. That might give you a clue what’s happening.

Proceed as follows:

Build the application with profiling and embedded function names (switches -p -fn)
On entry to every function it’ll have added a BL _count1 followed by a 0 word followed by a special encoded value
Write a replacement _count1 function that does whatever Reporter magic you want but that follows the same calling convention by returning to R14+8 with flags preserved
The format of the special encoded value can be deduced from armprof.c and gives an index into a table of function names
…or just work backwards from R14 to find the APCS function signature, which might be easier

You might need to sneakily edit your stubs to rename _count1 in there to say _zzzzz1 to avoid the linker winging about duplicate symbols.

Mar 28, 2021 5:47pm

Rick Murray (539) 13851 posts

You might need to sneakily edit your stubs

I note the uncertainty. ;-)

The last time I checked (around Y2K), the profiler stuff needed to be linked with ANSIlib as it wasn’t part of Stubs.
In that case, a user supplied _count1 and some magical fiddling with R14 ought to work.

Alternatively… How does Alt-Break kill a task? Is it something that can be trapped (say, by atexit() ?). I’ve never tried, but perhaps put together a simple backtrace handler to output a list of functions?

as it is 60k lines

Well, if it locks up the machine with an hourglass, there’s your first clue. All the bits of code that turn on the hourglass. If you have four or less, set the LEDs to a pattern (off, top, bottom, or both) which will tell you which hourglass part is the problem.

And, yes, it might take a few iterations. Unfortunately it’s like that if something works for you but fails elsewhere.
Trust me, I know. I just can’t get my Manga cache to crash on me. So I can’t see what others are seeing.

Mar 29, 2021 5:21pm

Martin Avison (27) 1494 posts

code that turn on the hourglass. If you have four or less

Unfortunately 43 >>> 4 !

Thanks for all the suggestions – some interesting ideas, but still evaluating.
I have also tried Jeffrey’s fiqprof, but I don’t think it works on a Titanium (or the remote Mini.m)

Mar 29, 2021 6:21pm

Steve Pampling (1551) 8172 posts

Unfortunately 43 >>> 4 !

Ricks suggestion with the obvious stated:
Pattern 1 for instances 1 – 21
Pattern 2 for instances 22 – 43

If fault with pattern 1
then
Pattern 1 for instances 1-10
Pattern 2 for instances 11-21

If fault with pattern 2
Then
Pattern 1 for instances 11-15
Pattern 2 for instances 16-21

repeat binary search until one instance…

Mar 29, 2021 6:27pm

Martin Avison (27) 1494 posts

Thanks – but I am well aware of the beauties and uses of a binary search.

Mar 29, 2021 7:27pm

Steve Pampling (1551) 8172 posts

uses of a binary search.

Telling a third parties external support how to locate the source of a loop in their network? :)

Mar 29, 2021 7:39pm

Rick Murray (539) 13851 posts

What I would do in this situation… Prefix every Hourglass call with something to output a message to DADebug (I’m a big believer in DADebug!).
Run the program.
Let it freeze up.
Kill it.
Go to command line and type *DADPrint, and Bob’s your cousin’s brother’s oh whatever the hell… the last line reported will be as far as the program got.

[but, you know, could use Reporter too! ;-)]

Mar 31, 2021 9:36pm

Martin Avison (27) 1494 posts

I had been hoping to get some clues without providing a recompiled debug version to the user, but all the responses required recompilation anyway. So I opted for simply adding some Reporter debugging at strategic points, and progress is now being made. Thanks for the suggestions.

But it would be very useful if a watchdog Alt-Break Stop could provide some clues what was being executed, even if in a largish looop.

Apr 1, 2021 10:33am

Alan Adams (2486) 1149 posts

But it would be very useful if a watchdog Alt-Break Stop could provide some clues what was being executed, even if in a largish looop.

and if any changes are done to alt-break, please add a “previous” button. Skipping through the long list to find the one to kill, then pressing next once too often is frustrating.

Apr 1, 2021 6:10pm

Chris Hall (132) 3559 posts

Not half! I suppose it is out of the question for the list of active tasks to be shown with radio buttons to select the task or tasks to be ended?

Apr 1, 2021 7:54pm

Julie Stamp (8365) 474 posts

I think it might be possible.

I’ve thought before that you could have a Kill option alongside the Quit option for a task in the task manager. The Quit option would be the same as now, i.e. sends Message_Quit to the task, but Kill would do the same as Alt-Breaking a non-active¹ task, i.e. does Wimp_CloseDown, which means that the task never gets control back (so for example no atexit() routines would be run). I didn’t think too much about that though because I don’t know when you’d want to use that?

But this wouldn’t provide all the functionality of the current Alt-Break error box for killing non-active tasks; the one extra thing Alt-Break does is it stops the Wimp delivering messages to tasks.

So if you want a dialogue box with the list of tasks on Alt-Break, you could make it so that the usual error box offers two options, kill current task, or kill other task. If you click ‘kill other task’ the current task is allowed to run until it calls Wimp_Poll but all tasks except one have been marked as ‘suspended’ – the one exception is the new task started to run dialogue box that appears. A DDT-style no-entry sign could be used to show that everything else if frozen, or perhaps the screenshot could be cross-hatched or similar.

Personally though, when I Alt-Break it is to kill the current task, so I’d be interested to hear what use people make of it to kill other tasks, to see which of the above two options is more appropriate.

Another small improvement to the Alt-Break is that at the moment if you use it during a single-tasking task, it calls it ‘Unknown’. More helpful might be to put the command that started the task, as returned by OS_GetEnv.

¹ That is to say, a task that did not have control (‘paged out’) when Alt-Break was pressed.

Apr 1, 2021 9:24pm

Rick Murray (539) 13851 posts

I think usually when one uses Alt-Break, it is an attempt to recover a machine that has hung up, so typically it will be the current task being ended.

I have, once or twice, used it to kill off a different task (when I do dumb things like forget to make a program respond to Quit messages), but it’s not a particularly friendly interface and I’m just as likely to accidentally drop a nuke on the wrong task.

A small improvement to Alt-Break would be if you are stopping the task that is currently paged in, and the task has a valid stack frame, offer a Backtrace button that will output a backtrace, dump the registers, and also about ten or so instructions either side of PC at the point where the program was killed.
That way, somebody stands a chance of seeing how the program got into that state. Won’t help so much for BASIC, but it’s a lot more than the current “hasta la vista, babeeee” method that will end a task and tell you nothing…

Apr 2, 2021 8:51am

Chris Hall (132) 3559 posts

It is a pity that ALT-Break does not return control to BASIC (for an application written in BASIC that is) so that the errant line and the variables can be examined. It kills off the whole app slot, including the run-time stuff for BASIC.

Apr 2, 2021 9:05am

Rick Murray (539) 13851 posts

It kills off the whole app slot,

That’s the idea. It’s the sledgehammer approach to task management.

Apr 2, 2021 10:22am

Alan Adams (2486) 1149 posts

Personally though, when I Alt-Break it is to kill the current task, so I’d be interested to hear what use people make of it to kill other tasks, to see which of the above two options is more appropriate

A recent example was trying to debug some software running off vectors, which locked up Netfetch. Alt-break offered an apparently randomly chosen task to kill, and experience showed that the one which recovered control of the machine was NetFetch. (It turned out that NetFetch’s fetch entered the file vectors about 45000 times, even if there was nothing to fetch. This resulted in extreme slowdown, and did eventually stop on its own, after about ten minutes.)

Apr 2, 2021 5:58pm

Andrew Rawnsley (492) 1445 posts

Hi Alan, could you clarify this please. NetFetch itself shouldn’t really be doing anything weird like that – it is userland C. Does Hermes trigger it if run separately from !NetFetch.Apps please? Basically, more clarity on exactly what you click to trigger the issue would be super-helpful to understand what’s going on, and where the bug is (assuming there is one!). NetFetch itself just calls the Hermes module via a SWI (documented in the Hermes help files) to trigger a fetch.

One thing, though – if it is Hermes, check in some of the !Hermes.MailDir files. If you’re doing things like “leave mail on server”, and you have a ton of mail on server, it might be checking headers against a known list which could (I guess) result in a lot of disc access (in which case there may be some logical optimisation that could be done).

Apr 2, 2021 8:00pm

Alan Adams (2486) 1149 posts

Hi Alan, could you clarify this please.

Just for clarity I’m not blaming Hermes/NetFetch for this, just explaining what I was seeing when my code got in the way.

I’m working on a disc activity indicator for the ARMX6. It intercepts the 6 filing system vectors, decides whether the activity is read or write, and shows indications accordingly. During development there was a lot of code active, and this considerably increased the time consumed by my code. As part of my attempts to work out what was happening, I counted the activity on the combined vectors and displayed it, along with all the other output. I’ve checked back over the online discussion. That count reached somewhere in the region of 18000 during the fetch.

It was triggered in either of two ways. If NetFetch wasn’t running, I started it. It is configured to fetch immediately. If it was already running, a right-click on its icon triggered the same activity.

Without my code in the way, the return was, visually at least, immediate. With my code it took around a minute. (I said earlier in this thread that it was ten minutes. I was wrong. It was around 1 minute.) During this time the pointer, whose colours I was changing, seemed to flicker at a very high rate. It turned out that despite attempting to limit the number of changes, a coding error meant that I was changing the colour for at least half of the vector entries. Changing the colour is slow, in relative terms, at least if using Wimp_SetPalette. I’m using a different method now.

Before I discovered that it would eventually complete I thought something had crashed. Alt-break usually first offered me “unknown application”, and after stopping that, offered NetFetch. Within a second or so of stopping NetFetch the machine responded normally, and my code was still running.

I think the “unknown application” might be a helper being started by NetFetch, but there’s no way for me to confirm that.

I have Netfetch configured to fetch and delete from the server, and on all these test occasions, nothing was debatched into MessengerPro. That leads me to assume there was nothing to fetch. I could, I suppose, have used the webmail interface to see whether something too big to fetch was present, but when that happens the fetch is normally pretty slow before quitting. This was fast, if my code wasn’t present.

The discussion is here https://www.riscosopen.org/forum/forums/11/topics/16274?page=5. There’s rather a lot of it.

Apr 2, 2021 8:23pm

Alan Adams (2486) 1149 posts

If you follow the discussion in the thread mentioned you’ll see that later on I found that I could reliably crash Messenger Pro with my code running. This was traced to my use of Wimp_SetPalette in a callback, which is unsafe. It sends a sequence of VDU codes, which it is suspected can occur in the middle of a sequence of codes from the interrupted application. I’ve changed the method of indication to avoid this call. Again, not the fault of Messenger Pro. The reason these two showed up the problem is simply that they are the most active background tasks on the machine.

In the early development stages, I also crashed Filer_Action in a similar way.

Debugging locked Application

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Mar 28, 2021 11:28am Martin Avison (27) 1494 posts	I am trying to investigate an application which, under some weird circumstances and only on a remote user machine, ‘locks’ the machine with a never-ending hourglass. It is recoverable by using Alt-Break to Stop the offending task, but I am unable to diagnose the problem without some clues where the app is looping. Any suggestions how I could get any debug information welcome!

Mar 28, 2021 11:31am Chris Hall (132) 3559 posts	If the application is written in BASIC then add some TRACE statements and PRINT statements and do a *Spool filename before running it. If it is not written in BASIC then someone else may help.

Mar 28, 2021 12:58pm Martin Avison (27) 1494 posts	It is not Basic – that I can debug. It is a C program. And yes, I could put some debugging in that, but it might take several iterations as it is 60k lines – and rather tedious for me & user.

Mar 28, 2021 1:14pm Julie Stamp (8365) 474 posts	I think you can do it, with a little bit of work. Take your Report button for Alt-Break, and modify it to set a non-transient callback. After the user presses cancel, it will fall into your callback environment handler, where you have full access to the task user mode registers. Send them then to Reporter, or whatever else you would do with them, and then return from the callback in the usual way. (I think this isn’t entirely foolproof, as if a callback is already set it will be missed, and C does use callbacks, but this hack will at least let you log the task state, even if it might crash after that.)

Mar 28, 2021 4:50pm Sprow (202) 1158 posts	Any suggestions how I could get any debug information welcome! If the application is written in BASIC then add some TRACE statements If you wanted to get a big list of all the functions that were called in the order they were called, like `TRACE PROC` in BASIC you could hook up the profiling that the C compiler can output to Reporter. That might give you a clue what’s happening. Proceed as follows: Build the application with profiling and embedded function names (switches `-p -fn`) On entry to every function it’ll have added a `BL _count1` followed by a 0 word followed by a special encoded value Write a replacement `_count1` function that does whatever Reporter magic you want but that follows the same calling convention by returning to R14+8 with flags preserved The format of the special encoded value can be deduced from armprof.c and gives an index into a table of function names …or just work backwards from R14 to find the APCS function signature, which might be easier You might need to sneakily edit your stubs to rename `_count1` in there to say `_zzzzz1` to avoid the linker winging about duplicate symbols.

Mar 28, 2021 5:47pm Rick Murray (539) 13851 posts	You might need to sneakily edit your stubs I note the uncertainty. ;-) The last time I checked (around Y2K), the profiler stuff needed to be linked with ANSIlib as it wasn’t part of Stubs. In that case, a user supplied _count1 and some magical fiddling with R14 ought to work. Alternatively… How does Alt-Break kill a task? Is it something that can be trapped (say, by atexit() ?). I’ve never tried, but perhaps put together a simple backtrace handler to output a list of functions? as it is 60k lines Well, if it locks up the machine with an hourglass, there’s your first clue. All the bits of code that turn on the hourglass. If you have four or less, set the LEDs to a pattern (off, top, bottom, or both) which will tell you which hourglass part is the problem. And, yes, it might take a few iterations. Unfortunately it’s like that if something works for you but fails elsewhere. Trust me, I know. I just can’t get my Manga cache to crash on me. So I can’t see what others are seeing.

Mar 29, 2021 5:21pm Martin Avison (27) 1494 posts	code that turn on the hourglass. If you have four or less Unfortunately 43 >>> 4 ! Thanks for all the suggestions – some interesting ideas, but still evaluating. I have also tried Jeffrey’s fiqprof, but I don’t think it works on a Titanium (or the remote Mini.m)

Mar 29, 2021 6:21pm Steve Pampling (1551) 8172 posts	Unfortunately 43 >>> 4 ! Ricks suggestion with the obvious stated: Pattern 1 for instances 1 – 21 Pattern 2 for instances 22 – 43 If fault with pattern 1 then Pattern 1 for instances 1-10 Pattern 2 for instances 11-21 If fault with pattern 2 Then Pattern 1 for instances 11-15 Pattern 2 for instances 16-21 repeat binary search until one instance…

Mar 29, 2021 6:27pm Martin Avison (27) 1494 posts	Thanks – but I am well aware of the beauties and uses of a binary search.

Mar 29, 2021 7:27pm Steve Pampling (1551) 8172 posts	uses of a binary search. Telling a third parties external support how to locate the source of a loop in their network? :)

Mar 29, 2021 7:39pm Rick Murray (539) 13851 posts	What I would do in this situation… Prefix every Hourglass call with something to output a message to DADebug (I’m a big believer in DADebug!). Run the program. Let it freeze up. Kill it. Go to command line and type `*DADPrint`, and Bob’s your cousin’s brother’s oh whatever the hell… the last line reported will be as far as the program got. [but, you know, could use Reporter too! ;-)]

Mar 31, 2021 9:36pm Martin Avison (27) 1494 posts	I had been hoping to get some clues without providing a recompiled debug version to the user, but all the responses required recompilation anyway. So I opted for simply adding some Reporter debugging at strategic points, and progress is now being made. Thanks for the suggestions. But it would be very useful if a watchdog Alt-Break Stop could provide some clues what was being executed, even if in a largish looop.

Apr 1, 2021 10:33am Alan Adams (2486) 1149 posts	But it would be very useful if a watchdog Alt-Break Stop could provide some clues what was being executed, even if in a largish looop. and if any changes are done to alt-break, please add a “previous” button. Skipping through the long list to find the one to kill, then pressing next once too often is frustrating.

Apr 1, 2021 6:10pm Chris Hall (132) 3559 posts	Not half! I suppose it is out of the question for the list of active tasks to be shown with radio buttons to select the task or tasks to be ended?

Apr 1, 2021 7:54pm Julie Stamp (8365) 474 posts	I think it might be possible. I’ve thought before that you could have a Kill option alongside the Quit option for a task in the task manager. The Quit option would be the same as now, i.e. sends Message_Quit to the task, but Kill would do the same as Alt-Breaking a non-active¹ task, i.e. does Wimp_CloseDown, which means that the task never gets control back (so for example no atexit() routines would be run). I didn’t think too much about that though because I don’t know when you’d want to use that? But this wouldn’t provide all the functionality of the current Alt-Break error box for killing non-active tasks; the one extra thing Alt-Break does is it stops the Wimp delivering messages to tasks. So if you want a dialogue box with the list of tasks on Alt-Break, you could make it so that the usual error box offers two options, kill current task, or kill other task. If you click ‘kill other task’ the current task is allowed to run until it calls Wimp_Poll but all tasks except one have been marked as ‘suspended’ – the one exception is the new task started to run dialogue box that appears. A DDT-style no-entry sign could be used to show that everything else if frozen, or perhaps the screenshot could be cross-hatched or similar. Personally though, when I Alt-Break it is to kill the current task, so I’d be interested to hear what use people make of it to kill other tasks, to see which of the above two options is more appropriate. Another small improvement to the Alt-Break is that at the moment if you use it during a single-tasking task, it calls it ‘Unknown’. More helpful might be to put the command that started the task, as returned by OS_GetEnv. ¹ That is to say, a task that did not have control (‘paged out’) when Alt-Break was pressed.

Apr 1, 2021 9:24pm Rick Murray (539) 13851 posts	I think usually when one uses Alt-Break, it is an attempt to recover a machine that has hung up, so typically it will be the current task being ended. I have, once or twice, used it to kill off a different task (when I do dumb things like forget to make a program respond to Quit messages), but it’s not a particularly friendly interface and I’m just as likely to accidentally drop a nuke on the wrong task. A small improvement to Alt-Break would be if you are stopping the task that is currently paged in, and the task has a valid stack frame, offer a Backtrace button that will output a backtrace, dump the registers, and also about ten or so instructions either side of PC at the point where the program was killed. That way, somebody stands a chance of seeing how the program got into that state. Won’t help so much for BASIC, but it’s a lot more than the current “hasta la vista, babeeee” method that will end a task and tell you nothing…

Apr 2, 2021 8:51am Chris Hall (132) 3559 posts	It is a pity that ALT-Break does not return control to BASIC (for an application written in BASIC that is) so that the errant line and the variables can be examined. It kills off the whole app slot, including the run-time stuff for BASIC.

Apr 2, 2021 9:05am Rick Murray (539) 13851 posts	It kills off the whole app slot, That’s the idea. It’s the sledgehammer approach to task management.

Apr 2, 2021 10:22am Alan Adams (2486) 1149 posts	Personally though, when I Alt-Break it is to kill the current task, so I’d be interested to hear what use people make of it to kill other tasks, to see which of the above two options is more appropriate A recent example was trying to debug some software running off vectors, which locked up Netfetch. Alt-break offered an apparently randomly chosen task to kill, and experience showed that the one which recovered control of the machine was NetFetch. (It turned out that NetFetch’s fetch entered the file vectors about 45000 times, even if there was nothing to fetch. This resulted in extreme slowdown, and did eventually stop on its own, after about ten minutes.)

Apr 2, 2021 5:58pm Andrew Rawnsley (492) 1445 posts	Hi Alan, could you clarify this please. NetFetch itself shouldn’t really be doing anything weird like that – it is userland C. Does Hermes trigger it if run separately from !NetFetch.Apps please? Basically, more clarity on exactly what you click to trigger the issue would be super-helpful to understand what’s going on, and where the bug is (assuming there is one!). NetFetch itself just calls the Hermes module via a SWI (documented in the Hermes help files) to trigger a fetch. One thing, though – if it is Hermes, check in some of the !Hermes.MailDir files. If you’re doing things like “leave mail on server”, and you have a ton of mail on server, it might be checking headers against a known list which could (I guess) result in a lot of disc access (in which case there may be some logical optimisation that could be done).

Apr 2, 2021 8:00pm Alan Adams (2486) 1149 posts	Hi Alan, could you clarify this please. Just for clarity I’m not blaming Hermes/NetFetch for this, just explaining what I was seeing when my code got in the way. I’m working on a disc activity indicator for the ARMX6. It intercepts the 6 filing system vectors, decides whether the activity is read or write, and shows indications accordingly. During development there was a lot of code active, and this considerably increased the time consumed by my code. As part of my attempts to work out what was happening, I counted the activity on the combined vectors and displayed it, along with all the other output. I’ve checked back over the online discussion. That count reached somewhere in the region of 18000 during the fetch. It was triggered in either of two ways. If NetFetch wasn’t running, I started it. It is configured to fetch immediately. If it was already running, a right-click on its icon triggered the same activity. Without my code in the way, the return was, visually at least, immediate. With my code it took around a minute. (I said earlier in this thread that it was ten minutes. I was wrong. It was around 1 minute.) During this time the pointer, whose colours I was changing, seemed to flicker at a very high rate. It turned out that despite attempting to limit the number of changes, a coding error meant that I was changing the colour for at least half of the vector entries. Changing the colour is slow, in relative terms, at least if using Wimp_SetPalette. I’m using a different method now. Before I discovered that it would eventually complete I thought something had crashed. Alt-break usually first offered me “unknown application”, and after stopping that, offered NetFetch. Within a second or so of stopping NetFetch the machine responded normally, and my code was still running. I think the “unknown application” might be a helper being started by NetFetch, but there’s no way for me to confirm that. I have Netfetch configured to fetch and delete from the server, and on all these test occasions, nothing was debatched into MessengerPro. That leads me to assume there was nothing to fetch. I could, I suppose, have used the webmail interface to see whether something too big to fetch was present, but when that happens the fetch is normally pretty slow before quitting. This was fast, if my code wasn’t present. The discussion is here https://www.riscosopen.org/forum/forums/11/topics/16274?page=5. There’s rather a lot of it.

Apr 2, 2021 8:23pm Alan Adams (2486) 1149 posts	If you follow the discussion in the thread mentioned you’ll see that later on I found that I could reliably crash Messenger Pro with my code running. This was traced to my use of Wimp_SetPalette in a callback, which is unsafe. It sends a sequence of VDU codes, which it is suspected can occur in the middle of a sequence of codes from the interrupted application. I’ve changed the method of indication to avoid this call. Again, not the fault of Messenger Pro. The reason these two showed up the problem is simply that they are the most active background tasks on the machine. In the early development stages, I also crashed Filer_Action in a similar way.