Debugging locked Application
Martin Avison (27) 1494 posts |
I am trying to investigate an application which, under some weird circumstances and only on a remote user machine, ‘locks’ the machine with a never-ending hourglass. It is recoverable by using Alt-Break to Stop the offending task, but I am unable to diagnose the problem without some clues where the app is looping. Any suggestions how I could get any debug information welcome! |
Chris Hall (132) 3554 posts |
If the application is written in BASIC then add some TRACE statements and PRINT statements and do a *Spool filename before running it. If it is not written in BASIC then someone else may help. |
Martin Avison (27) 1494 posts |
It is not Basic – that I can debug. It is a C program. And yes, I could put some debugging in that, but it might take several iterations as it is 60k lines – and rather tedious for me & user. |
Julie Stamp (8365) 474 posts |
I think you can do it, with a little bit of work. Take your Report button for Alt-Break, and modify it to set a non-transient callback. After the user presses cancel, it will fall into your callback environment handler, where you have full access to the task user mode registers. Send them then to Reporter, or whatever else you would do with them, and then return from the callback in the usual way. (I think this isn’t entirely foolproof, as if a callback is already set it will be missed, and C does use callbacks, but this hack will at least let you log the task state, even if it might crash after that.) |
Sprow (202) 1158 posts |
Any suggestions how I could get any debug information welcome! If you wanted to get a big list of all the functions that were called in the order they were called, like Proceed as follows:
You might need to sneakily edit your stubs to rename |
Rick Murray (539) 13840 posts |
I note the uncertainty. ;-) The last time I checked (around Y2K), the profiler stuff needed to be linked with ANSIlib as it wasn’t part of Stubs. Alternatively… How does Alt-Break kill a task? Is it something that can be trapped (say, by atexit() ?). I’ve never tried, but perhaps put together a simple backtrace handler to output a list of functions?
Well, if it locks up the machine with an hourglass, there’s your first clue. All the bits of code that turn on the hourglass. If you have four or less, set the LEDs to a pattern (off, top, bottom, or both) which will tell you which hourglass part is the problem. And, yes, it might take a few iterations. Unfortunately it’s like that if something works for you but fails elsewhere. |
Martin Avison (27) 1494 posts |
Unfortunately 43 >>> 4 ! Thanks for all the suggestions – some interesting ideas, but still evaluating. |
Steve Pampling (1551) 8170 posts |
Ricks suggestion with the obvious stated: If fault with pattern 1 If fault with pattern 2 repeat binary search until one instance… |
Martin Avison (27) 1494 posts |
Thanks – but I am well aware of the beauties and uses of a binary search. |
Steve Pampling (1551) 8170 posts |
Telling a third parties external support how to locate the source of a loop in their network? :) |
Rick Murray (539) 13840 posts |
What I would do in this situation… Prefix every Hourglass call with something to output a message to DADebug (I’m a big believer in DADebug!). [but, you know, could use Reporter too! ;-)] |
Martin Avison (27) 1494 posts |
I had been hoping to get some clues without providing a recompiled debug version to the user, but all the responses required recompilation anyway. So I opted for simply adding some Reporter debugging at strategic points, and progress is now being made. Thanks for the suggestions. But it would be very useful if a watchdog Alt-Break Stop could provide some clues what was being executed, even if in a largish looop. |
Alan Adams (2486) 1149 posts |
and if any changes are done to alt-break, please add a “previous” button. Skipping through the long list to find the one to kill, then pressing next once too often is frustrating. |
Chris Hall (132) 3554 posts |
Not half! I suppose it is out of the question for the list of active tasks to be shown with radio buttons to select the task or tasks to be ended? |
Julie Stamp (8365) 474 posts |
I think it might be possible. I’ve thought before that you could have a Kill option alongside the Quit option for a task in the task manager. The Quit option would be the same as now, i.e. sends Message_Quit to the task, but Kill would do the same as Alt-Breaking a non-active1 task, i.e. does Wimp_CloseDown, which means that the task never gets control back (so for example no atexit() routines would be run). I didn’t think too much about that though because I don’t know when you’d want to use that? But this wouldn’t provide all the functionality of the current Alt-Break error box for killing non-active tasks; the one extra thing Alt-Break does is it stops the Wimp delivering messages to tasks. So if you want a dialogue box with the list of tasks on Alt-Break, you could make it so that the usual error box offers two options, kill current task, or kill other task. If you click ‘kill other task’ the current task is allowed to run until it calls Wimp_Poll but all tasks except one have been marked as ‘suspended’ – the one exception is the new task started to run dialogue box that appears. A DDT-style no-entry sign could be used to show that everything else if frozen, or perhaps the screenshot could be cross-hatched or similar. Personally though, when I Alt-Break it is to kill the current task, so I’d be interested to hear what use people make of it to kill other tasks, to see which of the above two options is more appropriate. Another small improvement to the Alt-Break is that at the moment if you use it during a single-tasking task, it calls it ‘Unknown’. More helpful might be to put the command that started the task, as returned by OS_GetEnv. 1 That is to say, a task that did not have control (‘paged out’) when Alt-Break was pressed. |
Rick Murray (539) 13840 posts |
I think usually when one uses Alt-Break, it is an attempt to recover a machine that has hung up, so typically it will be the current task being ended. I have, once or twice, used it to kill off a different task (when I do dumb things like forget to make a program respond to Quit messages), but it’s not a particularly friendly interface and I’m just as likely to accidentally drop a nuke on the wrong task. A small improvement to Alt-Break would be if you are stopping the task that is currently paged in, and the task has a valid stack frame, offer a Backtrace button that will output a backtrace, dump the registers, and also about ten or so instructions either side of PC at the point where the program was killed. |
Chris Hall (132) 3554 posts |
It is a pity that ALT-Break does not return control to BASIC (for an application written in BASIC that is) so that the errant line and the variables can be examined. It kills off the whole app slot, including the run-time stuff for BASIC. |
Rick Murray (539) 13840 posts |
That’s the idea. It’s the sledgehammer approach to task management. |
Alan Adams (2486) 1149 posts |
A recent example was trying to debug some software running off vectors, which locked up Netfetch. Alt-break offered an apparently randomly chosen task to kill, and experience showed that the one which recovered control of the machine was NetFetch. (It turned out that NetFetch’s fetch entered the file vectors about 45000 times, even if there was nothing to fetch. This resulted in extreme slowdown, and did eventually stop on its own, after about ten minutes.) |
Andrew Rawnsley (492) 1445 posts |
Hi Alan, could you clarify this please. NetFetch itself shouldn’t really be doing anything weird like that – it is userland C. Does Hermes trigger it if run separately from !NetFetch.Apps please? Basically, more clarity on exactly what you click to trigger the issue would be super-helpful to understand what’s going on, and where the bug is (assuming there is one!). NetFetch itself just calls the Hermes module via a SWI (documented in the Hermes help files) to trigger a fetch. One thing, though – if it is Hermes, check in some of the !Hermes.MailDir files. If you’re doing things like “leave mail on server”, and you have a ton of mail on server, it might be checking headers against a known list which could (I guess) result in a lot of disc access (in which case there may be some logical optimisation that could be done). |
Alan Adams (2486) 1149 posts |
Just for clarity I’m not blaming Hermes/NetFetch for this, just explaining what I was seeing when my code got in the way. I’m working on a disc activity indicator for the ARMX6. It intercepts the 6 filing system vectors, decides whether the activity is read or write, and shows indications accordingly. During development there was a lot of code active, and this considerably increased the time consumed by my code. As part of my attempts to work out what was happening, I counted the activity on the combined vectors and displayed it, along with all the other output. I’ve checked back over the online discussion. That count reached somewhere in the region of 18000 during the fetch. It was triggered in either of two ways. If NetFetch wasn’t running, I started it. It is configured to fetch immediately. If it was already running, a right-click on its icon triggered the same activity. Without my code in the way, the return was, visually at least, immediate. With my code it took around a minute. (I said earlier in this thread that it was ten minutes. I was wrong. It was around 1 minute.) During this time the pointer, whose colours I was changing, seemed to flicker at a very high rate. It turned out that despite attempting to limit the number of changes, a coding error meant that I was changing the colour for at least half of the vector entries. Changing the colour is slow, in relative terms, at least if using Wimp_SetPalette. I’m using a different method now. Before I discovered that it would eventually complete I thought something had crashed. Alt-break usually first offered me “unknown application”, and after stopping that, offered NetFetch. Within a second or so of stopping NetFetch the machine responded normally, and my code was still running. I think the “unknown application” might be a helper being started by NetFetch, but there’s no way for me to confirm that. I have Netfetch configured to fetch and delete from the server, and on all these test occasions, nothing was debatched into MessengerPro. That leads me to assume there was nothing to fetch. I could, I suppose, have used the webmail interface to see whether something too big to fetch was present, but when that happens the fetch is normally pretty slow before quitting. This was fast, if my code wasn’t present. The discussion is here https://www.riscosopen.org/forum/forums/11/topics/16274?page=5. There’s rather a lot of it. |
Alan Adams (2486) 1149 posts |
If you follow the discussion in the thread mentioned you’ll see that later on I found that I could reliably crash Messenger Pro with my code running. This was traced to my use of Wimp_SetPalette in a callback, which is unsafe. It sends a sequence of VDU codes, which it is suspected can occur in the middle of a sequence of codes from the interrupted application. I’ve changed the method of indication to avoid this call. Again, not the fault of Messenger Pro. The reason these two showed up the problem is simply that they are the most active background tasks on the machine. In the early development stages, I also crashed Filer_Action in a similar way. |