Memory allocation woes
Rick Murray (539) 13840 posts |
Does anybody have some pointers as to how I might track down what’s going wrong with a failing memory allocation that results in: * unrecoverable error in run time system: Not enough memory, malloc failed, (heap overwritten) I’ve wrapped the claims in some logging code and there doesn’t seem to be anything spurious claiming a lot of memory. Knowing how things are, all of the string routines that take arguments (instead of simple stuff like writing four bytes into a larger buffer) use the ‘n’ form calls (strNcpy, sNprintf, etc). Is this correct for calling the Squash module? __asm { MOV R0, 0 // New operation, all data provided MOV R1, workspace // char * -> allocated (temp) MOV R2, data // struct -> allocated (input) MOV R3, size // int MOV R4, sqbuffer // char * -> allocated (output) MOV R5, size // int SWI (SWI_Squash_Compress + XOS_Bit), {R0-R5}, {R5}, {R0-R4, PSR} MOV remaining, R5 } I could understand it blowing up if, say, I used ‘&sqbuffer’ or whatever as that would point to where the sqbuffer value is, rather than where sqbuffer points to… Also, is there a CLib internal routine I can press into service to drop in some calls to verify the heap integrity at various points to see if I can narrow down where it’s going wrong? That being said, it looks like _primitive_alloc() calls RELEASEANDRETURN for several reasons. Hmmm…? Finally, is this error catchable? Does the library throw in the towel and nuke the program, or is it possible to perform a sanitised exit (like, not leaving files open etc)? It doesn’t seem to be doing anything with trapping SIGOSERROR, the program just bombs out. |
Charles Ferguson (8243) 427 posts |
https://www.riscosopen.org/forum/forums/2/topics/17301
No, you’re using assembler in C code for a SWI call. SWI calls are going to be slow. Assembler is for optimising code that needs to be fast. There is little to no reason to be using it here. If you’re got assembler in the code, assume that that’s wrong first. You’re corrupting memory, so you need to remove all the likely culprits. Having assembler involved is one case. You’ve not checked whether the X bit was set, but assumed that if it is the value in R5 is valid, so the assembler is wrong from just a cursory glance. You’ve not stated the context in which it was called – whether you allocated sufficient memory for the workspace block, your input block, or your output block. So it’s impossible to know whether what you’ve used is correct. You’re called R5 ‘remaining’ – to be clear that’s the number of bytes remaining in the output buffer – but you’re ignoring the number of bytes remaining in the input buffer. If you want a way of debugging things, rip out your assembler and write it with regular C calls using either Or just lift the compression code and test that independantly, before introducing it back into the made code base. |
Paolo Fabio Zaino (28) 1882 posts |
As suggested by Charles, I’d avoid using inline asm to call a SWI. MOV R2, data // struct -> allocated (input) Given you have indicated which registers contain a pointer (on other comments) while for this one your comment mention a struct, I’d double check this one and make sure it’s a pointer too (unfortunately you did not share where you have defined “data” and the other variables). Squash_Compress requires a pointer in R2 as well when bit 3 in R0 is clear. I’d also add a MOV status, R0 after it returns from Squash_Compress, to check the status of the operation, beside checking if R5 is less than 12, it should return 1 if R3 = 0 after the call and, if so, stop your loop (if you’re using a loop I’d imagine). Without more context it’s hard to see what is wrong. Sorry. |
Rick Murray (539) 13840 posts |
It’s a call to an assembler-based API (the OS). The libraries I use don’t have support for Squash, and while I could add it, it’s mostly going to be the exact same code, just in a .s file rather than inline…though, as you rightly point out, with error trapping. ;)
Hmm, a hundred instructions to do the job of eight… Thing is, no matter what method I use to call the SWI (kernel, swix, inline), if I pass in the same parameters I’d expect the same results, no? It is documented (StrongHelp and here) as only using R0-R5, so there shouldn’t be any register corruption as I’m telling the compiler what it’s using.
The output buffer is as large as the data being passed in, and given the data format (lots of nothing) it’s extremely unlikely to suffer pathological cases where the output is larger.
All in the form
It’s currently a one-shot. I think the next stage will be to #ifdef around the compression code to knock it out entirely, retest everything to ensure that I haven’t done anything dumb along the way… …and then to make a little test harness for testing the compression by itself. A bit of pointer faffery ought to allow me to allocate two extra words either side of the desired allocations, to use as guard words to ensure that nothing is being overwritten. Ah, the joy of debugging… :/ |
Rick Murray (539) 13840 posts |
I’m on break at work and… Not enough memory, malloc failed, (heap overwritten)What actually went wrong here? A failure to allocate memory? If so, is that why the heap got messed up? The slot should be big enough (it’s an alloc of a couple of megabytes on top of an ~18MB block within a 32MB slot). |
Julie Stamp (8365) 474 posts |
I think the error means something has corrupted the heap. When malloc() was called, I wonder if that’s a mistake in the messages? Maybe for your dropping in checks you could call malloc and immediately free it? |
Charles Ferguson (8243) 427 posts |
How it’s implemented is irrelevant. You have no reason to use assembler there. You’ve created something that’s corrupting memory (and the error is telling you that), in something that is overly complex. Remove the complexity. You have applied a premature optimisation to the code and now that code has gone wrong. Whilst it might be ‘mostly… the exact same thing’, it will be a simpler and less error-prone thing. You asked how to debug the problem, and I’m telling you what you have done badly, and should do about it. There is little defence in using assembler where safer interfaces exist to do the same thing in this instance. Create the library routine, then you’ve got it and you can use it again in the future, knowing that it works.
“The efficiency of incorrect code is zero”, so that doesn’t fly from the outset, never mind the fact that you’re calling a SWI, which will have a large overhead, to compress data, which will have an even larger overhead. You’re using assembler where it isn’t inappropriate. You’ve got your assembler wrong. You’re corrupting the heap, and you’re using overly complex code. These are all things that you can fix, and make the code more maintainable, easier to read and easier to debug. It will make the code less fragile, and mean that you can stop worrying about calling conventions and instead worry about the actual problem of where you’re corrupting memory. To reiterate what I’ve said in more general terms, you want to optimise the code that matters – where it’s important to have speed. For the other places, optimising code with assembler, or other complex methods, will increase the complexity, decrease the readability of the code, and increase the risk that something will go wrong. Remove the complexity. Don’t write ‘clever’ code, write working and maintainable code. In this specific case, you’ve got hung up on the fact that you’re probably calling the code wrongly. So call it more clearly, with a more obvious way of handling it.
In your original post you asked whether you’re doing it right, and now you’re stating that the call is right, with the implication that there’s no benefit in the clearer and less error-prone code. I don’t know what those parameters are, because you’ve not given the relevant information about the code. However changing the code may show you that you’ve done it wrong, and as a side effect gives you better code that is more maintainable. Making the code more readable and easier to debug when something goes wrong is the first order of business. More generally, arguing that you’re doing things right, when you’ve asked for help because it’s not working, and advice has been given to make it easier to work with and help do things better, is unhelpful to your cause.
Where things are ‘unlikely’ and you’ve got a problem, you need to reduce ‘unlikely’ to ‘not possible’. If you assume that your input data is correct, then when that assumption goes wrong you’ll find that the result is a breakage which results in bad behaviour – which might include corruption of the heap as you’re seeing. Just because you assume it won’t happen, doesn’t mean that it won’t. If there is a possibility that the call can fail, and you haven’t handled the error case that it does fail, and the failure might result in heap corruption then it is a case where the circumstances that you are seeing might happen. It might be utterly unrelated to the problem that you’re seeing but fix the problem anyhow so that you at least handle that case.
Well that’s great, but you’ve got memory corruption happening, so merely asserting that those are all fine is clearly not going to help with the process of providing advice for debugging the code – clearly something is wrong, and you’ve hand-waved away the things that are possible causes. As stated, it’s unclear whether you’ve done that right – have you passed in the wrong value to calloc, perhaps? I don’t know, but if you’d done
Or use the fortify library that I suggested, which does that?
As stated in the responses on this thread, it’s telling you that the heap is corrupt. It’s failed to allocate memory, and malloc has failed because the heap is corrupt. Use something like Fortify to wrap the memory allocations and help to identify the problem. |
Colin Ferris (399) 1814 posts |
A good accouce for Rick to try fortify :-) |
David Pilling (8394) 96 posts |
Crumbs, my life has been a lie. I was always inclined to use my own memory managers both sliding heap (flex) and malloc style. Keeps damage away from the OS and you’re in control. Of course they have ended up with code that checks if there is any corruption, so I can see when it happened. If there was any doubt about how the Squash SWI works, produce a small bit of Basic that just does the problem call. |
Stuart Swales (8827) 1357 posts |
It’s the only way to write non-trivial applications, since year dot. There are too many times you can call into the C runtime and it just doesn’t come back. Program ended, game over. Same with the original library that became RISC_OSLib, routines needed sanitising to deal with resource-limited (and other) failure paths rather than wimpt_die(). In ye olden days, I’d take a copy of AnsiLib, patch its symbol table entries for malloc/free as zalloc/zree which forced all allocations, including those made by the C runtime, to go though my replacement malloc/free. I’d also warn about reading very carefully the API for strncpy and strncat. For many years, I’ve used strcpy/cat/ncpy/ncat equivalents that take dst_buffer and dst_buffer_bytes/elements as first arguments and ensure NUL-termination (within the buffer) of dst for consistency. |
Rick Murray (539) 13840 posts |
Hmm… Don’t write assembler in the program, write it in a library routine instead. ;)
As it turns out, it was nothing whatsoever to do with that code. As a lot of tracing behaviour with DADebug showed, as well as extending my own memory tracing to add and check guard words (was simpler than downloading and setting up Fortify, though thanks for the reference, it’ll be useful in the future).
Actually, sorry, all wrong. When I posted the assembler, it wasn’t for a critique of my coding style, it was to check that using it in that way was syntaxtually correct and that something wasn’t blowing up because it needed the &pointer syntax or something like that. I’d already checked all the obvious (to me) things. As it turns out, the little bit of assembler that you’re getting worked up about was nothing to do with the problem. As for overly complex code, there’s a reason why I didn’t use the example presented in the Squash application source (and not just because it’s sometimes passing bogus flags).
Good grief, calm down already. Paolo said that I hadn’t defined what the variable definitions were, so I mentioned it.
Sorry, no cookie for you. Error: undeclared name, inventing 'extern int calloc()' Error: '=': implicit case of non-0 int to pointer While we are clearly going to have to disagree on the use of inline assembler, one of my quirks as a programmer is that a release version of a project should build with ZERO compiler messages. No errors, no warnings, nothing. So you’d think I’d notice if the compiler was telling me something’s amiss.
Yup, working (almost right now) on improving the code to not ignore what’s returned. ;)
Because I don’t come to the forum as a first point of call. I do try to look at what’s going on myself first. So not only are the sizes passed to calloc known to be correct, they also worked in earlier incarnations of the program.
Yeah, the standard C library is a bit of a mess, isn’t it? Is that file handle the first or last parameter (fputc(), I’m looking at you!)? The string length goes where? Does it terminate if it overruns or does it return bollocks? It’s surprising such inconsistencies made it to an actual ISO standard. How can any string routine given valid (non-null) input justify returning something that isn’t a valid string?
Which isn’t really what you want in a library, but I guess a program hard-aborting and potentially leaving half-written (and thus damaged) files is “the way we roll”.
Uh-hu, this tickled me. God forbid a program might want to shut down sensibly… I guess one would have to abuse the atexit handlers for that. :/ switch (e->data.msg.hdr.action) { case wimp_MCLOSEDOWN: tracef0("closedown message: for the moment, just do it.\n"); exit(0); Now, as it turns out, some code somewhere completely different that was being called just prior to the file saving routine was checking elements within its claimed memory block, only it was overshooting by one element (was checking up to and including MAXOBJECTS rather than only up to (as array counts from zero). It wasn’t corrupting all the time as it checks that it’s the right type of object before messing with it, but if the type word that it was seeing had the expected value, then it would modify and… …a little later on another attempt at claiming memory would blow up as something internal to the heap would have been trampled on. That assembler? Not the issue at all. A bit of a red herring, but flagged as it was added at around the time I noticed the problems, but ultimately no show-stopping problem occurring there. Just exercised the saving a bit and nothing has blown up on me. ;) |
Colin Ferris (399) 1814 posts |
It is interesting that when asking for help – Ding the answer pops up :-) |
James Pankhurst (8374) 126 posts |
Does that make us rubber ducks? |
Charles Ferguson (8243) 427 posts |
You explicitly asked for ways of pointers in tracking down what was going wrong with a memory corruption problem, and posted a piece of assembler. The responses I have given were to advise how you would track down the problems. By removing bad code smells like unnecessarily complex code – the use of assembler – and suggesting ways of isolating the problems. And I stated that the problem lay elsewhere and that you’d not provided enough information. I’m sorry you don’t like the advice given, but that advice still stands. However, I’m glad that you’ve managed to integrate Fortify with your code and now have another tool in your belt to help with these problems. |
David J. Ruck (33) 1635 posts |
That might not be what you want, but that’s what you are going to get! Everyone has to suck up the code reviews these days. |
Steve Pampling (1551) 8170 posts |
It was ever thus. 1 Very, very severely limited. |