Automated testing and CI
Rick Murray (539) 13850 posts |
Sure they are. Testing is what happens when the user runs the program. :-) The question I have about the use of automated testing is that, really, you are only going to be testing known knowns. For instance, if there is an FP test suite, it can verify that FPE generates good results. If there’s an FP that kicks processing off to the VFP or NEON units, it can verify that the results are good. But if there’s a new 128 bit super extended FP format, well, such a set of tests would have to be written and validated before the code in question can be validated. Which would suggest that, provided there aren’t a load of tweakers and tinkerers involved with a project, shouldn’t a bit of code behave in the same way once written and then left alone? It might not be necessary to test a function to convert a set of characters to lower case 1 but surely if you’re compacting a linked list you’ll have cobbled together something to exercise the code prior to putting into a released application? And then it ought to continue working as designed. That’s not to say that no testing is it should be done, that would be quite dumb. I just feel that automated testing is really only as good as the tests it performs, and a lot of edge cases might slip by as the person who devised the tests just didn’t think of those cases. Real world, that’s where people get involved, and people are better than any test suite at breaking stuff. ;-) What might be better, given the, um, specifics of RISC OS, is to code defensively and not just blindly accept any old crap handled to your routine. I mean, if the address you receive is 0 (or anything less than &8000) then there’s a pretty good chance that accessing it will cause something to fail. So at times where it matters, a few simple checks could advert a world of pain. I think, and this is where the Pyro project could be interesting, as well as emulation, is that testing on RISC OS itself is harder than it needs to be. There’s not a lot of help testing BASIC. As for C and assembler, DDT is a painful mess of interacting bugs 2 that has far too many quirks and limitations to be useful in day to day use. And if you’re doing anything more complicated, all too often the result is the machine just stiffs with no clear indication of how it got to be in that state. So one of the best testing tools that could happen to RISC OS is something that would allow the system to crash, and then let to user unwind it. Maybe then those little annoyances like pushing six registers to the system stack and pulling five (since we’re still writing so much in assembler even these days) might become clear if we can see it happening, as opposed to muttering “motherf….” and prodding the reset button. Again. 1 I trust you’re using Territory, people, and not just ORing in the ‘lowercase’ bit… 2 My dislike of DDT is well known, though it must be said that debugging by spewing data to either DADebug or the serial port is positively Neolithic. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Steve Fryatt (216) 2105 posts |
Which leads us to an observation made recently, and apparently seriously, in another place that people shouldn’t complain about their web browser crashing because complex software like a web browser always crashes.
Well, yes… that’s the point. Generally you’d write the suite to test the code being tested, and prove that it does what is expected for all the edge cases that you can think of. And then add more tests if reality intervenes, to pick up the things that were overlooked. Either way, if the module (not in a RISC OS sense) being tested changes, your test suite has also changed.
First, the overhead of the tests is probably minimal. If the code doesn’t change, then neither do the tests. In a RISC OS context, you could probably “mothball” them from daily use, and just run them when making releases. But how often does code remain unchanged forever? What about that small tweak that gets done, which is too small to be a problem? Well, you can test it easily now, can’t you… and probably find that it actually breaks something unexpected… :-)
I think I’d beg to disagree, having spent the past decade testing stuff (although not necessarily software, I must admit). First, there’s degrees of automated. What Gerph spoke about last week wasn’t by necessity automated: it was just test suites which could be built and run at the CLI, and by extension be thrown at an automated system. Even if you don’t automate it, it’s still a hell of a lot easier to run a test suite from the CLI and check that none of the cases fail, than it is to test that comprehensively in a live Wimp application. Because it’s a lot easier to write tests that exercise all of the awkward bits that you can think of, quickly and easily. And easier to debug a failed test if it’s just testing one function in isolation.
Well, yes, but you probably then want to try passing
Which brings us back to the whole point of last Monday’s talk… This stuff is hard, so if you can, you do it in an easier way. You test the building blocks simply, in a way that you have some control of, so that when you come to test the whole application you have
:-) ETA: And yes, I’m well aware that I don’t test my RISC OS stuff anywhere near well enough: most of my stuff on the platform dates back to well before I had any idea about this kind of thing. I started putting unit tests into SFLib at the back end of last year, and wish I’d done it sooner as it has made things so much easier to work on. PPS: It should be fairly easy to do this kind of testing in BASIC, too. In fact, IIRC, some of the examples last Monday were in BASIC (or was I imagining that?). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
Quoting Theo:
Interesting; I suspect it’s less useful from that side, but I would guess at the very least being able to confirm that what was built was executable would be a step in the right direction. As discussed privately, Pyromaniac isn’t really suitable for heavy use, like GCC builds – it’s just too slow really. Building on a sane system, and then executing the results makes more sense.
Other than the form of results being reported, I don’t think it makes too much sense to have a single definition of the tests and how to run them. Tests may be executed in many forms, and the environment in which they execute matters a lot. You can’t really provide an all encompassing format that expresses all the dimensions that the tests might take without – essentially – building a programming language that has the ability to do everything. Having some frameworks available for people to use is great, but you cannot settle on just one to cover everything, I believe. Within a given domain, you can certainly provide definitions of tests, but I don’t know that it even makes sense to do that more generically. As for the manner of reporting test results, I’ve pretty much settled on using JUnit XML represenation, in ‘whatever format works with the system you are trying to use it with’ (because oh dear gosh it’s a terribly inconsistent format with different versions mutually incompatible with each other). My reasoning is that Junit XML is the most widely supported format, and – for me – it integrates with all the tooling I use (GitHub, GitLab, Jenkins and my own tools as consumers, and the testing tools I’ve written as generators).
If we assume that this is going to be entirely domain specific, I would suggest that you should not be controlling the tests in quite that way. ‘all’ may implicitly include tests that never complete, or take huge lengths of time. It is important to be aware of what gate1 you are trying to pass to be able to execute the tests. Simple developer tests should (generally) not take a long time, whilst tests for a pull request/merge request/review may want to be heavier, and yet heavier still may be pre-release testing. So any definition that you provide may want to be specified in terms of the gate as one of the parameters. Similarly, the scope of the tests is often used to describe the sequence of testing, as the lesser scoped tests will report problems more usefully than the wide scoped tests (eg run your unit tests first, before moving on to integration, etc). And of course some people will want different policies on the ordering and what goes into each sequenced set of tests. All of which makes me think that you cannot create any form of specification which satisfies all of those dimensions. Which is kinda why I suggest that it’s domain specific. For an individual use case, defining how the tests are to be executed and what exists is useful. It’s only tangentially related, but I use not a definition, but a declaration of what tests are present in a header at the top of my test description files. These don’t say how to run them, but just what they are, like this (https://github.com/gerph/riscos-tests/blob/master/testcode/tests-core.txt): # SUT: Kernel: Core SWIs # Area: Output, conversions, execution # Class: Functional # Type: Integration test Then I can use a little tool to produce the reports on what is available (even if they’re not used) from the source tree (SVG: https://share.gerph.org/s/L0dTeIB51bGolbc PNG: https://share.gerph.org/s/zkYbwhMPdlbDjrN). So I can definitely see advantages in being able to declare what things are available. I guess as I think about this more I’m liking the idea, but I’m unsure how you might do it. Maybe I’m being put off by the whole ’there’s too much to think about here’, and should just shut up and let someone try to solve it instead of saying why I don’t think it can be perfect… ‘cos that’s not helpful.
This was one of the key points in my presentation – I know you were there, but I’m responding to things after the event, for future readers. The point being that you shouldn’t use the ‘it doesn’t exist so I won’t do it’. I agree it’s a disincentive, but… you create some tests – they don’t have to be hard – and then you build on it. What you say is true, that not having anything makes it an extra hurdle to get over, but honestly, if that hurdle is ‘write a program that returns 0 when it was successful or crashes if not’… that’s actually not that bad. Then you can start raising your game. 1 I’m using gate in the engineering sense of ‘what conditions must be met to allow the code to pass through it’. Developers will want tests at the ‘interactive’ gate to pass, generally, then move on to maybe ‘commit’ gating tests, before passing ‘mainline’ tests to be accepted into the main tree, ‘release’ gated tests to pass for the generation of the release. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
Quoting Rick Murray:
Yes it ought to – and that’s /why/ you keep the test around and you run it all the time. Because when it breaks, and it will, you want to know about it. You never assume that the code is left alone – because at some point someone will want to change it, or it will become broken because of something you just couldn’t envisage happening.
Of course they will. But automated testing is about trying to reduce the chance of those failures reaching the outside world. Any failure that reaches a user is an embarassment, and you should be ashamed of it. Not cripplingly, obviously, ‘cos you’d give up and crawl into a hole for 15 years if that happened, but enough to make you want to be better. Any failure that reaches a user that you see a report from is probably just the tip of the iceberg of users that saw a problem and said nothing. And how many of those users just walked away because everything was just too hard? As fort he more serious point that the tests are only as good as what you write… yes, that’s true. This is one reason why developers and testers tend to be different people in serious software companies – professional testers have a mindset that is somewhat warped and intentionally adversarial, whilst developers are inherently optimistic that the code that they wrote will work 1. You can always find more problems and more cases of that things don’t work, because as a developer you suck (that’s just the way it is… you’ll miss things and you’ll cry at how stupid you were). But the more testing you have, the more light you shine on the code, and the fewer places there are for the bugs to hide.
{facetious} DDT is not a testing tool. It’s barely a debugging tool. {/facetious} 1 Ok, not all of them, but humility is important :-) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13850 posts |
Yeah, that’s a bit of a touchy subject right now. As I was writing thoughts on the Eurovision final, live, so no ability to pause, I was also looking at Vince’s thoughts on Twitter on my older phone.
You and I both know that such a thing is akin to “what could possibly go wrong” in terms of tempting fate.
Certainly, and it gives you the opportunity to examine, store, and compare the output of the function. Not something that is necessarily possible when it’s embedded into a Wimp program. This comes into part of the “testing as it’s being written” idea. I also make fairly extensive use of DADebug to allow tracing through a live function, though this is usually omitted from release software. While I may remove the test code once stuff has been tested, I make a habit of wrapping an debug messages in #ifdef so they can be linked in easily at any time. Because, yes, that little tweak that won’t change anything….. been there, seen that. Yes, I’m conflating testing and debugging. Because to my mind, code that doesn’t work needs fixed. That’s the basic definition of debugging.
I think the first four words are superfluous in my case. :-) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
Quoting Steve Fryatt:
I’m not sure I said this in the talk – I did point out that I’d made mistakes in my own assumptions, etc, but I don’t think I said it clearly like Steve has… so here’s my statement too… I don’t test enough either. I’ve got whole libraries that don’t get any love when it comes to testing, and that’s sad. I should try taking some of my own advice on ways to do things, ’cos like everyone else, I suck too :-) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13850 posts |
I wanted to write a test harness for DeskLib. Then I realised the scope of what that might actually involve, understood that I’m not getting any younger, and that I have a life. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
Thank you everyone for the comments so far. I’ll post the results of the survey later in the week; they’re not especially surprising, but as I’ve said I’d do them, I should! Anyone who missed the presentation, the slides and full speaker notes can be found here: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Steve Fryatt (216) 2105 posts |
Oh, absolutely. I only wrote the basis of one for SFLib when I became aware (thanks to feedback from — ahem — “field testing”) that some of the routines that I use for copying strings in and out of icons might, possibly, contain some out-by-one errors in their bounds checking. It would have been a nightmare to have tested all of the conditions in the finished applications (yes, plural) where the problems were showing up, but a week of writing a test harness able to test routines which wanted to use Wimp_GetIconState1, and I was able to see for myself on demand what my users were unhappy with. And come away with some confidence that the subsequent fixes were correct, which was another point made last week. I should probably make the code available on GitHub for others to laugh at… perhaps when I’ve got the May issue of The WROCC to bed, as I notice that it still contains some signs of having been copied hurriedly from Launcher. :-) 1 So a Wimp application calling unit tests on null events and dumping the output to Reporter, in effect. I’ve since pondered linking to a “replacement OSLib” which does simple memory allocation for those calls, but that’s a big job and the quick-and-dirty Wimp application works fine so long as there’s some functioning bounds checks in the code being tested to stop the whole OS falling over in a heap. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alan Adams (2486) 1149 posts |
I have two different software problems, and I have problems testing both. One is the disc activity light project discussed extensively here a few weeks ago. The basic problem there is that the application seems to work perfectly. It’s just that other software crashes around it when it’s running. There doesn’t seem to be ANY way to track down that problem. (The only things I can see are 1: failing to preserve registers through vector intercepts, or 2: altering something outside my memory area. I’ve spent days poring over the code to look for both, without success.) The second is a 20-year development, which is now in the form of a database server with a large number of clients running on a network of other RISC OS computers. It’s used for scoring a sports competition, so needs to be reliable and reasonably fast. In use it’s highly likely that three of the computers will be updating the same area of the database, and four others will be reading data from that general area too. The only time that degree of concurrency gets tested is when the system goes live, under pressure, with non-RISC OS users using it. I cannot exercise that on my own – I would need 4 arms and as many brains. I do have a number of fallback options built in – fortunately, as the last time it was used, a debug feature was left in by mistake, which meant the server stalled after about ten minutes of activity. The fallback didn’t exercise that part of the code. Post-event it took me a couple of days to find a way to cause the crash at home in order to find the problem. Ring buffer code has a remarkable number of possible edge cases. Especially when the whole thing is written in BASIC. Most of them are in the “off by one” category. That particular one had existed undetected for at least 8 years. (possibilities: the message ends one byte before the end of the buffer, it ends at the end of the buffer, it ends at the first byte of the buffer after wrapping, in each of the above the next byte is the start of a valid message left behind earlier, or is the end of message marker, i.e. an empty message, or at the wrap point one byte is missed (written past buffer end) or duplicated.) Changing the message protocol to include a length and/or crc check would make detecting errors here more likely, but that degree of re-writing would likely introduce more errors than it would remove. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
I don’t know anything about it; reference to the source or discussion? There’s always ways to trap down the problems… you could stick code around your vectors traps to check the vector values before and after, in all modes. You could move the whole thing to running in user mode, and simulate the entries so that they perform the right actions but just don’t change modes so that you can see that they’re doing the right things. You can allocate much more memory if you think that there’s a crash in that code to see if it stops happening, which would tell you that you might be overrunning, or introduce boundaries on the code to see that you’re doing the right things. However, the closer you are to the OS interfaces, the more likely it is that you’ll make a mistake in the interfacing, because there’s not any other code to test. Essentially you’re making basic mistakes in the integration, which is the hardest part to test :-( I wrote something that sounds similar, some years ago. It trapped the FS vectors and changed the pointer colour on entry, and again on exit. It’s not espectially clever, but it did the job pretty reliably at the time. I’ve just made that repository public if it’s useful: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Steve Fryatt (216) 2105 posts |
This is exactly why unit testing stuff is useful, though. Exercise all of the edge cases in the ring buffer, which should be fairly simple in a bit of BASIC, and check that they work. Some ideas off the top of my head:
I’m sure that there are plenty more, too. In each case, you check that the values read back match what the spec says they should be given the (distinct) values that you put in. Then, when you know that your buffer is fairly solid, you can use it as a building block in the larger project. Rinse and repeat for the other simple modular chunks.
Out of curiosity, without solid testing behind it, what would you do if someone contested the scores produced by your system? |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chris Hall (132) 3558 posts |
what would you do if someone contested the scores This was sorted out for cricket centuries ago, the umpire was armed. Disputes were rare and quickly resolved in the umpire’s favour. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alan Adams (2486) 1149 posts |
Go to the paper originals. The judges write down the results. Then they phone them through to the clerks who put them on the computer system. (Previously the clerks put them on cards, and from there the summary was put on computer.) I have no plans to put computers on the river bank, not least because they are more than 100 metres away, so needing ethernet repeaters (don’t even think of using wireless). Then waterproofing everything… Incidentally I discovered a while back that the touch screen on an iPhone doesn’t work in the rain. Fortunately I wasn’t relying on OS Mapping for my navigation. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
On the subject of writing tests, “The lazy programmer’s guide to writing thousands of tests” gives some useful advice on how to write effective tests: https://www.youtube.com/watch?v=IYzDFHx6QPY |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13850 posts |
It’s not your iPhone, my S9 does the same thing. Capacitive touch screens are not at all good at telling the difference between fat bits of flesh and raindrops. It’s quite annoying when your phone is waterproof and you’re recording something in the rain and suddenly the phone is like focus on this, on this, zoom in, out, in, out, shake it all about, screenshot, screenshot, focus, exposure dark, pause! I’m like 😡 and wishing there was a setting to “just record and ignore anything that happens on the screen”. Use the power button or something to stop the recording.
The Japanese took this to it’s illogical conclusion and made the film Battlefield Baseball.
What’s the point of the computer then, exactly? A fancy score display? |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13850 posts |
Nice selection. I would add:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alan Adams (2486) 1149 posts |
Score display, prizegiving lists, final results. The recent revisions cut out one step in the process, and two people from Control. The judges are on the riverbank, and not to be disturbed by competitors wanting to know their results. The paddlers must be shown their scores in a short time after the run, to allow the time window for submitting protests. The latest revision provides a live results display over WiFi, eliminating crowding round the results – for reasons that should be apparent. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
A while back I started this discussion with a survey. I wanted to get a feel for what people were missing and why they weren’t taking the need to do testing seriously – or at least why that hadn’t been expressed by people using the build service. At the time I said I would give the results of the survey once it was complete. I’ll give the results I received (in aggregate) and some of the comments (none that might identify people). I’ll try to give a small interpretation of each, where I can. The survey was unsurprisingly low in submissions, only 14 submissions. There were 21 records where people started the survey but didn’t complete it. In all those 21 cases, the survey recorded no answers. I’m not sure there’s anything that you can draw from that, other than that it’s a relatively niche area (despite the fact that all developers should care), and I expect that there will be a lower response rate than readership rate. The questions in the survey were…
Before taking this survey, have you heard of the RISC OS Build Service?
That’s not unreasonable; it’s not hugely publicised, and the name changed from the April 1st release, so fair enough. Plus, that’s at least 2 more people that have been reached by the survey and the forum post. Were you aware that it can be used for automated building and testing of RISC OS software?
Again, this is not unreasonable for similar reasons. I can assume that probably the same 11 people that said yes here said yes to the prior question. So the proportion of people that were aware of it, and looked at it enough to understand its use as an automated testing tool is high. To me that means that the users who looked at the system understood it – the message that the site and documentation sends, once you get to it, is effective. Have you tried to use the build service?
This is interesting; although many of the respondants understood what it could be used for, they didn’t try it. Do you have any comments to share about the build service?This question had a number of canned answers which I thought were likely (as multiple choice, because people are likely to have multiple thoughts), and an ‘Other’ response for cases where people wanted to give a longer comment.
The 4 people who said they didn’t understand it is disappointing, but probably not unsurprising. There could be better documentation of what it could do or how it could do its job. Maybe that’s a worthwhile thing to try out with some worked examples. There are followup questions to try to help guide what’s needed, so I’ll not get into this here. ‘Looks useful’ and ‘I might use it’ are in the range I kinda expect – I’d reckon that the ‘I might use it’ + ‘I don’t really understand it’ cover most of the respondants. The lack of time is what I was expecting out of the survey, so that’s reassuring that I’m thinking the right thing. I worried that it might not work for people, or that they might think that it wasn’t suitable, so those responses were included. That they got no response is pretty neat, nobody believes that those were reasons. Maybe that’s because they didn’t have the time, so I can’t be complacent there, the next response was that there were only 3 people who tried it but needed help… Now ‘help’ can come in a number of forms: direct responses, reading tutorials and examples, following guides and other things. As I created the survey I feared that people wouldn’t have looked at the examples that I’d given (there are about 10 git repos with examples using the service in) or the documentation (there is documentation which describes both the protocol used with an example exchange, and guide on how to integrate with different services). The written responses in the forums kinda backed up that people hadn’t looked at, or hadn’t found the documentation and examples. Which would potentially explain the ‘need some help’ response, and may help with the ’don’t really understand it’ responses too. But what can I take away from this? Maybe that the layout of the site doesn’t take you to the information you want very easily? Maybe what’s in the documentation isn’t written in a way that people can find what they want? Maybe they just don’t have the time to dig into that stuff, so they ask questions that are already answered? None of those reasons are to the detriment of people who might feel that way, if they can’t find the right things, I’ve not done the job well; if they don’t want to look hard for things, it needs to be made easier; if they don’t want to dig into things, maybe they don’t need to know. So there’s a couple of insights there into where the system falls down. Do you have any interest in testing your software?Here, I wanted to understand what the respondants thought about testing. It seems like a bit of a leading question because obviously if you’re answering the survey you’re going to care, but I wanted to gauge the degree to which this mattered. Because to some people it may not matter, either because they’re not a developer, because they don’t see a need, because they don’t have time, because they’re awesome, or some other reason. Again this was a multiple choice question, as there may be multiple comments.
This sort of question is, I think, difficult to judge because some people will answer what they think they ought to say, rather than what they do. But we can only take the responses we have. If I were to answer it, I’d probably find myself saying ‘Yes and I’m doing automated testing’, BUT many of the other ‘Yes’ answers are true, and so is the ‘too much hassle’. And if I can answer this so ambiguously what do I hope to get from respondants? In retrospect I should have given this a framing to make it easier for people to concentrate on. “Thinking about a recent project you worked on …” or some other phrase like that, to narrow it. But that’s fine; I’m learning how to get useful feedback and what’s working and what’s not, I just have to go with what I’ve got. And what I’ve got isn’t too bad; there’s one ’What’s testing?’ response, which I included largely because it’s fun. I’m going to take it as someone who felt the same way… but maybe they’re being serious, that they don’t have a concern about it. I could feel sad if that were the case but actually if someone doesn’t feel it’s necessary then maybe they’ve made an appropriate choice. I can’t tell if they were serious or not, but it’s only 1 person, and I’ve done a presentation on that which might interest them (or others that feel that way) or not. I can only hope to give people information and tools. Some people said they were already doing automated testing; the GCC guys have at least build tests happening, so that’s not unreasonable and Jason Tribbeck said in his presentation last year that he was doing some automated testing. So this seems reasonable. If at least two instances I know of have automated testing, there’s got to be more. What that tells me is that my dim view that ‘people don’t do it’ might not be an absolute, but maybe just a good proportion of them. Good to know! Eight people wanting to start doing automated testing is great. That implies that they’re keen to try some things out. If they want to try things but haven’t, that means that there’s something preventing them, which I partially address in the next question (but only partially – I should have had an intermediary question about automated testing in general, rather than focusing on the build service, but heh-ho, it’s easy to see in retrospect). It fits into the narrative of needing help, too. The people who say their code ‘only runs in the desktop’ are where I thought a lot more of the answers would be, so my view of why people don’t do testing is skewed: that’s ~1/3 of respondants, where I would have expected the figure to be closer to 2/3. I’d tried to give examples of how to address this in the testing presentation, and in a recent Iconbar article, so that at least tries to address those people. It’s still reasonable to say that it’s hard to test in the desktop, but it’s rare to find things that you cannot test that otherwise run in the desktop, it’s just how hard you want to work at it. We also have two written answers in the ‘Other’ section.
Excellent. This person clearly knows the principles and practices. “My work here is done!” Well, not really, but there will be people that do this sort of thing for a living. It’s my day-to-day work, but I still don’t put as much effort in to testing RISC OS things as I probably should.
(sigh) I don’t know how to help this directly. However, I try to be inclusive of everyone in how I explain and write things. I include examples of where things have gone wrong and mistakes that I’ve made. There are many reasons for this (not least getting in the criticism before others can, and to try to humanise the subject matter and make it conversational), but there is one that is particularly pertinent here, both as a technique in how I write and because it’s entirely real. When I try to talk about things, including something that went wrong, or mistakes that I made, it serves to highlight what you’re talking about and that as I’m talking about what’s going wrong it’s because I’m just as fallible. Writing that way is – I think – amusing, and encouraging. “If he can make that mistake, then I don’t feel bad doing so”. It’s not so much what you do wrong, as how you deal with the results, and if you can make it so that it’s ok to make mistakes without being fearful, then that helps. Some people find coding hard, they don’t think in the procedural way that means that they can do it as easily (or for other reasons, but that’s the main reason I’ve come across). What keeps you from using the build service for testing?Lots of answers on this one, which is good as this was the main thing I wanted to understand. There’s also an ‘Other’ comment section which I’ll include at the end.
Ok, so not knowing about it is a really good reason to not use the system; can’t fault that. As with the previous answer, a little more communication of the intent would help, but still, I can’t expect 100% awareness, so I don’t think I’ll stress over this. ‘Code not being public’ is a fair comment. I can only say that the service doesn’t publish the source, nor even hold on to it, for more than the time it takes to execute. Even still that only mitigates such fears, and it’s entirely reasonable to not want to send your code (or binaries) to a public service. The only thing I’m taking away from this is that explaining what is and isn’t made available to others is important, because I don’t do that in the documentation. If I had they might have had a different opinion, but even if they felt the same way it would be with the correct information. On the other hand, the ‘I don’t trust it’ response got 0 responses, so maybe it was just a blanket ‘my code doesn’t leave me’… which I can’t argue with! ‘Code isn’t testable’ applies to many things: hardware drivers, desktop components, things that need more of the system than it provides. There are many that aren’t appropriate. I tried to give some examples of systems that seem inappropriate, but even basic crash tests can be written in the presentation and recent Iconbar article. The question is always whether it’s worth it, so fair comment really, and I’ve made some attempt to explain how things could still be tested when thought untestable. ‘I don’t know how to write test code’ is one that I expected to be a bit higher (I expected about 50%). My view is that ‘test code’ can be many things, and it’s very easy to get bogged down on what the code is, rather than what it’s trying to do. That’s why I spent a lot more time trying to explain that simple programs can still be tests in the presentation. Given the lower response here, maybe I over-did that. ‘I don’t know how to use the build service’ is (sadly) about where I expected. I had documented it with worked examples, and created git repos with the example integrations in. I responded to this on the forum thread, and the presentation talked more about it. Maybe once I release the FanController source it’ll make that easier because people’ll see the (odd) way that I’ve done it. I also gave the worked example of Rick Murray’s ErrorCancel module, and that’s up on github as a PR to make it clear exactly which bits changed. I still think that some sort of article walking through how you actually add build service tests to a repo would help. ‘I write my code on RISC OS’… yeah, I’m sure what to do with that as an answer. Maybe I shouldn’t have included it as a option. ‘I … (need) … a native git client’ which might be the same respondant, but it’s not obvious from the aggregate data. Either way, it’s basically the same answer as before – it needs something that I can’t give right now. ‘Time and energy are lacking’ and ‘I haven’t got around to it’ – these are pretty much where I thought they’d be. There’s always things competing for time, and especially with families and other commitments, this is pretty much where you’d expect it to be for an niche tool in an enthusiasts community.
Oh so little time…
Neat. I think it matters on larger scale projects more because you’ve got so many moving pieces. It still worries me though that the build service isn’t really used in anger by anything. It’s likely to fall over because it hasn’t been hardened by use. Of course, I could improve that by putting testing in place to check its behaviour… Hmmm….
Well, output can be piped to the system from other builds – so if you use gcc cross-compiling you could send the results to the service, etc. I think a little more discussion on this would be good, if it’s needed. I already had an exchange with one person who was working in an automated build environment, and there are some things I can probably do to make interworking better. What factors might influence your use of automated testing for RISC OS software?So this question comes to the secondary point of the survey… where’s there a lack of things. I largely include this as a general informational question, because I’m unlikely to be albe to fulfill any of the needs alone. But it’s certainly interesting.
‘Examples of how to write tests’ and ‘Tutorials explaining how to test’ are the top responses. Which is interesting because the survey was before I set out to do the presentation, and long before I wrote what became the Iconbar article. Maybe later I’ll do something more to explain how to do the tests – but there’s a number of examples out there that I’ve done so I’m feeling not too bad about that. ‘Better tooling’… well, I was thinking about libraries and possibly aids for processing results, but someone (Theo?) mentioned about a more standardised way of invoking tests. That’s an interesting approach too, and not one that I’d been actively considering outside of my own collections. One vote, so I guess actually that’s not so important. Not what I had expected. ‘Better testing environment’ is pretty self-explanatory, ‘cos RISC OS sucks for testing, albeit Pyromaniac makes that less dangerous. Ideally the system shouldn’t be so fragile and should allow forms of introspection. What that means is kinda up to others. I did my bit to make things less ugly and provide better information when things go wrong. Only 1 vote for that though? That surprises me, and possibly indicates why there was so little enthusiasm for any stability improvements and the Diagnostic Dumps – what I felt was important is not important to others. ‘More collaboration’… Yeah, I think I mentioned this in the presentation, about ways that people can try to encourage better testing and better engineering in general. That really needs work from reviewers and the people you’re working with to make it happen. Still, only one vote, so… maybe not such a big deal. Those 1 vote cases are interesting, ‘cos I think they’re all pretty important, but I guess less so to other people. I’m not sure if that’s just that my priorities are different, or that I understand what I mean and what those things can do for people. Should I worry about the fact that there’s 0 votes for ‘feeling it would improve quality’? That factor wouldn’t influence people? Huh. Maybe they’ve convinced it would, so that feeling isn’t an influence. Maybe that was just a useless answer. Hey-ho. Then there were the ‘Other’ comments:
Yup, they’re similar and they’re big influences. You want to do what’s most useful to you – or most fun, or most lucrative, or most educational, or whatever it is that you value – if you’re pressed for time, so I absolutely understand that. Do you have any other feedback on automated testing on RISC OS?This was an open question and it got a few answers.
Yup, examples are up on GitHub under ‘riscos-ci’, in many different forms, and there’s specific examples on the build.riscos.online site, under the menu item CI configuration Further links to the examples are listed on the Pyromaniac resource site under the link for ‘CI Examples’.
You’re welcome. I would love for others to produce a guide to automating tests of Wimp applications. Paolo’s very short demo of using Keystroke reminded me how it was back when I used it at school (really? that long ago?).
Feel free to let me know if you need more help. I can try to guide you through it or set up something for you to use, if that’s easier. At least, with Github.
Thanks; I know how hard it can be to find time! I have lots of little things that I really want to play with but… well something has to give ‘cos you can’t do everything!
I feel your pain. Those differences (except the crashing!) can be smoothed somewhat, but you won’t find many people wanting to do away with them, ‘cos if you wanted to run a Windows/Unix system, you’d do so. Hopefully the service, or at least the guidance for testing that I’ve given help. If not then… well, I tried.
It would be fun to work with. As I’ve said to others, Pyromaniac can be used in a more integrated way if people want to do that. The build service is kind a ‘quick project to show what you can do’, which is actually useful (in my opinion), but there’s a lot of other ways it might be useful.
Thank you both for your kind words. Conclusions?So it’s confirmed my belief that there needs to be more education (in different forms), although the survey was started before I did presentation which hopefully addresses some of the questions. Maybe a guide to using the build service might be useful. There’s words on the build service site, but maybe an article would help. The ‘how to write test code’ is a common theme, which I tried to address, but could certainly do with more work as well. Thank you if you filled in the survey! I hope that the results and my comments about it have been useful. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charlotte Benton (8631) 168 posts |
While I don’t dispute the need for more sophisticated development tools, I think that Pyromaniac is by far the more interesting development, and perhaps one of the most interesting developments in the post-Acorn era. Being able to run RISC OS programs without whole-hog emulation of the entire OS is a formidable achievement. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Andrew McCarthy (3688) 605 posts |
Wholeheartedly agree, a list that includes Iris, 4K widescreen support, RPI compatibility, updated C tools, an AI library, Python, Zap, … |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13850 posts |
Two quick notes (I’m on break at work). Firstly, the “I don’t trust it” could be taken in two different ways. The security/confidentiality of uploaded code (which is the interpretation you’re using) or the overall validity of the results. And secondly, I wonder if those who find coding hard are trying to use assembler? In 202x there should be little reason to create any new projects in assembler. Because that ties you to a whole host of issues (register assignment, stack balancing, indexing data, blah blah not to mention risking being extremely sensitive to changes in the processor itself) that people invented compilers to do for us. Given that a number of modern ARM cores don’t even execute the code that you’ve written in the way that you’ve written it, all I can see with respect to creating new projects in assembler is a complicated maintenance nightmare. Of course, that isn’t to say that C is easy, that too has plenty of its own quirks. But at least if you have a compiler and libraries to match your machine, rebuilding the project ought to be most of the job done. Though, as you say, it does require a certain logical mindset that operates purely in the realm of yes and no logic (no “maybe”, there’s no binary value for maybe except in quantum computing where your logic maybe yes or maybe no… :-) ). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Charles Ferguson (8243) 427 posts |
I honestly hadn’t thought of it that way. It’s a pretty fair interpretation too, and actually it’s one that applies to most testing in general, although we don’t phrase it that way. As your system gets more complex the ability to actually find real problems, or your confidence that you’re testing what you think you’re testing decreases – reducing your trust in the test’s usefulness. The build service isn’t exactly the system that you’ll be working with, so you can certainly say that. Still, nobody responded with that – maybe because there was an ambiguity that I didn’t see. However, there’s advantages in testing on fake systems beyond the fact they they’re letting you test in different ways. They can find stupid mistakes because the system isn’t the way it’s meant to be – essentially exposing where error handling or assumptions are invalid. That’s indirect testing, I guess, but it’s still valuable if what you get out at the end is a more robust system. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Steve Pampling (1551) 8172 posts |
The thing is Rick a test is a test. Whether it answers all the questions you may have is a different issue. It’s still a valid test, but perhaps not a test that answers your specific question. In a network setup, I can try and use a browser to connect to a web server. It fails. From that result what do I know? You see, the moment you start testing you need to put all the questions there and strike off the ones that are answered by the test. A list of what kind of answers the system can give and what it can’t might help. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13850 posts |
No, it’s very much not. Let us consider a potentially faulty memory decode on a Beeb. Working with software isn’t really that much different. You have an expected behaviour and you have an actual behaviour. When they don’t match, it is necessary to devise tests to try to work out why. But, note, depending on how the code was written, a number of things may behave when run in the debugger but fail when run on a real system. Before anybody is like “it’s the compiler! the debugger code isn’t optimised so it works!” one must remember that a debugger presents a sanitised world, with such aids as clearing memory before use (make it easy to see what was written where) but… if the bug was a pointer that was unset, under the debugger it may return the value null so all the null testing will work. On a real system, on the other hand, the unset pointer will have the value of whatever happened to be in memory. Null testing will fail as the pointer isn’t null. Bang. That latter part, the memory zeroing, counts as trust. Do you trust the debugger to run same code in the same way? You shouldn’t, because it doesn’t. ;-) So a test is a test if you’re looking at it from very far away.
How much faith (another word for trust) do you have in the accuracy of this list? What is its provenance? The manufacturer? StackOverflow? A blog post? How up to date is it? How up to date is the thing you’re testing, for that matter? |