Automated testing and CI

54 posts, 16 voices

Pages: 1 2 3

May 23, 2021 9:26pm

Rick Murray (539) 13850 posts

I don’t believe that many of the RISC OS projects are designed with testing in mind.

Sure they are. Testing is what happens when the user runs the program. :-)

The question I have about the use of automated testing is that, really, you are only going to be testing known knowns. For instance, if there is an FP test suite, it can verify that FPE generates good results. If there’s an FP that kicks processing off to the VFP or NEON units, it can verify that the results are good. But if there’s a new 128 bit super extended FP format, well, such a set of tests would have to be written and validated before the code in question can be validated.

Which would suggest that, provided there aren’t a load of tweakers and tinkerers involved with a project, shouldn’t a bit of code behave in the same way once written and then left alone? It might not be necessary to test a function to convert a set of characters to lower case ¹ but surely if you’re compacting a linked list you’ll have cobbled together something to exercise the code prior to putting into a released application? And then it ought to continue working as designed.

That’s not to say that no testing is it should be done, that would be quite dumb. I just feel that automated testing is really only as good as the tests it performs, and a lot of edge cases might slip by as the person who devised the tests just didn’t think of those cases. Real world, that’s where people get involved, and people are better than any test suite at breaking stuff. ;-)

What might be better, given the, um, specifics of RISC OS, is to code defensively and not just blindly accept any old crap handled to your routine. I mean, if the address you receive is 0 (or anything less than &8000) then there’s a pretty good chance that accessing it will cause something to fail. So at times where it matters, a few simple checks could advert a world of pain.

I think, and this is where the Pyro project could be interesting, as well as emulation, is that testing on RISC OS itself is harder than it needs to be. There’s not a lot of help testing BASIC. As for C and assembler, DDT is a painful mess of interacting bugs ² that has far too many quirks and limitations to be useful in day to day use. And if you’re doing anything more complicated, all too often the result is the machine just stiffs with no clear indication of how it got to be in that state. So one of the best testing tools that could happen to RISC OS is something that would allow the system to crash, and then let to user unwind it. Maybe then those little annoyances like pushing six registers to the system stack and pulling five (since we’re still writing so much in assembler even these days) might become clear if we can see it happening, as opposed to muttering “motherf….” and prodding the reset button. Again.

¹ I trust you’re using Territory, people, and not just ORing in the ‘lowercase’ bit…

² My dislike of DDT is well known, though it must be said that debugging by spewing data to either DADebug or the serial port is positively Neolithic.

May 23, 2021 10:00pm

Steve Fryatt (216) 2105 posts

Sure they are. Testing is what happens when the user runs the program. :-)

Which leads us to an observation made recently, and apparently seriously, in another place that people shouldn’t complain about their web browser crashing because complex software like a web browser always crashes.

The question I have about the use of automated testing is that, really, you are only going to be testing known knowns. For instance, if there is an FP test suite, it can verify that FPE generates good results. If there’s an FP that kicks processing off to the VFP or NEON units, it can verify that the results are good. But if there’s a new 128 bit super extended FP format, well, such a set of tests would have to be written and validated before the code in question can be validated.

Well, yes… that’s the point. Generally you’d write the suite to test the code being tested, and prove that it does what is expected for all the edge cases that you can think of. And then add more tests if reality intervenes, to pick up the things that were overlooked.

Either way, if the module (not in a RISC OS sense) being tested changes, your test suite has also changed.

Which would suggest that, provided there aren’t a load of tweakers and tinkerers involved with a project, shouldn’t a bit of code behave in the same way once written and then left alone? It might not be necessary to test a function to convert a set of characters to lower case 1 but surely if you’re compacting a linked list you’ll have cobbled together something to exercise the code prior to putting into a released application? And then it ought to continue working as designed.

First, the overhead of the tests is probably minimal. If the code doesn’t change, then neither do the tests. In a RISC OS context, you could probably “mothball” them from daily use, and just run them when making releases.

But how often does code remain unchanged forever? What about that small tweak that gets done, which is too small to be a problem? Well, you can test it easily now, can’t you… and probably find that it actually breaks something unexpected… :-)

That’s not to say that no testing is it should be done, that would be quite dumb. I just feel that automated testing is really only as good as the tests it performs, and a lot of edge cases might slip by as the person who devised the tests just didn’t think of those cases. Real world, that’s where people get involved, and people are better than any test suite at breaking stuff. ;-)

I think I’d beg to disagree, having spent the past decade testing stuff (although not necessarily software, I must admit).

First, there’s degrees of automated. What Gerph spoke about last week wasn’t by necessity automated: it was just test suites which could be built and run at the CLI, and by extension be thrown at an automated system.

Even if you don’t automate it, it’s still a hell of a lot easier to run a test suite from the CLI and check that none of the cases fail, than it is to test that comprehensively in a live Wimp application. Because it’s a lot easier to write tests that exercise all of the awkward bits that you can think of, quickly and easily. And easier to debug a failed test if it’s just testing one function in isolation.

What might be better, given the, um, specifics of RISC OS, is to code defensively and not just blindly accept any old crap handled to your routine. I mean, if the address you receive is 0 (or anything less than &8000) then there’s a pretty good chance that accessing it will cause something to fail. So at times where it matters, a few simple checks could advert a world of pain.

Well, yes, but you probably then want to try passing NULL into your library routines as part of your unit testing, to check that they do what you expect… :-)

I think, and this is where the Pyro project could be interesting, as well as emulation, is that testing on RISC OS itself is harder than it needs to be. There’s not a lot of help testing BASIC. As for C and assembler, DDT is a painful mess of interacting bugs 2 that has far too many quirks and limitations to be useful in day to day use. And if you’re doing anything more complicated, all too often the result is the machine just stiffs with no clear indication of how it got to be in that state.

Which brings us back to the whole point of last Monday’s talk… This stuff is hard, so if you can, you do it in an easier way. You test the building blocks simply, in a way that you have some control of, so that when you come to test the whole application you have

some confidence that the bits in it work as you intended, and

at least a fighting chance that if you do inadvertently pass NULL to one of your building blocks out in the field, it will just say “no” politely and return control back to you.

:-)

ETA: And yes, I’m well aware that I don’t test my RISC OS stuff anywhere near well enough: most of my stuff on the platform dates back to well before I had any idea about this kind of thing. I started putting unit tests into SFLib at the back end of last year, and wish I’d done it sooner as it has made things so much easier to work on.

PPS: It should be fairly easy to do this kind of testing in BASIC, too. In fact, IIRC, some of the examples last Monday were in BASIC (or was I imagining that?).

May 23, 2021 10:02pm

Charles Ferguson (8243) 427 posts

Quoting Theo:

So the testing service is great and I’d love to use it, but I’d like to separate that from the build system. My input is not a git repo of C files, but the output of a complex automated build pipeline that spits out ELF binaries. Because Pyromaniac is a cloud service I can’t adjust the build side of things, which means I don’t have a suitable way to inject artifacts to be tested (and environment etc). In essence what I want to test is both the binaries and the build environment that created them.

Interesting; I suspect it’s less useful from that side, but I would guess at the very least being able to confirm that what was built was executable would be a step in the right direction. As discussed privately, Pyromaniac isn’t really suitable for heavy use, like GCC builds – it’s just too slow really. Building on a sane system, and then executing the results makes more sense.

Another useful thing, and please correct me if I’ve missed anything in this direction, would be for the community to coalesce on a framework for running and reporting tests. For example, to be able to indicate what tests are available, how to run them, and report the results back in a machine-readable format (eg JUnit or TAP are used elsewhere).

Other than the form of results being reported, I don’t think it makes too much sense to have a single definition of the tests and how to run them. Tests may be executed in many forms, and the environment in which they execute matters a lot. You can’t really provide an all encompassing format that expresses all the dimensions that the tests might take without – essentially – building a programming language that has the ability to do everything.

Having some frameworks available for people to use is great, but you cannot settle on just one to cover everything, I believe.

Within a given domain, you can certainly provide definitions of tests, but I don’t know that it even makes sense to do that more generically.

As for the manner of reporting test results, I’ve pretty much settled on using JUnit XML represenation, in ‘whatever format works with the system you are trying to use it with’ (because oh dear gosh it’s a terribly inconsistent format with different versions mutually incompatible with each other). My reasoning is that Junit XML is the most widely supported format, and – for me – it integrates with all the tooling I use (GitHub, GitLab, Jenkins and my own tools as consumers, and the testing tools I’ve written as generators).

With the framework in place, adding another test would just be a case of declaring it somewhere and writing the code. ‘make all-tests’ or whatever would run it and all the other tests and generate a report of what passed or failed, that could then hook into other tools.

If we assume that this is going to be entirely domain specific, I would suggest that you should not be controlling the tests in quite that way. ‘all’ may implicitly include tests that never complete, or take huge lengths of time. It is important to be aware of what gate¹ you are trying to pass to be able to execute the tests. Simple developer tests should (generally) not take a long time, whilst tests for a pull request/merge request/review may want to be heavier, and yet heavier still may be pre-release testing. So any definition that you provide may want to be specified in terms of the gate as one of the parameters. Similarly, the scope of the tests is often used to describe the sequence of testing, as the lesser scoped tests will report problems more usefully than the wide scoped tests (eg run your unit tests first, before moving on to integration, etc).

And of course some people will want different policies on the ordering and what goes into each sequenced set of tests. All of which makes me think that you cannot create any form of specification which satisfies all of those dimensions.

Which is kinda why I suggest that it’s domain specific. For an individual use case, defining how the tests are to be executed and what exists is useful.

It’s only tangentially related, but I use not a definition, but a declaration of what tests are present in a header at the top of my test description files. These don’t say how to run them, but just what they are, like this (https://github.com/gerph/riscos-tests/blob/master/testcode/tests-core.txt):

# SUT:    Kernel: Core SWIs
# Area:   Output, conversions, execution
# Class:  Functional
# Type:   Integration test

Then I can use a little tool to produce the reports on what is available (even if they’re not used) from the source tree (SVG: https://share.gerph.org/s/L0dTeIB51bGolbc PNG: https://share.gerph.org/s/zkYbwhMPdlbDjrN). So I can definitely see advantages in being able to declare what things are available.

I guess as I think about this more I’m liking the idea, but I’m unsure how you might do it. Maybe I’m being put off by the whole ’there’s too much to think about here’, and should just shut up and let someone try to solve it instead of saying why I don’t think it can be perfect… ‘cos that’s not helpful.

Currently it seems like you have to write your own testing framework before you write any tests, which is a disincentive to get started.

This was one of the key points in my presentation – I know you were there, but I’m responding to things after the event, for future readers. The point being that you shouldn’t use the ‘it doesn’t exist so I won’t do it’. I agree it’s a disincentive, but… you create some tests – they don’t have to be hard – and then you build on it. What you say is true, that not having anything makes it an extra hurdle to get over, but honestly, if that hurdle is ‘write a program that returns 0 when it was successful or crashes if not’… that’s actually not that bad. Then you can start raising your game.

¹ I’m using gate in the engineering sense of ‘what conditions must be met to allow the code to pass through it’. Developers will want tests at the ‘interactive’ gate to pass, generally, then move on to maybe ‘commit’ gating tests, before passing ‘mainline’ tests to be accepted into the main tree, ‘release’ gated tests to pass for the generation of the release.

May 23, 2021 10:16pm

Charles Ferguson (8243) 427 posts

Quoting Rick Murray:

Which would suggest that, provided there aren’t a load of tweakers and tinkerers involved with a project, shouldn’t a bit of code behave in the same way once written and then left alone? It might not be necessary to test a function to convert a set of characters to lower case 1 but surely if you’re compacting a linked list you’ll have cobbled together something to exercise the code prior to putting into a released application? And then it ought to continue working as designed.

Yes it ought to – and that’s /why/ you keep the test around and you run it all the time. Because when it breaks, and it will, you want to know about it. You never assume that the code is left alone – because at some point someone will want to change it, or it will become broken because of something you just couldn’t envisage happening.

That’s not to say that no testing is it should be done, that would be quite dumb. I just feel that automated testing is really only as good as the tests it performs, and a lot of edge cases might slip by as the person who devised the tests just didn’t think of those cases. Real world, that’s where people get involved, and people are better than any test suite at breaking stuff. ;-)

Of course they will. But automated testing is about trying to reduce the chance of those failures reaching the outside world. Any failure that reaches a user is an embarassment, and you should be ashamed of it. Not cripplingly, obviously, ‘cos you’d give up and crawl into a hole for 15 years if that happened, but enough to make you want to be better. Any failure that reaches a user that you see a report from is probably just the tip of the iceberg of users that saw a problem and said nothing. And how many of those users just walked away because everything was just too hard?

As fort he more serious point that the tests are only as good as what you write… yes, that’s true. This is one reason why developers and testers tend to be different people in serious software companies – professional testers have a mindset that is somewhat warped and intentionally adversarial, whilst developers are inherently optimistic that the code that they wrote will work ¹. You can always find more problems and more cases of that things don’t work, because as a developer you suck (that’s just the way it is… you’ll miss things and you’ll cry at how stupid you were).

But the more testing you have, the more light you shine on the code, and the fewer places there are for the bugs to hide.

DDT.

{facetious} DDT is not a testing tool. It’s barely a debugging tool. {/facetious}

¹ Ok, not all of them, but humility is important :-)

May 23, 2021 10:19pm

Rick Murray (539) 13850 posts

because complex software like a web browser always crashes.

Yeah, that’s a bit of a touchy subject right now. As I was writing thoughts on the Eurovision final, live, so no ability to pause, I was also looking at Vince’s thoughts on Twitter on my older phone.
It severely crashed four times. By severely crash, I mean one moment I’m looking at Twitter and the next the boot animation. What the hell did Chrome have it’s fingers in to take down the entire system?

that small tweak that gets done, which is too small to be a problem?

You and I both know that such a thing is akin to “what could possibly go wrong” in terms of tempting fate.

And easier to debug a failed test if it’s just testing one function in isolation.

Certainly, and it gives you the opportunity to examine, store, and compare the output of the function. Not something that is necessarily possible when it’s embedded into a Wimp program. This comes into part of the “testing as it’s being written” idea.

I also make fairly extensive use of DADebug to allow tracing through a live function, though this is usually omitted from release software. While I may remove the test code once stuff has been tested, I make a habit of wrapping an debug messages in #ifdef so they can be linked in easily at any time. Because, yes, that little tweak that won’t change anything….. been there, seen that.

Yes, I’m conflating testing and debugging. Because to my mind, code that doesn’t work needs fixed. That’s the basic definition of debugging.

because as a developer you suck

I think the first four words are superfluous in my case. :-)

May 23, 2021 10:22pm

Charles Ferguson (8243) 427 posts

Quoting Steve Fryatt:

And yes, I’m well aware that I don’t test my RISC OS stuff anywhere near well enough: most of my stuff on the platform dates back to well before I had any idea about this kind of thing.

I’m not sure I said this in the talk – I did point out that I’d made mistakes in my own assumptions, etc, but I don’t think I said it clearly like Steve has… so here’s my statement too…

I don’t test enough either. I’ve got whole libraries that don’t get any love when it comes to testing, and that’s sad. I should try taking some of my own advice on ways to do things, ’cos like everyone else, I suck too :-)

May 23, 2021 10:25pm

Rick Murray (539) 13850 posts

I’ve got whole libraries that don’t get any love when it comes to testing, and that’s sad.

I wanted to write a test harness for DeskLib. Then I realised the scope of what that might actually involve, understood that I’m not getting any younger, and that I have a life.
I may suck, but I’ll quietly suck sitting under a tree with a good book and a mug of tea…

May 23, 2021 10:29pm

Charles Ferguson (8243) 427 posts

Thank you everyone for the comments so far. I’ll post the results of the survey later in the week; they’re not especially surprising, but as I’ve said I’d do them, I should!

Anyone who missed the presentation, the slides and full speaker notes can be found here:

https://presentations.riscos.online/testing/

May 23, 2021 10:54pm

Steve Fryatt (216) 2105 posts

I wanted to write a test harness for DeskLib. Then I realised the scope of what that might actually involve, understood that I’m not getting any younger, and that I have a life.

Oh, absolutely. I only wrote the basis of one for SFLib when I became aware (thanks to feedback from — ahem — “field testing”) that some of the routines that I use for copying strings in and out of icons might, possibly, contain some out-by-one errors in their bounds checking.

It would have been a nightmare to have tested all of the conditions in the finished applications (yes, plural) where the problems were showing up, but a week of writing a test harness able to test routines which wanted to use Wimp_GetIconState¹, and I was able to see for myself on demand what my users were unhappy with. And come away with some confidence that the subsequent fixes were correct, which was another point made last week.

I should probably make the code available on GitHub for others to laugh at… perhaps when I’ve got the May issue of The WROCC to bed, as I notice that it still contains some signs of having been copied hurriedly from Launcher. :-)

¹ So a Wimp application calling unit tests on null events and dumping the output to Reporter, in effect. I’ve since pondered linking to a “replacement OSLib” which does simple memory allocation for those calls, but that’s a big job and the quick-and-dirty Wimp application works fine so long as there’s some functioning bounds checks in the code being tested to stop the whole OS falling over in a heap.

May 24, 2021 10:02am

Alan Adams (2486) 1149 posts

I have two different software problems, and I have problems testing both.

One is the disc activity light project discussed extensively here a few weeks ago. The basic problem there is that the application seems to work perfectly. It’s just that other software crashes around it when it’s running. There doesn’t seem to be ANY way to track down that problem. (The only things I can see are 1: failing to preserve registers through vector intercepts, or 2: altering something outside my memory area. I’ve spent days poring over the code to look for both, without success.)

The second is a 20-year development, which is now in the form of a database server with a large number of clients running on a network of other RISC OS computers. It’s used for scoring a sports competition, so needs to be reliable and reasonably fast.

In use it’s highly likely that three of the computers will be updating the same area of the database, and four others will be reading data from that general area too. The only time that degree of concurrency gets tested is when the system goes live, under pressure, with non-RISC OS users using it. I cannot exercise that on my own – I would need 4 arms and as many brains.

I do have a number of fallback options built in – fortunately, as the last time it was used, a debug feature was left in by mistake, which meant the server stalled after about ten minutes of activity. The fallback didn’t exercise that part of the code. Post-event it took me a couple of days to find a way to cause the crash at home in order to find the problem.

Ring buffer code has a remarkable number of possible edge cases. Especially when the whole thing is written in BASIC. Most of them are in the “off by one” category. That particular one had existed undetected for at least 8 years. (possibilities: the message ends one byte before the end of the buffer, it ends at the end of the buffer, it ends at the first byte of the buffer after wrapping, in each of the above the next byte is the start of a valid message left behind earlier, or is the end of message marker, i.e. an empty message, or at the wrap point one byte is missed (written past buffer end) or duplicated.) Changing the message protocol to include a length and/or crc check would make detecting errors here more likely, but that degree of re-writing would likely introduce more errors than it would remove.

May 24, 2021 4:29pm

Charles Ferguson (8243) 427 posts

One is the disc activity light project discussed extensively here a few weeks ago. The basic problem there is that the application seems to work perfectly. It’s just that other software crashes around it when it’s running. There doesn’t seem to be ANY way to track down that problem. (The only things I can see are 1: failing to preserve registers through vector intercepts, or 2: altering something outside my memory area. I’ve spent days poring over the code to look for both, without success.)

I don’t know anything about it; reference to the source or discussion?

There’s always ways to trap down the problems… you could stick code around your vectors traps to check the vector values before and after, in all modes. You could move the whole thing to running in user mode, and simulate the entries so that they perform the right actions but just don’t change modes so that you can see that they’re doing the right things. You can allocate much more memory if you think that there’s a crash in that code to see if it stops happening, which would tell you that you might be overrunning, or introduce boundaries on the code to see that you’re doing the right things.

However, the closer you are to the OS interfaces, the more likely it is that you’ll make a mistake in the interfacing, because there’s not any other code to test. Essentially you’re making basic mistakes in the integration, which is the hardest part to test :-(

I wrote something that sounds similar, some years ago. It trapped the FS vectors and changed the pointer colour on entry, and again on exit. It’s not espectially clever, but it did the job pretty reliably at the time. I’ve just made that repository public if it’s useful:

https://gitlab.gerph.org/justin/activityhd

May 24, 2021 10:40pm

Steve Fryatt (216) 2105 posts

Ring buffer code has a remarkable number of possible edge cases. Especially when the whole thing is written in BASIC. Most of them are in the “off by one” category. That particular one had existed undetected for at least 8 years.

This is exactly why unit testing stuff is useful, though. Exercise all of the edge cases in the ring buffer, which should be fairly simple in a bit of BASIC, and check that they work. Some ideas off the top of my head:

Read from an empty buffer (which should fail).
Write one entry, and read two back (the second read should fail); repeat until you’ve wrapped around the buffer.
Write two entries, and read three back (the third read should fail); repeat until you’ve wrapped around the buffer.
Write two entries, and read one back; repeat until you’ve wrapped around the buffer.
Write n entries, where n is the buffer size, and read n + 1 back (the last read should fail).
Write n + 1 entries, where n is the buffer size, and read n + 1 back (the last read should fail).

I’m sure that there are plenty more, too. In each case, you check that the values read back match what the spec says they should be given the (distinct) values that you put in.

Then, when you know that your buffer is fairly solid, you can use it as a building block in the larger project. Rinse and repeat for the other simple modular chunks.

It’s used for scoring a sports competition, so needs to be reliable and reasonably fast.

Out of curiosity, without solid testing behind it, what would you do if someone contested the scores produced by your system?

May 25, 2021 7:40am

Chris Hall (132) 3558 posts

what would you do if someone contested the scores

This was sorted out for cricket centuries ago, the umpire was armed. Disputes were rare and quickly resolved in the umpire’s favour.

May 25, 2021 1:35pm

Alan Adams (2486) 1149 posts

Out of curiosity, without solid testing behind it, what would you do if someone contested the scores produced by your system?

Go to the paper originals. The judges write down the results. Then they phone them through to the clerks who put them on the computer system. (Previously the clerks put them on cards, and from there the summary was put on computer.) I have no plans to put computers on the river bank, not least because they are more than 100 metres away, so needing ethernet repeaters (don’t even think of using wireless). Then waterproofing everything…

Incidentally I discovered a while back that the touch screen on an iPhone doesn’t work in the rain. Fortunately I wasn’t relying on OS Mapping for my navigation.

May 25, 2021 5:22pm

Jeffrey Lee (213) 6048 posts

On the subject of writing tests, “The lazy programmer’s guide to writing thousands of tests” gives some useful advice on how to write effective tests: https://www.youtube.com/watch?v=IYzDFHx6QPY

May 25, 2021 5:37pm

Rick Murray (539) 13850 posts

the touch screen on an iPhone doesn’t work in the rain.

It’s not your iPhone, my S9 does the same thing. Capacitive touch screens are not at all good at telling the difference between fat bits of flesh and raindrops.

It’s quite annoying when your phone is waterproof and you’re recording something in the rain and suddenly the phone is like focus on this, on this, zoom in, out, in, out, shake it all about, screenshot, screenshot, focus, exposure dark, pause!

I’m like 😡 and wishing there was a setting to “just record and ignore anything that happens on the screen”. Use the power button or something to stop the recording.

This was sorted out for cricket centuries ago, the umpire was armed.

The Japanese took this to it’s illogical conclusion and made the film Battlefield Baseball.

The judges write down the results. Then they phone them through to the clerks who put them on the computer system.

What’s the point of the computer then, exactly? A fancy score display?

May 25, 2021 5:46pm

Rick Murray (539) 13850 posts

Some ideas off the top of my head:

Nice selection. I would add:

a number of read one write two (to check the pointer pairs are incrementing correctly)
Half fill the buffer, then read all but one byte, then completely fill the buffer, then read all plus the missing byte from above (this exercises the wrap around; the point of leaving a byte is in case your code optimises by resetting the head and tail pointers if they should equal).

May 26, 2021 11:15am

Alan Adams (2486) 1149 posts

What’s the point of the computer then, exactly? A fancy score display?

Score display, prizegiving lists, final results. The recent revisions cut out one step in the process, and two people from Control.

The judges are on the riverbank, and not to be disturbed by competitors wanting to know their results. The paddlers must be shown their scores in a short time after the run, to allow the time window for submitting protests.

The latest revision provides a live results display over WiFi, eliminating crowding round the results – for reasons that should be apparent.

Jun 2, 2021 8:34am

Charles Ferguson (8243) 427 posts

A while back I started this discussion with a survey. I wanted to get a feel for what people were missing and why they weren’t taking the need to do testing seriously – or at least why that hadn’t been expressed by people using the build service. At the time I said I would give the results of the survey once it was complete. I’ll give the results I received (in aggregate) and some of the comments (none that might identify people). I’ll try to give a small interpretation of each, where I can.

The survey was unsurprisingly low in submissions, only 14 submissions. There were 21 records where people started the survey but didn’t complete it. In all those 21 cases, the survey recorded no answers. I’m not sure there’s anything that you can draw from that, other than that it’s a relatively niche area (despite the fact that all developers should care), and I expect that there will be a lower response rate than readership rate.

The questions in the survey were…

Before taking this survey, had you heard of the RISC OS Build service?
Were you aware that it can be used for automated building and testing of RISC OS software?
Have you tried to use the build service?
Do you have any comments to share about the build service?
Do you have any interest in testing your software?
What keeps you from using the build service for testing?
What factors might influence your use of automated testing for RISC OS software?
Do you have any other feedback on automated testing on RISC OS?

Before taking this survey, have you heard of the RISC OS Build Service?

Yes	12	85.71%
No	2	15.29%

That’s not unreasonable; it’s not hugely publicised, and the name changed from the April 1st release, so fair enough. Plus, that’s at least 2 more people that have been reached by the survey and the forum post.

Were you aware that it can be used for automated building and testing of RISC OS software?

Yes	11	78.57%
No	3	21.43%

Again, this is not unreasonable for similar reasons. I can assume that probably the same 11 people that said yes here said yes to the prior question. So the proportion of people that were aware of it, and looked at it enough to understand its use as an automated testing tool is high. To me that means that the users who looked at the system understood it – the message that the site and documentation sends, once you get to it, is effective.

Have you tried to use the build service?

Yes	3	21.43%
No	10	71.43%
No answer	1	7.14%

This is interesting; although many of the respondants understood what it could be used for, they didn’t try it.
It is essentially backing up what I believed at the start of the survey – that people weren’t using it. So the initial hypothesis is backed up by figures here. I could be happy to be right, or I could be disappointed that this is the case, and that it’s not enticing enough. A bit of both… but that’s why I started this thread and created the survey.

Do you have any comments to share about the build service?

This question had a number of canned answers which I thought were likely (as multiple choice, because people are likely to have multiple thoughts), and an ‘Other’ response for cases where people wanted to give a longer comment.

I don’t really understand it.	4	28.57%
It’s not useful to me.	0	0.00%
Looks useful to me.	7	50.00%
I might use it but I haven’t had time to do so yet.	9	64.29%
I tried it but it didn’t work for me.	0	0.00%
I tried it but it wasn’t suitable for me.	0	0.00%
I tried it but I need some help.	3	21.43%
I tried it but it was too slow.	0	0.00%
I am using it for a project.	0	0.00%
Other	0	0.00%

The 4 people who said they didn’t understand it is disappointing, but probably not unsurprising. There could be better documentation of what it could do or how it could do its job. Maybe that’s a worthwhile thing to try out with some worked examples. There are followup questions to try to help guide what’s needed, so I’ll not get into this here.

‘Looks useful’ and ‘I might use it’ are in the range I kinda expect – I’d reckon that the ‘I might use it’ + ‘I don’t really understand it’ cover most of the respondants. The lack of time is what I was expecting out of the survey, so that’s reassuring that I’m thinking the right thing.

I worried that it might not work for people, or that they might think that it wasn’t suitable, so those responses were included. That they got no response is pretty neat, nobody believes that those were reasons. Maybe that’s because they didn’t have the time, so I can’t be complacent there, the next response was that there were only 3 people who tried it but needed help… Now ‘help’ can come in a number of forms: direct responses, reading tutorials and examples, following guides and other things.

As I created the survey I feared that people wouldn’t have looked at the examples that I’d given (there are about 10 git repos with examples using the service in) or the documentation (there is documentation which describes both the protocol used with an example exchange, and guide on how to integrate with different services). The written responses in the forums kinda backed up that people hadn’t looked at, or hadn’t found the documentation and examples.

Which would potentially explain the ‘need some help’ response, and may help with the ’don’t really understand it’ responses too. But what can I take away from this? Maybe that the layout of the site doesn’t take you to the information you want very easily? Maybe what’s in the documentation isn’t written in a way that people can find what they want? Maybe they just don’t have the time to dig into that stuff, so they ask questions that are already answered?

None of those reasons are to the detriment of people who might feel that way, if they can’t find the right things, I’ve not done the job well; if they don’t want to look hard for things, it needs to be made easier; if they don’t want to dig into things, maybe they don’t need to know.

So there’s a couple of insights there into where the system falls down.

Do you have any interest in testing your software?

Here, I wanted to understand what the respondants thought about testing. It seems like a bit of a leading question because obviously if you’re answering the survey you’re going to care, but I wanted to gauge the degree to which this mattered. Because to some people it may not matter, either because they’re not a developer, because they don’t see a need, because they don’t have time, because they’re awesome, or some other reason.

Again this was a multiple choice question, as there may be multiple comments.

Yes, but I only need to test manually.	0	0.00%
Yes, but automated testing isn’t appropriate.	0	0.00%
Yes, but my code only runs in the desktop	4	28.57%
Yes, and I want to start doing automated testing.	8	57.14%
Yes, and I’m doing automated testing.	2	14.29%
No, I’m perfect.	0	0.00%
No, It’s too much hassle.	0	0.00%
What’s testing?	1	7.14%

This sort of question is, I think, difficult to judge because some people will answer what they think they ought to say, rather than what they do. But we can only take the responses we have. If I were to answer it, I’d probably find myself saying ‘Yes and I’m doing automated testing’, BUT many of the other ‘Yes’ answers are true, and so is the ‘too much hassle’. And if I can answer this so ambiguously what do I hope to get from respondants? In retrospect I should have given this a framing to make it easier for people to concentrate on. “Thinking about a recent project you worked on …” or some other phrase like that, to narrow it. But that’s fine; I’m learning how to get useful feedback and what’s working and what’s not, I just have to go with what I’ve got.

And what I’ve got isn’t too bad; there’s one ’What’s testing?’ response, which I included largely because it’s fun. I’m going to take it as someone who felt the same way… but maybe they’re being serious, that they don’t have a concern about it. I could feel sad if that were the case but actually if someone doesn’t feel it’s necessary then maybe they’ve made an appropriate choice. I can’t tell if they were serious or not, but it’s only 1 person, and I’ve done a presentation on that which might interest them (or others that feel that way) or not. I can only hope to give people information and tools.

Some people said they were already doing automated testing; the GCC guys have at least build tests happening, so that’s not unreasonable and Jason Tribbeck said in his presentation last year that he was doing some automated testing. So this seems reasonable. If at least two instances I know of have automated testing, there’s got to be more. What that tells me is that my dim view that ‘people don’t do it’ might not be an absolute, but maybe just a good proportion of them. Good to know!

Eight people wanting to start doing automated testing is great. That implies that they’re keen to try some things out. If they want to try things but haven’t, that means that there’s something preventing them, which I partially address in the next question (but only partially – I should have had an intermediary question about automated testing in general, rather than focusing on the build service, but heh-ho, it’s easy to see in retrospect). It fits into the narrative of needing help, too.

The people who say their code ‘only runs in the desktop’ are where I thought a lot more of the answers would be, so my view of why people don’t do testing is skewed: that’s ~1/3 of respondants, where I would have expected the figure to be closer to 2/3. I’d tried to give examples of how to address this in the testing presentation, and in a recent Iconbar article, so that at least tries to address those people. It’s still reasonable to say that it’s hard to test in the desktop, but it’s rare to find things that you cannot test that otherwise run in the desktop, it’s just how hard you want to work at it.

We also have two written answers in the ‘Other’ section.

Yes, and I do it for other platforms.

Excellent. This person clearly knows the principles and practices. “My work here is done!” Well, not really, but there will be people that do this sort of thing for a living. It’s my day-to-day work, but I still don’t put as much effort in to testing RISC OS things as I probably should.

I hardly ever code due to fear of failure

(sigh) I don’t know how to help this directly. However, I try to be inclusive of everyone in how I explain and write things. I include examples of where things have gone wrong and mistakes that I’ve made. There are many reasons for this (not least getting in the criticism before others can, and to try to humanise the subject matter and make it conversational), but there is one that is particularly pertinent here, both as a technique in how I write and because it’s entirely real. When I try to talk about things, including something that went wrong, or mistakes that I made, it serves to highlight what you’re talking about and that as I’m talking about what’s going wrong it’s because I’m just as fallible. Writing that way is – I think – amusing, and encouraging. “If he can make that mistake, then I don’t feel bad doing so”. It’s not so much what you do wrong, as how you deal with the results, and if you can make it so that it’s ok to make mistakes without being fearful, then that helps.

Some people find coding hard, they don’t think in the procedural way that means that they can do it as easily (or for other reasons, but that’s the main reason I’ve come across).

What keeps you from using the build service for testing?

Lots of answers on this one, which is good as this was the main thing I wanted to understand. There’s also an ‘Other’ comment section which I’ll include at the end.

I didn’t know about it.	1	7.14%
I have no interest in using it.	0	0.00%
I don’t trust it.	0	0.00%
My code isn’t public, so it’s not appropriate.	1	7.14%
My code isn’t testable through it.	2	14.29%
I don’t know how to write test code.	4	28.57%
I don’t know how to use the build service for testing.	8	57.14%
I write my code on RISC OS, so why would I use an online service?	1	7.14%
I can’t get code to a point where it can be tested easily without a native git client	1	7.14%
Time and energy are lacking.	6	42.86%
I haven’t got around to it.	5	35.71%

Ok, so not knowing about it is a really good reason to not use the system; can’t fault that. As with the previous answer, a little more communication of the intent would help, but still, I can’t expect 100% awareness, so I don’t think I’ll stress over this.

‘Code not being public’ is a fair comment. I can only say that the service doesn’t publish the source, nor even hold on to it, for more than the time it takes to execute. Even still that only mitigates such fears, and it’s entirely reasonable to not want to send your code (or binaries) to a public service. The only thing I’m taking away from this is that explaining what is and isn’t made available to others is important, because I don’t do that in the documentation. If I had they might have had a different opinion, but even if they felt the same way it would be with the correct information. On the other hand, the ‘I don’t trust it’ response got 0 responses, so maybe it was just a blanket ‘my code doesn’t leave me’… which I can’t argue with!

‘Code isn’t testable’ applies to many things: hardware drivers, desktop components, things that need more of the system than it provides. There are many that aren’t appropriate. I tried to give some examples of systems that seem inappropriate, but even basic crash tests can be written in the presentation and recent Iconbar article. The question is always whether it’s worth it, so fair comment really, and I’ve made some attempt to explain how things could still be tested when thought untestable.

‘I don’t know how to write test code’ is one that I expected to be a bit higher (I expected about 50%). My view is that ‘test code’ can be many things, and it’s very easy to get bogged down on what the code is, rather than what it’s trying to do. That’s why I spent a lot more time trying to explain that simple programs can still be tests in the presentation. Given the lower response here, maybe I over-did that.

‘I don’t know how to use the build service’ is (sadly) about where I expected. I had documented it with worked examples, and created git repos with the example integrations in. I responded to this on the forum thread, and the presentation talked more about it. Maybe once I release the FanController source it’ll make that easier because people’ll see the (odd) way that I’ve done it. I also gave the worked example of Rick Murray’s ErrorCancel module, and that’s up on github as a PR to make it clear exactly which bits changed. I still think that some sort of article walking through how you actually add build service tests to a repo would help.

‘I write my code on RISC OS’… yeah, I’m sure what to do with that as an answer. Maybe I shouldn’t have included it as a option.

‘I … (need) … a native git client’ which might be the same respondant, but it’s not obvious from the aggregate data. Either way, it’s basically the same answer as before – it needs something that I can’t give right now.

‘Time and energy are lacking’ and ‘I haven’t got around to it’ – these are pretty much where I thought they’d be. There’s always things competing for time, and especially with families and other commitments, this is pretty much where you’d expect it to be for an niche tool in an enthusiasts community.

Yes, but little time

Oh so little time…

It is something I may look into when I work on larger scale projects

Neat. I think it matters on larger scale projects more because you’ve got so many moving pieces. It still worries me though that the build service isn’t really used in anger by anything. It’s likely to fall over because it hasn’t been hardened by use. Of course, I could improve that by putting testing in place to check its behaviour… Hmmm….

Needs integration with automated builds outside Pyromaniac

Well, output can be piped to the system from other builds – so if you use gcc cross-compiling you could send the results to the service, etc. I think a little more discussion on this would be good, if it’s needed. I already had an exchange with one person who was working in an automated build environment, and there are some things I can probably do to make interworking better.

What factors might influence your use of automated testing for RISC OS software?

So this question comes to the secondary point of the survey… where’s there a lack of things. I largely include this as a general informational question, because I’m unlikely to be albe to fulfill any of the needs alone. But it’s certainly interesting.

Examples of how to write tests for RISC OS software	13	92.86%
Tutorials explaining how to test RISC OS software	11	78.57%
Better tooling for testing	1	7.14%
Better testing environment (eg more stable OS, more OS support for testing)	1	7.14%
Better tooling for source control (ie git client to put the code where it can be tested)	3	21.43%
Whether I feel that my software quality would improve	0	0.00%
More collaboration to encourage better practices	1	7.14%

‘Examples of how to write tests’ and ‘Tutorials explaining how to test’ are the top responses. Which is interesting because the survey was before I set out to do the presentation, and long before I wrote what became the Iconbar article. Maybe later I’ll do something more to explain how to do the tests – but there’s a number of examples out there that I’ve done so I’m feeling not too bad about that.

‘Better tooling’… well, I was thinking about libraries and possibly aids for processing results, but someone (Theo?) mentioned about a more standardised way of invoking tests. That’s an interesting approach too, and not one that I’d been actively considering outside of my own collections. One vote, so I guess actually that’s not so important. Not what I had expected.

‘Better testing environment’ is pretty self-explanatory, ‘cos RISC OS sucks for testing, albeit Pyromaniac makes that less dangerous. Ideally the system shouldn’t be so fragile and should allow forms of introspection. What that means is kinda up to others. I did my bit to make things less ugly and provide better information when things go wrong. Only 1 vote for that though? That surprises me, and possibly indicates why there was so little enthusiasm for any stability improvements and the Diagnostic Dumps – what I felt was important is not important to others.

‘More collaboration’… Yeah, I think I mentioned this in the presentation, about ways that people can try to encourage better testing and better engineering in general. That really needs work from reviewers and the people you’re working with to make it happen. Still, only one vote, so… maybe not such a big deal.

Those 1 vote cases are interesting, ‘cos I think they’re all pretty important, but I guess less so to other people. I’m not sure if that’s just that my priorities are different, or that I understand what I mean and what those things can do for people.

Should I worry about the fact that there’s 0 votes for ‘feeling it would improve quality’? That factor wouldn’t influence people? Huh. Maybe they’ve convinced it would, so that feeling isn’t an influence. Maybe that was just a useless answer. Hey-ho.

Then there were the ‘Other’ comments:

Biggest blocker is time unfortunately.

If code more frequently.

Yup, they’re similar and they’re big influences. You want to do what’s most useful to you – or most fun, or most lucrative, or most educational, or whatever it is that you value – if you’re pressed for time, so I absolutely understand that.

Do you have any other feedback on automated testing on RISC OS?

This was an open question and it got a few answers.

I think the best example would be something like a RISC OS app with sources hosted on GitHub, which has a GitHub actions CI pipeline, where the commands needed to do whatever is needed to drive the service (zipping sources? sending them? getting results?) are just clearly listed in the .github/workflows/*.yaml file, and you can see the results of pipeline runs on GitHub for each commit in the project. Or if something like this exists, an obvious link is needed.

Yup, examples are up on GitHub under ‘riscos-ci’, in many different forms, and there’s specific examples on the build.riscos.online site, under the menu item CI configuration

Further links to the examples are listed on the Pyromaniac resource site under the link for ‘CI Examples’.

Yes. Right now I am using old key/mouse events recording to kinda simulate automated tests on RISC OS. I really would like to have a better and more integrated way to create automated tests for WIMP Applications as well as having an easy and documented way to integrate RISC OS build services with GitHub. P.S. Thanks for making RISC OS build Services!

You’re welcome. I would love for others to produce a guide to automating tests of Wimp applications. Paolo’s very short demo of using Keystroke reminded me how it was back when I used it at school (really? that long ago?).

I think the service is absolutely brilliant and wish I understood how to use it better. I have a number of command line tools planned that would work well in its framework, I think, but definitely more hand-holding would be needed for me to be able to use it.

Feel free to let me know if you need more help. I can try to guide you through it or set up something for you to use, if that’s easier. At least, with Github.

I love the project! I just wish I had more time to play with it!

Thanks; I know how hard it can be to find time! I have lots of little things that I really want to play with but… well something has to give ‘cos you can’t do everything!

I am just starting back out with RISC OS development and test driven development is something I would like to try out on less trivial projects, so this platform is something I have considered. I just haven’t got there yet, as there are so many differences between the way RISC OS software development seems to work and what I am used to.

I feel your pain. Those differences (except the crashing!) can be smoothed somewhat, but you won’t find many people wanting to do away with them, ‘cos if you wanted to run a Windows/Unix system, you’d do so. Hopefully the service, or at least the guidance for testing that I’ve given help. If not then… well, I tried.

I want to test builds that are generated by my CI environment, because that’s quite different to Pyromaniac (GCC crossbuild, ELF binaries, etc). Some way to feed in binaries/environment from outside would be great, or else to run Pyromaniac locally where I can integrate it into the build. (I also have almost zero time to work on this so I’m not really asking for anything here, just commenting that it’s great but won’t quite do what I want it for at present)

It would be fun to work with. As I’ve said to others, Pyromaniac can be used in a more integrated way if people want to do that. The build service is kind a ‘quick project to show what you can do’, which is actually useful (in my opinion), but there’s a lot of other ways it might be useful.

Thank you for all your work on this. A really impressive project

The idea is great. It is something that would improve code quality. An amazing achievement.

Thank you both for your kind words.

Conclusions?

So it’s confirmed my belief that there needs to be more education (in different forms), although the survey was started before I did presentation which hopefully addresses some of the questions. Maybe a guide to using the build service might be useful. There’s words on the build service site, but maybe an article would help. The ‘how to write test code’ is a common theme, which I tried to address, but could certainly do with more work as well.

Thank you if you filled in the survey!

I hope that the results and my comments about it have been useful.

Jun 2, 2021 10:06am

Charlotte Benton (8631) 168 posts

While I don’t dispute the need for more sophisticated development tools, I think that Pyromaniac is by far the more interesting development, and perhaps one of the most interesting developments in the post-Acorn era. Being able to run RISC OS programs without whole-hog emulation of the entire OS is a formidable achievement.

Jun 2, 2021 10:39am

Andrew McCarthy (3688) 605 posts

Wholeheartedly agree, a list that includes Iris, 4K widescreen support, RPI compatibility, updated C tools, an AI library, Python, Zap, …

Jun 2, 2021 12:53pm

Rick Murray (539) 13850 posts

Two quick notes (I’m on break at work).

Firstly, the “I don’t trust it” could be taken in two different ways. The security/confidentiality of uploaded code (which is the interpretation you’re using) or the overall validity of the results.

And secondly, I wonder if those who find coding hard are trying to use assembler? In 202x there should be little reason to create any new projects in assembler. Because that ties you to a whole host of issues (register assignment, stack balancing, indexing data, blah blah not to mention risking being extremely sensitive to changes in the processor itself) that people invented compilers to do for us. Given that a number of modern ARM cores don’t even execute the code that you’ve written in the way that you’ve written it, all I can see with respect to creating new projects in assembler is a complicated maintenance nightmare.

Of course, that isn’t to say that C is easy, that too has plenty of its own quirks. But at least if you have a compiler and libraries to match your machine, rebuilding the project ought to be most of the job done.

Though, as you say, it does require a certain logical mindset that operates purely in the realm of yes and no logic (no “maybe”, there’s no binary value for maybe except in quantum computing where your logic maybe yes or maybe no… :-) ).

Jun 2, 2021 5:21pm

Charles Ferguson (8243) 427 posts

Firstly, the “I don’t trust it” could be taken in two different ways. The security/confidentiality of uploaded code (which is the interpretation you’re using) or the overall validity of the results.

I honestly hadn’t thought of it that way. It’s a pretty fair interpretation too, and actually it’s one that applies to most testing in general, although we don’t phrase it that way. As your system gets more complex the ability to actually find real problems, or your confidence that you’re testing what you think you’re testing decreases – reducing your trust in the test’s usefulness. The build service isn’t exactly the system that you’ll be working with, so you can certainly say that.

Still, nobody responded with that – maybe because there was an ambiguity that I didn’t see.

However, there’s advantages in testing on fake systems beyond the fact they they’re letting you test in different ways. They can find stupid mistakes because the system isn’t the way it’s meant to be – essentially exposing where error handling or assumptions are invalid. That’s indirect testing, I guess, but it’s still valuable if what you get out at the end is a more robust system.

Jun 2, 2021 5:46pm

Steve Pampling (1551) 8172 posts

Firstly, the “I don’t trust it” could be taken in two different ways. The security/confidentiality of uploaded code (which is the interpretation you’re using) or the overall validity of the results.

The thing is Rick a test is a test. Whether it answers all the questions you may have is a different issue. It’s still a valid test, but perhaps not a test that answers your specific question.

In a network setup, I can try and use a browser to connect to a web server. It fails. From that result what do I know?
Answer: Something is wrong, nothing more.
I do not know whether the fault is at my end or the other
I do not know at the other end whether the server is up
I do not know whether the service offering the connection is operating on the port I tried to connect on – it could be a different port, it could be the service is stopped.

You see, the moment you start testing you need to put all the questions there and strike off the ones that are answered by the test.

A list of what kind of answers the system can give and what it can’t might help.

Jun 2, 2021 8:37pm

Rick Murray (539) 13850 posts

The thing is Rick a test is a test.

No, it’s very much not.

Let us consider a potentially faulty memory decode on a Beeb.
The memory decoding itself is mostly “fairly simple” (in as much as is possible for a machine of that complexity) and is implemented using numerous 74 series logic gates.
There’s a schematic.
The first thing you need to do is try to identify what is actually failing? Accessing a certain zone of memory (Fred, Jim, Sally, Deborah, Chulthu…) doesn’t work. So the device probably either has a chip select pin, or a control line that serves a similar purpose. It’s letting you know that the machine is accessing this thing.
So you will probably need an oscilloscope in order to see that signal flash by. One that can either auto trigger, or (better yet) trigger and hold. The latter is easier with a digital scope.
Well now you need to know how to set up the scope, set up the triggering, and have confidence in that if the scope isn’t triggering, it’s because the signal really isn’t there and not something wrong with either the scope or the settings.
Then, with reference to the schematic (and note that there is one schematic that I’m aware of and seven different board issues) you’ll need to work backwards through the logic gates to see when the signal actually appears. Which means that at each gate you’ll need to understand what specific input conditions are expected to generate the desired output.
Eventually, you’ll discover that it’s a dry joint around a dinky little resistor. ;-)

Working with software isn’t really that much different. You have an expected behaviour and you have an actual behaviour. When they don’t match, it is necessary to devise tests to try to work out why. But, note, depending on how the code was written, a number of things may behave when run in the debugger but fail when run on a real system. Before anybody is like “it’s the compiler! the debugger code isn’t optimised so it works!” one must remember that a debugger presents a sanitised world, with such aids as clearing memory before use (make it easy to see what was written where) but… if the bug was a pointer that was unset, under the debugger it may return the value null so all the null testing will work. On a real system, on the other hand, the unset pointer will have the value of whatever happened to be in memory. Null testing will fail as the pointer isn’t null. Bang.
(it’s why I always assign a value (usually 0 or NULL) to variable definitions, even if it isn’t really necessary)

That latter part, the memory zeroing, counts as trust. Do you trust the debugger to run same code in the same way? You shouldn’t, because it doesn’t. ;-)
But do you trust the debugger to perform fundamentally the same operations as expressed by the source, even if the underlying machine code is different? Well, that’s kind of the idea…

So a test is a test if you’re looking at it from very far away.
Get a little closer and you’ll see numerous complications that may require you to consider the issue of trust. In your tools, in your abilities, even in your reference materials. Because a test can be a complicated thing.

A list of what kind of answers the system can give and what it can’t might help.

How much faith (another word for trust) do you have in the accuracy of this list? What is its provenance? The manufacturer? StackOverflow? A blog post? How up to date is it? How up to date is the thing you’re testing, for that matter?
And that’s just discussing a list of possibilities.

Pages: 1 2 3

Reply

To post replies, please first log in.

Forums → General →

Automated testing and CI

Before taking this survey, have you heard of the RISC OS Build Service?

Were you aware that it can be used for automated building and testing of RISC OS software?

Have you tried to use the build service?

Do you have any comments to share about the build service?

Do you have any interest in testing your software?

What keeps you from using the build service for testing?

What factors might influence your use of automated testing for RISC OS software?

Do you have any other feedback on automated testing on RISC OS?

Conclusions?

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options