Strings in BASIC
Rick Murray (539) 13840 posts |
[extracted from “Hello all” thread]
It has? As far as I can see, there is just a brief mention of it in the really-techie stuff in the description of CALL.
The ARM evaluation reference (at Chris’ Acorns) says that this will only allocate 256 bytes:
(I dread to think what the Beeb does here!) Does it still hold true now?
Surely this is the time when string termination is an issue that will be exposed to the user? Nobody cares what BASIC uses internally, “it just works”, and passing regular strings to and from SWIs will “just work”. However, something like this will fail:
Actually, a lot of OS routines are tolerant of BASIC style line termination, however it may get you in the reverse direction, like this:
The information is there, but BASIC won’t display anything. Why? It doesn’t recognise what is there as a string.
Sounds like useful content to add to your BASIC help file. ;-) That and the dozen other urban myths I’m sure BASIC has accumulated along the way. Contrary to what certain people have implied, I’m actually quite fond of BASIC. It might be “complicated” to write multitasking applications, and it might lack structuring (your mods excepted!), but it is always present and unlike some languages it doesn’t change horribly from one release to the next, it is just “reliably there” (like a favourite stuffed teddy bear) and it sports a rather nifty built in assembler for when you want to get that extra special something from the machine. As it is built into the OS itself, there’s absolutely no baggage whatsoever (back in the days when we all had to RMEnsure FPEmulator, SharedCLib, etc), BASIC programs just… you know… started. The end. Here’s BASIC. What more do I need to say? |
Martin Avison (27) 1494 posts |
With Reporter running, it is easy to see exactly what happens when that code is run… … and the Reporter output is… which shows the initial state: No variable space used, and the $size shows no freestring blocks of any size. After the REPEAT is run … Still no free string blocks, but the variable space is now using 268 bytes which is… This variable memory contains…4 bytes of the chain to the next variable starting in a (ie none). 2 bytes of the variable name (less the first character) and terminator. 5 bytes of SIB (addr 00008F70, 255 length) 1 byte pad to word boundary 255 bytes of string (starting at 8F70) 1 byte pad to word boundary I seem to remember that BBC Basic would waste more space than this! |
Phil Hanson (2558) 75 posts |
I am jealous of the knowledge of you individuals. |
Dave Higton (1515) 3526 posts |
It has been built up over the years by lots of trial and error. If you persist, you will amass lots of knowledge too. |
Steve Drain (222) 1620 posts |
The use of SIBs is documented on the 2nd page of CALL in a table and a full paragraph describing it. Over the page is a neat diagram showing how it works. That is not “just a brief mention” and it is not in the “really-techie stuff”, which occupies the next 10 pages. I cannot see how it could be better documented. Anyone who is going to make statements about how BASIC V stores strings should have read at least this. In addition, there have been documents from SW available for a very long time. They are archived on the web, although I had mine direct, and they have been mentioned several times in c.s.a.programmer. These descibe how the string allocation system was designed and some possible problems that were highlighted by Minerva, I think. REPEAT a$ = a$ + "*" : UNTIL LEN a$ = 255 As I said was possible, this is one of those rare examples of the way BASIC will release an alloction, which only occurs when it is the last item added to the heap. So the only allocation that is permanent is the last one of 255 bytes – word-aligned that is 256. REPEAT a$ = a$ + "*" : b$ = b$ + "*" : UNTIL LEN a$ = 255 This does not hold here, and the allocations take over 16k. It is a somewhat pathological example, but is similar to the problems that were raised in the documents mentioned above.
Exactly. It is just necessary for everyone to recognise this and not tell users that they need to terminate such strings with a NULL. I am in no way referring to data blocks containing strings, which is indeed a difficult problem for BASIC – “it will bite you”. ;-) REM *VERY* contrived example! And I expect I could contrive an example for C, or other language, that would fail as well. If you include control characters in a string you do so at your own peril. It is an interesting feature of BASIC V string variables that they can contain any control characters, including NULL and CR. Not: SYS "OS_Args", 7, 255, a%,,, 256 but: SYS "OS_Args", 7, 255, a%,,, 256 TO ,,pathname$ Easy. Do not raise false problems through ignorance. Here is an old, and very well known ‘trick’ that you should have picked up if you were paying attention, even in the last few months. It means you do not have to DIM a% as a buffer: SYS "OS_Args", 7, 255, STRING$(255," "),,, 256 TO ,,pathname$
I know that, Perhaps the bad thing about BBC BASIC is the flexibility that it gives programmers, and thus the scope to do things inefficiently.
In days gone past on c.s.a.programmer there were dozens people to quosh the myths, and it is from them that I picked up most of what I know. A lot of that has been distilled into the BASIC StrongHelp manual |
GavinWraith (26) 1563 posts |
One of the educational advantages of gaining acquaintance of several different programming languages is understanding that the meaning of words like string or function can differ radically. Strings in BASIC are not the same as strings in C, though in both these languages strings are updateable objects whose content is stored as an array of consecutive bytes. In Lua strings are not updateable and not necessarily held as an array. They can be of any length (subject to available memory) and can contain characters of any ASCII code. Their content is determined by where in memory they are stored, so you cannot have two copies of the same string! It is the old story: different datatypes do different jobs. The important job for Lua strings is comparison, which involves only comparing two addresses and so is independent of string-size. You rarely get something for nothing, so it is the creation of strings in Lua that has to work harder. By contrast, in C string-creation is fast, but the time taken to compare two strings must depend on their length. This leads to the need for rather different string-handling algorithms, as those appropriate for C may be very inefficient for Lua. The trick is not to concatenate substrings in Lua until you have to; that may be never if your program will be writing the result out to a file. So long as you keep track of the order in which the substrings will have to be written, this strategy can be efficient. Part of the job of a programming language is to furnish the programmer with a good mental picture of its operational semantics. The danger of sticking to one type of language is falling into habits of thinking that are unnecessarily constricted. |
Rick Murray (539) 13840 posts |
Well, three lines and a slightly more complicated diagram (as it is describing “a,a(),a$” on entry into CALL).
I’ve read through the “hello all” thread, and the first mention was Steve Pampling who said this:
Followed by myself who said this:
We are dealing with the VISIBLE side of strings. The internal representation is not of consequence (just as I’m sure not many people know that VisualBasic5 stores strings in UTF-16 internally). We were only looking at what can be seen and felt. Given that SWIs would be used heavily when interacting with the OS, would you care to show me where this is incorrect? At this stage, nobody was talking about internal representation; although it is an understandable mistake as the null termination in C strings is an integral part of the string itself… The only thing I recall saying at this point that was slightly off track is this:
I should have made it clearer that I was referring to strings in data blocks and not given to the SYS command (although, unnecessary as it is, it won’t hurt anything).
<sigh> I said: “Yes, I know you could replace a% with a$, but I’m keeping it a simple example. It isn’t so easy when it is a buffer of various data that is returned, such as reading the current directory’s catalogue.” Namely, yes I know you can get BASIC to return a result directly by using a string instead of a memory pointer. This is because I wanted an example that would work simply. Okay, then, I should have done this in the first place…
Now one can’t just “use a string instead”, you need to do something to make the string work…
Nope, I’ve never seen that before. I’m guessing STRING$() allocates space on the heap, that is used as the pointer to the SWI which BASIC then ignores and copies the data into the pathname$ string. Correct? |
Steve Drain (222) 1620 posts |
Should I? Shouldn’t I? What the hell! Ding! Ding! Round 3 ;-) The documentation on SIBs is clear, concise and comprehensive. It is sufficient to its purpose and is quite visible in the most appropriate part of the manual. Apart from an index entry, what more do you want? Question from Phil H
This could be asking about the way information in arrays is stored, but it might just be seeking clarification of what the statement does. Answer from Steve P
That is a clear answer and I think it would have been quite sufficient.
This, however, describes how the information is stored and it is wrong. First, string variables, which includes array elements, can be up to to 255 characters, not 254. Second, string variables are not held in a terminated buffer, which is how I interpret “big enough to hold”. Answer from Rick M
The only way this correction could be necessary is by assumming that strings are stored in terminated buffers. It reinforces the bad meme.
Then why did you say something about just that? Statements from Rick M
Yes you should, because otherwise what you said was wrong and reinforces the bad meme.
We are entirely in agreement that strings in data blocks are a significant problem when programming in BASIC.
Indeed you do. Instead of: >DIM buffer% 255 >SYS "OS_GBPB", 12, "$", buffer%, 1, 0, 256, 0 >PRINT $(buffer% + 24) Try: >DIM buffer% 255 >SYS "OS_GBPB", 12, "$", buffer%, 1, 0, 256, 0,,,buffer%+24 TO ,,,,,,,,,filename$ >PRINT filename$ That is a ‘trick’ of my own invention, but I think I have probably only mentioned it once, on c.s.a.p a long while ago. I think it is generally safe, because it uses R9, but I am prepared to be gunned down. Anyway, I have much better ways to deal with data blocks these days. ;-)
My sincere apologies. I cannot find a mention of it in this forum, but I have mentioned it in c.s.a.p recently. I plead my age. ;-(
Not quite. The STRING$ expression is a string parameter to SYS, so it is copied to the stack, NULL terminated, and a pointer to it is passed to the SWI. It therefore acts as a buffer. On return, the NULL-terminated string in that buffer is copied to the string variable after TO before the stack is is emptied of the SYS call information. Edited to add: Just to avoid confusion, the STRING$ ‘trick’ can only be used when returning a single string. It cannot be used to return a data block unless you take the risks on your own head. |
Steve Drain (222) 1620 posts |
Rat-a-tat-tat. It was safe when I invented it, but it relied on a bug and, since RO4 and RO5, R9 is not preserved over calls to the SWI despatch code. It may be safe for specific SWIs, but it is not future proof. |
Steve Drain (222) 1620 posts |
Clearly the indirected strings problem has irked a lot of people for a very long time, so I thought I would see what needed to be done in BASIC itself to make such strings readable with any CTRL character terminator, rather than just CR. It should be trivial to produce a new version of BASIC to do this, but that would not apply to any previous ones, so it would not be possible to write programs that could be used universally. It turns out to be fairly straightforward to patch any BASIC module to do this, just 3 instructions need to be changed. I have written some code to do this which you can download from: http://kappa.me.uk/Miscellaneous/swIndirected.zipI would be interested to hear what pitfalls there might be, but if it is sound it could be included with any application. |
Steve Pampling (1551) 8170 posts |
Aw, now look what you’ve done: I’m not wrong about the termination anymore. :) |
Rick Murray (539) 13840 posts |
I wouldn’t say any control character – Tab probably ought to go through unmolested. But certainly stopping on nulls would be useful. Is it possible to make indirected strings automatically null terminated, or would that break everything? |
Steve Pampling (1551) 8170 posts |
Hmmm, what happens to a BASIC string containing TABs for spacing?
If the alteration is a simple &0D or &00 surely that covers the common versions. |
GavinWraith (26) 1563 posts |
For purposes of comparison, in RiscLua $(address, n) gives the string of n bytes starting at address. $(address) gives the string from address upto, but not including, the first byte with ASCII code less than 32. $[address] gives the string from address upto, but not including, the first ASCII NUL. A string (without ASCII NULs) given as a register-argument to sys is stored as a NUL-terminated string in a dedicated buffer (one buffer per register) and the buffer’s address is passed to the SWI. A nil argument is also replaced by the buffer’s address, so the sys-command uses the same string that the last sys-command used. There are a number of circumstances where this is useful, thanks to RISC OS’s sometimes sensible register-allocation policies. |
Steve Drain (222) 1620 posts |
Have either of you looked at the !ReadMe? If you are to patch a module you can only substitute instructions, not add them. So, it would be possible to change from CR to NULL for both reading and writing, but I think that is out of the question. There are likely to be so many bits of existing code that assume the CR that something is bound to break. Hence, writng anything other than CR is out, so you must still catch that when reading. With only one instruction to change, if you also want NULL, you will have to have all CTRL characters, or at least those less than or equal to &0D, which includes TAB. The main question then is whether any NULL-terminated strings are passed to or received from the OS as part of a data block, that contain CTRL characters other than the terminator. Rick’s PrettyPrint example is disqualified, because that is covered by the normal passing of strings by SYS. |
Steve Drain (222) 1620 posts |
Apologies to Gavin ;-) For the purpose of comparison, in Basalt RETURN$(address%,end%) gives the string of end%-address% bytes starting at address%. RETURN$(address%) gives the string from address% upto, but not including, the first byte with ASCII code less than 32. At one time it did: RETURN$(address%,char%) gives the string from address% upto, but not including, the first byte with ASCII code char% It would be simple to reinstate that. At one time there was also $$ as a synonym for RETURN$, so $$(address%). |
Steve Pampling (1551) 8170 posts |
Yup, and I understood the difficulty before commenting. If I’m reading things correctly the problem being that the patch is a slightly blunt instrument in changing from &0D only to everything less than &20. I believe the requirement is not a simple patch, rather a re-write of that portion of the BASIC implementation to allow more than one specific terminator. |
GavinWraith (26) 1563 posts |
None needed at all. I think it is good to know about other possibilities. The whole business of choice of syntax, what operations to build in, and what to be left out, and what compromises have to be made, fascinates me. The aesthetic judgments matter. Programming languages evolve over time, under pressure from users for more (or even for fewer!) features, mostly. There was a book (sorry I have forgotten the author) which gave an annotated disassembly of BBC BASIC that I found very interesting. It is a shame that there have not been subsequent books covering more recent incarnations of BASIC. |
Steve Drain (222) 1620 posts |
Here is another possibility. I addressed the specific problem of SWI blocks slightly differently. Basalt reserves a permanent 256-byte block, primarily for use with the Wimp. Its address is returned by the keyword BLOCK, but data is written to it and returned from it in a more controlled manner than just using block%. Writing is done using the pseudo-variable syntax: BLOCK!(offset%)=integer% An error is generated if an integer is not word-aligned or the offset or end of a string falls outside the block. Reading is a function: integer%=BLOCK!(offset%) |
Steve Drain (222) 1620 posts |
The source is available, but despite the efforts of many brave souls, it is by no means fully documented. Those who knew their way around the code backwards are more or less absent from the scene now. |
Steve Drain (222) 1620 posts |
That would be easy, say NULL, LF and CR which BASIC does elswhere. However, it would be pointless writing code to take advantage of it, because it would not work on any version except the latest, and the problem of dynamically changing the BASIC module has been done to death. It is really a no-go. |