Transitioning to Unicode alphabet
Ben Avison (25) 445 posts |
This post was prompted by the discussion of !Chars, but I’ve split this off into a separate topic. I think it’s being a little overlooked that there’s a wider problem here – with everyone focusing on Wimp_ProcessKey and the Key_Pressed event, remember that there are a load of other places where text which is (at least potentially) user-visible is passed around the OS. These are things like error messages, filenames, disc names, device names, interactive help messages, and so on. You might argue with hindsight that all those APIs should have featured an encoding along with the text, but it’s far too late to change that now; you’d break almost everything by doing so. Acorn’s solution was choose a system-wide alphabet, and let that be the implied encoding of all such strings. Now, this was a far from perfect solution, but it was the most practical one IMHO. And actually, with the advent of UTF-8, choosing it a single system-wide alphabet has fewer drawbacks than ever. I hope its uncontentious that in all my examples, it’s the meaning of the text which is important, not any typesetting attributes, so pure Unicode is a good fit. A number of East Asian Territories/Countries were defined to use the UTF-8 alphabet a decade ago – after all, that was the driving force behind the development of the Unicode font manager, keyboard handlers, Wimp, IMEs etc. It’s a little embarrassing that after all this time, most of us are still stuck using Latin-1. A large part of the problem is that to make such a switch, all the applications you use have to have their user-visible text converted into UTF-8. But with people’s expectations from other OSes increasingly being that full Unicode support is available everywhere, isn’t is time we thought about starting to make the transition? UTF-8 support in applications really isn’t that hard, you know – most of the tricky stuff is handled for you transparently by the Font Manager. Converting an application’s user-visible text from Latin-1 to UTF-8 can even be done without access to the source code if the application uses Messages and/or Toolbox Res files. Where an application edits text, the main change is that you need to handle cursor navigation and character deletion differently, to ensure that you deal in whole characters rather than just by bytes – there’s a good reference for how to do this in the Wimp in the form of two routines called skipcharL and skipcharR. These are coded defensively so as to follow the same rules for malformed byte sequences as are used in the Font Manager. (The Font Manager uses the Unicode replacement character glyph for malformed sequences, which may not be immediately apparent if you don’t have a font that defines it.) The Font Manager’s string split calls are the same as they ever were, returning byte indexes, so no code change is needed there. Character input code can also be left alone, since the multiple Key_Pressed events used for Unicode characters above 127 have higher priority than redraw events, so any intermediate states won’t be visible to the end user, which means that applications don’t need to amalgamate them into individual Unicode characters themselves, unless they want to. More of a problem is that your average application author isn’t going to want to have their application display incorrectly on older OSes like RISC OS 3 or 4 or even RISC OS Six without having to rely on having all the ROOL Unicode modules softloaded. So your typical application is going to need to contain both Latin-1 and UTF-8 versions of its resources. The logical way to permit this is by using the pre-existing Territory system to define a mirror set of Territories which differ only by mandating the use of UTF-8. If this increases awareness amongst application authors of how to support other languages in their applications, as a by-product, then that would be no bad thing! However, this does present us with another problem to overcome: the allocations for Territory and related numbers are already creaking at the seams, and can’t cope with such a doubling of allocations as it currently stands. A quick recap of related quantities:
You might well ask why the concepts of Country and Territory are separately configurable. I’m not aware of any official justification for this. But thinking about it, I reckon that in the days before the keyboard handlers were alphabet-agnostic, it didn’t make any sense to configure an alphabet which didn’t match the one your keyboard layout expected, so in practice you’d almost always just want to configure a Country and not a separate Keyboard and Alphabet (and note that Keyboard and Alphabet don’t have CMOS allocations). And just because you were typing on a (say) French keyboard wouldn’t necessarily mean that you wanted your user interface to be presented in French – and the user interface should always be chosen based on Territory, not on Country. As an aside, it’s worth noting that the Toolbox documentation is out of date regarding the suffixes used for Messages and Res files – it says that it uses the Country number, whereas since Toolbox 1.37, it uses the Territory number instead (and it searches for the suffixed file before the unsuffixed one). Judging by its date, I suspect this feature was first released with RISC OS 4. I note that the allocations header suggests that the limitations in OS_Byte 71 could be addressed “by doing something constructive with R2”. But I think I’d argue that since up till now, it had no defined value if called via the SWI (you can’t assume it was called as a result of a *FX command, which would have zeroed R2) and as a result that’s not a terribly useful idea. Besides, anyone monitoring ByteV for changes of country, alphabet or keyboard isn’t going to be reading R2. I think the least worst option is to simply use the upper bits of R1 in OS_Byte 70/71 to expand the range of Country and Keyboard codes. All current callers will implicitly be zeroing these, and there isn’t anything in the implementation of *FX that prevents R1 from being passed values greater than 255. Yes, we’d break BBC micro compatibility, but I think that’s a small price to pay to gain system-wide Unicode compatibility. A sensible choice would be to allocate the mirror copies of Country/Territory numbers offset by 256, thereby steering clear of all existing allocations. And if someone does happen to be ANDing R1 with &FF in ByteV, it should degrade as gracefully as can be expected. We’d also need to sort out how to store larger Country and Territory numbers in CMOS. Although spare bits in CMOS are as rare as proverbial hen’s teeth, this feels like one of those very rare cases which could justify a new allocation. I’m torn between allocating one additional bit on each of Country and Territory, and lavishing an entire byte between them (4 bits each) – this would potentially allow for the allocation of an additional 768 Countries and Territories. Given that there are currently 204 sovereign states in the world, and many of those will use multiple languages, that feels like a properly future-proof number. Of course, all the new mirrored Country and Territory allocations would need distinct, snappy names, and preferably something language- and locale-independent. My best idea so far is to use the original name with a ‘+’ suffix, so where Country 1 is “UK”, Country 257 would be “UK+”. The plus should be reminiscent of the “U+nnnn” naming used for Unicode character identification. One good thing about this scheme is that it could be phased in – the default configuration for normal users could stick with traditional Territories for the time being, while developers could test their applications by reconfiguring to the new value. Eventually we could switch the default over to UTF-8 Territories when enough key applications have been converted. Users would always have the option to configure their machines back to Latin-1 if absolutely necessary. One other thing while I remember it: when converting menu icon strings to UTF-8, note that to represent the “Shift” arrow icon for shortcuts, you need to use its proper Unicode character U+21D1 (UTF-8 sequence E2 87 91), and similarly for the other WimpSymbol characters if you happen to use them. This gets rid of the long-standing kludge in the Wimp where it appropriates otherwise unused code points in other alphabets for its own uses. The Wimp will still switch the font to WimpSymbol for you when it encounters them if (as will usually be the case) your Desktop font doesn’t define them – though this was always intended to be a temporary measure: IIRC the plan was for the Font Manager to take care of such font substitutions in general, much like it appears the RUFL library does. I still think this would be desirable, as it would permit arbitrary Unicode characters to be used anywhere, not only in applications which use RUFL to do their rendering. |
Chris (121) 472 posts |
What needs to be done to make the Wimp work properly in a UTF-8 system alphabet (i.e. menu text isn’t corrupted, etc.)? Is it just case of working through Messages files, Templates and menu/error text and converting it to UTF-8 strings? Or is there coding required to Wimp components? |
Ben Avison (25) 445 posts |
For all constant text, yes – it’s just a matter of processing Messages, Tesmplates and Res files. This should be fairly easy to automate: in most cases it’s just an Acorn Latin-1 to UTF-8 conversion. Where a string is displayed in the desktop font, a little care needs to be taken of the extra code points in Acorn Latin-1 which the Wimp reuses for certain symbols: Code Character Unicode replacement UTF-8 codes &80 ✔ U+2714 HEAVY CHECK MARK &E2 &9C &94 (older software) &80 € U+20AC EURO SIGN &E2 &82 &AC (newer software) &84 ✘ U+2718 HEAVY BALLOT X &E2 &9C &98 &88 ⇐ U+21D0 LEFTWARDS DOUBLE ARROW &E2 &87 &90 &89 ⇒ U+21D2 RIGHTWARDS DOUBLE ARROW &E2 &87 &92 &8A ⇓ U+21D3 DOWNWARDS DOUBLE ARROW &E2 &87 &93 &8B ⇑ U+21D1 UPWARDS DOUBLE ARROW &E2 &87 &91 It’s not the Wimp that’s responsible for Euro sign insertion, but I’ve included it there for reference, as if you encounter &80 in a text string, you’ll need to judge for yourself which character was intended. I definitely recall that Draw needs to understand how to do deletion when entering text in UTF-8 – however, when you edit an existing string, it uses a dialogue box, so the Wimp handles that for it. I can’t remember offhand whether Edit is fully UTF-8 aware or not. |
Andrew Hodgkinson (6) 465 posts |
ISTR that Kevin used either Edit or SrcEdit as a testbed for much of the UTF8 work so I suspect it (or rather, RISCOSLib) are OK in that respect. |
Chris (121) 472 posts |
I’m trying to get a working test version of Chars ready for testing in a few days’s time. In the meantime, I thought a little to-do list in the Wiki might be useful as a summary of this discussion. It’s here – comments/edits welcome. This assumes we go with Ben’s scheme for mirror territories, as mentioned above. I propose to work on steps 1 & 2 as a first stage – anyone willing to help out is most welcome to get in touch. |
W P Blatchley (147) 247 posts |
Can anyone tell me if Wimp_TextOp is UTF8 aware? If the desktop font happened to be xxx/EUTF8, would the call be able to split text correctly? |
Ben Avison (25) 445 posts |
Yes, I’m pretty sure that should already work if the system alphabet it UTF8. |
Sprow (202) 1155 posts |
Um, if presently there are 8 bits (for 256) surely you only need 2 bits each to get to 1024? I looked at ScanRes (which finds common stuff in resources when generating the Messages module) and it seems to all revolve around using 2 digit decimal directory names. The docs say that’s because territories are numbered 0-99, so I guess it’s already on the back foot, but it’s partly to defer a 10 letter directory limitation (thus up to 5 territories can be in a ROM). Do we assume long filenames are available these days? I guess so, some of the sources have filenames longer than 10 letters now (eg. SndSetupVIDC). I note that the Toolbox does sprintf(name, “Message%d”, territory) so can do 3 digit territory numbers on a 10 letter filing system. So territories 1000-1023 might have to be written off. |
WPB (1391) 352 posts |
Resurrecting a reasonably old thread…
Is there another way? Could perhaps the APIs that load Messages, Templates and Res files do what they currently do to work out the filename of what they’re looking for, then try looking for a file called that but with “+” appended if the system alphabet is UTF-8? It seems like a waste of Country and Territory allocations to double up everything if the only difference is the use of UTF-8. On non RO5 systems, the normal files get loaded. On RO5, if the system alphabet is UTF-8, the “+” versions get loaded automatically, with no changes necessary to program code. This would also work (I think), with ResFind – even if it wouldn’t, ResFind could be updated to cope. There’s bound to be a caveat or two. What are they, folks? EDIT: I realise my suggestion is a bit circular: How can the configured alphabet be UTF-8 without the Territory specifying it as such? There would have to be an “alphabet override” configuration option that forces a particular alphabet irrespective of what the configured territory says it should be. Perhaps this just makes an already complicated system even more complicated! Still, there may be some merit in the idea… |
Benoit Gilon (259) 14 posts |
Before migrating applications to adopt UTF-8, either on the hard way or through resource files, would it be possible that limitations regarding Wimp message lengths disappear or at least, a work around for them is implemented? AFAICR, at least some Wimp messages are size limited to 256 bytes, and that includes Interactive Help messages which contain plain text sent by applications to the IH application. I guess that this fact contributed to the use of escape sequence pattern within exchanged message like ‘\Smerge the paths’ (sent from the Draw application with a UK locale) which is subsequently expanded to ‘Click SELECT to merge the path’ by the IH application. Thank you for reading, |
Rick Murray (539) 13806 posts |
I thought ALL messages were limited to 256 bytes (including message header)? This is something that I have in mind for my It’llProbablyNeverHappenCosImNotThatSmart modification to the Wimp. In addition to enabling Unicode for capable applications, there will be a modification to the message protocol as follows: If a message length (+0) is -1, then the word at +20 will be a pointer to a block. This block, which can be up to 4K (total) will begin with the standard twenty bytes of a wimp message (length, sender handle, my_ref, your_ref, message code); which may then be followed by up to 4076 (4K – 20) bytes of data. Well, that is what I have in mind. Again, as with the Unicode/Latin1 duality I mentioned a short while back, one of the main criterion here is to offer enhanced facilities without compromising backwards compatibility (this is why I am against just Whether it would work in practice is a different question; I think !Help would be a brilliant idea for a test-bed for it; though I suspect !Help is written in BASIC. <looks in sources> Oooh, if it is “Help2”, that is C. Sweet! |
nemo (145) 2529 posts |
All Wimp messages are limited to 256 bytes. However, they can contain pointers to other memory and so this doesn’t have to be a limitation for protocol design. The length of the contextual help message is limited, but it is the equivalent of a ‘tooltip’ in other OSes, and should be short. Rik, long messages have been supported by various modules in that way for Quite Some Time. For example, I think CC’s Impact messages work like that. I also made a module that allows very long messages to be delivered fairly transparently – trapping SendMessage to divert long messages and send a short proxy in its place, which is fetched by the recipient by a call to the module. Long messages can’t be imposed on apps though, because the buffer passed to Wimp_Poll is limited to 256 bytes. |
Rick Murray (539) 13806 posts |
…which it was clearly marked as an extension only to be provided to compatible software. Arbitrarily extending the size of the Wimp block would break, like, everything so that’s obviously a non-starter… |