Wide char functions in SharedCLibrary
Cameron Cawley (3514) 158 posts |
The Shared C Library headers include wchar.h , wctype.h and uchar.h , which are all part of the ISO C standard, however none of the function prototypes included in those headers have corresponding implementations. Is there a major issue that prevents these functions from being implemented easily on RISC OS? |
Rick Murray (539) 13850 posts |
There’s no fixed definition of what a “wide character” actually is, and the complications of using non-8-bit character sets on RISC OS. The Unicode standard says “The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text.” I think wchar was an early attempt at getting away from limited character sets, specifically things like Shift-JIS and the like for the eastern Asian languages. Because of this, the actual encoding of a wide character is intentionally undefined. These days one might assume it’s UTF-8, but it doesn’t have to be. Therefore, what should these functions actually do? RISC OS support for other character sets is somewhat lacking, which may be why nothing much has happened in this respect. |
Chris Mahoney (1684) 2165 posts |
I think the logic went something like “you should be using Unicode, and wide chars aren’t fully compatible with Unicode, so we won’t bother with supporting them”. |
Cameron Cawley (3514) 158 posts |
The general impression that I get is that with the exception of Windows and OS/2 (which both use UCS-2 for wide chars), almost all current platforms these days use UCS-4 for wide chars, and since both Norcroft and GCCSDK define wchar_t as an int, it would make sense for RISC OS to do so as well. For char16_t and char32_t, the C11 specification seems to strongly encourage the use of UTF-16 and UTF-32 respectively, so there isn’t much reason not to do so on RISC OS. For reference, UnixLib provides implementations of a number of wide char functions, but it seems to be a mixture of functions from glibc that always assume UTF-8 multi-byte strings, original functions that convert strings by just changing the size of each character while ignoring the active code page and stubs which cause the application to abort immediately, interfering with build systems that attempt to detect the availability of functions before using them. The glibc code is also LGPL, which means that they’re only available when building with UnixLib, leaving a very small number of functions usable in SharedCLibrary builds. To summarise, what I would expect if this were implemented is that all functions would use the active code page for multi-byte strings and UCS-4 for wide strings, and that functions that don’t deal with ctype functions or converting to or from multi-byte strings being independent of the encoding being used. Does that look reasonable, or is there something that I’ve overlooked? |
nemo (145) 2552 posts |
This is an API-forking work-around for the general problem of updating an API that implicitly assumes 8bit characters to one that deals with other encodings in some defined platform-specific way – which is not itself defined by the C standard or even an accepted convention. Your OS Mileage Will Vary. The choice of UTF-8, UTF-16 or UCS to encode Unicodes is already a pretty outdated way of looking at the problem of solving Unicode-related API problems, as it continues to assume that one “char” (regardless of width) is in some way one “character”. But it is not. Embrace 32bit ints if you wish, it won’t change the fact that this one character is seven Unicodes long – For one character. So the width of ‘char’ is really no panacea, and due to the sparse but clumped nature of Unicode ranges and their associated attributes and metrics, the necessary data structures will be some kind of segmented tree anyway, so you might as well use the UTF-8 bytes as input rather than subdividing the bits of your decoded Unicode. And all your input from the outside world will be coming in UTF-8 form in plain old bytes. |
Rick Murray (539) 13850 posts |
Fairly recentish Chrome (Android). I see a coloured boy face, hands holding hands, a caucasian boy face. ;)
Other benefit being that generic C code can more or less treat it as a string, which is not the case with UTF-AnythingElse. |
nemo (145) 2552 posts |
I regret to announce that I have been at it again, and have caused UTF-16 to be defined as an Alphabet (117) partly for sound data-processing and labelling reasons, but partly just to underline that no one knows what the API of OS_WriteC and WrchV is. As for OS_WriteN, your guess is as good as mine. (And your emoji should be persons, not boys. That’s different Unicodes. Though we’re squinting at 12pt hairstyles now.) |
Clive Semmens (2335) 3276 posts |
I’ve completely given up with emojis. Even with my close work specs on I have no idea what the vast majority of emojis mean. Egyptian hieroglyphics are far less baffling. |
Steve Pampling (1551) 8172 posts |
Makes no great difference on size1, more a case of what the specific application decides the codes should be rendered as right now. I emphasize the right now, as you can put an emoji into a Teams message in the full knowledge that the next unwanted feature modification2 from MS will make it look different. 1 For what it’s worth, the micro size colour blobs do seem to be representations of humanoid figures 2 Maybe they might consider fixing the bugs and the stuff-megabytes-in-your-profile-for-no-good-reason aspects and leave the whizz bang features unchanged for a while. Stupid idea, it’s MS under discussion. |
Rick Murray (539) 13850 posts |
R0 is the character to spew to all of the available output streams. The question isn’t “what’s the API?”, the question is “what’s a character?”. 1
Did it change at any point? They’re definitely faces, not people. My newer phone has it as two people of different colour holding hands.
Messing with the UI (all the time) is a fairly simple way to be seen to be doing something without actually having to do much of everything. Oh, and thanks to listening to an 80s traditional metal station (think AC/DC, Poison, and Whitesnake (last three played)), one of the excruciatingly awful American adverts explained what 👁️🗨️ means. 1 Remember that WRCH is one of the entry points inherited from the MOS, so, I repeat, what’s “a character”? ;) |
Rick Murray (539) 13850 posts |
Well, we aren’t teenagers, so we can restrict ourselves to a useful subset, like: 👯♀️ Can’t believe there’s no teapot, but there are bunny girls (and boys 👯♂️ as an option). As for the rest, remember that emoji is from a Japanese word, so it’s not a surprise you’ll come across stuff like 👹👺🍡🍚🍱🍙🎴 (plus loads more). See here for details of all the weird and wonderful Japanese emoji: https://www.nippon.com/en/japan-topics/b00137/ To get back on topic, there might only be one RISC OS machine in existence capable of rendering this, and it demonstrates that maybe it’s better to think of input as a stream of bytes that have meaning, rather than trying to coerce them into some sort of poorly defined “wide” character. |
Clive Semmens (2335) 3276 posts |
My useful subset is MUCH shorter. :) :( ;) |
nemo (145) 2552 posts |
You managed to hit one I haven’t done though. |
Rick Murray (539) 13850 posts |
That’s… really quite impressive to see it as VDU text emoji. So, yup, I figured you’d have a way of doing it and you didn’t disappoint. 👍 |
nemo (145) 2552 posts |
Note that !UniEdit is a text editor so is showing separate Unicodes even when “the emoji” is actually a grapheme sequence (see ‘facepalm’). |