RISC OS Open: Forum: Wide char functions in SharedCLibrary

Dec 27, 2022 8:06pm

The Shared C Library headers include wchar.h , wctype.h and uchar.h , which are all part of the ISO C standard, however none of the function prototypes included in those headers have corresponding implementations. Is there a major issue that prevents these functions from being implemented easily on RISC OS?

Dec 27, 2022 10:43pm

Rick Murray (539) 13850 posts

There’s no fixed definition of what a “wide character” actually is, and the complications of using non-8-bit character sets on RISC OS.

The Unicode standard says “The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text.”

I think wchar was an early attempt at getting away from limited character sets, specifically things like Shift-JIS and the like for the eastern Asian languages. Because of this, the actual encoding of a wide character is intentionally undefined. These days one might assume it’s UTF-8, but it doesn’t have to be. Therefore, what should these functions actually do?

RISC OS support for other character sets is somewhat lacking, which may be why nothing much has happened in this respect.

Dec 27, 2022 10:57pm

Chris Mahoney (1684) 2165 posts

I think the logic went something like “you should be using Unicode, and wide chars aren’t fully compatible with Unicode, so we won’t bother with supporting them”.

Dec 28, 2022 12:14am

Cameron Cawley (3514) 158 posts

The general impression that I get is that with the exception of Windows and OS/2 (which both use UCS-2 for wide chars), almost all current platforms these days use UCS-4 for wide chars, and since both Norcroft and GCCSDK define wchar_t as an int, it would make sense for RISC OS to do so as well. For char16_t and char32_t, the C11 specification seems to strongly encourage the use of UTF-16 and UTF-32 respectively, so there isn’t much reason not to do so on RISC OS.

For reference, UnixLib provides implementations of a number of wide char functions, but it seems to be a mixture of functions from glibc that always assume UTF-8 multi-byte strings, original functions that convert strings by just changing the size of each character while ignoring the active code page and stubs which cause the application to abort immediately, interfering with build systems that attempt to detect the availability of functions before using them. The glibc code is also LGPL, which means that they’re only available when building with UnixLib, leaving a very small number of functions usable in SharedCLibrary builds.

To summarise, what I would expect if this were implemented is that all functions would use the active code page for multi-byte strings and UCS-4 for wide strings, and that functions that don’t deal with ctype functions or converting to or from multi-byte strings being independent of the encoding being used. Does that look reasonable, or is there something that I’ve overlooked?

Jan 6, 2023 2:45pm

nemo (145) 2552 posts

This is an API-forking work-around for the general problem of updating an API that implicitly assumes 8bit characters to one that deals with other encodings in some defined platform-specific way – which is not itself defined by the C standard or even an accepted convention. Your OS Mileage Will Vary.

The choice of UTF-8, UTF-16 or UCS to encode Unicodes is already a pretty outdated way of looking at the problem of solving Unicode-related API problems, as it continues to assume that one “char” (regardless of width) is in some way one “character”.

But it is not. Embrace 32bit ints if you wish, it won’t change the fact that this one character is seven Unicodes long –
🧑🏾‍🤝‍🧑🏼 (People Holding Hands: Medium-Dark Skin Tone, Medium-Light Skin Tone, 1F9D1, 1F3FE, 200D, 1F91D, 200D, 1F9D1, 1F3FC). The current worst-case emoji is ten Unicodes IIRC – in UTF-8 that’s 35 bytes long. And when it comes to letters with attached accents it can be even worse.

For one character.

So the width of ‘char’ is really no panacea, and due to the sparse but clumped nature of Unicode ranges and their associated attributes and metrics, the necessary data structures will be some kind of segmented tree anyway, so you might as well use the UTF-8 bytes as input rather than subdividing the bits of your decoded Unicode.

And all your input from the outside world will be coming in UTF-8 form in plain old bytes.

Jan 6, 2023 5:41pm

Rick Murray (539) 13850 posts

it won’t change the fact that this one character is seven Unicodes long

Fairly recentish Chrome (Android). I see a coloured boy face, hands holding hands, a caucasian boy face.
So that one character is being rendered as three characters.

;)

so you might as well use the UTF-8 bytes as input

Other benefit being that generic C code can more or less treat it as a string, which is not the case with UTF-AnythingElse.

Jan 6, 2023 6:01pm

nemo (145) 2552 posts

I regret to announce that I have been at it again, and have caused UTF-16 to be defined as an Alphabet (117) partly for sound data-processing and labelling reasons, but partly just to underline that no one knows what the API of OS_WriteC and WrchV is.

As for OS_WriteN, your guess is as good as mine.

(And your emoji should be persons, not boys. That’s different Unicodes. Though we’re squinting at 12pt hairstyles now.)

Jan 6, 2023 6:24pm

Clive Semmens (2335) 3276 posts

Though we’re squinting at 12pt hairstyles now.

I’ve completely given up with emojis. Even with my close work specs on I have no idea what the vast majority of emojis mean. Egyptian hieroglyphics are far less baffling.

Jan 6, 2023 8:08pm

Steve Pampling (1551) 8172 posts

(And your emoji should be persons, not boys. That’s different Unicodes. Though we’re squinting at 12pt hairstyles now.)

Makes no great difference on size¹, more a case of what the specific application decides the codes should be rendered as right now.

I emphasize the right now, as you can put an emoji into a Teams message in the full knowledge that the next unwanted feature modification² from MS will make it look different.

¹ For what it’s worth, the micro size colour blobs do seem to be representations of humanoid figures

² Maybe they might consider fixing the bugs and the stuff-megabytes-in-your-profile-for-no-good-reason aspects and leave the whizz bang features unchanged for a while. Stupid idea, it’s MS under discussion.

Jan 6, 2023 8:21pm

Rick Murray (539) 13850 posts

but partly just to underline that no one knows what the API of OS_WriteC and WrchV is.

R0 is the character to spew to all of the available output streams.

The question isn’t “what’s the API?”, the question is “what’s a character?”. ¹

And your emoji should be persons, not boys. That’s different Unicodes.

Did it change at any point? They’re definitely faces, not people.

My newer phone has it as two people of different colour holding hands.
My older phone is only about three and a half years old.

and leave the whizz bang features unchanged for a while.

Messing with the UI (all the time) is a fairly simple way to be seen to be doing something without actually having to do much of everything.
After all, it’s not the same bug ridden crap if it looks completely different, right?

Oh, and thanks to listening to an 80s traditional metal station (think AC/DC, Poison, and Whitesnake (last three played)), one of the excruciatingly awful American adverts explained what 👁️‍🗨️ means.

¹ Remember that WRCH is one of the entry points inherited from the MOS, so, I repeat, what’s “a character”? ;)

Jan 6, 2023 8:52pm

Rick Murray (539) 13850 posts

I have no idea what the vast majority of emojis mean

Well, we aren’t teenagers, so we can restrict ourselves to a useful subset, like:
😄 Happy
😂 Laughing arse off
🤔 Ummm… Let me think about that.
😓 Oh, FFS.
🤯 Blows my mind
😰 Oh, god…
😅 Yeah, that was embarrassing
☹️ One is NOT amused.
😠 Mildly miffed
🤬 Miffed
😱 Terror/fear/aaargh
😭 Waaaaah!
🤷‍♀️ Dunno
🤦🏻‍♀️ Facepalm
🙋 Hi/Bye
🙅 Aw hell no!
💩 Poop (quite versatile)
🐷 Me when there’s spaghetti ;)
🤒 Not well
🤮 Don’t eat the seafood
😷 Wear a mask, dammit
🥺 Seriously? Like for real?
😴 Sleepy
🤘 Listening to Aerosmith
🤞 Played the lottery
🤏 Small dick energy ;)

👯‍♀️ Can’t believe there’s no teapot, but there are bunny girls (and boys 👯‍♂️ as an option).

As for the rest, remember that emoji is from a Japanese word, so it’s not a surprise you’ll come across stuff like 👹👺🍡🍚🍱🍙🎴 (plus loads more). See here for details of all the weird and wonderful Japanese emoji: https://www.nippon.com/en/japan-topics/b00137/

To get back on topic, there might only be one RISC OS machine in existence capable of rendering this, and it demonstrates that maybe it’s better to think of input as a stream of bytes that have meaning, rather than trying to coerce them into some sort of poorly defined “wide” character.

Jan 6, 2023 9:04pm

Clive Semmens (2335) 3276 posts

My useful subset is MUCH shorter. :) :( ;)

Jan 6, 2023 9:11pm

nemo (145) 2552 posts

You managed to hit one I haven’t done though.

Jan 6, 2023 9:15pm

Rick Murray (539) 13850 posts

That’s… really quite impressive to see it as VDU text emoji. So, yup, I figured you’d have a way of doing it and you didn’t disappoint. 👍

Jan 6, 2023 9:16pm

nemo (145) 2552 posts

Note that !UniEdit is a text editor so is showing separate Unicodes even when “the emoji” is actually a grapheme sequence (see ‘facepalm’).

Wide char functions in SharedCLibrary

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Dec 27, 2022 8:06pm Cameron Cawley (3514) 158 posts	The Shared C Library headers include wchar.h , wctype.h and uchar.h , which are all part of the ISO C standard, however none of the function prototypes included in those headers have corresponding implementations. Is there a major issue that prevents these functions from being implemented easily on RISC OS?

Dec 27, 2022 10:43pm Rick Murray (539) 13850 posts	There’s no fixed definition of what a “wide character” actually is, and the complications of using non-8-bit character sets on RISC OS. The Unicode standard says “The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text.” I think wchar was an early attempt at getting away from limited character sets, specifically things like Shift-JIS and the like for the eastern Asian languages. Because of this, the actual encoding of a wide character is intentionally undefined. These days one might assume it’s UTF-8, but it doesn’t have to be. Therefore, what should these functions actually do? RISC OS support for other character sets is somewhat lacking, which may be why nothing much has happened in this respect.

Dec 27, 2022 10:57pm Chris Mahoney (1684) 2165 posts	I think the logic went something like “you should be using Unicode, and wide chars aren’t fully compatible with Unicode, so we won’t bother with supporting them”.

Dec 28, 2022 12:14am Cameron Cawley (3514) 158 posts	The general impression that I get is that with the exception of Windows and OS/2 (which both use UCS-2 for wide chars), almost all current platforms these days use UCS-4 for wide chars, and since both Norcroft and GCCSDK define wchar_t as an int, it would make sense for RISC OS to do so as well. For char16_t and char32_t, the C11 specification seems to strongly encourage the use of UTF-16 and UTF-32 respectively, so there isn’t much reason not to do so on RISC OS. For reference, UnixLib provides implementations of a number of wide char functions, but it seems to be a mixture of functions from glibc that always assume UTF-8 multi-byte strings, original functions that convert strings by just changing the size of each character while ignoring the active code page and stubs which cause the application to abort immediately, interfering with build systems that attempt to detect the availability of functions before using them. The glibc code is also LGPL, which means that they’re only available when building with UnixLib, leaving a very small number of functions usable in SharedCLibrary builds. To summarise, what I would expect if this were implemented is that all functions would use the active code page for multi-byte strings and UCS-4 for wide strings, and that functions that don’t deal with ctype functions or converting to or from multi-byte strings being independent of the encoding being used. Does that look reasonable, or is there something that I’ve overlooked?

Jan 6, 2023 2:45pm nemo (145) 2552 posts	This is an API-forking work-around for the general problem of updating an API that implicitly assumes 8bit characters to one that deals with other encodings in some defined platform-specific way – which is not itself defined by the C standard or even an accepted convention. Your OS Mileage Will Vary. The choice of UTF-8, UTF-16 or UCS to encode Unicodes is already a pretty outdated way of looking at the problem of solving Unicode-related API problems, as it continues to assume that one “char” (regardless of width) is in some way one “character”. But it is not. Embrace 32bit ints if you wish, it won’t change the fact that this one character is seven Unicodes long – 🧑🏾‍🤝‍🧑🏼 (People Holding Hands: Medium-Dark Skin Tone, Medium-Light Skin Tone, 1F9D1, 1F3FE, 200D, 1F91D, 200D, 1F9D1, 1F3FC). The current worst-case emoji is ten Unicodes IIRC – in UTF-8 that’s 35 bytes long. And when it comes to letters with attached accents it can be even worse. For one character. So the width of ‘char’ is really no panacea, and due to the sparse but clumped nature of Unicode ranges and their associated attributes and metrics, the necessary data structures will be some kind of segmented tree anyway, so you might as well use the UTF-8 bytes as input rather than subdividing the bits of your decoded Unicode. And all your input from the outside world will be coming in UTF-8 form in plain old bytes.

Jan 6, 2023 5:41pm Rick Murray (539) 13850 posts	it won’t change the fact that this one character is seven Unicodes long Fairly recentish Chrome (Android). I see a coloured boy face, hands holding hands, a caucasian boy face. So that one character is being rendered as three characters. ;) so you might as well use the UTF-8 bytes as input Other benefit being that generic C code can more or less treat it as a string, which is not the case with UTF-AnythingElse.

Jan 6, 2023 6:01pm nemo (145) 2552 posts	I regret to announce that I have been at it again, and have caused UTF-16 to be defined as an Alphabet (117) partly for sound data-processing and labelling reasons, but partly just to underline that no one knows what the API of OS_WriteC and WrchV is. As for OS_WriteN, your guess is as good as mine. (And your emoji should be persons, not boys. That’s different Unicodes. Though we’re squinting at 12pt hairstyles now.)

Jan 6, 2023 6:24pm Clive Semmens (2335) 3276 posts	Though we’re squinting at 12pt hairstyles now. I’ve completely given up with emojis. Even with my close work specs on I have no idea what the vast majority of emojis mean. Egyptian hieroglyphics are far less baffling.

Jan 6, 2023 8:08pm Steve Pampling (1551) 8172 posts	(And your emoji should be persons, not boys. That’s different Unicodes. Though we’re squinting at 12pt hairstyles now.) Makes no great difference on size¹, more a case of what the specific application decides the codes should be rendered as right now. I emphasize the right now, as you can put an emoji into a Teams message in the full knowledge that the next unwanted feature modification² from MS will make it look different. ¹ For what it’s worth, the micro size colour blobs do seem to be representations of humanoid figures ² Maybe they might consider fixing the bugs and the stuff-megabytes-in-your-profile-for-no-good-reason aspects and leave the whizz bang features unchanged for a while. Stupid idea, it’s MS under discussion.

Jan 6, 2023 8:21pm Rick Murray (539) 13850 posts	but partly just to underline that no one knows what the API of OS_WriteC and WrchV is. R0 is the character to spew to all of the available output streams. The question isn’t “what’s the API?”, the question is “what’s a character?”. ¹ And your emoji should be persons, not boys. That’s different Unicodes. Did it change at any point? They’re definitely faces, not people. My newer phone has it as two people of different colour holding hands. My older phone is only about three and a half years old. and leave the whizz bang features unchanged for a while. Messing with the UI (all the time) is a fairly simple way to be seen to be doing something without actually having to do much of everything. After all, it’s not the same bug ridden crap if it looks completely different, right? Oh, and thanks to listening to an 80s traditional metal station (think AC/DC, Poison, and Whitesnake (last three played)), one of the excruciatingly awful American adverts explained what 👁️‍🗨️ means. ¹ Remember that WRCH is one of the entry points inherited from the MOS, so, I repeat, what’s “a character”? ;)

Jan 6, 2023 8:52pm Rick Murray (539) 13850 posts	I have no idea what the vast majority of emojis mean Well, we aren’t teenagers, so we can restrict ourselves to a useful subset, like: 😄 Happy 😂 Laughing arse off 🤔 Ummm… Let me think about that. 😓 Oh, FFS. 🤯 Blows my mind 😰 Oh, god… 😅 Yeah, that was embarrassing ☹️ One is NOT amused. 😠 Mildly miffed 🤬 Miffed 😱 Terror/fear/aaargh 😭 Waaaaah! 🤷‍♀️ Dunno 🤦🏻‍♀️ Facepalm 🙋 Hi/Bye 🙅 Aw hell no! 💩 Poop (quite versatile) 🐷 Me when there’s spaghetti ;) 🤒 Not well 🤮 Don’t eat the seafood 😷 Wear a mask, dammit 🥺 Seriously? Like for real? 😴 Sleepy 🤘 Listening to Aerosmith 🤞 Played the lottery 🤏 Small dick energy ;) 👯‍♀️ Can’t believe there’s no teapot, but there are bunny girls (and boys 👯‍♂️ as an option). As for the rest, remember that emoji is from a Japanese word, so it’s not a surprise you’ll come across stuff like 👹👺🍡🍚🍱🍙🎴 (plus loads more). See here for details of all the weird and wonderful Japanese emoji: https://www.nippon.com/en/japan-topics/b00137/ To get back on topic, there might only be one RISC OS machine in existence capable of rendering this, and it demonstrates that maybe it’s better to think of input as a stream of bytes that have meaning, rather than trying to coerce them into some sort of poorly defined “wide” character.

Jan 6, 2023 9:04pm Clive Semmens (2335) 3276 posts	My useful subset is MUCH shorter. :) :( ;)

Jan 6, 2023 9:11pm nemo (145) 2552 posts	You managed to hit one I haven’t done though.

Jan 6, 2023 9:15pm Rick Murray (539) 13850 posts	That’s… really quite impressive to see it as VDU text emoji. So, yup, I figured you’d have a way of doing it and you didn’t disappoint. 👍

Jan 6, 2023 9:16pm nemo (145) 2552 posts	Note that !UniEdit is a text editor so is showing separate Unicodes even when “the emoji” is actually a grapheme sequence (see ‘facepalm’).