RISC OS Open: Forum: Encodings

Mar 31, 2018 6:20pm

Rick Murray (539) 13840 posts

Why is France different?

Nobody else stands a chance of figuring it out.

Had it been written in France by a French guy, it would make sense (and probably use the BEPO keyboard layout), however this change was checked in by Kevin Bracey back in July 2000 (revision 1.3).

Mar 31, 2018 7:02pm

Steve Pampling (1551) 8170 posts

Had it been written in France by a French guy, it would make sense (and probably use the BEPO keyboard layout), however this change was checked in by Kevin Bracey back in July 2000 (revision 1.3).

That puts it 12 months after the release of RO4.02 so if the fault exists with RO4.02 then it’s something ROL did and Kevin was merely checking in code changes passed back by ROL to Pace, if it isn’t in RO4.02 then we still have a debate.

Mar 31, 2018 7:05pm

David Feugey (2125) 2709 posts

I presume it is doing that to have access to the Euro at &A4

Absolutely. Old behaviour of RISC OS 4 times.

So line 106 probably ought to be modified accordingly

Not sure this is needed. AltGr+E gives code &A4 with Latin9. But if I use Alphabet Latin1, AltGr+E switches to &80, with the current setup. So all is needed is to change line 30 to Latin1.

Why is France different?

Ursula code.
https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Internat/Territory/Module/s/France

“France and Canada1 territories converted to Latin9. Other minor corrections
to message files.” 07/25/2000

Here’s my quick hack:

Hum, loaded and… alphabet: Latin9 :)

Anyway, I’m not sure it’s in the territory module, as I just use the keyboard option in configure, and there is no sign of French in Rommodules or Modules.

Didn’t we sort of cover this when discussing keyboard layouts and the existence of a prime key in the layout for an accented character that is used in something like one word?

That’s not the case here. No aditionnal characters in Latin9 are needed for French.

That puts it 12 months after the release of RO4.02 so if the fault exists with RO4.02

That’s what I say from the begining!!!

RISC OS 4 switch to Latin9 to provide an euro sign. The one that was not present at that time in Latin1. This trick is not needed anymore.

Mar 31, 2018 7:28pm

Rick Murray (539) 13840 posts

Hum, loaded and… alphabet: Latin9 :)

Okay, so… What is your configured Territory? Enter *Territory Is it UK or France? If France, maybe the Latin9 is a remnant of the OS booting. If UK, then there’s more work to do – and I have a horrible suspicion that if you’re using bog-standard RISC OS with a French layout, then that’ll be the case…

My hack definitely specifies Latin1, it’s at offset +&37C of the unsqueezed module (!UnModSqz in the DDE tools will do that – or rip it out of RMA with Zap/StrongEd). So I wonder what’s setting Latin9?

But if I use Alphabet Latin1, AltGr+E switches to &80, with the current setup.

What Alt-E specifies is a keyboard driver issue. What the Territory module says to other applications is the correct currency symbol is what I’m fixing here. ;-)

Hmm… Can you do a *Status and paste the results here? Also look inside !Boot.Choices.Boot.PreDesk for anything that looks like it might have to do with language/territory/keyboard.

Finally, if you are able/willing to reboot your machine, try rebooting while spamming the ESC key (to abort doing the normal boot stuff). What keyboard/alphabet does the machine start up as?

Mar 31, 2018 7:30pm

David Feugey (2125) 2709 posts

What is your configured Territory? Enter *Territory

I have no territory loaded, so UK.

Bad encoding is also in the International module. InterBody, line 969:
https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Internat/Inter/s/InterBody?rev=4.22;content-type=text%2Fx-cvsweb-markup

Of course, we can assume that Territory French changes this value.
This time we find (one of) the culprit.

The TTF2f bug is more strange, as it happens with some fonts, and not the others. But it does not affect Latin1 configurations :)

Edit: I did post the ticket 450
https://www.riscosopen.org/tracker/tickets/450

Mar 31, 2018 7:46pm

David Feugey (2125) 2709 posts

Perhaps I should use the territory module too. Past issues make me be prudent with it :)
Anyway, it would be good to have Territories modules in the RISC OS disc image.

Mar 31, 2018 8:53pm

Rick Murray (539) 13840 posts

Bad encoding is also in the International module. InterBody, line 969:

Well spotted!

Anyway, it would be good to have Territories modules in the RISC OS disc image.

Or, given that it’s a ROM image loaded into ooodles of memory, maybe RISC OS ought to contain a couple of territories built in for places where it is known to be in use and a territory exists – for example France and Germany. The module I built for you is 9K uncompressed, I’d imagine Germany would be similar. So built in support (of a manner, you know my thoughts on the half-ass implementation of territory support) could be extended to two non-English countries with a reasonably large RISC OS user base for, what? 19K all in? Why not?

QUESTION FOR JEFFREY: If I wanted to make, for instance, a ROM with this support in it, do I modify the Components file?

For instance, in the middle of the Pi’s Components file, it says:

TerritoryManager
Messages
MessageTrans
UK
WindowManager         -options OPTIONS=Ursula
TaskManager

Would it suffice to simply do this:

TerritoryManager
Messages
MessageTrans
UK
France
Germany
WindowManager         -options OPTIONS=Ursula
TaskManager

Apr 1, 2018 2:34pm

nemo (145) 2546 posts

David complained

Today, that’s a nightmare when you copy paste thing, or change the font. I write a € with Trinity under Draw, change it to FreeSerif, and – magic! – no € anymore, but international currency sign.

This is complicated by a number of issues related to encodings… but that’s only half the problem.

Short version: Using UTF-8 everywhere will really help.

Long version: Encodings have been criminally ignored, which is why you are seeing these problems. Let us compare and contrast !Chars (any version) with my own humble offering, !IntChars (other solutions are available)

In addition to the Font, one ought to be able to choose the Encoding one is using…

Even without Unicode, an Encoding contains lots of very useful information, and allow mappings from one Encoding to another, case changes, ligature composition, and character grouping:

If you click on a character in !IntChars, it sends message &5327F:

This allows the program to use the Encoding as well as the Font. Failing that, dragging the little Draw icon from the corner produces a line of text in the selected Font and Encoding.

HOWEVER, what happens after that is up to the application in question. Here’s what RO4 !Draw does if you change the Font of the resulting text object:

Which really isn’t what anyone wanted.

(I did say this was the long version)

These problems were always avoidable. But were not avoided. C’est la vie

UTF-8 everywhere will help a bit. But not much, on past experience.

Apr 1, 2018 2:51pm

nemo (145) 2546 posts

Rick:

Twitter image does not work

I suppose that’s why everyone was so quiet when I said I had double-height VDU4 text in all modes, and also with a custom 8×16 font too? Yeah, that would explain it.

Apr 1, 2018 3:07pm

nemo (145) 2546 posts

David then claimed:

all Acorn Fonts switch to Latin 9 (not the system one, see below), but some fonts converted with TTF2f stay in Acorn_Latin1Encoding (while they can be printed as Latin 9 in Chars!).

This is why.

Once upon a time there was no FontManager.
Then there was a FontManager that used bitmapped fonts (drawn by a funky pen & skeleton program, pre Draw).
Then there was a FontManager that used outline fonts (bitmaps supported for compatibility).
Then there was a FontManager that used big outline fonts and Encodings (bitmaps and old outlines supported for compatability).

The new outline fonts are called Language Fonts, the old outline fonts are Symbol Fonts. Yes, even though they have letters in them, they are technically symbolic, because they cannot adapt to an Encoding.

Is !TTF2f by EFF by any chance? If so, then it produces symbol fonts, not language fonts, so Encodings can’t be used with them. Also, it doesn’t actually use Latin1 or Acorn_Latin1, but what I refer to as EFF_Latin1… and that has two Euros. If it’s not by EFF, it my yet display one or more of these behaviours.

The way to tell the difference is to look at your font files. Those that contain files called “IntMetrics” are Symbol Fonts (ignores Encoding), those that contain “IntMetric0” (or some other digit) are Language Fonts (uses Encoding).

Apr 1, 2018 3:13pm

nemo (145) 2546 posts

Rick claimed:

I think ZapRedraw may be the same. Back when this stuff was written, nobody really thought much about anything like other encodings?

Zap is perfectly happy with whatever Encoding you want to use… but you have to bake it into the font. You already have a Teletext Zap font – that’s an Encoding. I have one or two…

Apr 1, 2018 3:13pm

nemo (145) 2546 posts

…

Apr 1, 2018 4:11pm

Rick Murray (539) 13840 posts

In addition to the Font, one ought to be able to choose the Encoding one is using…

Time to update your base applications. !Chars has supported encodings for a while, though clearly there remains the issue of “if Chars is in Latin2 and the OS in Latin1…” You can see where that’s going. :-)

Do you know how many programs support your encoding extension?

Zap is perfectly happy with whatever Encoding you want to use…

Yes. There is some sort of bug in ZapButtons (sets PC to about &00000002), and in twiddling the display options to look at the dump better, I came across the encoding options in the menu.
Darren Salt must have added it last week. I swear it wasn’t there before. Etc. ;-)

UTF-8 everywhere will help a bit. But not much, on past experience.

Indeed. The problem we would run into with a UTF-8 Wimp (etc) is the behaviour of older applications. Think of all those non-English Messages files that just assume Latin1 because, really, it’s a Latin1 system and…and…and…

[Acorn should have done this back with the RiscPC]

Apr 1, 2018 4:17pm

Chris (121) 472 posts

Let us compare and contrast !Chars (any version) with my own humble offering, !IntChars (other solutions are available) … In addition to the Font, one ought to be able to choose the Encoding one is using…

I did some work a while ago on Chars precisely to add this feature (plus display the full range of characters in UTF-8 fonts). Does the latest version in RISC OS 5.23 not display font encodings correctly? (I wouldn’t be too surprised if it didn’t – I didn’t find the issue at all easy to understand.)

That doesn’t touch on the related issue of transmitting characters to other apps: at present, Chars includes no font or encoding information at all when a character is clicked on. There was some discussion of how to handle this a while back, but AFAIK there’s no agreed protocol on how to handle this issue across the OS. If there were, Chars could be extended to be more helpful.

Apr 1, 2018 4:21pm

Chris (121) 472 posts

Is !TTF2f by EFF by any chance?

TTF2F was produced as part of the NetSurf project, I think. I used it to get hold of the Cyberbit font in order to test Chars’ ability to display the full range of >255 character fonts. As you say, the resulting fonts are Symbol fonts with no associated encoding.

Apr 2, 2018 11:07am

nemo (145) 2546 posts

Rick mused

The problem we would run into with a UTF-8 Wimp (etc) is the behaviour of older applications. Think of all those non-English Messages files that just assume Latin1

ACTUALLY that’s the problem I’ve solved.

Turns out it’s no problem at all.

Apr 2, 2018 11:13am

nemo (145) 2546 posts

Chris admitted

That doesn’t touch on the related issue of transmitting characters to other apps

It was to that I was referring, as well as the transcoding issue. When I put !IntChars into EBCDIC Encoding I can still type into its text box using the physical keyboard. Think about that for a bit.

there’s no agreed protocol on how to handle this issue across the OS

Apart from the long-defined Wimp message I just referred to?

There is an API blind-spot here. Allocations are ‘commercially sensitive’ by default, and there are plenty of protocols and APIs that are private, internal, and subject to change. But there are also APIs, modules, messages or strategies that are intended to be open, general purpose and pro bono.

It is a pity there is no ‘third-party public API’ part of the OS documentation.

Apr 2, 2018 11:15am

Rick Murray (539) 13840 posts

Turns out it’s no problem at all.

So if on a modern UTF-8 aware application, one choses to copy the text “La dernière chance pour l’humanité est celle qu’est la même, mais différente.”, an older application will see that and not “La derniÃ¨re chance pour l’humanitÃ© est celle qu’est la mÃªme, mais diffÃ©rente.”? (and vice versa, though it’s easier to detect non-standard UTF-8 and fall back to assuming Latin1 (not that FontManager does…))

Apr 2, 2018 11:19am

Rick Murray (539) 13840 posts

It is a pity there is no ‘third-party public API’ part of the OS documentation.

Couldn’t it be added to the Wiki?
Isn’t there a StrongHelp manual listing “other stuff”?
Above I see the message “Cerilica_StyledText” in StrongHelp format. Where is this help file available? It isn’t part of Wimp v1.25, which covers Wimp_Messages such as PopupHelp, ANT suite, and IRClient. Why’s your message not there also?

Apr 2, 2018 11:33am

nemo (145) 2546 posts

The UTF-8 support I have been working on features fallback – that is, if it isn’t UTF-8, then it gets treated transparently as though it is Latin1 (or your preferred Encoding).

Here is a mixed-mode text file, that has both valid UTF-8 and unadorned Latin1 in it, displayed in Zap, UniEdit and the command line:

Zap: Apologies for the ZapFont that has a tick at chr 128 for historical reasons. Regardless, this shows the byte content of the file.

UniEdit: This uses the System Font to display the character content of the file – that is, it uses the International Service to fetch Unicode character definitions, and this is the result that is displayed.

Command line: *TYPE just sends each byte to OS_WriteC, and once again, we get the ‘right’ result.

Apr 2, 2018 11:40am

nemo (145) 2546 posts

Rick asked

[if one pastes UTF-8 into] an older application [it] will see [Latin1] and not [Mojibake]?

In the absence of any additional cleverness, it will get the UTF-8 bytes. When it attempts to display them using an OS API, it will display the correct characters. The disparity between bytes and characters will cause it (mostly harmless) problems with cursor positioning if it is a text editor.

Ideally, a clever clipboard holder can have a blacklist of applications for which transcoding is required, but that is a separate, orthogonal problem.

Apr 2, 2018 11:51am

nemo (145) 2546 posts

BTW, another important feature of proper fallback handling is that it allows UDGs to continue to be used. Strict UTF-8 compliance obviously disallows that.

The UnicodeSupport module I’m working on has many support functions including file IO (CPut/CGet) that support strict UTF-8, fallback, WUTF-8, MUTF-8 and CESU-8 as required, as well as chr/byte offset conversions.

There’s a lot of unpleasantness involved in supporting UTF-8 at the command line, including reimplementations of OS_ReadLine(32), OS_Byte135, and some star commands such as *Cat, *Info etc. Star commands/utilities that output tabulated text need to be UTF-8 aware to do their formatting correctly (it’s not the end of the world if they don’t, but its ugly).

To greatly ease that I have a new module called VDUTabs, which implements definable tabstops (just like the extension I wrote for Zap). Code can then do a VDU23 to define tab widths, then output columns trivially separated by tab characters, and then a final VDU23 to return to normal. Saves a lot of code.

Apr 2, 2018 12:00pm

Steve Pampling (1551) 8170 posts

It is a pity there is no ‘third-party public API’ part of the OS documentation.

702 more important things to do… :)
Old saying “Man stuck in hole should stop digging”.

Although when so many people like the hole it is nice that he continues.

Apr 2, 2018 12:01pm

nemo (145) 2546 posts

Rick suggested

Couldn’t [third party APIs] be added to the Wiki?

I don’t know, can they?

StrongHelp format

Well that’s another question entirely. I’m sure my StrongHelp OS manual doesn’t look much like yours! StrongHelp is marvellous, but the necessity for it to be maintained by a gatekeeper is a barrier. Here’s a project for someone: An online wiki-like version of StrongHelp, from which the app could periodically fetch (and push) updates. The granularity is currently far too coarse.

Apr 2, 2018 12:08pm

nemo (145) 2546 posts

One further clarification of the mixed-mode fallback handling demonstrated in the picture above.

Sending the bytes E2,82,AC emits the Euro character. Sending bytes C2,80 outputs character U+0080 which is a useless control character for which I have not defined a glyph… so you get the ‘tofu’ square.

It may appear that outputting the byte 80 also emits the Euro character… but that’s not quite correct. It looks like the Euro character, and the fallback Encoding of Acorn Latin1 defines it to be a Euro, but it is actually user defined graphic 128, and so may well look like a Space Invader.

There is an important distinction between U+20AC the Euro, U+0080 the control character, and 80 the UDG. They are all different characters, and (my) OS_WriteC does not get confused between them.

Encodings

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Mar 31, 2018 6:20pm Rick Murray (539) 13840 posts	Why is France different? Nobody else stands a chance of figuring it out. Had it been written in France by a French guy, it would make sense (and probably use the BEPO keyboard layout), however this change was checked in by Kevin Bracey back in July 2000 (revision 1.3).

Mar 31, 2018 7:02pm Steve Pampling (1551) 8170 posts	Had it been written in France by a French guy, it would make sense (and probably use the BEPO keyboard layout), however this change was checked in by Kevin Bracey back in July 2000 (revision 1.3). That puts it 12 months after the release of RO4.02 so if the fault exists with RO4.02 then it’s something ROL did and Kevin was merely checking in code changes passed back by ROL to Pace, if it isn’t in RO4.02 then we still have a debate.

Mar 31, 2018 7:05pm David Feugey (2125) 2709 posts	I presume it is doing that to have access to the Euro at &A4 Absolutely. Old behaviour of RISC OS 4 times. So line 106 probably ought to be modified accordingly Not sure this is needed. AltGr+E gives code &A4 with Latin9. But if I use Alphabet Latin1, AltGr+E switches to &80, with the current setup. So all is needed is to change line 30 to Latin1. Why is France different? Ursula code. https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Internat/Territory/Module/s/France “France and Canada1 territories converted to Latin9. Other minor corrections to message files.” 07/25/2000 Here’s my quick hack: Hum, loaded and… alphabet: Latin9 :) Anyway, I’m not sure it’s in the territory module, as I just use the keyboard option in configure, and there is no sign of French in Rommodules or Modules. Didn’t we sort of cover this when discussing keyboard layouts and the existence of a prime key in the layout for an accented character that is used in something like one word? That’s not the case here. No aditionnal characters in Latin9 are needed for French. That puts it 12 months after the release of RO4.02 so if the fault exists with RO4.02 That’s what I say from the begining!!! RISC OS 4 switch to Latin9 to provide an euro sign. The one that was not present at that time in Latin1. This trick is not needed anymore.

Mar 31, 2018 7:28pm Rick Murray (539) 13840 posts	Hum, loaded and… alphabet: Latin9 :) Okay, so… What is your configured Territory? Enter `Territory` Is it UK or France? If France, maybe the Latin9 is a remnant of the OS booting. If UK, then there’s more work to do – and I have a horrible suspicion that if you’re using bog-standard RISC OS with a French layout, then that’ll be the case… My hack definitely* specifies Latin1, it’s at offset +&37C of the unsqueezed module (!UnModSqz in the DDE tools will do that – or rip it out of RMA with Zap/StrongEd). So I wonder what’s setting Latin9? But if I use Alphabet Latin1, AltGr+E switches to &80, with the current setup. What Alt-E specifies is a keyboard driver issue. What the Territory module says to other applications is the correct currency symbol is what I’m fixing here. ;-) Hmm… Can you do a `*Status` and paste the results here? Also look inside !Boot.Choices.Boot.PreDesk for anything that looks like it might have to do with language/territory/keyboard. Finally, if you are able/willing to reboot your machine, try rebooting while spamming the ESC key (to abort doing the normal boot stuff). What keyboard/alphabet does the machine start up as?

Mar 31, 2018 7:30pm David Feugey (2125) 2709 posts	What is your configured Territory? Enter *Territory I have no territory loaded, so UK. Bad encoding is also in the International module. InterBody, line 969: https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Internat/Inter/s/InterBody?rev=4.22;content-type=text%2Fx-cvsweb-markup Of course, we can assume that Territory French changes this value. This time we find (one of) the culprit. The TTF2f bug is more strange, as it happens with some fonts, and not the others. But it does not affect Latin1 configurations :) Edit: I did post the ticket 450 https://www.riscosopen.org/tracker/tickets/450

Mar 31, 2018 7:46pm David Feugey (2125) 2709 posts	Perhaps I should use the territory module too. Past issues make me be prudent with it :) Anyway, it would be good to have Territories modules in the RISC OS disc image.

Mar 31, 2018 8:53pm Rick Murray (539) 13840 posts	Bad encoding is also in the International module. InterBody, line 969: Well spotted! Anyway, it would be good to have Territories modules in the RISC OS disc image. Or, given that it’s a ROM image loaded into ooodles of memory, maybe RISC OS ought to contain a couple of territories built in for places where it is known to be in use and a territory exists – for example France and Germany. The module I built for you is 9K uncompressed, I’d imagine Germany would be similar. So built in support (of a manner, you know my thoughts on the half-ass implementation of territory support) could be extended to two non-English countries with a reasonably large RISC OS user base for, what? 19K all in? Why not? QUESTION FOR JEFFREY: If I wanted to make, for instance, a ROM with this support in it, do I modify the Components file? For instance, in the middle of the Pi’s Components file, it says: TerritoryManager Messages MessageTrans UK WindowManager -options OPTIONS=Ursula TaskManager Would it suffice to simply do this: TerritoryManager Messages MessageTrans UK France Germany WindowManager -options OPTIONS=Ursula TaskManager

Apr 1, 2018 2:34pm nemo (145) 2546 posts	David complained Today, that’s a nightmare when you copy paste thing, or change the font. I write a € with Trinity under Draw, change it to FreeSerif, and – magic! – no € anymore, but international currency sign. This is complicated by a number of issues related to encodings… but that’s only half the problem. Short version: Using UTF-8 everywhere will really help. Long version: Encodings have been criminally ignored, which is why you are seeing these problems. Let us compare and contrast !Chars (any version) with my own humble offering, !IntChars (other solutions are available) In addition to the Font, one ought to be able to choose the Encoding one is using… Even without Unicode, an Encoding contains lots of very useful information, and allow mappings from one Encoding to another, case changes, ligature composition, and character grouping: If you click on a character in !IntChars, it sends message &5327F: This allows the program to use the Encoding as well as the Font. Failing that, dragging the little Draw icon from the corner produces a line of text in the selected Font and Encoding. HOWEVER, what happens after that is up to the application in question. Here’s what RO4 !Draw does if you change the Font of the resulting text object: Which really isn’t what anyone wanted. (I did say this was the long version) These problems were always avoidable. But were not avoided. C’est la vie UTF-8 everywhere will help a bit. But not much, on past experience.

Apr 1, 2018 2:51pm nemo (145) 2546 posts	Rick: Twitter image does not work I suppose that’s why everyone was so quiet when I said I had double-height VDU4 text in all modes, and also with a custom 8×16 font too? Yeah, that would explain it.

Apr 1, 2018 3:07pm nemo (145) 2546 posts	David then claimed: all Acorn Fonts switch to Latin 9 (not the system one, see below), but some fonts converted with TTF2f stay in Acorn_Latin1Encoding (while they can be printed as Latin 9 in Chars!). This is why. Once upon a time there was no FontManager. Then there was a FontManager that used bitmapped fonts (drawn by a funky pen & skeleton program, pre Draw). Then there was a FontManager that used outline fonts (bitmaps supported for compatibility). Then there was a FontManager that used big outline fonts and Encodings (bitmaps and old outlines supported for compatability). The new outline fonts are called Language Fonts, the old outline fonts are Symbol Fonts. Yes, even though they have letters in them, they are technically symbolic, because they cannot adapt to an Encoding. Is !TTF2f by EFF by any chance? If so, then it produces symbol fonts, not language fonts, so Encodings can’t be used with them. Also, it doesn’t actually use Latin1 or Acorn_Latin1, but what I refer to as EFF_Latin1… and that has two Euros. If it’s not by EFF, it my yet display one or more of these behaviours. The way to tell the difference is to look at your font files. Those that contain files called “IntMetrics” are Symbol Fonts (ignores Encoding), those that contain “IntMetric0” (or some other digit) are Language Fonts (uses Encoding).

Apr 1, 2018 3:13pm nemo (145) 2546 posts	Rick claimed: I think ZapRedraw may be the same. Back when this stuff was written, nobody really thought much about anything like other encodings? Zap is perfectly happy with whatever Encoding you want to use… but you have to bake it into the font. You already have a Teletext Zap font – that’s an Encoding. I have one or two…

Apr 1, 2018 3:13pm nemo (145) 2546 posts	…

Apr 1, 2018 4:11pm Rick Murray (539) 13840 posts	In addition to the Font, one ought to be able to choose the Encoding one is using… Time to update your base applications. !Chars has supported encodings for a while, though clearly there remains the issue of “if Chars is in Latin2 and the OS in Latin1…” You can see where that’s going. :-) Do you know how many programs support your encoding extension? Zap is perfectly happy with whatever Encoding you want to use… Yes. There is some sort of bug in ZapButtons (sets PC to about &00000002), and in twiddling the display options to look at the dump better, I came across the encoding options in the menu. Darren Salt must have added it last week. I swear it wasn’t there before. Etc. ;-) UTF-8 everywhere will help a bit. But not much, on past experience. Indeed. The problem we would run into with a UTF-8 Wimp (etc) is the behaviour of older applications. Think of all those non-English Messages files that just assume Latin1 because, really, it’s a Latin1 system and…and…and… [Acorn should have done this back with the RiscPC]

Apr 1, 2018 4:17pm Chris (121) 472 posts	Let us compare and contrast !Chars (any version) with my own humble offering, !IntChars (other solutions are available) … In addition to the Font, one ought to be able to choose the Encoding one is using… I did some work a while ago on Chars precisely to add this feature (plus display the full range of characters in UTF-8 fonts). Does the latest version in RISC OS 5.23 not display font encodings correctly? (I wouldn’t be too surprised if it didn’t – I didn’t find the issue at all easy to understand.) That doesn’t touch on the related issue of transmitting characters to other apps: at present, Chars includes no font or encoding information at all when a character is clicked on. There was some discussion of how to handle this a while back, but AFAIK there’s no agreed protocol on how to handle this issue across the OS. If there were, Chars could be extended to be more helpful.

Apr 1, 2018 4:21pm Chris (121) 472 posts	Is !TTF2f by EFF by any chance? TTF2F was produced as part of the NetSurf project, I think. I used it to get hold of the Cyberbit font in order to test Chars’ ability to display the full range of >255 character fonts. As you say, the resulting fonts are Symbol fonts with no associated encoding.

Apr 2, 2018 11:07am nemo (145) 2546 posts	Rick mused The problem we would run into with a UTF-8 Wimp (etc) is the behaviour of older applications. Think of all those non-English Messages files that just assume Latin1 ACTUALLY that’s the problem I’ve solved. Turns out it’s no problem at all.

Apr 2, 2018 11:13am nemo (145) 2546 posts	Chris admitted That doesn’t touch on the related issue of transmitting characters to other apps It was to that I was referring, as well as the transcoding issue. When I put !IntChars into EBCDIC Encoding I can still type into its text box using the physical keyboard. Think about that for a bit. there’s no agreed protocol on how to handle this issue across the OS Apart from the long-defined Wimp message I just referred to? There is an API blind-spot here. Allocations are ‘commercially sensitive’ by default, and there are plenty of protocols and APIs that are private, internal, and subject to change. But there are also APIs, modules, messages or strategies that are intended to be open, general purpose and pro bono. It is a pity there is no ‘third-party public API’ part of the OS documentation.

Apr 2, 2018 11:15am Rick Murray (539) 13840 posts	Turns out it’s no problem at all. So if on a modern UTF-8 aware application, one choses to copy the text “La dernière chance pour l’humanité est celle qu’est la même, mais différente.”, an older application will see that and not “La derniÃ¨re chance pour l’humanitÃ© est celle qu’est la mÃªme, mais diffÃ©rente.”? (and vice versa, though it’s easier to detect non-standard UTF-8 and fall back to assuming Latin1 (not that FontManager does…))

Apr 2, 2018 11:19am Rick Murray (539) 13840 posts	It is a pity there is no ‘third-party public API’ part of the OS documentation. Couldn’t it be added to the Wiki? Isn’t there a StrongHelp manual listing “other stuff”? Above I see the message “Cerilica_StyledText” in StrongHelp format. Where is this help file available? It isn’t part of Wimp v1.25, which covers Wimp_Messages such as PopupHelp, ANT suite, and IRClient. Why’s your message not there also?

Apr 2, 2018 11:33am nemo (145) 2546 posts	The UTF-8 support I have been working on features fallback – that is, if it isn’t UTF-8, then it gets treated transparently as though it is Latin1 (or your preferred Encoding). Here is a mixed-mode text file, that has both valid UTF-8 and unadorned Latin1 in it, displayed in Zap, UniEdit and the command line: Zap: Apologies for the ZapFont that has a tick at chr 128 for historical reasons. Regardless, this shows the byte content of the file. UniEdit: This uses the System Font to display the character content of the file – that is, it uses the International Service to fetch Unicode character definitions, and this is the result that is displayed. Command line: `*TYPE` just sends each byte to OS_WriteC, and once again, we get the ‘right’ result.

Apr 2, 2018 11:40am nemo (145) 2546 posts	Rick asked [if one pastes UTF-8 into] an older application [it] will see [Latin1] and not [Mojibake]? In the absence of any additional cleverness, it will get the UTF-8 bytes. When it attempts to display them using an OS API, it will display the correct characters. The disparity between bytes and characters will cause it (mostly harmless) problems with cursor positioning if it is a text editor. Ideally, a clever clipboard holder can have a blacklist of applications for which transcoding is required, but that is a separate, orthogonal problem.

Apr 2, 2018 11:51am nemo (145) 2546 posts	BTW, another important feature of proper fallback handling is that it allows UDGs to continue to be used. Strict UTF-8 compliance obviously disallows that. The UnicodeSupport module I’m working on has many support functions including file IO (CPut/CGet) that support strict UTF-8, fallback, WUTF-8, MUTF-8 and CESU-8 as required, as well as chr/byte offset conversions. There’s a lot of unpleasantness involved in supporting UTF-8 at the command line, including reimplementations of OS_ReadLine(32), OS_Byte135, and some star commands such as `Cat`, `Info` etc. Star commands/utilities that output tabulated text need to be UTF-8 aware to do their formatting correctly (it’s not the end of the world if they don’t, but its ugly). To greatly ease that I have a new module called VDUTabs, which implements definable tabstops (just like the extension I wrote for Zap). Code can then do a VDU23 to define tab widths, then output columns trivially separated by tab characters, and then a final VDU23 to return to normal. Saves a lot of code.

Apr 2, 2018 12:00pm Steve Pampling (1551) 8170 posts	It is a pity there is no ‘third-party public API’ part of the OS documentation. 702 more important things to do… :) Old saying “Man stuck in hole should stop digging”. Although when so many people like the hole it is nice that he continues.

Apr 2, 2018 12:01pm nemo (145) 2546 posts	Rick suggested Couldn’t [third party APIs] be added to the Wiki? I don’t know, can they? StrongHelp format Well that’s another question entirely. I’m sure my StrongHelp OS manual doesn’t look much like yours! StrongHelp is marvellous, but the necessity for it to be maintained by a gatekeeper is a barrier. Here’s a project for someone: An online wiki-like version of StrongHelp, from which the app could periodically fetch (and push) updates. The granularity is currently far too coarse.

Apr 2, 2018 12:08pm nemo (145) 2546 posts	One further clarification of the mixed-mode fallback handling demonstrated in the picture above. Sending the bytes E2,82,AC emits the Euro character. Sending bytes C2,80 outputs character U+0080 which is a useless control character for which I have not defined a glyph… so you get the ‘tofu’ square. It may appear that outputting the byte 80 also emits the Euro character… but that’s not quite correct. It looks like the Euro character, and the fallback Encoding of Acorn Latin1 defines it to be a Euro, but it is actually user defined graphic 128, and so may well look like a Space Invader. There is an important distinction between U+20AC the Euro, U+0080 the control character, and 80 the UDG. They are all different characters, and (my) OS_WriteC does not get confused between them.