Encodings
Rick Murray (539) 13840 posts |
Why is France different? Had it been written in France by a French guy, it would make sense (and probably use the BEPO keyboard layout), however this change was checked in by Kevin Bracey back in July 2000 (revision 1.3). |
Steve Pampling (1551) 8170 posts |
That puts it 12 months after the release of RO4.02 so if the fault exists with RO4.02 then it’s something ROL did and Kevin was merely checking in code changes passed back by ROL to Pace, if it isn’t in RO4.02 then we still have a debate. |
David Feugey (2125) 2709 posts |
Absolutely. Old behaviour of RISC OS 4 times.
Not sure this is needed. AltGr+E gives code &A4 with Latin9. But if I use Alphabet Latin1, AltGr+E switches to &80, with the current setup. So all is needed is to change line 30 to Latin1.
Ursula code. “France and Canada1 territories converted to Latin9. Other minor corrections
Hum, loaded and… alphabet: Latin9 :) Anyway, I’m not sure it’s in the territory module, as I just use the keyboard option in configure, and there is no sign of French in Rommodules or Modules.
That’s not the case here. No aditionnal characters in Latin9 are needed for French.
That’s what I say from the begining!!! RISC OS 4 switch to Latin9 to provide an euro sign. The one that was not present at that time in Latin1. This trick is not needed anymore. |
Rick Murray (539) 13840 posts |
Okay, so… What is your configured Territory? Enter My hack definitely specifies Latin1, it’s at offset +&37C of the unsqueezed module (!UnModSqz in the DDE tools will do that – or rip it out of RMA with Zap/StrongEd). So I wonder what’s setting Latin9?
What Alt-E specifies is a keyboard driver issue. What the Territory module says to other applications is the correct currency symbol is what I’m fixing here. ;-) Hmm… Can you do a Finally, if you are able/willing to reboot your machine, try rebooting while spamming the ESC key (to abort doing the normal boot stuff). What keyboard/alphabet does the machine start up as? |
David Feugey (2125) 2709 posts |
I have no territory loaded, so UK. Bad encoding is also in the International module. InterBody, line 969: Of course, we can assume that Territory French changes this value. The TTF2f bug is more strange, as it happens with some fonts, and not the others. But it does not affect Latin1 configurations :) Edit: I did post the ticket 450 |
David Feugey (2125) 2709 posts |
Perhaps I should use the territory module too. Past issues make me be prudent with it :) |
Rick Murray (539) 13840 posts |
Well spotted!
Or, given that it’s a ROM image loaded into ooodles of memory, maybe RISC OS ought to contain a couple of territories built in for places where it is known to be in use and a territory exists – for example France and Germany. The module I built for you is 9K uncompressed, I’d imagine Germany would be similar. So built in support (of a manner, you know my thoughts on the half-ass implementation of territory support) could be extended to two non-English countries with a reasonably large RISC OS user base for, what? 19K all in? Why not? QUESTION FOR JEFFREY: If I wanted to make, for instance, a ROM with this support in it, do I modify the Components file? For instance, in the middle of the Pi’s Components file, it says: TerritoryManager Messages MessageTrans UK WindowManager -options OPTIONS=Ursula TaskManager Would it suffice to simply do this: TerritoryManager Messages MessageTrans UK France Germany WindowManager -options OPTIONS=Ursula TaskManager |
nemo (145) 2546 posts |
David complained
This is complicated by a number of issues related to encodings… but that’s only half the problem. Short version: Using UTF-8 everywhere will really help. Long version: Encodings have been criminally ignored, which is why you are seeing these problems. Let us compare and contrast !Chars (any version) with my own humble offering, !IntChars (other solutions are available) In addition to the Font, one ought to be able to choose the Encoding one is using… Even without Unicode, an Encoding contains lots of very useful information, and allow mappings from one Encoding to another, case changes, ligature composition, and character grouping: If you click on a character in !IntChars, it sends message &5327F: This allows the program to use the Encoding as well as the Font. Failing that, dragging the little Draw icon from the corner produces a line of text in the selected Font and Encoding. HOWEVER, what happens after that is up to the application in question. Here’s what RO4 !Draw does if you change the Font of the resulting text object: Which really isn’t what anyone wanted. (I did say this was the long version) These problems were always avoidable. But were not avoided. C’est la vie UTF-8 everywhere will help a bit. But not much, on past experience. |
nemo (145) 2546 posts |
Rick:
I suppose that’s why everyone was so quiet when I said I had double-height VDU4 text in all modes, and also with a custom 8×16 font too? Yeah, that would explain it. |
nemo (145) 2546 posts |
David then claimed:
This is why. Once upon a time there was no FontManager. The new outline fonts are called Language Fonts, the old outline fonts are Symbol Fonts. Yes, even though they have letters in them, they are technically symbolic, because they cannot adapt to an Encoding. Is !TTF2f by EFF by any chance? If so, then it produces symbol fonts, not language fonts, so Encodings can’t be used with them. Also, it doesn’t actually use Latin1 or Acorn_Latin1, but what I refer to as EFF_Latin1… and that has two Euros. If it’s not by EFF, it my yet display one or more of these behaviours. The way to tell the difference is to look at your font files. Those that contain files called “IntMetrics” are Symbol Fonts (ignores Encoding), those that contain “IntMetric0” (or some other digit) are Language Fonts (uses Encoding). |
nemo (145) 2546 posts |
Rick claimed:
Zap is perfectly happy with whatever Encoding you want to use… but you have to bake it into the font. You already have a Teletext Zap font – that’s an Encoding. I have one or two… |
nemo (145) 2546 posts |
… |
Rick Murray (539) 13840 posts |
Time to update your base applications. !Chars has supported encodings for a while, though clearly there remains the issue of “if Chars is in Latin2 and the OS in Latin1…” You can see where that’s going. :-) Do you know how many programs support your encoding extension?
Yes. There is some sort of bug in ZapButtons (sets PC to about &00000002), and in twiddling the display options to look at the dump better, I came across the encoding options in the menu.
Indeed. The problem we would run into with a UTF-8 Wimp (etc) is the behaviour of older applications. Think of all those non-English Messages files that just assume Latin1 because, really, it’s a Latin1 system and…and…and… [Acorn should have done this back with the RiscPC] |
Chris (121) 472 posts |
I did some work a while ago on Chars precisely to add this feature (plus display the full range of characters in UTF-8 fonts). Does the latest version in RISC OS 5.23 not display font encodings correctly? (I wouldn’t be too surprised if it didn’t – I didn’t find the issue at all easy to understand.) That doesn’t touch on the related issue of transmitting characters to other apps: at present, Chars includes no font or encoding information at all when a character is clicked on. There was some discussion of how to handle this a while back, but AFAIK there’s no agreed protocol on how to handle this issue across the OS. If there were, Chars could be extended to be more helpful. |
Chris (121) 472 posts |
TTF2F was produced as part of the NetSurf project, I think. I used it to get hold of the Cyberbit font in order to test Chars’ ability to display the full range of >255 character fonts. As you say, the resulting fonts are Symbol fonts with no associated encoding. |
nemo (145) 2546 posts |
Rick mused
ACTUALLY that’s the problem I’ve solved. Turns out it’s no problem at all. |
nemo (145) 2546 posts |
Chris admitted
It was to that I was referring, as well as the transcoding issue. When I put !IntChars into EBCDIC Encoding I can still type into its text box using the physical keyboard. Think about that for a bit.
Apart from the long-defined Wimp message I just referred to? There is an API blind-spot here. Allocations are ‘commercially sensitive’ by default, and there are plenty of protocols and APIs that are private, internal, and subject to change. But there are also APIs, modules, messages or strategies that are intended to be open, general purpose and pro bono. It is a pity there is no ‘third-party public API’ part of the OS documentation. |
Rick Murray (539) 13840 posts |
So if on a modern UTF-8 aware application, one choses to copy the text “La dernière chance pour l’humanité est celle qu’est la même, mais différente.”, an older application will see that and not “La dernière chance pour l’humanité est celle qu’est la même, mais différente.”? (and vice versa, though it’s easier to detect non-standard UTF-8 and fall back to assuming Latin1 (not that FontManager does…)) |
Rick Murray (539) 13840 posts |
Couldn’t it be added to the Wiki? |
nemo (145) 2546 posts |
The UTF-8 support I have been working on features fallback – that is, if it isn’t UTF-8, then it gets treated transparently as though it is Latin1 (or your preferred Encoding). Here is a mixed-mode text file, that has both valid UTF-8 and unadorned Latin1 in it, displayed in Zap, UniEdit and the command line: Zap: Apologies for the ZapFont that has a tick at chr 128 for historical reasons. Regardless, this shows the byte content of the file. UniEdit: This uses the System Font to display the character content of the file – that is, it uses the International Service to fetch Unicode character definitions, and this is the result that is displayed. Command line: |
nemo (145) 2546 posts |
Rick asked
In the absence of any additional cleverness, it will get the UTF-8 bytes. When it attempts to display them using an OS API, it will display the correct characters. The disparity between bytes and characters will cause it (mostly harmless) problems with cursor positioning if it is a text editor. Ideally, a clever clipboard holder can have a blacklist of applications for which transcoding is required, but that is a separate, orthogonal problem. |
nemo (145) 2546 posts |
BTW, another important feature of proper fallback handling is that it allows UDGs to continue to be used. Strict UTF-8 compliance obviously disallows that. The UnicodeSupport module I’m working on has many support functions including file IO (CPut/CGet) that support strict UTF-8, fallback, WUTF-8, MUTF-8 and CESU-8 as required, as well as chr/byte offset conversions. There’s a lot of unpleasantness involved in supporting UTF-8 at the command line, including reimplementations of OS_ReadLine(32), OS_Byte135, and some star commands such as To greatly ease that I have a new module called VDUTabs, which implements definable tabstops (just like the extension I wrote for Zap). Code can then do a VDU23 to define tab widths, then output columns trivially separated by tab characters, and then a final VDU23 to return to normal. Saves a lot of code. |
Steve Pampling (1551) 8170 posts |
702 more important things to do… :) Although when so many people like the hole it is nice that he continues. |
nemo (145) 2546 posts |
Rick suggested
I don’t know, can they?
Well that’s another question entirely. I’m sure my StrongHelp OS manual doesn’t look much like yours! StrongHelp is marvellous, but the necessity for it to be maintained by a gatekeeper is a barrier. Here’s a project for someone: An online wiki-like version of StrongHelp, from which the app could periodically fetch (and push) updates. The granularity is currently far too coarse. |
nemo (145) 2546 posts |
One further clarification of the mixed-mode fallback handling demonstrated in the picture above. Sending the bytes E2,82,AC emits the Euro character. Sending bytes C2,80 outputs character U+0080 which is a useless control character for which I have not defined a glyph… so you get the ‘tofu’ square. It may appear that outputting the byte 80 also emits the Euro character… but that’s not quite correct. It looks like the Euro character, and the fallback Encoding of Acorn Latin1 defines it to be a Euro, but it is actually user defined graphic 128, and so may well look like a Space Invader. There is an important distinction between U+20AC the Euro, U+0080 the control character, and 80 the UDG. They are all different characters, and (my) OS_WriteC does not get confused between them. |