UTF8

19 posts, 10 voices

Oct 3, 2022 6:01pm Dave Higton (1515) 3525 posts	There’s been a lot of discussion recently about this in another thread. It makes me want to ask: what is necessary to handle UTF8 in RISC OS, doing the job properly? This is a naive question, because Font Manager isn’t something I use explicitly. If we do the job properly, is it something that can be broken down into stages, or does it have to be all-or-nothing? Are there apps that can benefit from some early stage(s)? (And here’s a question with some more general applicability.) Is it possible to start at the top, in C, and call out to chunks of assembler from C until (or while) the translation from assembler to C is extended?

Oct 3, 2022 7:40pm Rick Murray (539) 13840 posts	I am not overly familiar with the requirements of a UTF-8 system, so I cannot comment much on that. I think nemo is the person to ask about that, and his advice would likely be “tear out FontManager and burn it”. I mean, d’you think it’s capable of anything like this never mind more complicated scripts? (note – the text runs in two different directions, this is normal). What I am looking at is to have sensible behaviour for old (8 bit character set) applications when the system itself is in the UTF-8 alphabet (which is probably the wrong way to handle such a thing but I digress). To be honest, I think FontManager is the wrong place to be fiddling with this, but it’s the easiest place. The correct place would be to have the Wimp be aware of tasks that are basic Latin1 and tasks that are properly Unicode. Wouldn’t it be nice if the Wimp had a flags word rather than loading all sorts of meaning into various Wimp version numbers, eh? Anyway, once the Wimp knows what sort of app it is, it can either plot stuff with a font using the Latin1 encoding, or the UTF-8 encoding. But that implies a lot of fundamental changes inside the Wimp, which is quite unlikely to ever actually happen. So my sticky plaster method is to simply detect invalid UTF-8 sequences in strings and treat them as Latin1 characters. The idea then being that if the system is in UTF-8 mode, something useful will happen if the user runs a Latin1 program. It’s not an ideal solution, so should only ever appear in FontManager (if ever) as a configurable option. It’s more aimed at giving a push to get people to be more inclined to transition over to using RISC OS in UTF-8 mode. There’s little impetus at the moment as anything at all outside of the basic ASCII characters simply don’t work. This doesn’t affect English “that much”, but languages with accented characters will suffer. Some examples at the top of https://heyrick.eu/blog/index.php?diary=20140314 (eight years ago!)

Oct 5, 2022 8:19pm nemo (145) 2546 posts	Hi Rick, note – the text runs in two different directions, this is normal Aye. sensible behaviour for old (8 bit character set) applications when the system itself is in the UTF-8 alphabet Quite so. if the system is in UTF-8 mode, something useful will happen if the user runs a Latin1 program An excellent idea (and you should download the UTF8Alphabet module from my website). I’m no longer a denizen so I won’t be following up this thread, but you’re one of the very few people who will fully understand everything that is happening here… and there’s a lot happening here: https://twitter.com/nemo20000/status/1577753718560276482?s=20&t=uVuwyjZnai7V4xmfdlt9qg Anyone wishing to ask me things should do so on Twitter or by email which I check very rarely.

Oct 6, 2022 5:19am Clive Semmens (2335) 3276 posts	I’m no longer a denizen A matter of great regret to some of us.

Oct 6, 2022 10:36am Rob Andrews (112) 164 posts	I totally agree we may be all old farts but it would be great to welcome Nemo back into the fold we need everyone in this community to function the place feels bare without him.

Oct 6, 2022 10:37am GavinWraith (26) 1563 posts	RiscLua has Lua’s UTF-8 library. Because Lua is 8-bit clean it can read, write and store UTF-8 strings just like other strings. utf8.len returns the number of UTF-8 characters in a string and performs validation. utf8.char and utf8.codepoint are the equivalent of string.char and string.byte. utf8.offset converts character position to byte position. utf8.codes lets you iterate over characters in a UTF-8 string: `for byte_position, code in utf8.codes (mystring) do print (byte_position, code) end`

Oct 6, 2022 11:30am Paul Sprangers (346) 524 posts	I’m no longer a denizen A bit of a shock to me, really.

Oct 6, 2022 12:10pm Rick Murray (539) 13840 posts	Probably got fed up and stayed away. Something I rather failed at doing… :/

Oct 6, 2022 12:42pm Chris (121) 472 posts	I don’t know nearly as much about this as many other people, but it’s seemed to me for a while that the current partial implementation of Unicode in RISC OS relies too much on the system’s Alphabet setting. At present, AIUI, on a UCS version of RISC OS, the Alphabet configuration is used for two separate things: 1. Determining the encoding used by Wimp textual resources (menus, icons, error messages, etc) 2. Determining the encoding used by user text operations (input from the keyboard, copied text, etc) The problem is with (2). It’s entirely possible I’d like my desktop to run in Latin1 encoding, but be able to send UTF8-encoded text from, say, Chars to Draw. Or run my desktop in UTF8 and edit a textfile in Latin1. Currently, I can’t do this – the few apps that are aware of UTF8 stuff will assume that all text operations are governed by the global encoding. I think it would be better if the Alphabet setting really only controlled (1), and the Wimp didn’t try to universally control the encoding of a user’s text operations. That should be up to the applications. If I have a UTF8-aware document processor, ideally it should be keep track of each individual document’s selected encoding, and cope with keypresses and imported text accordingly. Similarly, things like Chars should send encoding information along with any transmitted characters (a Wimp User Message?), and let the app decide what to do with it. As long as there’s appropriate fallback for apps that aren’t UTF8-aware, that seems to me to be a better basis for more flexible UTF8 support. Though that may all be nonsense, and even if it isn’t I’m sure there are complexities I haven’t considered.

Oct 6, 2022 5:38pm Rick Murray (539) 13840 posts	This probably comes under one of my various complaints about the mess that is internationalisation – territory, country, keyboard, alphabet… They’re all related but at the same time not. the Alphabet configuration is used for two separate things RISC OS isn’t that clever, I’m afraid. It’s more like “determining the encoding” (full stop). It’s probably better to conceptualise UTF-8 as just another character set that can be chosen, only it just happens to have a billion extra characters (and that’s just CJK ☺). I’d like my desktop to run in Latin1 encoding, but be able to send UTF8-encoded text from, say, Chars to Draw. Yes, maybe. The problem there is twofold. The first part is that there’s no sensible way to handle such a scenario. There’s no technical reason why applications cannot use Unicode if they like – Ovation does it if you hold both Alt keys while it starts up ¹ and you do not need to be in UTF-8 mode for it to work (that was the point). The problem is that there’s no obvious way for any mixed setup to work as there are way too many potential assumptions that will cause problems. Say your desktop is Latin1. Okay, you receive a Wimp message with a filename. A DataLoad or something. What character set is it? Or run my desktop in UTF8 and edit a textfile in Latin1. What would be the most logical is for the desktop to actually be agnostic, and applications can declare that they are UTF-8 aware, and all those that don’t will be treated as Latin1. But this will require quite a number of changes to the Wimp, not to mention the question of how to handle Wimp messages between tasks of different character sets. If I have a UTF8-aware document processor, ideally it should be keep track of each individual document’s selected encoding, and cope with keypresses and imported text accordingly. Yes, but if the user inputs ñ, how does this get sent to the app? Remember, also, there is actually no method of sending UTF-8 characters to an app. The current “method” (note the scare quotes) is to send repeated Keypress events to the app, each one being a byte of the UTF-8 sequence. Which means that no existing app would be broken by getting weird wide character information, it can coexist with function keys (that hijacks all the &1xx codes), and all keypress information will be received (regardless of whether or not the app knows what to do with it). On the flip side, the programmer now has to hold state across polls and make all sorts of assumptions. A better way? Well, yes. Provide one keypress event with the character converted to Latin1 in the regular data (or ‘?’ if no conversion possible) and then provide an additional word giving the Unicode number of the character. If it was plain ASCII or a function key, just set this word to zero. Backwards compatibility? Set this extra word to &FFFFFFFF or something on entry, if it’s still that on the way out then it’s not a Unicode capable system. But, again, lots of changes to the Wimp. There are plenty of ideas, good ideas even, however these do tend to require numerous changes to how the system works. We’re back to the “developers?” question again. And, it would seem, also saddled with the baggage of an incomplete implementation that was aimed at “working on this one specific setup”. ¹ Very simply it just tells Ovation to apply UTF-8 encoding to all fonts it tries to open, it was basically a ten minute hack. I’d like to have it work automatically on a document by document basis, but never got around to figuring out the file format. Maybe a job for OvationPro? I mean, the Windows version handles this just fine…

Oct 6, 2022 6:39pm Steve Pampling (1551) 8170 posts	A matter of great regret to some of us. Indeed so, but totally understandable given the combative (and worse) behaviour of some. I’d say they know who they are, but I don’t think it actually registered.

Oct 7, 2022 4:15am Clive Semmens (2335) 3276 posts	Indeed so, but totally understandable given the combative (and worse) behaviour of some. I’d say they know who they are, but I don’t think it actually registered. All that, yes. I remember the stooshie; I don’t remember the (other) names involved. _{I hope I wasn’t a guilty party; I know I can sometimes be somewhat socially clumsy – although I think probably not on quite the scale of some others.}

Oct 7, 2022 10:04am Paul Sprangers (346) 524 posts	https://twitter.com/nemo20000/status/1577753718560276482?s=20&t=uVuwyjZnai7V4xmfdlt9qg This is absolutely impressive! I gives me hope that a functional unicode system on RISC OS will be alive one day.

Oct 7, 2022 12:31pm Rick Murray (539) 13840 posts	This is absolutely impressive! Yup. Switching to top down right to left was pretty cool. Especially the selective rotation. And… nemoBasic ???!

Oct 7, 2022 12:32pm Rick Murray (539) 13840 posts	I hope I wasn’t a guilty party; I know I can sometimes be somewhat socially clumsy I may have been, and I am. But, really, anybody that takes something I say seriously…… I’m just some random arsehole on a forum. If you don’t like what I say, ignore me. Simple! But don’t bother starting an argument. Been there, done that, not going to bother any more. The older I get the more time is important. So somebody else can write a wall of text arguing against GPL (or whatever). I have better things to do, like watching this snail on its epic journey across the windowsill. It’s the little things that matter. ;)

Oct 7, 2022 1:07pm Clive Semmens (2335) 3276 posts	I may have been, and I am. I suspect that you and I get away with a bit of social clumsiness, Rick, because (a) we’re so obviously socially clumsy, and (b) reasonably obviously not intending to be horrible to anyone who isn’t being an unconscionable a**ehole themselves.

Oct 7, 2022 5:19pm Steve Pampling (1551) 8170 posts	reasonably obviously not intending to be horrible to anyone who isn’t being an unconscionable a**ehole themselves. I find being particularly horrible is an effort, and I’m not prepared to make that effort most of the time. Jocular remarks, puns etc. may be irritating to some¹, but I’m not making the effort – honestly. Apparently I make Justin laugh. What particular aspect of my character (or lack of knowledge) is responsible, I don’t know. ¹ Easily done for some.

Oct 7, 2022 5:20pm Steve Pampling (1551) 8170 posts	nemoBasic Now there are two things that don’t belong in the same sentence, never mind the same compound word.

Oct 30, 2022 10:17pm Matthew Phillips (473) 721 posts	I’ve been away from the forum for a while, but if Dave is still interested I may be able to provide a few answers from the perspective of someone who has actually implemented some support for UTF8 in applications. For example, Nominatim, which you can download from our website looks up places and features in the world. If you pop in a search term for “restaurants in Athens” or “hotels in Belgrade” and make sure you’re allowing it to look in Greece or Serbia, you will probably get some results showing addresses in non-roman alphabets. The Font Manager can render stuff using UTF8, whatever the alphabet or territory settings, so an application can behave consistently despite what the user may have set the machine to use. For rendering Unicode glyphs the key thing is the fonts you are using. The “ROM” fonts only support the various Latin fonts, and perhaps Greek and Cyrillic — I forget. The easiest way, in my experience, to support a wider range of characters in your application is to use the RUFL library, which was developed for NetSurf. It scans all the fonts on the machine and will substitute glyphs from other fonts when the font you’re painting in does not support the characters you need. It has some deficiencies: no support for right-to-left switching, as needed for Arabic and Hebrew. It could also be more intelligent: I have pondered the idea of getting it to work out whether a font is mono-spaced, sans-serif or with serifs when it scans, so that it can substitute a glyph from a font in a similar style. RiscOSM uses RUFL to some extent, but we have not got round to implementing full support in the map rendering. That would require using callbacks from RUFL to start a new Draw object and switch to a different font, adding the required font for the substitute glyph to the font table. Hard to do dynamically when you have to have the Draw file font table at the start! So, you’ve rendered stuff to the screen, possibly made a Draw file or printed. Note that various other applications will get confused. ArtWorks, for example, does not like Draw files that specify the encoding, so we provide an option in RiscOSM to avoid using Unicode in the maps if possible, as some users like to take the maps and tweak them in ArtWorks. User input is another issue. Here it would be useful for the Wimp to know that you support Unicode, in order to receive the keyboard input in UTF-8. But bear in mind that characters beyond the 8-bit range were already used for special meanings. This makes keyboard handling a bit of a mess. And we don’t have very good keyboard drivers anyway. I think Rick made a nice utility that allows you to pick which accented character you want, a bit like a smartphone user interface. My idea, not implemented yet, would be for the keyboard assistant app to broadcast a special message offering the Unicode character direct, and if the app with input focus understood, it could acknowledge. Otherwise the assistant would revert to sending KeyPressed events. Then you want to communicate with other applications! Is a file with filetype FFF meant to be a text file in the current alphabet (usually true) or could it have a variety of encodings on one machine? How is another app meant to tell? Perhaps before sending Message_DataLoad there could be a little handshake to agree to use UTF8 in defiance of the current alphabet, if both applications are UTF8 aware. At the moment, for example, NetSurf will (possibly) avoid exporting exotic characters when saving parts of a page as text, because these may upset other applications. Or perhaps it just dumps it all as UTF8 — I cannot remember. The inter-application communication and the interpretation of existing files are among the knottiest problems. Anyhow, you asked whether things can be broken down into stages, and yes, they can. The first stage is writing more applications that are using the existing Unicode facilities of the OS. Then there will be more of an impetus to agree how to tackle the rest of the issues.

Reply

To post replies, please first log in.

Forums → General →

UTF8

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options