UTF-8 in the OS / outside the desktop
Pages: 1 2
nemo (145) 2552 posts |
UTF-8 inside the desktop is slightly wobbly in RO5, and unavailable in any other version of the OS. UTF-8 outside the desktop does not exist, other than here in the Nautilus. Disappointingly (for me, at least) I’ve realised that no emulator is going to be happy with Unicode filenames until updated. This is due to the oft-mentioned HostFS strangeness. RPCEmu will be easiest to fix. Its sin is that it uses ANSI versions of the Win API, so even if one creates a correctly-formed UTF-8 filename, it will get encoded on the host as a series of ANSI characters and not the Cyrillic (say) one intended – Mojibake baked-in so to speak. Conversely any Unicode filenames on the host are simply ignored by RPCEmu. But if you don’t look at the host filing system, it works OK from the emulator’s point of view. VirtualRPC is weirder. It tries to map the filename characters from Unicode to Latin1. Unfortunately, any it can’t map are replaced by The various HostFS implementations all do slightly different things with the mapping of filenames and file types, they’re incompatible with each other and none of them are complete or symmetrical. This is bad regardless. UTF-8 requires a more robust solution instead of the “that’ll do” knocked-up stuff we have at the moment. Particular care must be taken with mixed-mode text, where some is UTF-8 and some a legacy encoding such as Acorn Latin1. It would be appropriate, for example, for a HostFS to send valid UTF-8 filenames through the Windows …W APIs, while sending malformed UTF-8 (or plain old Latin1) through the …A APIs. Unicode offers a reliable way of dealing with ‘illegal characters’ (from the FS API point of view) by using other characters instead (resorting to Private Use if necessary), but the ANSI route will need reversible escaping of bad characters. Again, twas always thus. Native RO filesystems have no problem of course. |
||||||||||
Rick Murray (539) 13850 posts |
The problem as I see it is that there is no Unicode option. If the machine is set to Latin# then UTF-8 won’t work, and if the machine is set to UTF-8 then it’ll affect non-English versions of all existing applications. Hence, there’s not much impetus to even try to support UTF-8.
This I find strange. I used a Unicode FontManager with 3.7 and Fresco, so it existed while Acorn was still a thing. Given that, I don’t know why there was no support provided with RISC OS 4. That said, the usual question arises – should development of new features make the best use of a developing version of RISC OS, or should it be held back by versions that will never be further developed? The emulator is something of a special case, given its need to translate to and from an alien filesystem. There have been times (I forget which emulator) that renaming a text file to something/txt (for putting in a zip file) goes haywire because while it’s a name with an extension under RISC OS, it’s already been converted as such for the underlying filesystem; so it seems to sort of find the file while at the same time sort of not find it. |
||||||||||
Steve Pampling (1551) 8172 posts |
I thought ? was one of the wildcards and not legal in filenames. Why would anyone code in something that produced files with illegal characters in the name? |
||||||||||
Steffen Huber (91) 1953 posts |
? is not allowed in Windows FSes, but are no problem on RISC OS. Not sure which side Nemo refers to, I guess he means any non-mappable character in a Windows filename gets replaced by ? on the RISC OS HostFS side. I recently looked into RPCEmu’s HostFS. It does not map all (see https://www.riscosopen.org/wiki/documentation/show/FileSwitch%20Key%20Features) illegal RISC OS filename characters, only some. Irritatingly, the files containing those unmapped illegal characters can be accessed without problems on the RISC OS side – I am still struggling to understand this! Are those characters like $ and @ only a problem in the CLI but not in the desktop? |
||||||||||
nemo (145) 2552 posts |
The problem isn’t so much the substitution, but that the substitution will never be matched. So if you create a file called
The definition of ‘illegal character’ is filing system dependent and is affected by how the FS is implemented. As you know, there are multiple levels of abstraction, for example:
Filing systems can be implemented at any of those levels. An FS that sits on the FS vectors can do anything it likes – there are no illegal characters at that layer, and in fact even the dot directory separator could be up for grabs. Admittedly, desktop software would fail to extract leafnames – a required part of the desktop save protocol. Filing systems that are clients of FileSwitch inherit quite a bit of filename syntax – spaces are a separator, the quote and vertical bar character are illegal, hash and star are wildcards, colon and dot have special meanings, and the dollar, ampersand, backslash, percent and circumflex special directory symbols are defined. It also uses GSTrans, so less than and greater than symbols can have special meaning under some circumstances (this is a known asymmetric bug). Then there’s the I think FileCore adds restrictions on special fields, but I can’t think of any additional filename restrictions from there. Finally, we have image filing systems, which can sit at any point in the directory hierarchy of another FS, but then change the parsing rules for everything within it. Then we have the filing system commands, which are a particular class of star command. They can implement their own behaviour when dealing with ‘their’ filing system – so a perfect DFS implementation would allow you to continually |
||||||||||
Steffen Huber (91) 1953 posts |
I am not someone who is well-versed wrt the deeper philosophical ideas of RISC OS, but the PRM quite clearly states that
Does it make sense to ignore these rules just because we can think of other clever ways to provide something like a filing system? After all, applicatons and the Filer etc. will all follow those things specified in the PRM (like e.g. the rename field in the filer, the SaveAs writable fields). |
||||||||||
nemo (145) 2552 posts |
All built-in FSes, yes. All that have ever been written? No. For an example of the general point, Jason Tribbeck’s UnixTrans allows filenames to be in Unix format. |
||||||||||
Steffen Huber (91) 1953 posts |
The PRM defines that only something based on Fileswitch is a filing system. Seems to be a sensible definition, since Fileswitch provides all the SWIs that code uses to talk to any filing system, no matter if built-in or 3rd party. That you can possibly somehow work around that fact does not make other ways of implementing an FS sensible, especially if they do not follow the rules set out in the PRM about the allowed characters in filenames on filing systems. At least in my book.
Never seen that one. |
||||||||||
nemo (145) 2552 posts |
FileSwitch does not provide any SWIs.
FileSwitch sits on the various FS vectors. Anything else is free to do so also, and many things do. FileSwitch is convenient, it is not mandatory. |
||||||||||
nemo (145) 2552 posts |
Alpha/Beta releaseI am starting to release bits of the complete UTF-8 support. Don’t get excited, it’s not the pretty bits yet. My UTF-8 page will contain the parts as they come. I’m releasing two today. The Alpha/Beta releases are unlikely to be much use to you unless you understand what UTF-8 and Unicode are (probably). Please don’t download anything if you’re just going to ask why you can’t read War and Peace in Swahili already. Most importantly I am not releasing the Unicode font yet One step at a time. |
||||||||||
nemo (145) 2552 posts |
UTF8Alphabet 1.05Now updated to actually work on 5.24. Grrrr.This module provides the
It does some ‘interesting’ things, including rearranging the module service handler lists in order to fix/augment various bits of behaviour. Sorry, but in the absence of VectorExtend there’s no other way of fixing RO4’s Japan alphabet or providing the correct fallback behaviour. The Fallback Alphabet is used to interpret text which clearly isn’t UTF-8. Instead of disallowing such text, as the current RO5 implementation enforces (in the desktop), the fallback strategy interprets such malformed sequences as being text in an 8-bit alphabet. This module manages the configuration of that. This has an interesting effect on a couple of APIs: Service_International,5 – Define character set When you switch to UTF8 (sic) on RO5, the VDU character set is redefined so that the top-bit-set characters are hexadecimal numbers. This module changes that, so the character set is redefined to be that of the fallback alphabet, as software will expect if it isn’t UTF-8 aware. Service_International,8 – Return Unicode table In RO5’s UTF-8 alphabet, this call is ignored. This module returns the Unicode table for the fallback alphabet, so that code that assumes it must be running under an 8-bit alphabet will work correctly
The almost-bound-to-change Service_International reason codes 260 and 264 allow code to write and read (respectively) the fallback alphabet. 0=None, 255=Auto. The zip file contains an extensive (ie long) ReadMe that explains… well, probably too much. Note that the module’s help string is in UTF-8 — this is deliberate! |
||||||||||
nemo (145) 2552 posts |
Elastic 0.06This is a fun one, and may even be generally useful
Elastic Tabstops automatically calculates the necessary column widths to display text using tabs in nice neat columns. Each consecutive run of text lines that use tabs are formatted as a section (or the whole file if the It auto-senses line endings, and has configurable minimum column width and intra-column gaps. You can even choose to make tabs visible in a couple of ways. It’s quite swish. However, the reason I’m releasing this here (and as a Beta) is that it is also UTF-8 aware, and its behaviour changes somewhat when run under a UTF-8 alphabet: Obviously, it counts characters not bytes, so unless you can display the results using a UTF-8 supporting Unicode font, the formatting won’t look right. If you understand UTF-8 sequences you will be able to confirm that they are correct, I hope. It uses my proposed mixed-mode fallback strategy that I think must be employed for RISC OS. See the UTF8Alphabet ReadMe for detailed discussions. This means it’ll display anything, theoretically. It also gives special properties to the Line Separator and Paragraph Separator characters in Unicode – Line Separator starts a new line without starting a new section, and Paragraph Separator starts a new section immediately. It can also suppress certain zero-width characters if the The zip file contains a ReadMe and some test files – some of which feature UTF-8. Enjoy |
||||||||||
Colin (478) 2433 posts |
I downloaded U8Alphabet and it gives an ‘abort on data transfer’ on my pi and armx6 when I try to load it. |
||||||||||
nemo (145) 2552 posts |
Ah crap. Something important moved between 5.22 and 5.24. I’ve said it before and I’ll say it again. PLEASE stop moving things. There are simply too many Kernel locations to give every one of them a unique ReadSysInfo,6 identifier, but that’s OK because there’s one within 36 bytes of the one I need NO YOU’VE MOVED IT. So the only way to deal with this is to have a hard coded address for every version. I don’t mind doing that for legacy ROMs, but it shouldn’t be happening in RO5. Damn and, indeed, blast. Give me some minutes. |
||||||||||
Colin (478) 2433 posts |
5.25 (11-May-18) on the pi and 5.23 (18-Feb-18) on the Armx6. I also had to get Otter working to download it. Whilst your web pages look very nice it would have been handy to be able to download in netsurf. |
||||||||||
nemo (145) 2552 posts |
I don’t know whats up with Netsurf. I don’t believe Google have broken the internet though, so it’s probably Netsurf at fault, don’t you think? |
||||||||||
Clive Semmens (2335) 3276 posts |
Netsurf is certainly a bit cranky. I make an effort to make the RISCOS area of my website Netsurf friendly, but the rest of it, not so much. Life is too short. Some of it quite coincidentally works okay, some doesn’t. Ho hum. |
||||||||||
nemo (145) 2552 posts |
It’s a toss-up which one 5.23 is. I’m going to guess it’s 5.24ish, not 5.22ish in this respect. OK, so Unfortunately, in 5.24 whilst This is why we can’t have nice things I shall have to bake-in a hard-wired address for each version. This is extremely aggravating and completely avoidable. Don’t move things, there is no shortage of bytes OK, version 1.06 is now on the webpage – this means I have to build another test rig, because 5.22 and 5.23 are significantly different. I don’t mind 3.10 and 3.50 being slightly different, there’s a good reason. But 5.22 and 5.23? Sheesh. |
||||||||||
Colin (478) 2433 posts |
I’ve no doubt that netsurf isn’t good enough to display many websites but it’s fine for browsing this site. A direct link to the download would suffice. |
||||||||||
nemo (145) 2552 posts |
Dear National Grid, I know you’re quite keen on standards and everything, but I’ve got a TV from a company no one else in the world has heard of and its plug doesn’t quite fit your sockets. Please can you change the sockets that everyone else uses so that my strange TV works. Thanks, |
||||||||||
Colin (478) 2433 posts |
That version loaded ok on both machines. |
||||||||||
Colin (478) 2433 posts |
Dear Advertising Agency Why aren’t you generating more custom. Thanks. |
||||||||||
nemo (145) 2552 posts |
Thanks Colin. What does Netsurf struggle with? This is one of the links. Does that work? |
||||||||||
Colin (478) 2433 posts |
No just a blank page. |
||||||||||
nemo (145) 2552 posts |
Netsurf can’t cope with Google Drive then. Here’s an alternative. Can Netsurf cope with this link ? |
Pages: 1 2