nemo (145) 2552 posts

UTF-8 inside the desktop is slightly wobbly in RO5, and unavailable in any other version of the OS. UTF-8 outside the desktop does not exist, other than here in the Nautilus.

Disappointingly (for me, at least) I’ve realised that no emulator is going to be happy with Unicode filenames until updated. This is due to the oft-mentioned HostFS strangeness.

RPCEmu will be easiest to fix. Its sin is that it uses ANSI versions of the Win API, so even if one creates a correctly-formed UTF-8 filename, it will get encoded on the host as a series of ANSI characters and not the Cyrillic (say) one intended – Mojibake baked-in so to speak. Conversely any Unicode filenames on the host are simply ignored by RPCEmu. But if you don’t look at the host filing system, it works OK from the emulator’s point of view.

VirtualRPC is weirder. It tries to map the filename characters from Unicode to Latin1. Unfortunately, any it can’t map are replaced by ?, which can never match anything. So you can see there are files, but you can’t interact with them. It also does weird things with spaces – hard space on RISC OS is mapped to space on the host… but it doesn’t handle hard-space in the host name, and that’s what you will get if you encode certain characters in UTF-8, which means even in ANSI mode, some legal RISC OS filenames will be completely inaccessible (but twas always thus).

The various HostFS implementations all do slightly different things with the mapping of filenames and file types, they’re incompatible with each other and none of them are complete or symmetrical. This is bad regardless.

UTF-8 requires a more robust solution instead of the “that’ll do” knocked-up stuff we have at the moment. Particular care must be taken with mixed-mode text, where some is UTF-8 and some a legacy encoding such as Acorn Latin1. It would be appropriate, for example, for a HostFS to send valid UTF-8 filenames through the Windows …W APIs, while sending malformed UTF-8 (or plain old Latin1) through the …A APIs.

Unicode offers a reliable way of dealing with ‘illegal characters’ (from the FS API point of view) by using other characters instead (resorting to Private Use if necessary), but the ANSI route will need reversible escaping of bad characters. Again, twas always thus.

Native RO filesystems have no problem of course.

May 6, 2018 2:56pm

Rick Murray (539) 13850 posts

UTF-8 inside the desktop is slightly wobbly in RO5,

The problem as I see it is that there is no Unicode option. If the machine is set to Latin# then UTF-8 won’t work, and if the machine is set to UTF-8 then it’ll affect non-English versions of all existing applications. Hence, there’s not much impetus to even try to support UTF-8.
The sad thing is that for the most part it’s a font encoding and rendering issue so ought to be possible on an application by application basis.

and unavailable in any other version of the OS.

This I find strange. I used a Unicode FontManager with 3.7 and Fresco, so it existed while Acorn was still a thing. Given that, I don’t know why there was no support provided with RISC OS 4.

That said, the usual question arises – should development of new features make the best use of a developing version of RISC OS, or should it be held back by versions that will never be further developed?

The emulator is something of a special case, given its need to translate to and from an alien filesystem. There have been times (I forget which emulator) that renaming a text file to something/txt (for putting in a zip file) goes haywire because while it’s a name with an extension under RISC OS, it’s already been converted as such for the underlying filesystem; so it seems to sort of find the file while at the same time sort of not find it.
And, God, what a mess it is changing the type of an open file. Not a problem with RISC OS, plenty of problem with an emulator (as it alters the extension making it an entirely different file).
Like I said, emulation is a special case…

May 6, 2018 3:51pm

Steve Pampling (1551) 8172 posts

VirtualRPC is weirder. It tries to map the filename characters from Unicode to Latin1. Unfortunately, any it can’t map are replaced by ?, which can never match anything.

I thought ? was one of the wildcards and not legal in filenames. Why would anyone code in something that produced files with illegal characters in the name?

May 6, 2018 5:27pm

Steffen Huber (91) 1953 posts

? is not allowed in Windows FSes, but are no problem on RISC OS. Not sure which side Nemo refers to, I guess he means any non-mappable character in a Windows filename gets replaced by ? on the RISC OS HostFS side.

I recently looked into RPCEmu’s HostFS. It does not map all (see https://www.riscosopen.org/wiki/documentation/show/FileSwitch%20Key%20Features) illegal RISC OS filename characters, only some. Irritatingly, the files containing those unmapped illegal characters can be accessed without problems on the RISC OS side – I am still struggling to understand this! Are those characters like $ and @ only a problem in the CLI but not in the desktop?

May 6, 2018 6:12pm

nemo (145) 2552 posts

The problem isn’t so much the substitution, but that the substitution will never be matched. So if you create a file called СТОП for example, it will appear in the RO world as ????, but if you try to access the file ???? you get File not found.

Are those characters like $ and @ only a problem in the CLI but not in the desktop?

The definition of ‘illegal character’ is filing system dependent and is affected by how the FS is implemented. As you know, there are multiple levels of abstraction, for example:

OS SWIs -> Vectors -> FileSwitch -> Filecore -> ADFS

Filing systems can be implemented at any of those levels.

An FS that sits on the FS vectors can do anything it likes – there are no illegal characters at that layer, and in fact even the dot directory separator could be up for grabs. Admittedly, desktop software would fail to extract leafnames – a required part of the desktop save protocol.

Filing systems that are clients of FileSwitch inherit quite a bit of filename syntax – spaces are a separator, the quote and vertical bar character are illegal, hash and star are wildcards, colon and dot have special meanings, and the dollar, ampersand, backslash, percent and circumflex special directory symbols are defined. It also uses GSTrans, so less than and greater than symbols can have special meaning under some circumstances (this is a known asymmetric bug). Then there’s the Something$Path convention, which means that commas can’t be used in directory names.

I think FileCore adds restrictions on special fields, but I can’t think of any additional filename restrictions from there.

Finally, we have image filing systems, which can sit at any point in the directory hierarchy of another FS, but then change the parsing rules for everything within it.

Then we have the filing system commands, which are a particular class of star command. They can implement their own behaviour when dealing with ‘their’ filing system – so a perfect DFS implementation would allow you to continually *DIR A, *DIR B, and ought to allow you to *DIR # even though FileSwitch would have kittens.

May 6, 2018 8:53pm

Steffen Huber (91) 1953 posts

Filing systems can be implemented at any of those levels.

I am not someone who is well-versed wrt the deeper philosophical ideas of RISC OS, but the PRM quite clearly states that

all RISC OS filing systems are Fileswitch-based (2-3 and 2-9)
there are characters with specific meanings that must not be used in filenames (2-12)

Does it make sense to ignore these rules just because we can think of other clever ways to provide something like a filing system? After all, applicatons and the Filer etc. will all follow those things specified in the PRM (like e.g. the rename field in the filer, the SaveAs writable fields).

May 7, 2018 12:30pm

nemo (145) 2552 posts

all RISC OS filing systems are Fileswitch-based

All built-in FSes, yes. All that have ever been written? No.

For an example of the general point, Jason Tribbeck’s UnixTrans allows filenames to be in Unix format.

May 7, 2018 6:18pm

Steffen Huber (91) 1953 posts

All built-in FSes, yes. All that have ever been written? No.

The PRM defines that only something based on Fileswitch is a filing system. Seems to be a sensible definition, since Fileswitch provides all the SWIs that code uses to talk to any filing system, no matter if built-in or 3rd party.

That you can possibly somehow work around that fact does not make other ways of implementing an FS sensible, especially if they do not follow the rules set out in the PRM about the allowed characters in filenames on filing systems. At least in my book.

For an example of the general point, Jason Tribbeck’s UnixTrans allows filenames to be in Unix format.

Never seen that one.

May 8, 2018 5:43pm

nemo (145) 2552 posts

Fileswitch provides all the SWIs

FileSwitch does not provide any SWIs.

That you can possibly somehow work around that fact

FileSwitch sits on the various FS vectors. Anything else is free to do so also, and many things do. FileSwitch is convenient, it is not mandatory.

May 21, 2018 6:05pm

nemo (145) 2552 posts

Alpha/Beta release

I am starting to release bits of the complete UTF-8 support. Don’t get excited, it’s not the pretty bits yet.

My UTF-8 page will contain the parts as they come. I’m releasing two today.

The Alpha/Beta releases are unlikely to be much use to you unless you understand what UTF-8 and Unicode are (probably). Please don’t download anything if you’re just going to ask why you can’t read War and Peace in Swahili already.

Most importantly I am not releasing the Unicode font yet

One step at a time.

May 21, 2018 6:07pm

nemo (145) 2552 posts

UTF8Alphabet 1.05

Now updated to actually work on 5.24. Grrrr.

This module provides the UTF8 (sic) and UTF-8 alphabet for ‘all’ versions of RISC OS. It also:

Defines various Asian countries, and provides the correct alphabet for Japan in RO4
Provides the ISO3316-1 country codes if they are not defined
Provides the alphabet Unicode tables if they are missing
Implements the Fallback Alphabet, via a new *FallbackAlphabet command

It does some ‘interesting’ things, including rearranging the module service handler lists in order to fix/augment various bits of behaviour. Sorry, but in the absence of VectorExtend there’s no other way of fixing RO4’s Japan alphabet or providing the correct fallback behaviour.

The Fallback Alphabet is used to interpret text which clearly isn’t UTF-8. Instead of disallowing such text, as the current RO5 implementation enforces (in the desktop), the fallback strategy interprets such malformed sequences as being text in an 8-bit alphabet. This module manages the configuration of that.

This has an interesting effect on a couple of APIs:

Service_International,5 – Define character set

When you switch to UTF8 (sic) on RO5, the VDU character set is redefined so that the top-bit-set characters are hexadecimal numbers. This module changes that, so the character set is redefined to be that of the fallback alphabet, as software will expect if it isn’t UTF-8 aware.

Service_International,8 – Return Unicode table

In RO5’s UTF-8 alphabet, this call is ignored. This module returns the Unicode table for the fallback alphabet, so that code that assumes it must be running under an 8-bit alphabet will work correctly

*FallbackAlphabet	Display current fallback alphabet
FallbackAlphabet alphabetname*	Change fallback alphabet
FallbackAlphabet countryname*	Change fallback alphabet
*FallbackAlphabet Auto	Guess the fallback alphabet dynamically
*FallbackAlphabet None	No fallback

*FallbackAlphabet Auto follows the current alphabet setting, so if you *Alphabet Cyrillic then *Alphabet UTF8 the fallback alphabet will remain Cyrillic.

The almost-bound-to-change Service_International reason codes 260 and 264 allow code to write and read (respectively) the fallback alphabet. 0=None, 255=Auto.

The zip file contains an extensive (ie long) ReadMe that explains… well, probably too much.

Note that the module’s help string is in UTF-8 — this is deliberate!

May 21, 2018 6:27pm

nemo (145) 2552 posts

Elastic 0.06

This is a fun one, and may even be generally useful

*Elastic is a text file display utility not unlike *Print except it formats text using the Elastic Tabstops method probably invented by Nick Gravgaard (Hi Nick!).

Elastic Tabstops automatically calculates the necessary column widths to display text using tabs in nice neat columns. Each consecutive run of text lines that use tabs are formatted as a section (or the whole file if the -all switch is used). Sections separated by lines not containing tabs are formatted separately (again, unless -all). It also supports VDU sequences so the text can change colour, redefine characters or even draw graphics without messing up the tabulation.

It auto-senses line endings, and has configurable minimum column width and intra-column gaps. You can even choose to make tabs visible in a couple of ways. It’s quite swish.

However, the reason I’m releasing this here (and as a Beta) is that it is also UTF-8 aware, and its behaviour changes somewhat when run under a UTF-8 alphabet:

Obviously, it counts characters not bytes, so unless you can display the results using a UTF-8 supporting Unicode font, the formatting won’t look right. If you understand UTF-8 sequences you will be able to confirm that they are correct, I hope.

It uses my proposed mixed-mode fallback strategy that I think must be employed for RISC OS. See the UTF8Alphabet ReadMe for detailed discussions. This means it’ll display anything, theoretically.

It also gives special properties to the Line Separator and Paragraph Separator characters in Unicode – Line Separator starts a new line without starting a new section, and Paragraph Separator starts a new section immediately.

It can also suppress certain zero-width characters if the -zero switch is used.

The zip file contains a ReadMe and some test files – some of which feature UTF-8.

Enjoy

May 21, 2018 6:45pm

Colin (478) 2433 posts

I downloaded U8Alphabet and it gives an ‘abort on data transfer’ on my pi and armx6 when I try to load it.

May 21, 2018 8:29pm

nemo (145) 2552 posts

~~OS version?~~

Ah crap. Something important moved between 5.22 and 5.24.

I’ve said it before and I’ll say it again. PLEASE stop moving things.

There are simply too many Kernel locations to give every one of them a unique ReadSysInfo,6 identifier, but that’s OK because there’s one within 36 bytes of the one I need NO YOU’VE MOVED IT.

So the only way to deal with this is to have a hard coded address for every version.

I don’t mind doing that for legacy ROMs, but it shouldn’t be happening in RO5.

Damn and, indeed, blast. Give me some minutes.

May 21, 2018 8:45pm

Colin (478) 2433 posts

5.25 (11-May-18) on the pi and 5.23 (18-Feb-18) on the Armx6.

I also had to get Otter working to download it. Whilst your web pages look very nice it would have been handy to be able to download in netsurf.

May 21, 2018 8:48pm

nemo (145) 2552 posts

I don’t know whats up with Netsurf. I don’t believe Google have broken the internet though, so it’s probably Netsurf at fault, don’t you think?

May 21, 2018 8:58pm

Clive Semmens (2335) 3276 posts

Netsurf is certainly a bit cranky. I make an effort to make the RISCOS area of my website Netsurf friendly, but the rest of it, not so much. Life is too short. Some of it quite coincidentally works okay, some doesn’t. Ho hum.

May 21, 2018 9:15pm

nemo (145) 2552 posts

It’s a toss-up which one 5.23 is. I’m going to guess it’s 5.24ish, not 5.22ish in this respect.

OK, so Serv_SysChains, for which there is no ReadSysInfo,6 identifer, has moved. It used to be &194. Well that’s OK, because DAList is at &170 and it does have a ReadSysInfo,6 identifier.

Unfortunately, in 5.24 whilst DAList is still at &170 (for which there is a way to read the address), Serv_SysChains has silently moved to &164.

This is why we can’t have nice things

I shall have to bake-in a hard-wired address for each version. This is extremely aggravating and completely avoidable.

Don’t move things, there is no shortage of bytes

OK, version 1.06 is now on the webpage – this means I have to build another test rig, because 5.22 and 5.23 are significantly different. I don’t mind 3.10 and 3.50 being slightly different, there’s a good reason. But 5.22 and 5.23? Sheesh.

May 21, 2018 9:45pm

Colin (478) 2433 posts

I’ve no doubt that netsurf isn’t good enough to display many websites but it’s fine for browsing this site.

A direct link to the download would suffice.

May 21, 2018 9:54pm

nemo (145) 2552 posts

Dear National Grid,

I know you’re quite keen on standards and everything, but I’ve got a TV from a company no one else in the world has heard of and its plug doesn’t quite fit your sockets.

Please can you change the sockets that everyone else uses so that my strange TV works.

Thanks,
Outraged of Leicester.

May 21, 2018 9:55pm

Colin (478) 2433 posts

version 1.06 is now on the webpage

That version loaded ok on both machines.

May 21, 2018 9:58pm

Colin (478) 2433 posts

Dear Advertising Agency

Why aren’t you generating more custom.

Thanks.
Puzzled Chairman

May 21, 2018 10:00pm

nemo (145) 2552 posts

That version loaded ok on both machines.

Thanks Colin.

What does Netsurf struggle with? This is one of the links. Does that work?

May 21, 2018 10:05pm

Colin (478) 2433 posts

No just a blank page.

May 21, 2018 10:40pm

nemo (145) 2552 posts

No

Netsurf can’t cope with Google Drive then.

Here’s an alternative. Can Netsurf cope with this link ?

UTF-8 in the OS / outside the desktop

Alpha/Beta release

UTF8Alphabet 1.05

Now updated to actually work on 5.24. Grrrr.

Elastic 0.06

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options