RISC OS Open: Forum: Filename Translation

Jun 7, 2021 4:18pm

Steffen Huber (91) 1953 posts

The reason for the diversion into international issues is that these days other filing systems can handle names that just cannot be represented under RISC OS.

But Theo’s point was that, if you use UTF-8 encoding throughout, that RISC OS can handle all those i18n filenames. Not that their name will look the same on both OSes, but the name mapping can be 1:1, because RISC OS can handle top-bit-set 8bit characters in filenames without problems. So you can always do host-fs-whatever-encoding-is-used to UTF-8 encoding and show the result on the RISC OS side. At least, this allows access of all files.

Then, you can still allow the user to restrict the client’s FS to a subset with e.g. encoding conversion to the RISC OS alphabet.

Jun 8, 2021 8:47am

Colin (478) 2433 posts

Can you elaborate on how utf8 can be used?

As I see it:

1) local storage filenames are encoded in the current territory

2) remote storage filenames are encoded in unicode

3) To copy from local to remote ignoring other filing system differences you need to translate territory→unicode.

4) To copy from remote to local you need to translate unicode→territory

If in 4 instead of unicode→territory you do unicode→utf8 then you have changed the filename and would save as different file if you double clicked on a file, modified and saved.

The only way I can see utf8 working is if the current territory uses utf8.

Would love to be proven wrong.

Jun 8, 2021 9:23pm

David J. Ruck (33) 1636 posts

It’s got to be ~~turtles~~ UTF-8 all the way down.

Jun 9, 2021 7:24am

Colin (478) 2433 posts

If that is a reply to my post do you mean the current territory needs to be utf8 or something else?

Jun 9, 2021 8:26am

GavinWraith (26) 1563 posts

Lua, and RiscLua, has a utf8 library.

*lua
Lua 5.4.3  Copyright (C) 1994-2021 Lua.org, PUC-Rio
RiscLua 8.4  (VFP)
> for name,_ in pairs (utf8) do print(name) end
offset
len
codepoint
char
charpattern
codes
>

See the manual for what they do.

Jun 9, 2021 9:04am

David J. Ruck (33) 1636 posts

@Colin, the only sensible option is ubiquitous UTF-8 support throughout the operating system. As soon as you start converting between UTF-8 and 8 bit character you are going run in to display issues and characters which get translated back to something different.

Jun 9, 2021 10:48am

Chris Hall (132) 3558 posts

A horrible fudge that will annoy everyone is to handle filenames ignoring whether they are UTF-8 or Latin-1 (it is not possible to tell) except when displaying on screen in a filer window. When displaying on screen then any valid two- three- or four- character UTF-8 sequence can be translated to Latin-1 where there is a valid Latin-1 character that corresponds. Single character utf-8 characters (Ax, Bx) are compatible with Latin-1. Unless all characters in the filename have a Latin-1 equivalent, leave it untranslated. Ducks to avoid brickbats.

Jun 9, 2021 2:19pm

Colin (478) 2433 posts

@David Yes a utf8 os is the best solution but lets face it it is never going to happen.

Jun 9, 2021 7:23pm

Steffen Huber (91) 1953 posts

Can you elaborate on how utf8 can be used?

You use the bytes of your UTF-8 source directly and let RISC OS represent them as characters (in the usual single byte encoding, because we all know how well the UTF-8 alphabet works). This works because RISC OS can use all byte values 128-255 for filename characters, and an UTF-8 byte sequence is always top-bit-set (unless it is the single-byte-encoded ASCII stuff, which is no problem anyway).

This is for reading, and writing back under the same name. All applications (beyond the obviously broken ones) can work with those filenames (unless you run into the filename/pathname length limitations of couse, which is more likely in a multi-byte-encoding, but that is a generic unsolvable problem).

To make it a better round-trip solution, you “just” need to additionally handle the bytes that are not valid characters in UTF-8 sequences. From RISC OS to Server, you translate those bytes to “true” UTF-8 representations of that byte/character, and for all of those characters, you do the same in reverse.

Problem solved. I think. I hope. Maybe I should knock up some test code to prove the theory…

Jun 9, 2021 8:01pm

Rick Murray (539) 13850 posts

you “just” need to additionally handle the bytes that are not valid characters in UTF-8 sequences.

Yes, the great thing about UTF-8 is that the byte sequences are self-documenting, so invalid things (like “this is actually Latin1”) are easily found. Easily to the point where I’m surprised nobody ever thought to give FontManager a fallback option when in UTF-8 mode, like “set this flag and if the character isn’t valid UTF-8, it’ll be treated as Latin1”…

Jun 10, 2021 6:51am

Colin (478) 2433 posts

If you:

local→remote: Validate that a node in path is utf8. If it isn’t convert it to utf8 – or in my case utf16.

remote→local: Read as utf8

Then the node name has changed.

Jun 10, 2021 7:54am

Stuart Swales (8827) 1357 posts

@Colin: UTF-16 (using surrogates for non-BMP characters) supports all the same code points as UTF-8 so transformation from local <→ remote would be lossless were a local filing system to support UTF-8

Jun 10, 2021 8:16am

Colin (478) 2433 posts

Yes I understand that utf8→utf16 is lossless.

I don’t know if I’m coming across as an advocate against utf8 – I’m not. I just don’t see how I can use it.

Jun 10, 2021 8:37am

Stuart Swales (8827) 1357 posts

I suppose one way to move forward would be to see if the remote name contains only characters from the current local encoding and, if so, present it in that*, whether that is Latin1/2/etc or even UTF8. If if contains characters which are unrepresentable in the current local encoding, you may as well just present it as UTF-8 rather than coming up with some other escaping mechanism.

Edit: * where such a local name wouldn’t contain a sequence of top-bit-set characters that would be valid UTF-8 (for a non-UTF8 local encoding).

Jun 10, 2021 9:17am

Rick Murray (539) 13850 posts

Or just transition to UTF-8 and then the problem only exists for legacy things that can be handled specially, much as was the case when the assumption of a ten character filename ¹ was removed.

The benefit of UTF-8 is that it works with all existing string handling. Other versions of Unicode require more storage space and pad out the unused bytes with nulls. UTF-8, however, just looks like a regular C style string and the high bit set behaviour is extremely predictable so adding support for it ² shouldn’t present too much of a problem.

¹ That was only ever FileCore. DOSFS 8.3 was twelve characters, and so on.

² Handling cursor movement over UTF-8 in Ovation was really simple because of this predictability.

Filename Translation

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Jun 7, 2021 4:18pm Steffen Huber (91) 1953 posts	The reason for the diversion into international issues is that these days other filing systems can handle names that just cannot be represented under RISC OS. But Theo’s point was that, if you use UTF-8 encoding throughout, that RISC OS can handle all those i18n filenames. Not that their name will look the same on both OSes, but the name mapping can be 1:1, because RISC OS can handle top-bit-set 8bit characters in filenames without problems. So you can always do host-fs-whatever-encoding-is-used to UTF-8 encoding and show the result on the RISC OS side. At least, this allows access of all files. Then, you can still allow the user to restrict the client’s FS to a subset with e.g. encoding conversion to the RISC OS alphabet.

Jun 8, 2021 8:47am Colin (478) 2433 posts	Can you elaborate on how utf8 can be used? As I see it: 1) local storage filenames are encoded in the current territory 2) remote storage filenames are encoded in unicode 3) To copy from local to remote ignoring other filing system differences you need to translate territory→unicode. 4) To copy from remote to local you need to translate unicode→territory If in 4 instead of unicode→territory you do unicode→utf8 then you have changed the filename and would save as different file if you double clicked on a file, modified and saved. The only way I can see utf8 working is if the current territory uses utf8. Would love to be proven wrong.

Jun 8, 2021 9:23pm David J. Ruck (33) 1636 posts	It’s got to be ~~turtles~~ UTF-8 all the way down.

Jun 9, 2021 7:24am Colin (478) 2433 posts	If that is a reply to my post do you mean the current territory needs to be utf8 or something else?

Jun 9, 2021 8:26am GavinWraith (26) 1563 posts	Lua, and RiscLua, has a utf8 library. `*lua Lua 5.4.3 Copyright (C) 1994-2021 Lua.org, PUC-Rio RiscLua 8.4 (VFP) > for name,_ in pairs (utf8) do print(name) end offset len codepoint char charpattern codes >` See the manual for what they do.

Jun 9, 2021 9:04am David J. Ruck (33) 1636 posts	@Colin, the only sensible option is ubiquitous UTF-8 support throughout the operating system. As soon as you start converting between UTF-8 and 8 bit character you are going run in to display issues and characters which get translated back to something different.

Jun 9, 2021 10:48am Chris Hall (132) 3558 posts	A horrible fudge that will annoy everyone is to handle filenames ignoring whether they are UTF-8 or Latin-1 (it is not possible to tell) except when displaying on screen in a filer window. When displaying on screen then any valid two- three- or four- character UTF-8 sequence can be translated to Latin-1 where there is a valid Latin-1 character that corresponds. Single character utf-8 characters (Ax, Bx) are compatible with Latin-1. Unless all characters in the filename have a Latin-1 equivalent, leave it untranslated. Ducks to avoid brickbats.

Jun 9, 2021 2:19pm Colin (478) 2433 posts	@David Yes a utf8 os is the best solution but lets face it it is never going to happen.

Jun 9, 2021 7:23pm Steffen Huber (91) 1953 posts	Can you elaborate on how utf8 can be used? You use the bytes of your UTF-8 source directly and let RISC OS represent them as characters (in the usual single byte encoding, because we all know how well the UTF-8 alphabet works). This works because RISC OS can use all byte values 128-255 for filename characters, and an UTF-8 byte sequence is always top-bit-set (unless it is the single-byte-encoded ASCII stuff, which is no problem anyway). This is for reading, and writing back under the same name. All applications (beyond the obviously broken ones) can work with those filenames (unless you run into the filename/pathname length limitations of couse, which is more likely in a multi-byte-encoding, but that is a generic unsolvable problem). To make it a better round-trip solution, you “just” need to additionally handle the bytes that are not valid characters in UTF-8 sequences. From RISC OS to Server, you translate those bytes to “true” UTF-8 representations of that byte/character, and for all of those characters, you do the same in reverse. Problem solved. I think. I hope. Maybe I should knock up some test code to prove the theory…

Jun 9, 2021 8:01pm Rick Murray (539) 13850 posts	you “just” need to additionally handle the bytes that are not valid characters in UTF-8 sequences. Yes, the great thing about UTF-8 is that the byte sequences are self-documenting, so invalid things (like “this is actually Latin1”) are easily found. Easily to the point where I’m surprised nobody ever thought to give FontManager a fallback option when in UTF-8 mode, like “set this flag and if the character isn’t valid UTF-8, it’ll be treated as Latin1”…

Jun 10, 2021 6:51am Colin (478) 2433 posts	If you: local→remote: Validate that a node in path is utf8. If it isn’t convert it to utf8 – or in my case utf16. remote→local: Read as utf8 Then the node name has changed.

Jun 10, 2021 7:54am Stuart Swales (8827) 1357 posts	@Colin: UTF-16 (using surrogates for non-BMP characters) supports all the same code points as UTF-8 so transformation from local <→ remote would be lossless were a local filing system to support UTF-8

Jun 10, 2021 8:16am Colin (478) 2433 posts	Yes I understand that utf8→utf16 is lossless. I don’t know if I’m coming across as an advocate against utf8 – I’m not. I just don’t see how I can use it.

Jun 10, 2021 8:37am Stuart Swales (8827) 1357 posts	I suppose one way to move forward would be to see if the remote name contains only characters from the current local encoding and, if so, present it in that, whether that is Latin1/2/etc or even UTF8. If if contains characters which are unrepresentable in the current local encoding, you may as well just present it as UTF-8 rather than coming up with some other escaping mechanism. Edit: where such a local name wouldn’t contain a sequence of top-bit-set characters that would be valid UTF-8 (for a non-UTF8 local encoding).

Jun 10, 2021 9:17am Rick Murray (539) 13850 posts	Or just transition to UTF-8 and then the problem only exists for legacy things that can be handled specially, much as was the case when the assumption of a ten character filename ¹ was removed. The benefit of UTF-8 is that it works with all existing string handling. Other versions of Unicode require more storage space and pad out the unused bytes with nulls. UTF-8, however, just looks like a regular C style string and the high bit set behaviour is extremely predictable so adding support for it ² shouldn’t present too much of a problem. ¹ That was only ever FileCore. DOSFS 8.3 was twelve characters, and so on. ² Handling cursor movement over UTF-8 in Ovation was really simple because of this predictability.