Filename Translation
Steffen Huber (91) 1953 posts |
But Theo’s point was that, if you use UTF-8 encoding throughout, that RISC OS can handle all those i18n filenames. Not that their name will look the same on both OSes, but the name mapping can be 1:1, because RISC OS can handle top-bit-set 8bit characters in filenames without problems. So you can always do host-fs-whatever-encoding-is-used to UTF-8 encoding and show the result on the RISC OS side. At least, this allows access of all files. Then, you can still allow the user to restrict the client’s FS to a subset with e.g. encoding conversion to the RISC OS alphabet. |
Colin (478) 2433 posts |
Can you elaborate on how utf8 can be used? As I see it: 1) local storage filenames are encoded in the current territory 2) remote storage filenames are encoded in unicode 3) To copy from local to remote ignoring other filing system differences you need to translate territory→unicode. 4) To copy from remote to local you need to translate unicode→territory If in 4 instead of unicode→territory you do unicode→utf8 then you have changed the filename and would save as different file if you double clicked on a file, modified and saved. The only way I can see utf8 working is if the current territory uses utf8. Would love to be proven wrong. |
David J. Ruck (33) 1636 posts |
It’s got to be |
Colin (478) 2433 posts |
If that is a reply to my post do you mean the current territory needs to be utf8 or something else? |
GavinWraith (26) 1563 posts |
Lua, and RiscLua, has a utf8 library. See the manual for what they do.
|
David J. Ruck (33) 1636 posts |
@Colin, the only sensible option is ubiquitous UTF-8 support throughout the operating system. As soon as you start converting between UTF-8 and 8 bit character you are going run in to display issues and characters which get translated back to something different. |
Chris Hall (132) 3558 posts |
A horrible fudge that will annoy everyone is to handle filenames ignoring whether they are UTF-8 or Latin-1 (it is not possible to tell) except when displaying on screen in a filer window. When displaying on screen then any valid two- three- or four- character UTF-8 sequence can be translated to Latin-1 where there is a valid Latin-1 character that corresponds. Single character utf-8 characters (Ax, Bx) are compatible with Latin-1. Unless all characters in the filename have a Latin-1 equivalent, leave it untranslated. Ducks to avoid brickbats. |
Colin (478) 2433 posts |
@David Yes a utf8 os is the best solution but lets face it it is never going to happen. |
Steffen Huber (91) 1953 posts |
You use the bytes of your UTF-8 source directly and let RISC OS represent them as characters (in the usual single byte encoding, because we all know how well the UTF-8 alphabet works). This works because RISC OS can use all byte values 128-255 for filename characters, and an UTF-8 byte sequence is always top-bit-set (unless it is the single-byte-encoded ASCII stuff, which is no problem anyway). This is for reading, and writing back under the same name. All applications (beyond the obviously broken ones) can work with those filenames (unless you run into the filename/pathname length limitations of couse, which is more likely in a multi-byte-encoding, but that is a generic unsolvable problem). To make it a better round-trip solution, you “just” need to additionally handle the bytes that are not valid characters in UTF-8 sequences. From RISC OS to Server, you translate those bytes to “true” UTF-8 representations of that byte/character, and for all of those characters, you do the same in reverse. Problem solved. I think. I hope. Maybe I should knock up some test code to prove the theory… |
Rick Murray (539) 13850 posts |
Yes, the great thing about UTF-8 is that the byte sequences are self-documenting, so invalid things (like “this is actually Latin1”) are easily found. Easily to the point where I’m surprised nobody ever thought to give FontManager a fallback option when in UTF-8 mode, like “set this flag and if the character isn’t valid UTF-8, it’ll be treated as Latin1”… |
Colin (478) 2433 posts |
If you: local→remote: Validate that a node in path is utf8. If it isn’t convert it to utf8 – or in my case utf16. remote→local: Read as utf8 Then the node name has changed. |
Stuart Swales (8827) 1357 posts |
@Colin: UTF-16 (using surrogates for non-BMP characters) supports all the same code points as UTF-8 so transformation from local <→ remote would be lossless were a local filing system to support UTF-8 |
Colin (478) 2433 posts |
Yes I understand that utf8→utf16 is lossless. I don’t know if I’m coming across as an advocate against utf8 – I’m not. I just don’t see how I can use it. |
Stuart Swales (8827) 1357 posts |
I suppose one way to move forward would be to see if the remote name contains only characters from the current local encoding and, if so, present it in that*, whether that is Latin1/2/etc or even UTF8. If if contains characters which are unrepresentable in the current local encoding, you may as well just present it as UTF-8 rather than coming up with some other escaping mechanism. Edit: * where such a local name wouldn’t contain a sequence of top-bit-set characters that would be valid UTF-8 (for a non-UTF8 local encoding). |
Rick Murray (539) 13850 posts |
Or just transition to UTF-8 and then the problem only exists for legacy things that can be handled specially, much as was the case when the assumption of a ten character filename 1 was removed. The benefit of UTF-8 is that it works with all existing string handling. Other versions of Unicode require more storage space and pad out the unused bytes with nulls. UTF-8, however, just looks like a regular C style string and the high bit set behaviour is extremely predictable so adding support for it 2 shouldn’t present too much of a problem. 1 That was only ever FileCore. DOSFS 8.3 was twelve characters, and so on. 2 Handling cursor movement over UTF-8 in Ovation was really simple because of this predictability. |