Filename Translation
Andrew McCarthy (3688) 605 posts |
Naive question: does the displayed filename have to be identical to the filename given to and/or returned from the drive? Bump +1 For a single specification for the conversion of filenames and a single translation library. @Theo Markettos |
Colin (478) 2433 posts |
Using utf8 isn’t reversible. If you save a file with an unprintable char to riscos and then upload it back to the remote storage you get a different file name as when uploaded the chars will be treated as top bit set chars in the local territory and each individual char will be set to a utf16 char. So I think it’s better to use an escape character as I outlined. Also utf8 can include unprintable chars in the 0×80-0×9f range. |
Dave Higton (1515) 3526 posts |
A relocatable module to provide filename translation is an interesting idea. Give it a filename, tell it what format it’s in, tell it what format to translate to. That works for leafnames, but doesn’t deal with directory separators. Dunno what you do there, unless it’s as simple as doing a one-for-one substitution, requiring two more arguments. UTF8 appears to be able to represent all characters, so it looks like the Internet has largely embraced it. So one of the input/output formats has to be UTF8. SMB2/3 specify UTF16, so that has to be another. And of course we want ADFS with various national character sets. Which makes me realise that there are two aspects: one is translating, and the other is what characters aren’t allowed because they have significance to the OS. Some escaping mechanism is necessary. So, are there already too many difficulties for the idea to ever work, or are there enough creative solutions? |
David J. Ruck (33) 1635 posts |
The solution I have for files which are used cross platform is stick to ASCII, and even then avoid almost everything that isn’t alpha numeric, underscore or dash. Otherwise you fall foul of differing sets of disallowed characters, and unsymmetrically mappings. |
Dave Higton (1515) 3526 posts |
Unfortunately, “stick to ASCII” is unhelpful to the many people who use a language other than English. |
Chris Hall (132) 3554 posts |
other than English. or French (noone in their right mind is going to include accents in filenames!) or German (ditto umlauts) etc. |
Stuart Swales (8827) 1357 posts |
Or spaces… However, I personally do as druck! |
Colin (478) 2433 posts |
Couldn’t agree more but it doesn’t solve the problem of accessing existing files which use non ascii chars. Non ASCII chars are a problem on ADFS discs too as the filenames are territory encoded so the names may change with territory. |
Chris Hall (132) 3554 posts |
I, too, stick to ASCII. |
Rick Murray (539) 13840 posts |
It’s probably easier to use accents when:
I’ve seen people save files with accents in the names. Much easier these days on Android where the swipe-type inserts the correct word (“maison médicale” for example) than some bizarre non-accented thing. |
Rick Murray (539) 13840 posts |
I should add, most of the rest of the world has dealt with the filename problem by restricting certain characters that are special to the filing systems (like ‘/’ or ‘\’) and simply adding a method of referencing all the potential characters that one might want. As such, under Windows, iOS, and Android the following works:
I’m guessing there’s a bunch of you that can’t even see that on RISC OS. ;-) Making filename translations (except for the necessary stuff like ‘.’ and ‘$’) is just a sticky plaster. The filesystem (and by extension the OS) needs to embrace Unicode. Probably UTF-8 because it’s the one least likely to break legacy code, but whatever, as long as we’re stuck trying to cope with an eight bit character set and using outdated advice from the eighties like “stick to plain ASCII”, we’ll be at a disadvantage. Plain ASCII? Quite a lot of countries don’t even use what ASCII depicts (as demonstrated in my example filename) and only have it at all because of the legacy of the very early computers (like MS-DOS and such) and the fading “everything is in English” mentality. |
Chris Hall (132) 3554 posts |
And they can actually change the meaning of things Oh no! I hadn’t come across that. In French, German and Latin it is just to do with pronunciation (beta and ss in German can be ss). If you are searching for role you want to pick up both role and rôle (same for naïve). |
David J. Ruck (33) 1635 posts |
You could just say that for the whole of RISC OS. But my point which I earlier, is there has to be consistent handling of file naming conventions across native local, non-native local, and remote filing systems, or you have no choice but to stick to the subset of ASCII which is supported and two way translatable on everything. |
Stuart Swales (8827) 1357 posts |
But it would help to have some consistency in how badly any RISC OS mapping of filenames stored on foreign systems has to work. |
Dave Higton (1515) 3526 posts |
The idea of “stick to ASCII” is not an idea worthy of us. It’s like the worst old days of MS-DOS – the whole world speaks English. Well, they don’t. They speak different languages; they use different characters. They are real people. There have been murmurings about how RISC OS should support UTF8. I honestly believe it will before too long. (My belief is entirely without insight or foundation, I grant you.) That will make things easier in terms of filename translation. But until then, can we think of a filename translation scheme that will work?
I don’t see why it shouldn’t be, if the app knows that it’s translating between UTF8 and (e.g.) Latin1. Can apps always know? Even if it works mostly, it would be better than not working at all. (Perfect is the enemy of good.) |
Colin (478) 2433 posts |
Because top bit characters are valid eg latin1 characters so the filing system can only treat them as such. Now if you want to create a unicode file system it would make things a lot easier but next to impossible to usefully do. |
Rick Murray (539) 13840 posts |
And they can actually change the meaning of things I trust that is sarcasm. In case it isn’t, let me point out that “pêcheur” and “pécheur” are entirely different things. It’s only acceptable to omit accents from capital letters, and this probably came about because original typewriters didn’t have a sensible way of putting accents on to capitals. Accents should never be omitted from lower case, as it not only changes pronunciation, it changes the meaning. Which is why it’s rather annoying that Android’s support for accents with a UK layout Bluetooth keyboard is so miserably poor. Hell, even RISC OS found an adequate way of addressing this problem back in ’92.
We probably need to come up with a new disc format that is exactly like the current format, except that UTF-8 is assumed. Because it would be nice to say that “the filing system shouldn’t be treating the characters at all, just matching the filename metadata with entries in the catalogue” which is technically accurate until you realise that shifting from one naming convention to the other is going to be a mess. You cannot really have both working on the same disc, as what would the filing system look for if you asked for opening the file “résumé”? It’s either going to look like “résumé” or like “résumé”. Either is good. Both at the same time is bad. So the media really ought to be one or the other only and something (FileSwitch?) provide character set translation as necessary (at least, until UTF-8 works everywhere on the OS).
Impossible? Everybody else has done it. Quite a lot of them went through various iterations of guessing how to translate from this to that (especially fun things like ShiftJIS), but in the end they settled down to some form of Unicode, and various filesystem reserved characters. Microsoft layered Unicode on top of 8.3 FAT naming when they introduced VFAT and LFN back with Windows 95. The translation between the two naming conventions isn’t perfect, but Micros~1 has gone to lengths to make it unnecessary to even need to see and/or use the older 8.3 style filename. |
Chris Hall (132) 3554 posts |
I trust that is sarcasm. Actually it wasn’t. I hadn’t come across this (apart from “a” and “à” and past tense acute e, which is the same word just pronounced differently) despite doing ‘O’ level French and getting an ‘A’. Omitting accents from lower case used to be acceptable, at least in England. |
Dave Higton (1515) 3526 posts |
That’s the Officer Crabtree school of French. The key thing that makes it “acceptable” is to not understand the language. |
Steve Pampling (1551) 8170 posts |
As you rightly point out, English does such things with additional letters. Interesting development paths really – the tilde n development in Spanish etc is a medieval development with monks/scholars transcribing and using a shorthand. An accent to allow the removal of letters.
That doesn’t account for the use of English variants of spelling and pronunciation of French derived words in France. Franglais is part of a cultural mixing. It won’t stop, no matter how much the grammar authoritarians on both sides of the channel might wish it to cease/wither/die. |
Rick Murray (539) 13840 posts |
It’s too late. My head just cannot cope with this at such a time of night.
Perhaps, however in France where people actually speak French…
A good one is the ^ accent indicating the removal of an “s”. Hôtel – hostel And so on.
No, because some things are just better. English economists talked about “tranches” while everybody here waits for “le weekend”. And Frenchies, always looking to optimise the language, wish you a nice weekend by saying “bon week”. ;-) And, of course, everybody knows that the problem with the French is that they don’t have a word for entrepreneur. |
Chris Hall (132) 3554 posts |
Omitting accents from lower case used to be acceptable, at least in England. But that was before the second world war. The English were more arrogant then with other people’s languages (not as arrogant as the French, obviously). Now we pander to the foreigners too much and start to rename places that have been well known for years – Napoli for Naples and Mumboy for Bombay. Places only used to get renamed after revolutions or wars (e.g. Stalingrad/Volgagrad). It was silly enough calling it Porthmadog when it was named after Colonel Maddocks as Port Maddock. The silliest one (leaving out Tenby) was ber-woo-kle for Buckley! The Irish got it right – on their road signs the different languages are shown in different fonts so that they are not so distracting to drivers. Now it is trendy to rename something to make a political gesture, like Colston Hall in Bristol. Oh well, swings and roundabouts… |
Rick Murray (539) 13840 posts |
How painfully, embarrassingly, Little Englander. Mumbai was renamed by Indians, for Indians. I’m not entirely certain why, I think Bombay was of Portuguese origin and Mumbai is a goddess…or something along those lines. It’s their city, they can call it whatever they like. As for Napoli, that’s not a renaming, that’s just some English speaking people realising that they’ve actually been saying it wrong all these years.
Originally known as Tsaritsyn, don’t forget. The name change to Volgograd was to attempt to de-Stalinise things. Given what happened there, some see it as an attempt to whitewash.
Sometimes it’s for daft political reasons. A nearby town used to be called Pouancé. It’s been known by variations of that name since the year had three digits. Personally I find it depressing that these places need to join and rename in order for their regional government to recognise that they exist, as, well, history talks about a place called Pouancé, which has the ruins of a big castle right in the middle of town. History doesn’t talk about Ombrée d’Anjou because that didn’t exist until five years ago… |
Colin Ferris (399) 1814 posts |
I wonder which team Rick would have supported – Eleanor lot in the south or the lot in North based around Paris :-) (What we now call France) |
Steve Pampling (1551) 8170 posts |
I didn’t know about that aspect of French, unfortunately that just strengthens the argument that accents aren’t needed – just put the “s” back in. I’m now suspecting that the other accents are all a means of using a letter or two less. Like “aren’t” and “I’m” just now, except the French “s” is a single pen stroke replaced with another, while the apostrophe allows the removal of a letter with one replaced by the apostrophe (net -1) the usual drift is now removing the apostrophe.
It would be interesting to strip the borrowed words from English (and French) and see what is left.
Worse still the English insist on pronouncing it like it’s French, while the Americans have that wonderful “noo-er” ending :)
Complicated. I believe there was nothing much other than marshland when the Europeans swanned in. Go further back on a rename sequence – Istanbul. |