Encodings
Chris Hall (132) 3554 posts |
I must admit that I am bemused. Valiant efforts to make filenames work with top-bit-set characters seem a little pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions. RISC OS is not alone in just giving up on many characters – Windows just renders ‘5/8’ as ‘?’ if you save a ‘csv’ file for example from Excel (which can handle lots of odd characters). There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows. Thank goodness that the ‘£’ key on the keyboard works correctly and produces the right character code on whatever system you are using. I remember producing an ALT-nnn list of things like sexed quotes, ellipsis, etc. so that I could type them directly on the keyboard and it was (and still is) frustrating that different software used different codes for the same thing. Let’s get a text editor that allows you to specify the encoding on input and output before worrying about characters in filenames! |
Steve Pampling (1551) 8170 posts |
It’s a little more than just filenames wouldn’t you say? The thing to remember on these subjects is that you need the foundations in before building the rest. The proffered “best” solution is UTF-8 support in the OS. If the OS in general can’t handle UTF-8 there seems little point in piddling around with the applications. Now if the OS has a decent character handling system that falls back to the current behaviour(ish) then people can look at updating the applications that use it. nemo said:
So, people might more usefully ask what they have to do to use that UTF-8 support in various applications. |
Mike Freestone (2564) 131 posts |
Item number 2 on the list here but a list doesn’t guarantee action if nobody looks at the list |
Steffen Huber (91) 1953 posts |
UTF-8 or not, a lot of applications need to be better aware of encodings. E.g. CDBurn takes a very simplistic approach to Joliet (UTF-16). If you try to use LanManFS or Sunfish or FTPc with different servers and clients in a mixed OS environment, there are many surprises. Even the last version of Messenger Pro that I saw had many weaknesses wrt EMail and Usenet encoding. It is not a solution to “just stay with ASCII”. |
nemo (145) 2546 posts |
Chris said
You are literally arguing that we shouldn’t implement the thing that would solve the limitation that you’re using to justify not implementing the thing. Edit will consequently work much like this editor: |
Rick Murray (539) 13840 posts |
Indeed. A lot of applications provide internationalised messages (French, German, etc). The problem with this is that it is somewhat hardwired to have an expectation of Latin1 (or something similar) as the default alphabet.
Yeah… Justification for doing nothing. :-/
Any projections on when this will see a wider release? |
nemo (145) 2546 posts |
Chris also claimed
It isn’t. You’re wrong. You’ve misunderstood. Even if it were discouraged, and it isn’t, “deprecated” would imply that it used to be OK and now isn’t. The reverse is true.
“What a man knows may fill a book. What he does not know fills the library.” Limitless though my ignorance no doubt is, I am probably least ignorant about how RISC OS works, sadly. Steffen pointed out
Oh yes indeed. Chris Mahoney said
Indeed, and any other modern platforms with access to those files would also be happy. It is only RISC OS (as distinct from WindowManager and FontManager) that would be blissfully unaware. That, I am fixing. And now I return to Chris:
Which is nothing to do with Windows, or Excel. The problem is that CSV has no way of declaring its Encoding unambiguously… unless it is UTF-8, in which case it can use a Byte Order Mark. Arguably, starting a CSV with BOM is not fully compliant, but is usually respected by UTF-8 aware applications. Excel doesn’t make use of this… but that’s because (and this may shock you), Microsoft has not traditionally put much effort into allowing people to get data out of their products and into someone else’s. The fact that Excel’s preferred formats, including all modern replacements for the utterly outdated CSV, use UTF-8 by default is the pertinent point.
…unless it begins with a BOM, in which case it is definitely1 UTF-8. The more disturbing point is your assertion that “Windows” means “CP1252”… which is as US-centric as Microsoft themselves are frequently guilty of. However, the facts are much more complex than that – here is Microsoft’s list which prominently features this advice:
If you won’t take my ignorant word for it, will you take Microsoft’s? I suspect not.
Oh really? 1 Don’t. Just don’t. |
nemo (145) 2546 posts |
Being pedantic, I would expect the translation to use the default alphabet for that language – cue much wrangling about region versus language versus alphabet versus user preference. And hence, Unicode.
Assuming you mean !UniEdit in the screenshots, probably never. Once UTF-8 is supported at the OS level, !Edit can trivially gain exactly this functionality. Theoretically it should already have it if using an outline font when the alphabet is UTF-8… but I can understand if KB didn’t bother (for it probably would have been he). As for the UTF-8 support, it is so very nearly shippable that I feel guilty typing this. So translating from programmer time to BST… months probably. I am being told to “just get on with it” by everyone already, don’t you start. ;-) |
Paul Sprangers (346) 524 posts |
Just get on with it, please. |
Steve Pampling (1551) 8170 posts |
So, modelling the prediction in Chief Enginneer Scott1 terms thats a few weeks then :) 1 Trekkie mode. |
Colin (478) 2433 posts |
The biggest drawback I find when considering using encodings is that Font_Paint doesn’t display an ‘unknown’ glyph for code points a font doesn’t have a glyph for. I feel I should be able to paint a random block of memory in an encoding of my choice and have all code points display something otherwise editing becomes a problem – is that an unreasonable expectation?. Last time I looked Font_Paint/Font_ScanString just ignored codepoints without a glyph. As a result painting unknown glyphs is a non trivial affair requiring caching of font data in every program. |
nemo (145) 2546 posts |
The FontManager is like that really incompetent person at work, who somehow manages to not get fired but still uses up the seat and the salary to no discernible effect. The offence is not only their uselessness, but the fact that their mere presence is preventing a better candidate being sought. |
Steve Pampling (1551) 8170 posts |
and somehow manages to give people the impression that is actually other people that are at fault. |
Rick Murray (539) 13840 posts |
;-) Arguably, though, that keyboard is still compliant with that was said due to the lack of a ‘£’ key.
<BOOM!> Ludicrous Gibs that were formerly the insides of heads.
May we then be more useful and offer to beta test when you’re ready?
The biggest drawback I find when using Font_Paint is that FontManager doesn’t recognise when the current font doesn’t have a glyph and automatically substitutes a font which does, so writing about 鬼束 ちひろ or カラフィナ means handling all this by yourself instead of just passing the text to FontManager and saying “render this”. |
nemo (145) 2546 posts |
Rick asked
More than 90% of Japanese users have a US keyboard and type in Romaji, in effect. 皮肉ですね?
I can think of no one better qualified. There’s literally no one who types more than you.
The ability to ‘stack’ physical fonts into a composite virtual font is indeed one of the valuable and versatile features wot we have not got. |
Clive Semmens (2335) 3276 posts |
YES!
I suspect the equivalent is true of Indian users. I don’t know how Japanese typewriters used to work, so I don’t know whether Romaji on a US keyboard is a satisfactory alternative. But I do know how Hindi typewriters used to work (and Panjabi, Bengali, Gujerati, Odiya etc. ones worked in a similar fashion) and the US keyboard option is a very poor alternative. Ridiculously, I don’t know how French, Spanish or German computer keyboards work, although I do know how traditional typewriters work in those languages. If the computers aren’t the same, I suspect they’re a backwards step too. |
Chris Mahoney (1684) 2165 posts |
“Hunt and peck” with 2450 characters. I can’t say whether something “better” came along after that! As for computers, while there is a kana keyboard (as pictured previously in the thread), as pointed out a lot of people use a US-based layout. It’s technically slower (since you need to press, for example, K-A to get か) but does get the job done. Meanwhile, on phones, apparently the so-called “ten-key” kana input method is the most common. |
Clive Semmens (2335) 3276 posts |
I suspected that 8~) – almost anything must be better! The old Hindi typewriters, on the other hand, lent themselves beautifully to touch typing and used ~40% fewer keystrokes than US layout keyboards doing transliterated Hindi that’s converted to proper Hindi in software on the fly. I was reasonably good at it at one time – either on a typewriter or on an Acorn running my own rather hacky software. Anyone tell me how you type accented characters on a French computer? Or is it the same as on an old French typewriter (just two keystrokes for an accented character). |
Rick Murray (539) 13840 posts |
Clive: It depends upon the accented character. https://en.m.wikipedia.org/wiki/AZERTY Common ones are directly accessible – remember, contrary to practically every other layout on the planet, you need to shift to get numbers on the French keyboard, so é and the like are a simple keypress. Frankly I prefer the RISC OS and Windows UK International methods. While every character is preceded by an Alt modifier, it is regular and one doesn’t need to hunt down where keys are. |
Clive Semmens (2335) 3276 posts |
Not sure what the RISC OS and Windows UK international methods are, then. That description is exactly what my old French typewriter does, and it’s really quick for touch-typing. I made a variation on it for my own purposes on RISCOS, because I wanted a much wider range of languages to be reasonably usable (academic papers with authors’ names and addresses from all over the world – the papers themselves in English) but it was so idiosyncratic that I certainly wouldn’t release it into the wild! |
Chris Mahoney (1684) 2165 posts |
That’s what we have in NZ; the keyboards are physically ANSI (i.e. US) layout, but ` is a dead key for typing the so-called macronised vowels (ā, ē, ī, ō, ū) used in Māori. |
Rick Murray (539) 13840 posts |
Hmm… You know what they say about monkeys and typewriters, right? |
nemo (145) 2546 posts |
The MMK keyboard driver for RISC OS and my Windows driver support 16 dead-key accents, all with Alt: ` => ò grave " => ő hungarumlaut or double acute 6 => ô circumflex ^ => ǒ caron - => ō macron _ => ẕ lowline ; => ö dieresis : => ŏ breve ' => ó acute @ => å ring # => õ tilde ~ => ō macron again , => ç cedilla < => ą ogonek . => ṡ dot above > => ṇ dot below / => ħ bar I didn’t bother with hook or any of the non-European diacritics. The dead-keys on the MMK driver were different, I’ve forgotten. I prefer these now. I also have sexed quote on the brackets ‘’“” and en– and em—dash on N & M. Plus mathematical symbols ×÷± and other typographical niceties •™fifl. The hardest part is remembering what I’ve put where. |
Clive Semmens (2335) 3276 posts |
I like that. It’s the exact same set I supported with my idiosyncratic system. (Calling the double acute “hungarumlaut” is confusing, since Hungarian has not only a double acute, but ALSO an ordinary umlaut.) I didn’t try to support Vietnamese, which uses the Latin alphabet with multiple diacritics. Not sure what I’d have done if we’d had a Vietnamese author, but we never did. I’d probably have made a special font, with just the things necessary for the particular case. I did support Hindi (and I suppose still do, if anyone wants it, but I’ve not used it myself for years), with dead key diacritics, possibly as many as four on a single character.
Keyboard diagrams… see http://clive.semmens.org.uk/RISCOS/index.php?JPhysiolKB for an example… |
Matthew Phillips (473) 721 posts |
Chris Hall wrote:
Not entirely true. It’s actually pretty easy to identify text files encoded in UTF-8 because of the way the non-ASCII characters have a particular format to the binary representation which is quite distinctive and quite unlike Latin 1. You have to use heuristics, but it’s pretty reliable. Rick wrote:
I realise this is not as helpful as having it built into the OS, but there is RUfl which is quite easy to incorporate into applications written in C, as we have done with RiscOSM and Nominatim. Maybe the ideas used in RUfl could be added to an enhanced font manager some day. We’ve not yet done the bits needed to make the maps in RiscOSM support switching between different fonts to cover the glyphs, but it will support any glyphs found in a single font. The RUfl library does not support changing direction for Arabic and Hebrew, as far as I can remember, and the font substitution could be enhanced by some methods for identifying common styles like fixed-width, sans serif and so on. I have ideas about how to do this but have not got around to trying them out. |