RISC OS Open: Forum: Encodings

Apr 7, 2018 7:16am

Chris Hall (132) 3554 posts

I must admit that I am bemused. Valiant efforts to make filenames work with top-bit-set characters seem a little pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions. RISC OS is not alone in just giving up on many characters – Windows just renders ‘5/8’ as ‘?’ if you save a ‘csv’ file for example from Excel (which can handle lots of odd characters). There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows. Thank goodness that the ‘£’ key on the keyboard works correctly and produces the right character code on whatever system you are using.

I remember producing an ALT-nnn list of things like sexed quotes, ellipsis, etc. so that I could type them directly on the keyboard and it was (and still is) frustrating that different software used different codes for the same thing. Let’s get a text editor that allows you to specify the encoding on input and output before worrying about characters in filenames!

Apr 7, 2018 8:18am

Steve Pampling (1551) 8170 posts

I must admit that I am bemused. Valiant efforts to make filenames work with top-bit-set characters seem a little pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions.

It’s a little more than just filenames wouldn’t you say? The thing to remember on these subjects is that you need the foundations in before building the rest. The proffered “best” solution is UTF-8 support in the OS. If the OS in general can’t handle UTF-8 there seems little point in piddling around with the applications. Now if the OS has a decent character handling system that falls back to the current behaviour(ish) then people can look at updating the applications that use it.

nemo said:

The UTF-8 support I have been working on features fallback – that is, if it isn’t UTF-8, then it gets treated transparently as though it is Latin1 (or your preferred Encoding).

So, people might more usefully ask what they have to do to use that UTF-8 support in various applications.

Apr 7, 2018 9:34am

Mike Freestone (2564) 131 posts

Let’s get a text editor that allows you to specify the encoding on input and output before worrying about characters in filenames!

Item number 2 on the list here but a list doesn’t guarantee action if nobody looks at the list

Apr 7, 2018 11:12am

Steffen Huber (91) 1953 posts

UTF-8 or not, a lot of applications need to be better aware of encodings. E.g. CDBurn takes a very simplistic approach to Joliet (UTF-16). If you try to use LanManFS or Sunfish or FTPc with different servers and clients in a mixed OS environment, there are many surprises. Even the last version of Messenger Pro that I saw had many weaknesses wrt EMail and Usenet encoding.

It is not a solution to “just stay with ASCII”.

Apr 7, 2018 12:49pm

nemo (145) 2546 posts

Chris said

pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions

You are literally arguing that we shouldn’t implement the thing that would solve the limitation that you’re using to justify not implementing the thing.

Edit will consequently work much like this editor:

Apr 7, 2018 12:55pm

Rick Murray (539) 13840 posts

a lot of applications need to be better aware of encodings

Indeed. A lot of applications provide internationalised messages (French, German, etc). The problem with this is that it is somewhat hardwired to have an expectation of Latin1 (or something similar) as the default alphabet.

You are literally arguing that we shouldn’t

Yeah… Justification for doing nothing. :-/

Edit will consequently work much like this editor:

Any projections on when this will see a wider release?

Apr 7, 2018 1:12pm

nemo (145) 2546 posts

Chris also claimed

it may have been a mistake for RISC OS to deprecate top-bit-set characters in filenames but it is a fact

It isn’t. You’re wrong. You’ve misunderstood. Even if it were discouraged, and it isn’t, “deprecated” would imply that it used to be OK and now isn’t. The reverse is true.

Glad to be able to correct your ignorance

“What a man knows may fill a book. What he does not know fills the library.”

Limitless though my ignorance no doubt is, I am probably least ignorant about how RISC OS works, sadly.

Steffen pointed out

VRPC HostFS is broken in so many respects that it is frightening

Oh yes indeed.

Chris Mahoney said

I’ve also tested [UTF-8] with one of my own apps. Filer was happy, Edit was happy, my !Run file was happy, the Toolbox SaveAs object was happy

Indeed, and any other modern platforms with access to those files would also be happy. It is only RISC OS (as distinct from WindowManager and FontManager) that would be blissfully unaware. That, I am fixing.

And now I return to Chris:

Windows just renders ‘5/8’ as ‘?’ if you save a ‘csv’ file for example from Excel (which can handle lots of odd characters).

Which is nothing to do with Windows, or Excel. The problem is that CSV has no way of declaring its Encoding unambiguously… unless it is UTF-8, in which case it can use a Byte Order Mark. Arguably, starting a CSV with BOM is not fully compliant, but is usually respected by UTF-8 aware applications. Excel doesn’t make use of this… but that’s because (and this may shock you), Microsoft has not traditionally put much effort into allowing people to get data out of their products and into someone else’s. The fact that Excel’s preferred formats, including all modern replacements for the utterly outdated CSV, use UTF-8 by default is the pertinent point.

There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows.

…unless it begins with a BOM, in which case it is definitely¹ UTF-8.

The more disturbing point is your assertion that “Windows” means “CP1252”… which is as US-centric as Microsoft themselves are frequently guilty of. However, the facts are much more complex than that – here is Microsoft’s list which prominently features this advice:

ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page.

If you won’t take my ignorant word for it, will you take Microsoft’s? I suspect not.

Thank goodness that the ‘£’ key on the keyboard works correctly and produces the right character code on whatever system you are using.

Oh really?

¹ Don’t. Just don’t.

Apr 7, 2018 1:19pm

nemo (145) 2546 posts

A lot of applications provide internationalised messages (French, German, etc). The problem with this is that it is somewhat hardwired to have an expectation of Latin1 (or something similar) as the default alphabet.

Being pedantic, I would expect the translation to use the default alphabet for that language – cue much wrangling about region versus language versus alphabet versus user preference. And hence, Unicode.

Any projections on when this will see a wider release?

Assuming you mean !UniEdit in the screenshots, probably never. Once UTF-8 is supported at the OS level, !Edit can trivially gain exactly this functionality. Theoretically it should already have it if using an outline font when the alphabet is UTF-8… but I can understand if KB didn’t bother (for it probably would have been he).

As for the UTF-8 support, it is so very nearly shippable that I feel guilty typing this. So translating from programmer time to BST… months probably. I am being told to “just get on with it” by everyone already, don’t you start. ;-)

Apr 7, 2018 1:28pm

Paul Sprangers (346) 524 posts

Just get on with it, please.

Apr 7, 2018 2:05pm

Steve Pampling (1551) 8170 posts

It might be worth stress testing this new version (although note that as of this moment the autobuilt ROM is not yet available).

So, modelling the prediction in Chief Enginneer Scott¹ terms thats a few weeks then :)

¹ Trekkie mode.

Apr 7, 2018 2:40pm

Colin (478) 2433 posts

The biggest drawback I find when considering using encodings is that Font_Paint doesn’t display an ‘unknown’ glyph for code points a font doesn’t have a glyph for. I feel I should be able to paint a random block of memory in an encoding of my choice and have all code points display something otherwise editing becomes a problem – is that an unreasonable expectation?. Last time I looked Font_Paint/Font_ScanString just ignored codepoints without a glyph. As a result painting unknown glyphs is a non trivial affair requiring caching of font data in every program.

Apr 8, 2018 12:17pm

nemo (145) 2546 posts

The FontManager is like that really incompetent person at work, who somehow manages to not get fired but still uses up the seat and the salary to no discernible effect. The offence is not only their uselessness, but the fact that their mere presence is preventing a better candidate being sought.

Apr 8, 2018 12:26pm

Steve Pampling (1551) 8170 posts

The FontManager is like that really incompetent person at work, who somehow manages to not get fired but still uses up the seat and the salary to no discernible effect. The offence is not only their uselessness, but the fact that their mere presence is preventing a better candidate being sought.

and somehow manages to give people the impression that is actually other people that are at fault.
On a personal note the initials are DK (as top candidate because “there can be more than one”)

Apr 8, 2018 7:33pm

Rick Murray (539) 13840 posts

Oh really?

;-)

Arguably, though, that keyboard is still compliant with that was said due to the lack of a ‘£’ key.
Oh, and shouldn’t there be a bunch of keys for language options?

I would expect the translation to use the default alphabet for that language

<BOOM!> Ludicrous Gibs that were formerly the insides of heads.

I am being told to “just get on with it” by everyone already, don’t you start. ;-)

May we then be more useful and offer to beta test when you’re ready?

The biggest drawback I find when considering using encodings is that Font_Paint doesn’t display an ‘unknown’ glyph for code points a font doesn’t have a glyph for.

The biggest drawback I find when using Font_Paint is that FontManager doesn’t recognise when the current font doesn’t have a glyph and automatically substitutes a font which does, so writing about 鬼束ちひろ or カラフィナ means handling all this by yourself instead of just passing the text to FontManager and saying “render this”.

Apr 8, 2018 10:21pm

nemo (145) 2546 posts

Rick asked

Shouldn’t there be

More than 90% of Japanese users have a US keyboard and type in Romaji, in effect. 皮肉ですね？

May we then be more useful

I can think of no one better qualified. There’s literally no one who types more than you.

instead of just passing the text to FontManager and saying “render this”

The ability to ‘stack’ physical fonts into a composite virtual font is indeed one of the valuable and versatile features wot we have not got.

Apr 9, 2018 12:59am

Clive Semmens (2335) 3276 posts

The ability to ‘stack’ physical fonts into a composite virtual font is indeed one of the valuable and versatile features wot we have not got.

YES!

More than 90% of Japanese users have a US keyboard and type in Romaji, in effect. 皮肉ですね？

I suspect the equivalent is true of Indian users. I don’t know how Japanese typewriters used to work, so I don’t know whether Romaji on a US keyboard is a satisfactory alternative. But I do know how Hindi typewriters used to work (and Panjabi, Bengali, Gujerati, Odiya etc. ones worked in a similar fashion) and the US keyboard option is a very poor alternative.

Ridiculously, I don’t know how French, Spanish or German computer keyboards work, although I do know how traditional typewriters work in those languages. If the computers aren’t the same, I suspect they’re a backwards step too.

Apr 9, 2018 1:22am

Chris Mahoney (1684) 2165 posts

I don’t know how Japanese typewriters used to work

“Hunt and peck” with 2450 characters.

http://historysanjose.org/wp/the-first-japanese-typewriter-a-100-year-old-mechanical-marvel-with-2450-characters/

I can’t say whether something “better” came along after that!

As for computers, while there is a kana keyboard (as pictured previously in the thread), as pointed out a lot of people use a US-based layout. It’s technically slower (since you need to press, for example, K-A to get か) but does get the job done.

Meanwhile, on phones, apparently the so-called “ten-key” kana input method is the most common.

Apr 9, 2018 5:45am

Clive Semmens (2335) 3276 posts

“Hunt and peck” with 2450 characters.

I suspected that 8~) – almost anything must be better!

The old Hindi typewriters, on the other hand, lent themselves beautifully to touch typing and used ~40% fewer keystrokes than US layout keyboards doing transliterated Hindi that’s converted to proper Hindi in software on the fly. I was reasonably good at it at one time – either on a typewriter or on an Acorn running my own rather hacky software.

Anyone tell me how you type accented characters on a French computer? Or is it the same as on an old French typewriter (just two keystrokes for an accented character).

Apr 9, 2018 6:35am

Rick Murray (539) 13840 posts

Clive: It depends upon the accented character.

https://en.m.wikipedia.org/wiki/AZERTY

Common ones are directly accessible – remember, contrary to practically every other layout on the planet, you need to shift to get numbers on the French keyboard, so é and the like are a simple keypress.
Other less common characters are available via a dead key press, either the backwards apostrophe or the circumflex/umlaut key to the right of P.

Frankly I prefer the RISC OS and Windows UK International methods. While every character is preceded by an Alt modifier, it is regular and one doesn’t need to hunt down where keys are.
Sadly Android never got the memo and it’s keyboard handling is extremely broken (to the point where a Bluetooth keyboard using UK layout is only just capable of sticking acute accents on letters).

Apr 9, 2018 7:08am

Clive Semmens (2335) 3276 posts

Not sure what the RISC OS and Windows UK international methods are, then. That description is exactly what my old French typewriter does, and it’s really quick for touch-typing. I made a variation on it for my own purposes on RISCOS, because I wanted a much wider range of languages to be reasonably usable (academic papers with authors’ names and addresses from all over the world – the papers themselves in English) but it was so idiosyncratic that I certainly wouldn’t release it into the wild!

Apr 9, 2018 8:50am

Chris Mahoney (1684) 2165 posts

Other less common characters are available via a dead key press, either the backwards apostrophe

That’s what we have in NZ; the keyboards are physically ANSI (i.e. US) layout, but ` is a dead key for typing the so-called macronised vowels (ā, ē, ī, ō, ū) used in Māori.

Apr 9, 2018 8:50am

Rick Murray (539) 13840 posts

There’s literally no one who types more than you.

Hmm… You know what they say about monkeys and typewriters, right?

Apr 9, 2018 11:26am

nemo (145) 2546 posts

The MMK keyboard driver for RISC OS and my Windows driver support 16 dead-key accents, all with Alt:

` => ò grave
" => ő hungarumlaut or double acute
6 => ô circumflex
^ => ǒ caron
- => ō macron
_ => ẕ lowline
; => ö dieresis
: => ŏ breve
' => ó acute
@ => å ring
# => õ tilde
~ => ō macron again
, => ç cedilla
< => ą ogonek
. => ṡ dot above
> => ṇ dot below
/ => ħ bar

I didn’t bother with hook or any of the non-European diacritics.

The dead-keys on the MMK driver were different, I’ve forgotten. I prefer these now. I also have sexed quote on the brackets ‘’“” and en– and em—dash on N & M. Plus mathematical symbols ×÷± and other typographical niceties •™ﬁﬂ.

The hardest part is remembering what I’ve put where.

Apr 9, 2018 11:38am

Clive Semmens (2335) 3276 posts

I like that. It’s the exact same set I supported with my idiosyncratic system. (Calling the double acute “hungarumlaut” is confusing, since Hungarian has not only a double acute, but ALSO an ordinary umlaut.)

I didn’t try to support Vietnamese, which uses the Latin alphabet with multiple diacritics. Not sure what I’d have done if we’d had a Vietnamese author, but we never did. I’d probably have made a special font, with just the things necessary for the particular case.

I did support Hindi (and I suppose still do, if anyone wants it, but I’ve not used it myself for years), with dead key diacritics, possibly as many as four on a single character.

The hardest part is remembering what I’ve put where.

Keyboard diagrams… see http://clive.semmens.org.uk/RISCOS/index.php?JPhysiolKB for an example…

Apr 15, 2018 7:57am

Matthew Phillips (473) 721 posts

Chris Hall wrote:

There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows.

Not entirely true. It’s actually pretty easy to identify text files encoded in UTF-8 because of the way the non-ASCII characters have a particular format to the binary representation which is quite distinctive and quite unlike Latin 1. You have to use heuristics, but it’s pretty reliable.

Rick wrote:

The biggest drawback I find when using Font_Paint is that FontManager doesn’t recognise when the current font doesn’t have a glyph and automatically substitutes a font which does, so writing about 鬼束ちひろ or カラフィナ means handling all this by yourself instead of just passing the text to FontManager and saying “render this”.

I realise this is not as helpful as having it built into the OS, but there is RUfl which is quite easy to incorporate into applications written in C, as we have done with RiscOSM and Nominatim. Maybe the ideas used in RUfl could be added to an enhanced font manager some day.

We’ve not yet done the bits needed to make the maps in RiscOSM support switching between different fonts to cover the glyphs, but it will support any glyphs found in a single font. The RUfl library does not support changing direction for Arabic and Hebrew, as far as I can remember, and the font substitution could be enhanced by some methods for identifying common styles like fixed-width, sans serif and so on. I have ideas about how to do this but have not got around to trying them out.

Encodings

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Apr 7, 2018 7:16am Chris Hall (132) 3554 posts	I must admit that I am bemused. Valiant efforts to make filenames work with top-bit-set characters seem a little pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions. RISC OS is not alone in just giving up on many characters – Windows just renders ‘5/8’ as ‘?’ if you save a ‘csv’ file for example from Excel (which can handle lots of odd characters). There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows. Thank goodness that the ‘£’ key on the keyboard works correctly and produces the right character code on whatever system you are using. I remember producing an ALT-nnn list of things like sexed quotes, ellipsis, etc. so that I could type them directly on the keyboard and it was (and still is) frustrating that different software used different codes for the same thing. Let’s get a text editor that allows you to specify the encoding on input and output before worrying about characters in filenames!

Apr 7, 2018 8:18am Steve Pampling (1551) 8170 posts	I must admit that I am bemused. Valiant efforts to make filenames work with top-bit-set characters seem a little pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions. It’s a little more than just filenames wouldn’t you say? The thing to remember on these subjects is that you need the foundations in before building the rest. The proffered “best” solution is UTF-8 support in the OS. If the OS in general can’t handle UTF-8 there seems little point in piddling around with the applications. Now if the OS has a decent character handling system that falls back to the current behaviour(ish) then people can look at updating the applications that use it. nemo said: The UTF-8 support I have been working on features fallback – that is, if it isn’t UTF-8, then it gets treated transparently as though it is Latin1 (or your preferred Encoding). So, people might more usefully ask what they have to do to use that UTF-8 support in various applications.

Apr 7, 2018 9:34am Mike Freestone (2564) 131 posts	Let’s get a text editor that allows you to specify the encoding on input and output before worrying about characters in filenames! Item number 2 on the list here but a list doesn’t guarantee action if nobody looks at the list

Apr 7, 2018 11:12am Steffen Huber (91) 1953 posts	UTF-8 or not, a lot of applications need to be better aware of encodings. E.g. CDBurn takes a very simplistic approach to Joliet (UTF-16). If you try to use LanManFS or Sunfish or FTPc with different servers and clients in a mixed OS environment, there are many surprises. Even the last version of Messenger Pro that I saw had many weaknesses wrt EMail and Usenet encoding. It is not a solution to “just stay with ASCII”.

Apr 7, 2018 12:49pm nemo (145) 2546 posts	Chris said pointless while we have a text editor, !Edit, that cannot handle simple things like eighths fractions You are literally arguing that we shouldn’t implement the thing that would solve the limitation that you’re using to justify not implementing the thing. Edit will consequently work much like this editor:

Apr 7, 2018 12:55pm Rick Murray (539) 13840 posts	a lot of applications need to be better aware of encodings Indeed. A lot of applications provide internationalised messages (French, German, etc). The problem with this is that it is somewhat hardwired to have an expectation of Latin1 (or something similar) as the default alphabet. You are literally arguing that we shouldn’t Yeah… Justification for doing nothing. :-/ Edit will consequently work much like this editor: Any projections on when this will see a wider release?

Apr 7, 2018 1:12pm nemo (145) 2546 posts	Chris also claimed it may have been a mistake for RISC OS to deprecate top-bit-set characters in filenames but it is a fact It isn’t. You’re wrong. You’ve misunderstood. Even if it were discouraged, and it isn’t, “deprecated” would imply that it used to be OK and now isn’t. The reverse is true. Glad to be able to correct your ignorance “What a man knows may fill a book. What he does not know fills the library.” Limitless though my ignorance no doubt is, I am probably least ignorant about how RISC OS works, sadly. Steffen pointed out VRPC HostFS is broken in so many respects that it is frightening Oh yes indeed. Chris Mahoney said I’ve also tested [UTF-8] with one of my own apps. Filer was happy, Edit was happy, my !Run file was happy, the Toolbox SaveAs object was happy Indeed, and any other modern platforms with access to those files would also be happy. It is only RISC OS (as distinct from WindowManager and FontManager) that would be blissfully unaware. That, I am fixing. And now I return to Chris: Windows just renders ‘5/8’ as ‘?’ if you save a ‘csv’ file for example from Excel (which can handle lots of odd characters). Which is nothing to do with Windows, or Excel. The problem is that CSV has no way of declaring its Encoding unambiguously… unless it is UTF-8, in which case it can use a Byte Order Mark. Arguably, starting a CSV with BOM is not fully compliant, but is usually respected by UTF-8 aware applications. Excel doesn’t make use of this… but that’s because (and this may shock you), Microsoft has not traditionally put much effort into allowing people to get data out of their products and into someone else’s. The fact that Excel’s preferred formats, including all modern replacements for the utterly outdated CSV, use UTF-8 by default is the pertinent point. There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows. …unless it begins with a BOM, in which case it is definitely¹ UTF-8. The more disturbing point is your assertion that “Windows” means “CP1252”… which is as US-centric as Microsoft themselves are frequently guilty of. However, the facts are much more complex than that – here is Microsoft’s list which prominently features this advice: ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page. If you won’t take my ignorant word for it, will you take Microsoft’s? I suspect not. Thank goodness that the ‘£’ key on the keyboard works correctly and produces the right character code on whatever system you are using. Oh really? ¹ Don’t. Just don’t.

Apr 7, 2018 1:19pm nemo (145) 2546 posts	A lot of applications provide internationalised messages (French, German, etc). The problem with this is that it is somewhat hardwired to have an expectation of Latin1 (or something similar) as the default alphabet. Being pedantic, I would expect the translation to use the default alphabet for that language – cue much wrangling about region versus language versus alphabet versus user preference. And hence, Unicode. Any projections on when this will see a wider release? Assuming you mean !UniEdit in the screenshots, probably never. Once UTF-8 is supported at the OS level, !Edit can trivially gain exactly this functionality. Theoretically it should already have it if using an outline font when the alphabet is UTF-8… but I can understand if KB didn’t bother (for it probably would have been he). As for the UTF-8 support, it is so very nearly shippable that I feel guilty typing this. So translating from programmer time to BST… months probably. I am being told to “just get on with it” by everyone already, don’t you start. ;-)

Apr 7, 2018 1:28pm Paul Sprangers (346) 524 posts	Just get on with it, please.

Apr 7, 2018 2:05pm Steve Pampling (1551) 8170 posts	It might be worth stress testing this new version (although note that as of this moment the autobuilt ROM is not yet available). So, modelling the prediction in Chief Enginneer Scott¹ terms thats a few weeks then :) ¹ Trekkie mode.

Apr 7, 2018 2:40pm Colin (478) 2433 posts	The biggest drawback I find when considering using encodings is that Font_Paint doesn’t display an ‘unknown’ glyph for code points a font doesn’t have a glyph for. I feel I should be able to paint a random block of memory in an encoding of my choice and have all code points display something otherwise editing becomes a problem – is that an unreasonable expectation?. Last time I looked Font_Paint/Font_ScanString just ignored codepoints without a glyph. As a result painting unknown glyphs is a non trivial affair requiring caching of font data in every program.

Apr 8, 2018 12:17pm nemo (145) 2546 posts	The FontManager is like that really incompetent person at work, who somehow manages to not get fired but still uses up the seat and the salary to no discernible effect. The offence is not only their uselessness, but the fact that their mere presence is preventing a better candidate being sought.

Apr 8, 2018 12:26pm Steve Pampling (1551) 8170 posts	The FontManager is like that really incompetent person at work, who somehow manages to not get fired but still uses up the seat and the salary to no discernible effect. The offence is not only their uselessness, but the fact that their mere presence is preventing a better candidate being sought. and somehow manages to give people the impression that is actually other people that are at fault. On a personal note the initials are DK (as top candidate because “there can be more than one”)

Apr 8, 2018 7:33pm Rick Murray (539) 13840 posts	Oh really? ;-) Arguably, though, that keyboard is still compliant with that was said due to the lack of a ‘£’ key. Oh, and shouldn’t there be a bunch of keys for language options? I would expect the translation to use the default alphabet for that language <BOOM!> Ludicrous Gibs that were formerly the insides of heads. I am being told to “just get on with it” by everyone already, don’t you start. ;-) May we then be more useful and offer to beta test when you’re ready? The biggest drawback I find when considering using encodings is that Font_Paint doesn’t display an ‘unknown’ glyph for code points a font doesn’t have a glyph for. The biggest drawback I find when using Font_Paint is that FontManager doesn’t recognise when the current font doesn’t have a glyph and automatically substitutes a font which does, so writing about 鬼束ちひろ or カラフィナ means handling all this by yourself instead of just passing the text to FontManager and saying “render this”.

Apr 8, 2018 10:21pm nemo (145) 2546 posts	Rick asked Shouldn’t there be More than 90% of Japanese users have a US keyboard and type in Romaji, in effect. 皮肉ですね？ May we then be more useful I can think of no one better qualified. There’s literally no one who types more than you. instead of just passing the text to FontManager and saying “render this” The ability to ‘stack’ physical fonts into a composite virtual font is indeed one of the valuable and versatile features wot we have not got.

Apr 9, 2018 12:59am Clive Semmens (2335) 3276 posts	The ability to ‘stack’ physical fonts into a composite virtual font is indeed one of the valuable and versatile features wot we have not got. YES! More than 90% of Japanese users have a US keyboard and type in Romaji, in effect. 皮肉ですね？ I suspect the equivalent is true of Indian users. I don’t know how Japanese typewriters used to work, so I don’t know whether Romaji on a US keyboard is a satisfactory alternative. But I do know how Hindi typewriters used to work (and Panjabi, Bengali, Gujerati, Odiya etc. ones worked in a similar fashion) and the US keyboard option is a very poor alternative. Ridiculously, I don’t know how French, Spanish or German computer keyboards work, although I do know how traditional typewriters work in those languages. If the computers aren’t the same, I suspect they’re a backwards step too.

Apr 9, 2018 1:22am Chris Mahoney (1684) 2165 posts	I don’t know how Japanese typewriters used to work “Hunt and peck” with 2450 characters. http://historysanjose.org/wp/the-first-japanese-typewriter-a-100-year-old-mechanical-marvel-with-2450-characters/ I can’t say whether something “better” came along after that! As for computers, while there is a kana keyboard (as pictured previously in the thread), as pointed out a lot of people use a US-based layout. It’s technically slower (since you need to press, for example, K-A to get か) but does get the job done. Meanwhile, on phones, apparently the so-called “ten-key” kana input method is the most common.

Apr 9, 2018 5:45am Clive Semmens (2335) 3276 posts	“Hunt and peck” with 2450 characters. I suspected that 8~) – almost anything must be better! The old Hindi typewriters, on the other hand, lent themselves beautifully to touch typing and used ~40% fewer keystrokes than US layout keyboards doing transliterated Hindi that’s converted to proper Hindi in software on the fly. I was reasonably good at it at one time – either on a typewriter or on an Acorn running my own rather hacky software. Anyone tell me how you type accented characters on a French computer? Or is it the same as on an old French typewriter (just two keystrokes for an accented character).

Apr 9, 2018 6:35am Rick Murray (539) 13840 posts	Clive: It depends upon the accented character. https://en.m.wikipedia.org/wiki/AZERTY Common ones are directly accessible – remember, contrary to practically every other layout on the planet, you need to shift to get numbers on the French keyboard, so é and the like are a simple keypress. Other less common characters are available via a dead key press, either the backwards apostrophe or the circumflex/umlaut key to the right of P. Frankly I prefer the RISC OS and Windows UK International methods. While every character is preceded by an Alt modifier, it is regular and one doesn’t need to hunt down where keys are. Sadly Android never got the memo and it’s keyboard handling is extremely broken (to the point where a Bluetooth keyboard using UK layout is only just capable of sticking acute accents on letters).

Apr 9, 2018 7:08am Clive Semmens (2335) 3276 posts	Not sure what the RISC OS and Windows UK international methods are, then. That description is exactly what my old French typewriter does, and it’s really quick for touch-typing. I made a variation on it for my own purposes on RISCOS, because I wanted a much wider range of languages to be reasonably usable (academic papers with authors’ names and addresses from all over the world – the papers themselves in English) but it was so idiosyncratic that I certainly wouldn’t release it into the wild!

Apr 9, 2018 8:50am Chris Mahoney (1684) 2165 posts	Other less common characters are available via a dead key press, either the backwards apostrophe That’s what we have in NZ; the keyboards are physically ANSI (i.e. US) layout, but ` is a dead key for typing the so-called macronised vowels (ā, ē, ī, ō, ū) used in Māori.

Apr 9, 2018 8:50am Rick Murray (539) 13840 posts	There’s literally no one who types more than you. Hmm… You know what they say about monkeys and typewriters, right?

Apr 9, 2018 11:26am nemo (145) 2546 posts	The MMK keyboard driver for RISC OS and my Windows driver support 16 dead-key accents, all with Alt: ` => ò grave " => ő hungarumlaut or double acute 6 => ô circumflex ^ => ǒ caron - => ō macron _ => ẕ lowline ; => ö dieresis : => ŏ breve ' => ó acute @ => å ring # => õ tilde ~ => ō macron again , => ç cedilla < => ą ogonek . => ṡ dot above > => ṇ dot below / => ħ bar I didn’t bother with hook or any of the non-European diacritics. The dead-keys on the MMK driver were different, I’ve forgotten. I prefer these now. I also have sexed quote on the brackets ‘’“” and en– and em—dash on N & M. Plus mathematical symbols ×÷± and other typographical niceties •™ﬁﬂ. The hardest part is remembering what I’ve put where.

Apr 9, 2018 11:38am Clive Semmens (2335) 3276 posts	I like that. It’s the exact same set I supported with my idiosyncratic system. (Calling the double acute “hungarumlaut” is confusing, since Hungarian has not only a double acute, but ALSO an ordinary umlaut.) I didn’t try to support Vietnamese, which uses the Latin alphabet with multiple diacritics. Not sure what I’d have done if we’d had a Vietnamese author, but we never did. I’d probably have made a special font, with just the things necessary for the particular case. I did support Hindi (and I suppose still do, if anyone wants it, but I’ve not used it myself for years), with dead key diacritics, possibly as many as four on a single character. The hardest part is remembering what I’ve put where. Keyboard diagrams… see http://clive.semmens.org.uk/RISCOS/index.php?JPhysiolKB for an example…

Apr 15, 2018 7:57am Matthew Phillips (473) 721 posts	Chris Hall wrote: There is also no way of telling the encoding used in a text file apart from assuming Latin1 on RISC OS and CP-1252 on Windows. Not entirely true. It’s actually pretty easy to identify text files encoded in UTF-8 because of the way the non-ASCII characters have a particular format to the binary representation which is quite distinctive and quite unlike Latin 1. You have to use heuristics, but it’s pretty reliable. Rick wrote: The biggest drawback I find when using Font_Paint is that FontManager doesn’t recognise when the current font doesn’t have a glyph and automatically substitutes a font which does, so writing about 鬼束ちひろ or カラフィナ means handling all this by yourself instead of just passing the text to FontManager and saying “render this”. I realise this is not as helpful as having it built into the OS, but there is RUfl which is quite easy to incorporate into applications written in C, as we have done with RiscOSM and Nominatim. Maybe the ideas used in RUfl could be added to an enhanced font manager some day. We’ve not yet done the bits needed to make the maps in RiscOSM support switching between different fonts to cover the glyphs, but it will support any glyphs found in a single font. The RUfl library does not support changing direction for Arabic and Hebrew, as far as I can remember, and the font substitution could be enhanced by some methods for identifying common styles like fixed-width, sans serif and so on. I have ideas about how to do this but have not got around to trying them out.