Unicode support, пожалуйста.
Jess Hampshire (158) 865 posts |
I have been using FreeSerif unicode font as my system font for quite a few weeks as an experiment. I had to put “alphabet utf8” into and obey file in predesk, to make it work. With the addition of Paul Sprangers’ Keymap, I can enter other characters into aware programs. (As the subject line should show – entered from my Iyonix.) The only major issues with the desktop appears to be that it tries to display characters above 128 in non unicode on occasions. (hard space and the up arrow e.g. on the shutdown menu entry.) Where it displays a black diamond with a query. Could this be worked around by adding the characters which are presumably normally undefined to the font? (i.e. a hybrid unicode-RISC OS 8bit system). Could the system be fixed so it displays properly? Is seems that we are only a tiny way from a sensible working unicode system. Is a real time switch between unicode and 8 bit possible? |
nemo (145) 2529 posts |
The incorrect up arrow is due to the TaskManager using Latin1 instead of UTF8 Messages – there is a defined Unicode for the arrow and the hard space must be correctly encoded (it is character 160, not byte code 160).
No more than a real-time switch between English and French would be – no. |
Rick Murray (539) 13806 posts |
I manage it. ;-) The only unfortunate thing is the good French I speak seems to be slightly incompatible with the French spoken by actual French people.
It is something I have made a proposal of, but it doesn’t seem to be a terribly popular idea; perhaps because of the restrictions/bodginess … in essence, my Wimp would be pure UTF-8 for apps that knew how to handle it, local alphabet (probably Latin1) for older apps; however only the Wimp calls and mechanisms would support UTF8, everything else would be vanilla ASCII. On the face of it, this is no big deal as no self-respecting Wimp program should be using stuff like OS_Write0 these days; however it brings into focus some interesting questions:
I’m writing all this again because my animé is downloading and it is slow; but also I hope that one day the right person will think “mmm, that’s not actually such a bad idea”, that being to bring proper support for Unicode to compliant applications without breaking everything due to massive API changes nor requiring to set the UTF8 alphabet (which would apply globally); and also in the smallest amount of work. Adding functionality to the Wimp as an option is a lot less grief than rewriting core parts of RISC OS to deal with characters larger than a byte. I reckon in this respect the Wimp would be better off acting as an abstraction from the command line and the world should just accept that Unicode in the CLI isn’t going to happen any more than it does in the DOS console (where my files look like “???.mp4”). |
nemo (145) 2529 posts |
Yes, but that’s because you didn’t have to choose between English and French brain templates when you woke up this morning. Some new protocol would have to signal the change and apps would have to be rewritten to act upon it. So that won’t happen.
The problem is that though it is easy to map Latin1 to Unicode, you can’t reasonably do the reverse. One could leave the UTF8 untranslated, but that introduces inconsistency. For example, what happens when you paste UTF8 into a writeable belonging to a Latin1 app? If the app is going to process the contents it would have to be converted to Latin1, but if the app is just going to send it through some other Wimp interface then you’d want it to stay as Unicode. You can’t have it both ways. The 8bit/Unicode transition has haunted every OS at some stage – it split EPOC off from Symbian, in effect. You just can’t have it both ways.
But you forget the reverse angle – when the 8bit application creates a file with a name that is valid Latin1 but an invalid UTF8 encoded sequence. What then? You can’t require EVERY Unicode application to handle broken sequences at every interface. Nobody does that.
I’m amused by your choice of Japanese, as you’ll be much more aware than most of how Unicode fails to address the needs of Chinese and Japanese users, especially with regard to names. For example, if one examines Adobe’s Japan1 character set (which, as A-J1-6, is the definitive standard for Japanese PDFs) one will find over 8,700 characters that do not map to a single unique Unicode. That’s 34% of the repertoire! In other words, Unicode isn’t necessarily always the answer… but unfortunately we’re not even at the point of being able to discover that! As for ‘big’ codes in Poll_Keypress and Wimp_SendKey, the DeepKeys module already allows that… but as their names suggest they are for keys, not characters. It’s important not to confuse the two. |
Rick Murray (539) 13806 posts |
No, I have to change on the fly. Sometimes it doesn’t work and the person I’m speaking to has to wait while a backtrace scrolls across my eyeballs…
Easy. Unknown characters become question marks. If you’re using Firefox on Windows, copy this “ロボティクス・ノーツ” into Notepad. You’ll probably see “??????????”.
Old (Latin1) apps will only have access to the old (non-Unicode) interface, so sending something through a non-Unicode application will be a potential loss of data. At this point, I will mention the forward thinking Visual Basic 5 and 6 that used 16 bit (UTF16) strings internally, running on a UTF capable OS (XP++ and w98/ME with the patches applied), yet it only used the ANSI API, so getting Unicode working is…interesting. I accept that a loss of data will be inevitable, I just hope the end result isn’t quite as stupid as VB5/6!
Japanophile, that’s why. And it is better to say stuff you stand a chance of being able to read. The above, by the way, is “Robotikusu Nōtsu” (Robotics;Notes), the title of an animé series.
It seems quite common in animé for people to horribly misread names, or to have to specify which kanji is used to write the name. It is my impression that Japanese people grew up with the duality of a name not only being a moniker, but also having a meaning potentially separate to the name, almost as if they had two names. With that in mind, how many of these unsupported characters are typically used in day to day conversation. The Unicode guys might have a point at leaving out 34% of the kanji if less than 3.4% even have a clue about how they are supposed to be read?
Unicode may not be the answer. I cannot say.
Exactly. That is why I propose to return the Unicode code – what code, exactly, does a keyboard in kana mode send when the user presses ‘の’? As a character code, it would be 12398. As a UTF-8 sequence, it would be 227,129,174. |
nemo (145) 2529 posts |
Decomposition would probably be better, de pr?f?rence?!
Nope. Notepad is Unicode aware.
There are two problems of course, pronunciation is not obvious (hence ruby annotations) but more relevantly for this discussion, names often use archaic or alternate forms of glyphs that can be lost when going through Unicode. For example, the Adobe-Japan1-6 characters 7746 & 8422, which are Hanyo-Denshi characters JC1555 & IB1603 respectively, are both represented by Unicode U+585A (塚) or unified compatibility character U+FA10 (塚). Now, an Ideographic Variation Sequence can be defined (and of course has for those two cases, see this list for details – 塚󠄁 & 塚󠄂) but that doesn’t magically cause existing fonts to gain the correct glyphs (and RISC OS doesn’t support OpenType so we couldn’t display them anyway – the GSUB mechanism is required). This Unicode FAQ does a good job of explaining the use of variation sequences to tackle this problem, and why it isn’t a complete solution.
No it isn’t, not if you regard the UTF8 start and continuation codes as ‘modifier keys’ in some imaginary IME – similar to |!|? in old BBC Micro escaping (that’s character 255 by the way). This way the same mechanism can deliver character 12398 and key 12398 unambiguously – key 12398 perhaps being allocated as the “open a browser” virtual key. Poll_KeyPress and Wimp_SendKey already support multiple keys which are not characters. For example, when your program receives KeyPress 27 you naturally assume that the user pressed Escape, not that they want the [ESC] character inserted (DeepKeys helps disambiguate this kind of thing – it’s worst with character 13 of course). |
Rick Murray (539) 13806 posts |
Isn’t this sort of what I was suggesting? In your example I am guessing we are treating an accented e ‘é’ as unknown, so yes – if this were the case then “pr?f?rence” would be the result. However, since ‘é’ exists in Latin1… ;-)
I guess you’re using Windows 7 or something. On this machine (XP SP3), it isn’t so.
Ah, but now we aren’t passing key codes, we’re passing metadata. That’s surely worse than just passing characters?
I’m tempted to say that there should be a related but separate mechanism defined for these – didn’t my earlier write-up specify bit 31 set for “functional keys” like browser open etc? I don’t remember right now.
I was thinking about this at work (spent half my day on the production line, thinking about stuff is necessary to stop one going completely gaga). Anyway, the results of my long think was sadly not “42”, but rather that these “key” events are already abstracted into character codes. There is no “upper case A” key distinct from “lower case a”, nor is there a key for “±” etc. It is basically a translation into plain ASCII with some “special keys” thrown into the mix. Oh, and for what it is worth, nothing we’re talking about here uses the internal key numbers. The whole thing is an abstraction. Therefore, it would seem to me to be logical to use bit 31 unset to indicate that what is provided is a character code, and bit 31 set to indicate that what is provided is a key that has no logical mapping – “F4” or “Select” for example.
Having programmed that other system, I look at RISC OS as being friendier and more obnoxious in equal measure. It is a much friendlier API to get to know and use – however some things that seem like they ought to be fairly simple are actually require jumping through hoops. An example being mouseclicks and redraw loops. The times when you are looking at screen co-ordinates, and the times when you are looking at window co-ordinates, and the fact that the window origin (which needs to first be calculated from the data block…) and the way of numbering differ between screen and window. Yet more calculations. The driving force, in case you hadn’t noticed, is to make the API do more stuff so the programmer doesn’t have to keep doing the same work over and over. And also to have RISC OS capable of displaying stuff in foreign languages – be it ελληνικά or 日本語 or even বাংলা – preferably at the same time and “natively” (not doing app-specific fontwork to get it to work). Is this too much to ask, given you need to look hard these days to find a device that fails with (at least some) extended characters? |