Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → Wish lists →

Unicode support, пожалуйста.

7 posts, 3 voices

Jan 12, 2013 12:00pm Jess Hampshire (158) 865 posts	I have been using FreeSerif unicode font as my system font for quite a few weeks as an experiment. I had to put “alphabet utf8” into and obey file in predesk, to make it work. With the addition of Paul Sprangers’ Keymap, I can enter other characters into aware programs. (As the subject line should show – entered from my Iyonix.) The only major issues with the desktop appears to be that it tries to display characters above 128 in non unicode on occasions. (hard space and the up arrow e.g. on the shutdown menu entry.) Where it displays a black diamond with a query. Could this be worked around by adding the characters which are presumably normally undefined to the font? (i.e. a hybrid unicode-RISC OS 8bit system). Could the system be fixed so it displays properly? Is seems that we are only a tiny way from a sensible working unicode system. Is a real time switch between unicode and 8 bit possible?

Jan 15, 2013 2:32pm nemo (145) 2546 posts	The incorrect up arrow is due to the TaskManager using Latin1 instead of UTF8 Messages – there is a defined Unicode for the arrow and the hard space must be correctly encoded (it is character 160, not byte code 160). Is a real time switch between unicode and 8 bit possible? No more than a real-time switch between English and French would be – no.

Jan 15, 2013 8:30pm Rick Murray (539) 13840 posts	No more than a real-time switch between English and French would be – no. I manage it. ;-) The only unfortunate thing is the good French I speak seems to be slightly incompatible with the French spoken by actual French people. Is a real time switch between unicode and 8 bit possible? It is something I have made a proposal of, but it doesn’t seem to be a terribly popular idea; perhaps because of the restrictions/bodginess … in essence, my Wimp would be pure UTF-8 for apps that knew how to handle it, local alphabet (probably Latin1) for older apps; however only the Wimp calls and mechanisms would support UTF8, everything else would be vanilla ASCII. On the face of it, this is no big deal as no self-respecting Wimp program should be using stuff like OS_Write0 these days; however it brings into focus some interesting questions: How do you handle filenames? [my current idea – do nothing, the RISC OS native filing system can cope with weird filenames (as would occur with UTF-8 shown as plain 8-bit text); this will at least permit expending to a UTF capable filesystem without too much pain further down the road] What about copy/paste between applications? [my current idea – do nothing, the protocol allows for negotiation; however making assumptions can easily lead to data loss] Wimp_Message (etc) to older apps? [my current idea – if a short message, this would be translated on the fly to-from Unicode depending on the app (old or “new”); if a long message, no translation as long messages are not intended to even be passed to older apps. The long message is the idea I posted a few weeks back about a 4K block of memory that would be used instead of the ~230-odd bytes of Wimp_Message.] Keypresses? [my current idea – keypresses return the LOGICAL character code, which may be larger than a byte and will be the “Unicode” code (not UTF-8). So ‘£’ will be 163 and the smiley face ‘ツ’ (katakana tsu) will be 12484 – assuming the keyboard driver in question is capable of this. For reference, the UTF-8 sequence is £ = 194,163; ツ = 227,131,132.] There was some disparity about key codes in an earlier post. This would be because I’m a dozy moron, ignore it… Related to the above, Wimp_SendKey and Wimp_EventKeypress would be modified to deal with Unicode values (instead of the horrid call-multiple thing we have at the moment). I’m writing all this again because my animé is downloading and it is slow; but also I hope that one day the right person will think “mmm, that’s not actually such a bad idea”, that being to bring proper support for Unicode to compliant applications without breaking everything due to massive API changes nor requiring to set the UTF8 alphabet (which would apply globally); and also in the smallest amount of work. Adding functionality to the Wimp as an option is a lot less grief than rewriting core parts of RISC OS to deal with characters larger than a byte. I reckon in this respect the Wimp would be better off acting as an abstraction from the command line and the world should just accept that Unicode in the CLI isn’t going to happen any more than it does in the DOS console (where my files look like “???.mp4”). It isn’t perfect, but it’s something to think about. Still hopin’. (^_^)

Jan 16, 2013 3:04pm nemo (145) 2546 posts	I manage it. ;-) Yes, but that’s because you didn’t have to choose between English and French brain templates when you woke up this morning. Some new protocol would have to signal the change and apps would have to be rewritten to act upon it. So that won’t happen. my Wimp would be pure UTF-8 for apps that knew how to handle it, local alphabet (probably Latin1) for older apps; however only the Wimp calls and mechanisms would support UTF8, everything else would be vanilla ASCII. The problem is that though it is easy to map Latin1 to Unicode, you can’t reasonably do the reverse. One could leave the UTF8 untranslated, but that introduces inconsistency. For example, what happens when you paste UTF8 into a writeable belonging to a Latin1 app? If the app is going to process the contents it would have to be converted to Latin1, but if the app is just going to send it through some other Wimp interface then you’d want it to stay as Unicode. You can’t have it both ways. The 8bit/Unicode transition has haunted every OS at some stage – it split EPOC off from Symbian, in effect. You just can’t have it both ways. How do you handle filenames? [my current idea – do nothing, the RISC OS native filing system can cope with weird filenames (as would occur with UTF-8 shown as plain 8-bit text); this will at least permit expending to a UTF capable filesystem without too much pain further down the road] But you forget the reverse angle – when the 8bit application creates a file with a name that is valid Latin1 but an invalid UTF8 encoded sequence. What then? You can’t require EVERY Unicode application to handle broken sequences at every interface. Nobody does that. Keypresses? [my current idea – keypresses return the LOGICAL character code, which may be larger than a byte and will be the “Unicode” code (not UTF-8). So ‘£’ will be 163 and the smiley face ‘ツ’ (katakana tsu) will be 12484 – assuming the keyboard driver in question is capable of this. I’m amused by your choice of Japanese, as you’ll be much more aware than most of how Unicode fails to address the needs of Chinese and Japanese users, especially with regard to names. For example, if one examines Adobe’s Japan1 character set (which, as A-J1-6, is the definitive standard for Japanese PDFs) one will find over 8,700 characters that do not map to a single unique Unicode. That’s 34% of the repertoire! In other words, Unicode isn’t necessarily always the answer… but unfortunately we’re not even at the point of being able to discover that! As for ‘big’ codes in Poll_Keypress and Wimp_SendKey, the DeepKeys module already allows that… but as their names suggest they are for keys, not characters. It’s important not to confuse the two.

Jan 16, 2013 9:14pm Rick Murray (539) 13840 posts	Yes, but that’s because you didn’t have to choose between English and French brain templates when you woke up this morning. No, I have to change on the fly. Sometimes it doesn’t work and the person I’m speaking to has to wait while a backtrace scrolls across my eyeballs… For example, what happens when you paste UTF8 into a writeable belonging to a Latin1 app? Easy. Unknown characters become question marks. If you’re using Firefox on Windows, copy this “ロボティクス・ノーツ” into Notepad. You’ll probably see “??????????”. Even supposedly-aware applications can fail. I can copy such characters from one Firefox window to another, or into Notepad++ (provided it is in UTF mode), but I can’t do either with Opera 9, it seems to always give me question marks. but if the app is just going to send it through some other Wimp interface then you’d want it to stay as Unicode. Old (Latin1) apps will only have access to the old (non-Unicode) interface, so sending something through a non-Unicode application will be a potential loss of data. At this point, I will mention the forward thinking Visual Basic 5 and 6 that used 16 bit (UTF16) strings internally, running on a UTF capable OS (XP++ and w98/ME with the patches applied), yet it only used the ANSI API, so getting Unicode working is…interesting. I accept that a loss of data will be inevitable, I just hope the end result isn’t quite as stupid as VB5/6! I’m amused by your choice of Japanese, Japanophile, that’s why. And it is better to say stuff you stand a chance of being able to read. The above, by the way, is “Robotikusu Nōtsu” (Robotics;Notes), the title of an animé series. one will find over 8,700 characters that do not map to a single unique Unicode. That’s 34% of the repertoire! It seems quite common in animé for people to horribly misread names, or to have to specify which kanji is used to write the name. It is my impression that Japanese people grew up with the duality of a name not only being a moniker, but also having a meaning potentially separate to the name, almost as if they had two names. [possible translation into English: the ‘R’ in Rick isn’t that sort of R, it is this and this is an R and this R means “one who fumbles for door keys in the dark”] With that in mind, how many of these unsupported characters are typically used in day to day conversation. The Unicode guys might have a point at leaving out 34% of the kanji if less than 3.4% even have a clue about how they are supposed to be read? In other words, Unicode isn’t necessarily always the answer… but unfortunately we’re not even at the point of being able to discover that! Unicode may not be the answer. I cannot say. What I can say, with total confidence, is Latin1 is NOT the answer. I have a database here that itemises my collection of Japanese food. I am teaching myself to count in Japanese, so the numbers are in Japanese traditional format as the only way I’m going to remember is to use them: 〇一二三四五六七八九十 or zero, ichi, ni, san, yon, go, roku, nana, something, something, ju… A less weird example are lyrics of songs like “Dear You – Hope” written in kanji with romaji underneath followed by a translation. I have several, ironically written with OvationPro on Windows. Not too surprisingly, it doesn’t work with RISC OS. As for ‘big’ codes in Poll_Keypress and Wimp_SendKey, the DeepKeys module already allows that… but as their names suggest they are for keys, not characters. It’s important not to confuse the two. Exactly. That is why I propose to return the Unicode code – what code, exactly, does a keyboard in kana mode send when the user presses ‘の’? As a character code, it would be 12398. As a UTF-8 sequence, it would be 227,129,174. It looks like the current method of sending extended keys is to emit a UTF-8 sequence with Wimp_SendKey – so the issue is already muddied. I would say that “character code” is the most logical interpretation as there is no key ‘é’ on a UK layout keyboard, yet it is fairly simple to get that on RISC OS. There is also no key (nor OS supported keypress) for some of the fruitier things (fl ligature?) but with !Chars it is possible. The use of “key” in the name may be a misnomer…

Jan 17, 2013 3:04pm nemo (145) 2546 posts	Easy. Unknown characters become question marks. Decomposition would probably be better, de pr?f?rence?! If you’re using Firefox on Windows, copy this “ロボティクス・ノーツ” into Notepad. You’ll probably see “??????”. Nope. Notepad is Unicode aware. It seems quite common in animé for people to horribly misread names, or to have to specify which kanji is used to write the name. There are two problems of course, pronunciation is not obvious (hence ruby annotations) but more relevantly for this discussion, names often use archaic or alternate forms of glyphs that can be lost when going through Unicode. For example, the Adobe-Japan1-6 characters 7746 & 8422, which are Hanyo-Denshi characters JC1555 & IB1603 respectively, are both represented by Unicode U+585A (塚) or unified compatibility character U+FA10 (塚). Now, an Ideographic Variation Sequence can be defined (and of course has for those two cases, see this list for details – 塚󠄁 & 塚󠄂) but that doesn’t magically cause existing fonts to gain the correct glyphs (and RISC OS doesn’t support OpenType so we couldn’t display them anyway – the GSUB mechanism is required). This Unicode FAQ does a good job of explaining the use of variation sequences to tackle this problem, and why it isn’t a complete solution. It looks like the current method of sending extended keys is to emit a UTF-8 sequence with Wimp_SendKey – so the issue is already muddied. No it isn’t, not if you regard the UTF8 start and continuation codes as ‘modifier keys’ in some imaginary IME – similar to \|!\|? in old BBC Micro escaping (that’s character 255 by the way). This way the same mechanism can deliver character 12398 and key 12398 unambiguously – key 12398 perhaps being allocated as the “open a browser” virtual key. Poll_KeyPress and Wimp_SendKey already support multiple keys which are not characters. For example, when your program receives KeyPress 27 you naturally assume that the user pressed Escape, not that they want the [ESC] character inserted (DeepKeys helps disambiguate this kind of thing – it’s worst with character 13 of course).

Jan 17, 2013 7:36pm Rick Murray (539) 13840 posts	Me: Easy. Unknown characters become question marks. Decomposition would probably be better, de pr?f?rence?! Isn’t this sort of what I was suggesting? In your example I am guessing we are treating an accented e ‘é’ as unknown, so yes – if this were the case then “pr?f?rence” would be the result. However, since ‘é’ exists in Latin1… ;-) Nope. Notepad is Unicode aware. I guess you’re using Windows 7 or something. On this machine (XP SP3), it isn’t so. Actually, I replaced the system “Notepad” with “Metapad” (renamed “Notepad” to “NotepadX” in case I needed it – namely those times Windows lets me open a binary, but INSISTS it is written in Chinese) and that can’t do it correctly either which is odd as it has a UTF16 mode, so… No it isn’t, not if you regard the UTF8 start and continuation codes as ‘modifier keys’ in some imaginary IME Ah, but now we aren’t passing key codes, we’re passing metadata. That’s surely worse than just passing characters? This way the same mechanism can deliver character 12398 and key 12398 unambiguously – key 12398 perhaps being allocated as the “open a browser” virtual key. I’m tempted to say that there should be a related but separate mechanism defined for these – didn’t my earlier write-up specify bit 31 set for “functional keys” like browser open etc? I don’t remember right now. Poll_KeyPress and Wimp_SendKey already support multiple keys which are not characters. I was thinking about this at work (spent half my day on the production line, thinking about stuff is necessary to stop one going completely gaga). Anyway, the results of my long think was sadly not “42”, but rather that these “key” events are already abstracted into character codes. There is no “upper case A” key distinct from “lower case a”, nor is there a key for “±” etc. It is basically a translation into plain ASCII with some “special keys” thrown into the mix. Oh, and for what it is worth, nothing we’re talking about here uses the internal key numbers. The whole thing is an abstraction. Therefore, it would seem to me to be logical to use bit 31 unset to indicate that what is provided is a character code, and bit 31 set to indicate that what is provided is a key that has no logical mapping – “F4” or “Select” for example. Benefits: All of the Unicode repertoire can be instantly available without complication [whether or not it works/is supported is a different question]. By splitting off the keypresses from characters, they can be handled separately [in effect, the Wimp would be flagging for you what this is you’re being given, so potentially more optimal code] Related to the above, to aid in updating code, all of the “special keys” can keep their same character number, like the >256 for the function keys. No need to assemble multiple polls or send multiple SendKeys, with deUTFify and reUTFify around it. The Wimp will see the character as “12345”, you will see the character as “12345”, why the mucking around in the middle? I know DeepKeys can help with the last one, however this presupposes that everybody will have this loaded. My suggestion is for this to be a native activity. Having programmed that other system, I look at RISC OS as being friendier and more obnoxious in equal measure. It is a much friendlier API to get to know and use – however some things that seem like they ought to be fairly simple are actually require jumping through hoops. An example being mouseclicks and redraw loops. The times when you are looking at screen co-ordinates, and the times when you are looking at window co-ordinates, and the fact that the window origin (which needs to first be calculated from the data block…) and the way of numbering differ between screen and window. Yet more calculations. I’m thinking of adding to my “changes to make to the wimp when I’m smart enough” list to extend the window block returned in GetWindowState GetRectangle to include a pre-calculated origin and…there was something else, I forget. Anyway, it’ll fit into the size of a poll block and applications that don’t know it is there just won’t see it. The driving force, in case you hadn’t noticed, is to make the API do more stuff so the programmer doesn’t have to keep doing the same work over and over. And also to have RISC OS capable of displaying stuff in foreign languages – be it ελληνικά or 日本語 or even বাংলা – preferably at the same time and “natively” (not doing app-specific fontwork to get it to work). Is this too much to ask, given you need to look hard these days to find a device that fails with (at least some) extended characters?

Reply

To post replies, please first log in.

Forums → Wish lists →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

What would you like to see written or changed?

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails