Unicode, alphabets, etc.
Jeffrey Lee (213) 6048 posts |
I’m looking at implementing support for non-ASCII characters in VNCServ. Dealing with different alphabets and character encodings in RISC OS isn’t really something I’m familiar with, so I figured that asking a few questions here might be a good way of getting the answers I need. Possibly some of these questions are answered in the PRMs, but I haven’t had a chance to check them yet. In any case, this post might serve as a useful reference for anyone else who is taking their first look at character encoding support.
|
Sprow (202) 1158 posts |
I think the only thing in (current) ROMs using !Unicode is the more recent versions of Chars. A quick grep of the sources show !Browse and the Korean IME do too, neither of which get used anywhere. The log messages for !Unicode have several mentions of set top box model numbers so most likely there are closed components for STBs of years gone by.
UnicodeLib has some promising looking conversion functions, but I think the last time I tried to do what you’re doing it turned out some vital step was missing. If you can assume you only need to support RISC OS 5 then Service_International 8 will give you a 256 entry lookup table for a given alphabet. If you want to support other OS versions you’ll need to carry round your own copy, or at least a Latin1 fallback. |
Jeffrey Lee (213) 6048 posts |
Thanks for the tips. A bit of experimentation with Iconv suggests that:
And all of the above is totally fair, considering that Iconv is meant to work with standards and the RISC OS alphabets aren’t a 1:1 match for any of the standards. So I’ll probably rely on Service_International 8 plus a hardcoded Latin1 fallback. However Service_International 8 isn’t perfect either, since it doesn’t include the Wimp symbols. I can kind of understand why (it’s the Wimp that defines them, and the RISC OS 3 PRMs warn that their presence shouldn’t be relied upon), but surely the fact that apps are using them for their menus (and are causing trouble with UTF-8 alphabet) means that we should have some official way of converting them. Have we actually decided on which Unicode code points should be used for the Wimp symbols? |
Sprow (202) 1158 posts |
I guess it’s trying to map from number to the name you would need with a *Alphabet command, with the possibility of extra modules being added that respond with even more alphabets. In the strictest of senses since Latin1 is a superset of ISO8859-1 it would be a lie to return ISO8859-1 as the answer. When I sent in Appendix G of the User Guide with all the updated character sets I also dug up the standard numbers for the respective alphabets. I doubt we’ll ever add any new alphabets now UTF-8 exists.
Yes – there’s a table in the Style Guide on page 98 in the section “Unicode support”. I think the table just comes from the Wimp sources (wimpsymbols_UTF8 and wimpsymbols_UCS4 in Wimp04.s). |
Jeffrey Lee (213) 6048 posts |
Some poking around in the InternationalKeyboard sources suggests that the answer is “yes”. |
Chris Mahoney (1684) 2165 posts |
(Ignore this – it was completely wrong) |
Rick Murray (539) 13840 posts |
Okay, I’ll ignore it – but Japanese filenames in the Filer, huh? I’m guessing you are in Alphabet = UTF8? |
Chris Mahoney (1684) 2165 posts |
Sneaky timing there! :) Yep, UTF8 alphabet with IPAex font. |
Michael Drake (88) 336 posts |
John-Mark’s Drobe article might help: http://www.drobe.co.uk/article.php?id=1319&hlt=unicode |