UTF-8?
nemo (145) 2546 posts |
I am trying to get an Alpha on the site. Rather busy with real life at the moment.
It’s a module, though obviously could be done in-kernel.
Katakana and Hiragana are in, as is Arabic. However, all I’m doing is arranging for UTF-8 sent to WriteC et al to produce the right 8×8 glyphs on screen. To any claimant of WrchV will just see the UTF-8 stream go past.
Well, one of the clever bits is the fallback alphabet, that ensures that stuff that spits out Latin1 (say) whilst in UTF-8 mode still produces the right glyphs. WrchV claimants will see the Latin1 codes – they’re compatibility glyphs, not replacement code sequences. (Such a substitution cannot be done earlier because it requires a retrospective change – if you see a valid UTF-8 start byte but then an invalid byte, then the earlier byte needs to be interpreted in the fallback encoding and then encoded as UTF-8, which would require a different start byte)
Then it’ll work fine. |
WPB (1391) 352 posts |
Nemo, I wonder if your silence on this means you’ve decided to tackle 16×16 gylphs after all? ;) |
nemo (145) 2546 posts |
Gomen nasai. It means that since we won a new Japanese customer I’ve had no spare time AT ALL. I looked into it, and I’ve played with Unifont too, but as I’ve said before I need to get the 8×8 UTF-8 out there, even in Beta form, before I address the 16×16 idea. On the positive side, the reason I’m here to reply is because I am installing RPCEmu+RO5 again, having lost it with a machine change six months ago. Need to check that the UTF-8 module works under RO5. 4841 glyphs now! So it’ll definitely be ready soon. Maybe even THIS YEAR! :-( |
WPB (1391) 352 posts |
No apology necessary – I was only teasing. (But very aware that I’m in a house made of glass with an ample supply of stones!) 8×8 will be fantastic to experiment with and if 16×16 ever comes along, you’ll be bumped up to 天下一品 status… GOOD LUCK! |
Steve Pampling (1551) 8170 posts |
The install document is good for Windows XP and Win7, not tested on Win 8. Mac OSX hasn’t had many people use it (or at least very few have said they have used either successfully or not) but all who have reported using it had no great issues. |
Rick Murray (539) 13840 posts |
Only 天下一品? If 16×16 comes along, it’ll be 神様 without a doubt. |
WPB (1391) 352 posts |
Don’t go too far there, Rick. We want to hold something back in case we need him to implement anything else. ;) |
Galax (2465) 3 posts |
I’d love to see full system-wide UTF-8 support, with fall backs for non-updated apps. Are there already system APIs for splitting a stream of UTF-8 bytes into characters? It’s not too hard to do, but should be done consistently. I speak and write a bit of Chinese and from a personal point of view the lack of Chinese would be a major factor stopping me from using RISC OS as my only system. Don’t get distracted at all by thinking about vertical text; it doesn’t exist on Chinese computers outside of specialist areas such as DTP programs. There might be a few more apps that can do it in Taiwan, it’s only been added to DirectWrite in Windows 8, and this discussion is about catching up with Windows 98. |
Rick Murray (539) 13840 posts |
It would be nice, for the few of us that have squiggles alongside regular Latin characters. ;-) I posted, a while back, about a way that this might be possible to implement, because the Wimp needs to be able to handle both UTF-8 and Latin1 at the same time (this is why). A dearth of developers and a large number of legacy applications mean that a two-tier Wimp is the best we can hope for. 1, While we can “fairly easily” determine if a compatible app is UTF-8 or not by how it calls the Wimp during its initialisation, what do we do if we pass a filename to OS_Find? Is the filename Latin1 or UTF-8? Maybe that’s not the best example as a filing system ought to be fairly agnostic and attempt to open the filename passed; but the general point holds true for anything that receives a string input or returns a string. Thankfully this isn’t as painful as it could be. We’re using UTF-8 not UTF-16, so it may be doable so long as the API doesn’t make too many assumptions. It isn’t just Chinese/Japanese. There are others here who would like Greek, Cyrillic, and maybe even a proper range of accents – Ōsaka, Kyōto… You can do Chinese right now. Install the Cyberbit font, switch to UTF-8 language, then realise exactly how far we have to go. I’ll leave it to somebody else to teach Edit (and such) that one byte does not necessarily equal one character. I think that we could at least begin by making the Wimp environment better cope with multibyte characters, though I wonder – given that most people here are English speakers – how much desire there is to support such a thing, especially given the complications regarding older applications? Maybe you could start a bounty and see if there is any interest? |
Galax (2465) 3 posts |
I don’t think that’s a good way to represent it to a non-Unicode application, information has been lost. There are lots of possible solutions, the simplest might escape out nonstandard characters in all filenames being passed to/from non-Unicode applications, something like the URL escaping (%20 etc.). As you said about buffer sizes etc., it seems risky to force UTF-8 on any application that isn’t written (or at least tested) to expect it. I don’t think it’s realistic to expect applications that were written without Unicode/UTF-8 support in mind to just magically work correctly. A bigger problem for Chinese (and other non-alphabetic languages) might be getting an Input Method Editor to work everywhere. Actually just creating an IME is non-trivial. You can’t just type these languages directly, you need a system that takes what you type (usually phonetics) and converts it into the actual characters. I could explain more but I’ve probably already gone on too long. |
Rick Murray (539) 13840 posts |
I don’t think it is necessarily possible to represent a Unicode entity to a non-Unicode one without some sort of information loss. However, the use of a row of question marks (albeit dumb) is exactly what XP does to a command line application, although there are differences in the short (8.3) filename. I would agree that attempting to make a unique filename of at least the first ten characters (if longer) might be a workable way to do it, but it would require the filing system (FileSwitch?) to be aware of this need. For what it’s worth: 2014/06/16 22:12 11,185,867 !76FF~1.MP4 ??!??????????????.mp4 That is showing the filename of the promo video for an animé series from the DOS console using the command Of course, this might be moot if nemo’s unicode command line ever reaches a release point, then we’d be in the unique position of potentially having a fully unicode system right down to command line level.
My specific example was for a French phrase from a Latin1 application to a Unicode application. Going the other way, the result will be shorter. A two-byte accent can be converted to a single byte accent/character. And anything that can’t be represented can become a question mark. Either way, the sizes given (in, say, datasend) won’t match up.
Given that we have potentially hundreds of applications that assume Latin1 is the current alphabet, and we have a great number that will not be further updated, and at current time we have exactly zero Unicode Wimp applications (NetSurf manages its own font handling), I am afraid that we’re going to hav to bend over backwards to support legacy applications.
Indeed. It is fairly trivial to write something to accept keypresses and convert to kana – typing in “kokoro” can easily become either こころ or ココロ, but the logic to go from there to 心 is rather complicated. Just using an IME is “interesting” as you can type something, it munges it phonetically to be hiragana, and every so often it will delete a bunch and replace it with a kanji, or if your writing could imply several (Japanese is full of homophones) it will open up a list for you to pick what you want. [BTW, if anybody is interested, 心 = ♥ :-) ] As for Chinese – do they even have a native phonetic way of writing? It all looks like Kangxi. Oh, and anybody who has read the backs of boxes of cereal packets might have noticed that there is “Traditional Chinese” and “Simplified Chinese”. They both look alike until you notice a fifty-stroke glyph in the Traditional that has become a vague squiggle in the Simplified. Chinese IMEs must be hellishly complicated. |
Chris Hall (132) 3554 posts |
2, How do you represent a file called “「月光」.mp3” in Latin1? What’s wrong with E3 80 8C E6 9C 88 E5 85 89 E3 80 8D.mp3 (i.e. just use top bit set characters (it won’t display here properly if I show it in Latin 1 as it thinks it is UTF-8 as this is Firefox on Windows)) unless you meant to include the sexed quotes as part of the filename. |
WPB (1391) 352 posts |
At least if you use the UTF-8 byte sequence in hex, you get a unique filename, but they get pretty long and they’re horrible to work with. (Definitely need Tab completion of filenames at the command line for that!) “Yue guang” or “Gekkou” would be far friendly, though might require a level of complexity in the Filer that no one could justify. ;) |
WPB (1391) 352 posts |
Simpler than JA at least. There’s pretty much a one-to-one mapping between pinyin (with tonal accents) and hanzi. Not one-to-many like romaji-to-kanji. It’s all tied up in Japanese history and how they pinched the Chinese’s writing system and bodged it onto their own language. ;) And no phonetic scripts (kana) to figure out. (Is that は a particle or is it the first character of はな? – that problem is unique to JA.) As for simplified versus traditional, I think each traditional character has at most one simplified counterpart, and you don’t mix the two unless the traditional character has no simplified form. So it doesn’t really complicate things from a computing perspective, but has made Chinese people’s lives a lot harder (and anyone learning Chinese). Now you need to learn both the traditional AND simplified characters if you want to read modern text and text written pre-simplification. Complicated! |
Chris Hall (132) 3554 posts |
At least if you use the UTF-8 byte sequence in hex That’s not what I meant. I meant just use the top bit set characters as a single-byte character string. After all that is what will be stored as the filename on disc. |
Chris Mahoney (1684) 2165 posts |
Regarding “Latinisation” of filenames etc, I see that OS X manages to go from kana/kanji to rōmaji. For example, 猫 becomes neko in Terminal, and the example given above of 月光 becomes gekkou (and not something ridiculous like tsukihikari!). Of course, this would be fiendishly difficult to implement in RISC OS :) |
WPB (1391) 352 posts |
That’s amazing! I was saying it tongue-in-cheek above. Right, RISC OS better step up then! EDIT: Or I wonder if it isn’t as amazing as I thought – is the romaji stored somewhere as metadata with the file? That would be pretty cool, actually. As a test, you could try renaming the file to 犬 and seeing what it says in the terminal? |
Chris Hall (132) 3554 posts |
So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority… |
WPB (1391) 352 posts |
It all comes down to what the people willing to do the work are interested in and want to do, as always. It would open up RISC OS to a much wider audience, which would be no bad thing. |
Chris Hall (132) 3554 posts |
In principle, Yes, but a stable build for the Pi (and Pandaboard) and a working build for the Compute Module are probably more important. |
Chris Mahoney (1684) 2165 posts |
I’m not actually touching files at all, just displaying kanji in a “Latin” window. It’s therefore not metadata on the filename. 犬 does display as expected (inu). The easiest place to test seems to be in System Preferences > Sharing; if you enter Japanese into the Computer Name box then you get a romaji representation of it underneath. |
WPB (1391) 352 posts |
That’s pretty smart. I’ve seen plenty of implementations of similar, but never at OS level. Kudos to OS X. |
Rick Murray (539) 13840 posts |
Exotic? We can’t even do European languages at the same time never mind anything fancy like an Asian IME… |
Rick Murray (539) 13840 posts |
An Irishman, a Hungarian, and a Greek walk into a bar. The Irish man, contrary to popular stereotype, was actually rather intelligent. He said: This joke... it can't possibly work. The Hungarian pondered this for a moment before replying: ?n nem hiszem, hogy a sz?mit?g?p k?pes erre. The Greek, with a more Mediterranean personality, waved his arms around a lot while exclaiming: ???? ??????????? ??? ??????? ???? ???????? ?? ???? ????? ??????????! |
Rick Murray (539) 13840 posts |
…meanwhile the pretty Asian girl in the corner of the room thought to herself: ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 ? 9 ? 8 ? 9 ? ? |