RISC OS Open: Forum: UTF-8?

May 6, 2014 5:10pm

nemo (145) 2546 posts

If I ask really nicely – would you consider hooking me up with an early copy of the command line UTF-8 handler?

I am trying to get an Alpha on the site. Rather busy with real life at the moment.

is it a module? kernel source mods?

It’s a module, though obviously could be done in-kernel.

I would like to verify that my OLED module is quite happy with UTF-8 sequences (I’d likely just test basic kana (if you have implemented it, that is!)

Katakana and Hiragana are in, as is Arabic. However, all I’m doing is arranging for UTF-8 sent to WriteC et al to produce the right 8×8 glyphs on screen. To any claimant of WrchV will just see the UTF-8 stream go past.

if that works okay, no reason to assume other stuff would fail) and that nothing unexpected happens when providing UTF-8 strings. If there are any specific requirements, let me know.

Well, one of the clever bits is the fallback alphabet, that ensures that stuff that spits out Latin1 (say) whilst in UTF-8 mode still produces the right glyphs. WrchV claimants will see the Latin1 codes – they’re compatibility glyphs, not replacement code sequences. (Such a substitution cannot be done earlier because it requires a retrospective change – if you see a valid UTF-8 start byte but then an invalid byte, then the earlier byte needs to be interpreted in the fallback encoding and then encoded as UTF-8, which would require a different start byte)

All of this takes place while VDU is redirected to a sprite

Then it’ll work fine.

Jul 6, 2014 9:42am

WPB (1391) 352 posts

Nemo, I wonder if your silence on this means you’ve decided to tackle 16×16 gylphs after all? ;)

Jul 19, 2014 10:48pm

nemo (145) 2546 posts

I wonder if your silence on this means you’ve decided to tackle 16×16 gylphs after all?

Gomen nasai. It means that since we won a new Japanese customer I’ve had no spare time AT ALL.

I looked into it, and I’ve played with Unifont too, but as I’ve said before I need to get the 8×8 UTF-8 out there, even in Beta form, before I address the 16×16 idea.

On the positive side, the reason I’m here to reply is because I am installing RPCEmu+RO5 again, having lost it with a machine change six months ago. Need to check that the UTF-8 module works under RO5. 4841 glyphs now!

So it’ll definitely be ready soon. Maybe even THIS YEAR! :-(

Jul 20, 2014 5:58am

WPB (1391) 352 posts

Gomen nasai.

No apology necessary – I was only teasing. (But very aware that I’m in a house made of glass with an ample supply of stones!)

8×8 will be fantastic to experiment with and if 16×16 ever comes along, you’ll be bumped up to 天下一品 status…

GOOD LUCK!

Jul 20, 2014 8:09am

Steve Pampling (1551) 8170 posts

the reason I’m here to reply is because I am installing RPCEmu+RO5 again

The install document is good for Windows XP and Win7, not tested on Win 8.
Notes on differences for Win 8 would be useful.

Mac OSX hasn’t had many people use it (or at least very few have said they have used either successfully or not) but all who have reported using it had no great issues.

Jul 20, 2014 11:06am

Rick Murray (539) 13840 posts

you’ll be bumped up to 天下一品 status…

Only 天下一品?

If 16×16 comes along, it’ll be 神様 without a doubt.

Jul 20, 2014 3:46pm

WPB (1391) 352 posts

Only 天下一品?

If 16×16 comes along, it’ll be 神様 without a doubt.

Don’t go too far there, Rick. We want to hold something back in case we need him to implement anything else. ;)

Aug 23, 2014 8:31am

Galax (2465) 3 posts

I’d love to see full system-wide UTF-8 support, with fall backs for non-updated apps. Are there already system APIs for splitting a stream of UTF-8 bytes into characters? It’s not too hard to do, but should be done consistently.

I speak and write a bit of Chinese and from a personal point of view the lack of Chinese would be a major factor stopping me from using RISC OS as my only system.

Don’t get distracted at all by thinking about vertical text; it doesn’t exist on Chinese computers outside of specialist areas such as DTP programs. There might be a few more apps that can do it in Taiwan, it’s only been added to DirectWrite in Windows 8, and this discussion is about catching up with Windows 98.

Aug 23, 2014 11:01am

Rick Murray (539) 13840 posts

It would be nice, for the few of us that have squiggles alongside regular Latin characters. ;-)

I posted, a while back, about a way that this might be possible to implement, because the Wimp needs to be able to handle both UTF-8 and Latin1 at the same time (this is why).

A dearth of developers and a large number of legacy applications mean that a two-tier Wimp is the best we can hope for.
Unfortunately, this raises all sorts of complications:

1, While we can “fairly easily” determine if a compatible app is UTF-8 or not by how it calls the Wimp during its initialisation, what do we do if we pass a filename to OS_Find? Is the filename Latin1 or UTF-8? Maybe that’s not the best example as a filing system ought to be fairly agnostic and attempt to open the filename passed; but the general point holds true for anything that receives a string input or returns a string. Thankfully this isn’t as painful as it could be. We’re using UTF-8 not UTF-16, so it may be doable so long as the API doesn’t make too many assumptions.
2, How do you represent a file called “「月光」.mp3” in Latin1? Probably as “????.mp3”. Fine. Now how about a file called “トポロジ.mp3”?
3, Related to the above, it is feasible to have a shim to bodge UTF-8 into its closest Latin1 equivalent for inter-application messages, but this is not an easy task. What happens if you to a drag-save of a phrase written in French (“je suis très désolé, mon ami!”) from a Latin1 application to a UTF-8 one. It is clear that this will be invalid in the UTF-8 application. Should it be converted? What if the content is not in plain text? It might be a DrawFile. How do we handle sizes and buffers? I see three accents in there, so I can say that the UTF-8 string will be at least three bytes larger.

It isn’t just Chinese/Japanese. There are others here who would like Greek, Cyrillic, and maybe even a proper range of accents – Ōsaka, Kyōto…

You can do Chinese right now. Install the Cyberbit font, switch to UTF-8 language, then realise exactly how far we have to go. I’ll leave it to somebody else to teach Edit (and such) that one byte does not necessarily equal one character.

I think that we could at least begin by making the Wimp environment better cope with multibyte characters, though I wonder – given that most people here are English speakers – how much desire there is to support such a thing, especially given the complications regarding older applications? Maybe you could start a bounty and see if there is any interest?

Aug 23, 2014 2:16pm

Galax (2465) 3 posts

2, How do you represent a file called “「月光」.mp3” in Latin1? Probably as “????.mp3”.

I don’t think that’s a good way to represent it to a non-Unicode application, information has been lost. There are lots of possible solutions, the simplest might escape out nonstandard characters in all filenames being passed to/from non-Unicode applications, something like the URL escaping (%20 etc.).

As you said about buffer sizes etc., it seems risky to force UTF-8 on any application that isn’t written (or at least tested) to expect it. I don’t think it’s realistic to expect applications that were written without Unicode/UTF-8 support in mind to just magically work correctly.

A bigger problem for Chinese (and other non-alphabetic languages) might be getting an Input Method Editor to work everywhere. Actually just creating an IME is non-trivial. You can’t just type these languages directly, you need a system that takes what you type (usually phonetics) and converts it into the actual characters. I could explain more but I’ve probably already gone on too long.

Aug 23, 2014 5:16pm

Rick Murray (539) 13840 posts

I don’t think that’s a good way to represent it to a non-Unicode application, information has been lost.

I don’t think it is necessarily possible to represent a Unicode entity to a non-Unicode one without some sort of information loss.

However, the use of a row of question marks (albeit dumb) is exactly what XP does to a command line application, although there are differences in the short (8.3) filename. I would agree that attempting to make a unique filename of at least the first ten characters (if longer) might be a workable way to do it, but it would require the filing system (FileSwitch?) to be aware of this need.

For what it’s worth:

2014/06/16  22:12     11,185,867 !76FF~1.MP4  ??!??????????????.mp4

That is showing the filename of the promo video for an animé series from the DOS console using the command DIR /X. This is why video players that pass parameters on the console (MPlayer and the like) often has an option to force the use of 8.3 filenames. The short filename (!76FF~1.MP4) is unique. The row of question marks is not. Actually, this one is, however in the list I have four that are exactly “???????.mp4”.
[I’m not holding XP/DOS up as an example, I’m just pointing out that a company with the resources of Microsoft couldn’t come up with anything better]

Of course, this might be moot if nemo’s unicode command line ever reaches a release point, then we’d be in the unique position of potentially having a fully unicode system right down to command line level.

it seems risky to force UTF-8 on any application that isn’t written

My specific example was for a French phrase from a Latin1 application to a Unicode application. Going the other way, the result will be shorter. A two-byte accent can be converted to a single byte accent/character. And anything that can’t be represented can become a question mark. Either way, the sizes given (in, say, datasend) won’t match up.

I don’t think it’s realistic to expect applications that were written without Unicode/UTF-8 support in mind to just magically work correctly.

Given that we have potentially hundreds of applications that assume Latin1 is the current alphabet, and we have a great number that will not be further updated, and at current time we have exactly zero Unicode Wimp applications (NetSurf manages its own font handling), I am afraid that we’re going to hav to bend over backwards to support legacy applications.

might be getting an Input Method Editor to work everywhere. Actually just creating an IME is non-trivial.

Indeed. It is fairly trivial to write something to accept keypresses and convert to kana – typing in “kokoro” can easily become either こころ or ココロ, but the logic to go from there to 心‎ is rather complicated. Just using an IME is “interesting” as you can type something, it munges it phonetically to be hiragana, and every so often it will delete a bunch and replace it with a kanji, or if your writing could imply several (Japanese is full of homophones) it will open up a list for you to pick what you want.

[BTW, if anybody is interested, 心 = ♥ :-) ]

As for Chinese – do they even have a native phonetic way of writing? It all looks like Kangxi. Oh, and anybody who has read the backs of boxes of cereal packets might have noticed that there is “Traditional Chinese” and “Simplified Chinese”. They both look alike until you notice a fifty-stroke glyph in the Traditional that has become a vague squiggle in the Simplified. Chinese IMEs must be hellishly complicated.

Aug 23, 2014 6:06pm

Chris Hall (132) 3554 posts

2, How do you represent a file called “「月光」.mp3” in Latin1?

What’s wrong with E3 80 8C E6 9C 88 E5 85 89 E3 80 8D.mp3 (i.e. just use top bit set characters (it won’t display here properly if I show it in Latin 1 as it thinks it is UTF-8 as this is Firefox on Windows)) unless you meant to include the sexed quotes as part of the filename.

Aug 24, 2014 5:55am

WPB (1391) 352 posts

At least if you use the UTF-8 byte sequence in hex, you get a unique filename, but they get pretty long and they’re horrible to work with. (Definitely need Tab completion of filenames at the command line for that!)

“Yue guang” or “Gekkou” would be far friendly, though might require a level of complexity in the Filer that no one could justify. ;)

Aug 24, 2014 6:02am

WPB (1391) 352 posts

Chinese IMEs must be hellishly complicated.

Simpler than JA at least. There’s pretty much a one-to-one mapping between pinyin (with tonal accents) and hanzi. Not one-to-many like romaji-to-kanji. It’s all tied up in Japanese history and how they pinched the Chinese’s writing system and bodged it onto their own language. ;) And no phonetic scripts (kana) to figure out. (Is that は a particle or is it the first character of はな? – that problem is unique to JA.)

As for simplified versus traditional, I think each traditional character has at most one simplified counterpart, and you don’t mix the two unless the traditional character has no simplified form. So it doesn’t really complicate things from a computing perspective, but has made Chinese people’s lives a lot harder (and anyone learning Chinese). Now you need to learn both the traditional AND simplified characters if you want to read modern text and text written pre-simplification. Complicated!

Aug 24, 2014 9:02pm

Chris Hall (132) 3554 posts

At least if you use the UTF-8 byte sequence in hex

That’s not what I meant. I meant just use the top bit set characters as a single-byte character string. After all that is what will be stored as the filename on disc.

Aug 24, 2014 11:45pm

Chris Mahoney (1684) 2165 posts

Regarding “Latinisation” of filenames etc, I see that OS X manages to go from kana/kanji to rōmaji. For example, 猫 becomes neko in Terminal, and the example given above of 月光 becomes gekkou (and not something ridiculous like tsukihikari!). Of course, this would be fiendishly difficult to implement in RISC OS :)

Aug 25, 2014 6:45am

WPB (1391) 352 posts

I see that OS X manages to go from kana/kanji to rōmaji. For example, 猫 becomes neko in Terminal, and the example given above of 月光 becomes gekkou (and not something ridiculous like tsukihikari!)

That’s amazing! I was saying it tongue-in-cheek above. Right, RISC OS better step up then!

EDIT: Or I wonder if it isn’t as amazing as I thought – is the romaji stored somewhere as metadata with the file? That would be pretty cool, actually. As a test, you could try renaming the file to 犬 and seeing what it says in the terminal?

Aug 25, 2014 8:40am

Chris Hall (132) 3554 posts

So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority…

Aug 25, 2014 8:42am

WPB (1391) 352 posts

So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority…

It all comes down to what the people willing to do the work are interested in and want to do, as always. It would open up RISC OS to a much wider audience, which would be no bad thing.

Aug 25, 2014 8:51am

Chris Hall (132) 3554 posts

In principle, Yes, but a stable build for the Pi (and Pandaboard) and a working build for the Compute Module are probably more important.

Aug 25, 2014 9:06am

Chris Mahoney (1684) 2165 posts

Or I wonder if it isn’t as amazing as I thought – is the romaji stored somewhere as metadata with the file? That would be pretty cool, actually. As a test, you could try renaming the file to 犬 and seeing what it says in the terminal?

I’m not actually touching files at all, just displaying kanji in a “Latin” window. It’s therefore not metadata on the filename. 犬 does display as expected (inu).

The easiest place to test seems to be in System Preferences > Sharing; if you enter Japanese into the Computer Name box then you get a romaji representation of it underneath.

Aug 25, 2014 9:33am

WPB (1391) 352 posts

The easiest place to test seems to be in System Preferences > Sharing; if you enter Japanese into the Computer Name box then you get a romaji representation of it underneath.

That’s pretty smart. I’ve seen plenty of implementations of similar, but never at OS level. Kudos to OS X.

Aug 25, 2014 10:33am

Rick Murray (539) 13840 posts

So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority…

Exotic? We can’t even do European languages at the same time never mind anything fancy like an Asian IME…

Aug 25, 2014 5:09pm

Rick Murray (539) 13840 posts

We can’t even do European languages at the same time

An Irishman, a Hungarian, and a Greek walk into a bar.

The Irish man, contrary to popular stereotype, was actually rather intelligent. He said:

This joke... it can't possibly work.

The Hungarian pondered this for a moment before replying:

?n nem hiszem, hogy a sz?mit?g?p k?pes erre.

The Greek, with a more Mediterranean personality, waved his arms around a lot while exclaiming:

???? ??????????? ??? ??????? ???? ???????? ?? ???? ????? ??????????!

Aug 25, 2014 5:15pm

Rick Murray (539) 13840 posts

…meanwhile the pretty Asian girl in the corner of the room thought to herself:

? ? ?
? ? ?
? ? ?
? ? ?
? ? 1
  ? 9
  ? 8
  ? 9
  ? ?

UTF-8?

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

May 6, 2014 5:10pm nemo (145) 2546 posts	If I ask really nicely – would you consider hooking me up with an early copy of the command line UTF-8 handler? I am trying to get an Alpha on the site. Rather busy with real life at the moment. is it a module? kernel source mods? It’s a module, though obviously could be done in-kernel. I would like to verify that my OLED module is quite happy with UTF-8 sequences (I’d likely just test basic kana (if you have implemented it, that is!) Katakana and Hiragana are in, as is Arabic. However, all I’m doing is arranging for UTF-8 sent to WriteC et al to produce the right 8×8 glyphs on screen. To any claimant of WrchV will just see the UTF-8 stream go past. if that works okay, no reason to assume other stuff would fail) and that nothing unexpected happens when providing UTF-8 strings. If there are any specific requirements, let me know. Well, one of the clever bits is the fallback alphabet, that ensures that stuff that spits out Latin1 (say) whilst in UTF-8 mode still produces the right glyphs. WrchV claimants will see the Latin1 codes – they’re compatibility glyphs, not replacement code sequences. (Such a substitution cannot be done earlier because it requires a retrospective change – if you see a valid UTF-8 start byte but then an invalid byte, then the earlier byte needs to be interpreted in the fallback encoding and then encoded as UTF-8, which would require a different start byte) All of this takes place while VDU is redirected to a sprite Then it’ll work fine.

Jul 6, 2014 9:42am WPB (1391) 352 posts	Nemo, I wonder if your silence on this means you’ve decided to tackle 16×16 gylphs after all? ;)

Jul 19, 2014 10:48pm nemo (145) 2546 posts	I wonder if your silence on this means you’ve decided to tackle 16×16 gylphs after all? Gomen nasai. It means that since we won a new Japanese customer I’ve had no spare time AT ALL. I looked into it, and I’ve played with Unifont too, but as I’ve said before I need to get the 8×8 UTF-8 out there, even in Beta form, before I address the 16×16 idea. On the positive side, the reason I’m here to reply is because I am installing RPCEmu+RO5 again, having lost it with a machine change six months ago. Need to check that the UTF-8 module works under RO5. 4841 glyphs now! So it’ll definitely be ready soon. Maybe even THIS YEAR! :-(

Jul 20, 2014 5:58am WPB (1391) 352 posts	Gomen nasai. No apology necessary – I was only teasing. (But very aware that I’m in a house made of glass with an ample supply of stones!) 8×8 will be fantastic to experiment with and if 16×16 ever comes along, you’ll be bumped up to 天下一品 status… GOOD LUCK!

Jul 20, 2014 8:09am Steve Pampling (1551) 8170 posts	the reason I’m here to reply is because I am installing RPCEmu+RO5 again The install document is good for Windows XP and Win7, not tested on Win 8. Notes on differences for Win 8 would be useful. Mac OSX hasn’t had many people use it (or at least very few have said they have used either successfully or not) but all who have reported using it had no great issues.

Jul 20, 2014 11:06am Rick Murray (539) 13840 posts	you’ll be bumped up to 天下一品 status… Only 天下一品? If 16×16 comes along, it’ll be 神様 without a doubt.

Jul 20, 2014 3:46pm WPB (1391) 352 posts	Only 天下一品? If 16×16 comes along, it’ll be 神様 without a doubt. Don’t go too far there, Rick. We want to hold something back in case we need him to implement anything else. ;)

Aug 23, 2014 8:31am Galax (2465) 3 posts	I’d love to see full system-wide UTF-8 support, with fall backs for non-updated apps. Are there already system APIs for splitting a stream of UTF-8 bytes into characters? It’s not too hard to do, but should be done consistently. I speak and write a bit of Chinese and from a personal point of view the lack of Chinese would be a major factor stopping me from using RISC OS as my only system. Don’t get distracted at all by thinking about vertical text; it doesn’t exist on Chinese computers outside of specialist areas such as DTP programs. There might be a few more apps that can do it in Taiwan, it’s only been added to DirectWrite in Windows 8, and this discussion is about catching up with Windows 98.

Aug 23, 2014 11:01am Rick Murray (539) 13840 posts	It would be nice, for the few of us that have squiggles alongside regular Latin characters. ;-) I posted, a while back, about a way that this might be possible to implement, because the Wimp needs to be able to handle both UTF-8 and Latin1 at the same time (this is why). A dearth of developers and a large number of legacy applications mean that a two-tier Wimp is the best we can hope for. Unfortunately, this raises all sorts of complications: 1, While we can “fairly easily” determine if a compatible app is UTF-8 or not by how it calls the Wimp during its initialisation, what do we do if we pass a filename to OS_Find? Is the filename Latin1 or UTF-8? Maybe that’s not the best example as a filing system ought to be fairly agnostic and attempt to open the filename passed; but the general point holds true for anything that receives a string input or returns a string. Thankfully this isn’t as painful as it could be. We’re using UTF-8 not UTF-16, so it may be doable so long as the API doesn’t make too many assumptions. 2, How do you represent a file called “「月光」.mp3” in Latin1? Probably as “????.mp3”. Fine. Now how about a file called “トポロジ.mp3”? 3, Related to the above, it is feasible to have a shim to bodge UTF-8 into its closest Latin1 equivalent for inter-application messages, but this is not an easy task. What happens if you to a drag-save of a phrase written in French (“je suis très désolé, mon ami!”) from a Latin1 application to a UTF-8 one. It is clear that this will be invalid in the UTF-8 application. Should it be converted? What if the content is not in plain text? It might be a DrawFile. How do we handle sizes and buffers? I see three accents in there, so I can say that the UTF-8 string will be at least three bytes larger. It isn’t just Chinese/Japanese. There are others here who would like Greek, Cyrillic, and maybe even a proper range of accents – Ōsaka, Kyōto… You can do Chinese right now. Install the Cyberbit font, switch to UTF-8 language, then realise exactly how far we have to go. I’ll leave it to somebody else to teach Edit (and such) that one byte does not necessarily equal one character. I think that we could at least begin by making the Wimp environment better cope with multibyte characters, though I wonder – given that most people here are English speakers – how much desire there is to support such a thing, especially given the complications regarding older applications? Maybe you could start a bounty and see if there is any interest?

Aug 23, 2014 2:16pm Galax (2465) 3 posts	2, How do you represent a file called “「月光」.mp3” in Latin1? Probably as “????.mp3”. I don’t think that’s a good way to represent it to a non-Unicode application, information has been lost. There are lots of possible solutions, the simplest might escape out nonstandard characters in all filenames being passed to/from non-Unicode applications, something like the URL escaping (%20 etc.). As you said about buffer sizes etc., it seems risky to force UTF-8 on any application that isn’t written (or at least tested) to expect it. I don’t think it’s realistic to expect applications that were written without Unicode/UTF-8 support in mind to just magically work correctly. A bigger problem for Chinese (and other non-alphabetic languages) might be getting an Input Method Editor to work everywhere. Actually just creating an IME is non-trivial. You can’t just type these languages directly, you need a system that takes what you type (usually phonetics) and converts it into the actual characters. I could explain more but I’ve probably already gone on too long.

Aug 23, 2014 5:16pm Rick Murray (539) 13840 posts	I don’t think that’s a good way to represent it to a non-Unicode application, information has been lost. I don’t think it is necessarily possible to represent a Unicode entity to a non-Unicode one without some sort of information loss. However, the use of a row of question marks (albeit dumb) is exactly what XP does to a command line application, although there are differences in the short (8.3) filename. I would agree that attempting to make a unique filename of at least the first ten characters (if longer) might be a workable way to do it, but it would require the filing system (FileSwitch?) to be aware of this need. For what it’s worth: 2014/06/16 22:12 11,185,867 !76FF~1.MP4 ??!??????????????.mp4 That is showing the filename of the promo video for an animé series from the DOS console using the command `DIR /X`. This is why video players that pass parameters on the console (MPlayer and the like) often has an option to force the use of 8.3 filenames. The short filename (`!76FF~1.MP4`) is unique. The row of question marks is not. Actually, this one is, however in the list I have four that are exactly “`???????.mp4`”. [I’m not holding XP/DOS up as an example, I’m just pointing out that a company with the resources of Microsoft couldn’t come up with anything better] Of course, this might be moot if nemo’s unicode command line ever reaches a release point, then we’d be in the unique position of potentially having a fully unicode system right down to command line level. it seems risky to force UTF-8 on any application that isn’t written My specific example was for a French phrase from a Latin1 application to a Unicode application. Going the other way, the result will be shorter. A two-byte accent can be converted to a single byte accent/character. And anything that can’t be represented can become a question mark. Either way, the sizes given (in, say, datasend) won’t match up. I don’t think it’s realistic to expect applications that were written without Unicode/UTF-8 support in mind to just magically work correctly. Given that we have potentially hundreds of applications that assume Latin1 is the current alphabet, and we have a great number that will not be further updated, and at current time we have exactly zero Unicode Wimp applications (NetSurf manages its own font handling), I am afraid that we’re going to hav to bend over backwards to support legacy applications. might be getting an Input Method Editor to work everywhere. Actually just creating an IME is non-trivial. Indeed. It is fairly trivial to write something to accept keypresses and convert to kana – typing in “kokoro” can easily become either こころ or ココロ, but the logic to go from there to 心‎ is rather complicated. Just using an IME is “interesting” as you can type something, it munges it phonetically to be hiragana, and every so often it will delete a bunch and replace it with a kanji, or if your writing could imply several (Japanese is full of homophones) it will open up a list for you to pick what you want. [BTW, if anybody is interested, 心 = ♥ :-) ] As for Chinese – do they even have a native phonetic way of writing? It all looks like Kangxi. Oh, and anybody who has read the backs of boxes of cereal packets might have noticed that there is “Traditional Chinese” and “Simplified Chinese”. They both look alike until you notice a fifty-stroke glyph in the Traditional that has become a vague squiggle in the Simplified. Chinese IMEs must be hellishly complicated.

Aug 23, 2014 6:06pm Chris Hall (132) 3554 posts	2, How do you represent a file called “「月光」.mp3” in Latin1? What’s wrong with E3 80 8C E6 9C 88 E5 85 89 E3 80 8D.mp3 (i.e. just use top bit set characters (it won’t display here properly if I show it in Latin 1 as it thinks it is UTF-8 as this is Firefox on Windows)) unless you meant to include the sexed quotes as part of the filename.

Aug 24, 2014 5:55am WPB (1391) 352 posts	At least if you use the UTF-8 byte sequence in hex, you get a unique filename, but they get pretty long and they’re horrible to work with. (Definitely need Tab completion of filenames at the command line for that!) “Yue guang” or “Gekkou” would be far friendly, though might require a level of complexity in the Filer that no one could justify. ;)

Aug 24, 2014 6:02am WPB (1391) 352 posts	Chinese IMEs must be hellishly complicated. Simpler than JA at least. There’s pretty much a one-to-one mapping between pinyin (with tonal accents) and hanzi. Not one-to-many like romaji-to-kanji. It’s all tied up in Japanese history and how they pinched the Chinese’s writing system and bodged it onto their own language. ;) And no phonetic scripts (kana) to figure out. (Is that は a particle or is it the first character of はな? – that problem is unique to JA.) As for simplified versus traditional, I think each traditional character has at most one simplified counterpart, and you don’t mix the two unless the traditional character has no simplified form. So it doesn’t really complicate things from a computing perspective, but has made Chinese people’s lives a lot harder (and anyone learning Chinese). Now you need to learn both the traditional AND simplified characters if you want to read modern text and text written pre-simplification. Complicated!

Aug 24, 2014 9:02pm Chris Hall (132) 3554 posts	At least if you use the UTF-8 byte sequence in hex That’s not what I meant. I meant just use the top bit set characters as a single-byte character string. After all that is what will be stored as the filename on disc.

Aug 24, 2014 11:45pm Chris Mahoney (1684) 2165 posts	Regarding “Latinisation” of filenames etc, I see that OS X manages to go from kana/kanji to rōmaji. For example, 猫 becomes neko in Terminal, and the example given above of 月光 becomes gekkou (and not something ridiculous like tsukihikari!). Of course, this would be fiendishly difficult to implement in RISC OS :)

Aug 25, 2014 6:45am WPB (1391) 352 posts	I see that OS X manages to go from kana/kanji to rōmaji. For example, 猫 becomes neko in Terminal, and the example given above of 月光 becomes gekkou (and not something ridiculous like tsukihikari!) That’s amazing! I was saying it tongue-in-cheek above. Right, RISC OS better step up then! EDIT: Or I wonder if it isn’t as amazing as I thought – is the romaji stored somewhere as metadata with the file? That would be pretty cool, actually. As a test, you could try renaming the file to 犬 and seeing what it says in the terminal?

Aug 25, 2014 8:40am Chris Hall (132) 3554 posts	So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority…

Aug 25, 2014 8:42am WPB (1391) 352 posts	So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority… It all comes down to what the people willing to do the work are interested in and want to do, as always. It would open up RISC OS to a much wider audience, which would be no bad thing.

Aug 25, 2014 8:51am Chris Hall (132) 3554 posts	In principle, Yes, but a stable build for the Pi (and Pandaboard) and a working build for the Compute Module are probably more important.

Aug 25, 2014 9:06am Chris Mahoney (1684) 2165 posts	Or I wonder if it isn’t as amazing as I thought – is the romaji stored somewhere as metadata with the file? That would be pretty cool, actually. As a test, you could try renaming the file to 犬 and seeing what it says in the terminal? I’m not actually touching files at all, just displaying kanji in a “Latin” window. It’s therefore not metadata on the filename. 犬 does display as expected (inu). The easiest place to test seems to be in System Preferences > Sharing; if you enter Japanese into the Computer Name box then you get a romaji representation of it underneath.

Aug 25, 2014 9:33am WPB (1391) 352 posts	The easiest place to test seems to be in System Preferences > Sharing; if you enter Japanese into the Computer Name box then you get a romaji representation of it underneath. That’s pretty smart. I’ve seen plenty of implementations of similar, but never at OS level. Kudos to OS X.

Aug 25, 2014 10:33am Rick Murray (539) 13840 posts	So far as priority goes, making RISC OS work in Chinese and various other exotic languages is hardly top priority… Exotic? We can’t even do European languages at the same time never mind anything fancy like an Asian IME…

Aug 25, 2014 5:09pm Rick Murray (539) 13840 posts	We can’t even do European languages at the same time An Irishman, a Hungarian, and a Greek walk into a bar. The Irish man, contrary to popular stereotype, was actually rather intelligent. He said: This joke... it can't possibly work. The Hungarian pondered this for a moment before replying: ?n nem hiszem, hogy a sz?mit?g?p k?pes erre. The Greek, with a more Mediterranean personality, waved his arms around a lot while exclaiming: ???? ??????????? ??? ??????? ???? ???????? ?? ???? ????? ??????????!

Aug 25, 2014 5:15pm Rick Murray (539) 13840 posts	…meanwhile the pretty Asian girl in the corner of the room thought to herself: ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 ? 9 ? 8 ? 9 ? ?