Horizontal Scroll Bar for Boot/Run-at-startup
WPB (1391) 352 posts |
This really is missing the point somewhat. However, moving towards using UTF-8 as a default and global alphabet for the whole OS and its applications will bring you all the weird and obscure fractions you’d like and commonly used glyphs from all over the world as well. It’s the way to go. We should all get behind the change and make it happen. |
Chris Hall (132) 3554 posts |
We should all get behind the change and make it happen. So long as it is backwards compatible, I agree. For example use a few of the most obscure and little used top-bit-set characters to specify a different range of top-bit-set characters for applications that are aware. For example use &F8 to &FF to specify one of eight possible different sets of 120 characters. That should easily be enough. Unaware applications would simply display the default top-bit-set character instead of the special one. And &F8 to &FF could simply display themselves as a space. |
WPB (1391) 352 posts |
For the avoidance of doubt, I am absolutely NOT advocating this. The last thing we want to do is introduce yet another character encoding scheme into the world. The Unicode Consortium have worked so hard to come up with one encoding to rule them all (I know that’s a simplification; UTF-8 isn’t all things to all men, but has a great deal of benefits), and a huge amount of research went into the spec. of UTF-8 – we would have to be mad at this stage to go with some RISC OS-only character encoding over that. Chris, if you have any time, you should look into UTF-8 and what it was designed to achieve. The change would not be fully backwards compatible, but as Ben as outlined in detail, it doesn’t need to be terribly painful, either. Not a lot will break in the majority of apps. What we need is tools to help with the transition. I will start working on a few once I’ve got something else I’m working on out of the way. |
Jess Hampshire (158) 865 posts |
That sort of reminds me of the clichéed Englishman abroad, who talks LOUDLY and slowly so Johnny Foriegner can understand.
Hmm, the number of times I have wanted to do this is on a par with the number of pole dancers I have dated. (i.e. not many)
Same in my school, (though thicko was obviously relative, since everyone had passed the 11+) we learnt Russian, which I hated at the time, but now years later (having been somewhere where even a tiny bit of Russian was useful), regret a bit not trying harder, and am now trying to learn it again. Ironically, the last few weeks I’ve been trying to learn a little Spanish, because I have a holiday booked to Tenerife. Unicode, shouldn’t there be a separate filetype for it? Then at least it would be possible to copy and paste between aware applications. (Allowing an on screen keyboard to generate it.) |
WPB (1391) 352 posts |
What do you mean? A separate filetype for Unicode text files or something? Then would we have separate filetypes for UTF-8 textfiles / UTF-16 textfiles / UTF-32 textfiles / etc.? I’m not sure this would help in any way. There should just be a way for textual data on the clipboard to have an associated encoding, and possibly even to be able to paste in a different encoding (that’s always handy to be able to do). |
Jess Hampshire (158) 865 posts |
That was the sort of thing I was thinking. So when you open a file, it opens with something that understands it. As far as I can see from a non-programmer POV, that would improve the situation, because any unicode aware program would have a means of loading and saving unicode, and I don’t see it need any big change to the underlying OS. |
Andrew Flegg (1574) 28 posts |
The only way I can think of doing this would be to have a BOM in front of any UTF-8 string. But that doesn’t give you backwards compatible transliteration to things which can’t handle UTF-8, and would mean two separate strings couldn’t be concatenated to form another valid UTF-8 string. Any attempt to do half of a multi-byte character encoding, or otherwise switching between different encodings in different applications will result in horrible bugs, broken behaviour and data loss. |
nemo (145) 2529 posts |
OMG stop, people, please stop. If you find yourself talking about Unicode without knowing what a BOM is, stop talking. ;-)
You do NOT have a “Unicode filetype”. It is possible to tell the difference between one of the 8bit encodings I’ve mentioned and UTF8 with about 98% certainty even without a BOM, but a BOM is definitive. Consequently one can prefix a BOM on a text file or CSV file, one can put the appropriate Content-Type in HTTP and XML files. Fractions are an embarrassment. Unicode jumped the shark around 5.1 but started in the right direction despite difficult starting conditions: Unicode is a universal character set, and hence seeks to represent every character. I stress that because it is important. Fractions aren’t characters, they’re combinations of characters… except… those difficult starting conditions. Being a universal character set means it needs a 1:1 mapping to all existing character sets. That meant inheriting the duplications and unfortunate choices of those existing character sets, and that means anachronisms like small fractions having their own Unicode. However, that isn’t what Unicode is for. ½ is a hang-over from typewriters. 1/2 is what we should have in the text. Now as a typographer, I’m not happy with “1/2” – ½ is much more appealing, but that’s presentation, not semantic meaning. Unicode is not supposed to represent presentation, and it is this founding principle that has now been gleefully set on fire and thrown out of the window (see u1F534 and uFE0F and weep). Thankfully, other technologies fix the presentation aspect. Any good OpenType font can automatically replace 1/2 with ½ and, more to the point, will also happily represent 527/756 similarly, which no one could suggest be represented by an individual glyph. The fact that RISC OS doesn’t support OTs is another story.
There’s nothing magical about UTF8 in that respect, if you try concatenating two strings in any two different encodings you’re going to get a silly result. We’ve had multiple Alphabets forever. The fact that “little Acorners” never ventured outside Latin1 doesn’t alter the fact that this has always been the case. More interestingly, ‘old’ (ie non-‘language’) Acorn fonts could have their own Encoding file… which didn’t have to conform to any defined Alphabet or even be a subset of any known character set (not even the AGL). My !IntChars program parsed these and automatically mapped keypresses in the configured Alphabet to appropriate character codes in the font’s Encoding (where possible), and also performed case-swapping in the font’s Encoding. As far as I know it’s the only program that did so. Such mapping is essential when mixing encodings, UTF-8 is absolutely no different in that regard. * By which I mean one can pretty easily and unambiguously detect whether you have UTF8 or Latin1. UTF-16 is much more dangerous in this regard – see “Bush hid the facts” |
nemo (145) 2529 posts |
I quipped:
prompting Rick and WPB to fall into my Ben trap:
I shall say two words and allow Rick or WPB to regain their Japanophile reputation by explaining:
Stares significantly at the FontManager. |
Chris Hall (132) 3554 posts |
Hmm, the number of times I have wanted to do this is on a par with the number of pole dancers I have dated. (i.e. not many) I have just finished publishing a book. The master is an HTML file generated from the ‘csv’ output from a spreadsheet which is then turned into a pdf for printing. It is a reference work containing many imperial dimensions such as 4⅛″ which gave me enormous difficulty. First Excel translated them to ‘4?” ’ when it saved the ’.csv’ file which I had to idenify and convert in my processing, using an image for the fraction (as my reference book on HTML didn’t mention the code Anyway many thanks for drawing my attention to UTF-8 which I had never heard of – this has allowed me to find the HTML codes for the odd eighths fractions. |
Eric Rucker (325) 232 posts |
And now you can see why UTF-8 support is a good idea. |
Rick Murray (539) 13806 posts |
I think it depends upon the people you work with. The very few people I’ve known (tripe that if you include keyboards used in films) were more kana than romaji. While both methods work, and the various IMEs will accept either method, I suspect there are a large number of kana users – especially the younger ones. Think about it, why should the bar for working with a computer be set at the very artificial level of needing to know how to write your language in somebody else’s characters? It would be like “it’s okay to code for RISC OS, just so long as you first learn how to do it on a Cyrillic keyboard”! A young person, who still depends upon Furigana to be able to read stuff, is not going to appreciate having to battle Latin characters which they could just get on with Hiragana. |
Rick Murray (539) 13806 posts |
…or just use an internationally agreed and widely supported way of specifying characters that gets away from this problem entirely? |
Rick Murray (539) 13806 posts |
It is pretty easy for UTF-8. If the text is just plain English with no “extended” characters (anything over 127), then it is just plain text. If the file contains extended characters in UTF-8, they will begin with a specific marker, one of %110xxxxx or %1110xxxx or %11110xxx which specifies the number of bytes used to represent the character (2-4; up to 6 possible but this is unusual), followed by the corresponding number of six-bit bytes in the form %10xxxxxx.
Anything that does not fit this pattern is not a UTF-8 file. I believe UTF-16 and so on have other identifying attributes, including an optional marker to undicate endianness. |
Rick Murray (539) 13806 posts |
Why? If you are creating a newer protocol that accepts UTF-8 encoded text, how hard is it to reserve a flags bit to say “actually, this isn’t UTF-8”? I don’t mean like to Wimp_CreateIcon and everywhere, but more in sending/receiving messages and places where it might be desired to have non-UTF-8 strings. Doesn’t BOM mean Byte Orientation Marker or something? It applies more to UTF-16 than our UTF-8, right?
Rubbish! If you have a byte at the start of the strings and you know have a marker byte (or some description) there, you just increment the pointer of the second string to point beyond the marker of the second string (preserve the first one so the result is valid) and just use strncpy() as normal. |
Rick Murray (539) 13806 posts |
;-) Can HTML manage vertical yet? Yeah… I tend to gloss over this point, but thankfully since most animé credits are horizontal (so much so that Chihayafuru’s “backwards sideways scrolling vertical credits” was quite jarring) so it’ll do using horizontal for now… Works for ja.wikipedia.org ! Punctuation? You’re asking a programmer? Have you seen my nested brackets, in written stuff? Just be glad I don’t wrap paragraphs in curly braces! (^_^) However, what you perhaps mean is that a Japanese comma looks like 、 and a full stop looks like 。 and there are kinda cool quotes like 『this』 (and a single-line version), etc. I’m not sure what this is a FontManager issue though, shouldn’t the input method convert “.” to “。” and so forth? |
Rick Murray (539) 13806 posts |
A fellow troper?
How about 1/2? That was:
<flippant> If we want the real typewriter feel, don’t forget “l” for one and “O” for zero. Backspace-slash is optional. ;-) BTW, sorry for the mass of posts. Site appeared to be down while I was on break at work, so I’m catching up. |
Steve Pampling (1551) 8155 posts |
I would say no. Leave translation of a language to something like google translate and leave FontManager to make the right marks on the screen when presented with the right byte combination. Otherwise you run into the context issues when swapping from our favourite Latin set used by the non-English inhabitants of the UK being translated to to/from something like Greek. |
Rick Murray (539) 13806 posts |
Huh? I’m not sure what Google Translate is doing here, but isn’t this what I am saying? The input method (not FontManager) should be the one responsible for noticing that you have pressed a full stop, and when in Japanese mode, should offer a Japanese style one, perhaps1? The task of FontManager, as you say, should be to only display what it has been directed to display. 1 The question is moot anyway, just looked at my kana keys and the punctuation is Japanese-specific on the Japanese layout (duh): 、 on <, 。 on >, ・ on ?, and 「 」 on { and }. So forget what I said about converting “.” to “。”, just switch keyboard layouts and do it that way. Bootnote: Ironically, having looked again, shift-comma is the JP-comma, and shift-period is the JP-period; handled by the input method (the keyboard driver), so it is sort of what I was suggesting anyway. ;-) |
WPB (1391) 352 posts |
A “Ben trap”, eh? I don’t think I’ve ever fallen into one of them before… I wouldn’t say vertical text is a prerequisite for saying the FM supports Japanese. Sorry, but apart from in newspapers, very little printed Japanese is written vertically. In the various offices where I worked, pretty much all printed material was written horizontally. The reason is simple – software support on most OSes is geared towards horizontal display of text. If you told a Japanese person they wouldn’t be able to use vertical text in a word processor, they’d likely say, “Yeah, so?” I don’t know what the problem is with punctuation. Unicode contains plenty of normal JP punctuation characters. There’s nothing special about them. RO can display them fine. When kids hand write Japanese on the little bits of squared paper they use (yes, vertically usually, but that’s because it’s hand-written), bits of punctuation occupy a space on their own. That’s the same as what FM does – treating punctuation as seperate characters. Perhaps I’ve missed the point?
I think nemo’s switched the logic here. What Andrew Flegg was suggesting is that you can’t concatenate two UTF-8 strings if they have BOMs on the front. That’s true. He wasn’t talking about concatenating strings in different encodings. Of course that won’t work. Equally, I don’t think Rick was suggesting putting a BOM on the front of strings in memory. Probably what he meant was, add some metadata to the transfer of strings to indicate the UTF-8 encoding, leaving the string itself untouched.
I agree it seems a little odd, but that’s the practice in Japan. I’ve taught Japanese kids, and they all learn to type in romaji. I think probably the Japanese see it as a good way to get more familiar with the alphabet. Learning to type at all takes a considerable amount of time. Learning to type in two totally different ways takes twice as long. IMEs allow you to learn just one way, but input anything you like. Perhaps that’s the rationale.
You could if you wanted to. But where would it end? That would imply having a filetype for text files in every different encoding supported by the system. That’s really why it’s not a good idea.
Yes, this is exactly what happens on pretty much every IME I’ve ever used. Or you can type it in directly, as Rick points out. This is nothing to do with linguistic translation. It’s to do with input.
A Welsh IME could legitimately change “l” followed by “l” into the Welsh “ll” (I think it has its own codepoint – U+1EFA maybe? No doubt nemo will jump in here and tell me this is orthogonal to something – perhaps an equals sign ;). |
Jess Hampshire (158) 865 posts |
I finally managed to type some Russian on RISC OS I used Paul Sprangers’ Keymap, and Stronged and the following HTML file. <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <pre> Typing here Setting the Keyboard to Russian unicode and typing between the tags produced gobblediegook but clicking on run displayed what I had typed in netsurf. Am I right in thinking that there is no way of entering text and viewing it as you write? From what has been said, if RISC OS were to use UTF-8, everything would work fine until you tried typing a non asci character (like a pound sign) into a non unicode program. Keymap shows that the keyboard can be switched, so isn’t it possible to switch betwing UTF-8 and the current system, depending on which program has focus? |
WPB (1391) 352 posts |
You might have had better luck if you set the system alphabet to UTF-8. Then, if you managed to tell StrongED to use anti-aliased fonts, and selected a font containing the Russian glyphs, you might have seen them. It depends how StrongED is coded. If it doesn’t mess about with the characters it gets given by the Wimp, and doesn’t specify an encoding on opening its fonts for display, the Font Manager would have used the system alphabet as the encoding and should have painted the string correctly. As you can see, that’s quite a lot of "if"s. Really, in all but the most trivial of cases, it’s not just as simple as “making RISC OS use UTF-8”. First of all what you type needs to be got into an application. That happens in the system alphabet usually. (I think Keymap forces UTF-8, regardless of the system alphabet). Then the application needs to know it’s expecting UTF-8, so it can let the caret move properly within the string, and calculate string lengths sensibly. If everything’s done in icons, the Wimp handles much of this for you (again, if the system alphabet is correctly set). Otherwise it’s up to application authors. It’s not hard, but it doesn’t just happen magically.
Yes, it’s possible. But it doesn’t solve many of the problems. IMHO, Ben is right – the only sensible way is to create mirror territories and switch the system alphabet to UTF-8. |
Chris Hall (132) 3554 posts |
There’s more work to do UTF-8 thhan there appears at first sight. At present typing More tricky is |
WPB (1391) 352 posts |
UTF-8 support in BASIC is a whole other topic! And one left well alone for now probably! Note that under RISC OS (5), only the Font Manager has Unicode support. The CLI/system font has none.
No, in UTF-8 encoding, you can’t use any single byte code above &7F to mean anything other than what it was intended to mean in UTF-8. Mixing two encodings at the same time is not going to work. &A3 does mean something – it means it’s a byte from the middle or end of a sequence. If you start trying to interpret codes differently depending on where they occur in a sequence (in this case, you’re saying if it comes at the beginning of a sequence you know it’s not UTF-8), you break many of the great things about the UTF-8 encoding – like being able to recover from errors in the stream, or being able to move quickly about in the stream.
There is no ambiguity, because you must specify the encoding you’re talking about. If you’re talking about UTF-8, you know exactly what these codes mean. |
Chris Hall (132) 3554 posts |
and so the pound sign should appear correctly under both systems In a sense you are right. So long as the user can type the pound sign and it appears correctly, it doesn’t matter what goes on ‘under the bonnet’. I would extend this to typing ALT-163 and displaying correctly existing strings containing “£”. You just get to a difficulty when concatenating strings of different encodings when one string will have to be forced into UTF-8 as part of the concatenation process. Mixed encodings in a single string would not be permitted and the encoding whether, ASCII, UTF-8, invalid [mixed] or neutral [no top-bit-set characters], can be determined by examination of the string contents. It is just fortunate [if Word could get it wrong then so, probably, could competent programmers] that &A3 (in everything except Word) corresponds to U+00A3 and UTF-8 support in BASIC is a whole other topic! An essential one, if wider support for UTF-8 within RISC OS is being seriously considered. |