Further thinking on Internationalisation
Pages: 1 2
Steve Drain (222) 1620 posts |
I thought I would add that the single module could do multiple territories. Initially I implemented these as instantiations, but realised that this was an elaboration too far. ;-)
Do you have an API for these? If so, it might encourage me to return to Region. My versions would have been pretty limited, although I had adopted the use of ISO codes. I made quite bit of use of ResourceFS and Messages files. Although the Territory system could best be replaced wholesale, I think it is going to be essential to support the current implementation as well. So, what about territory/country numbers? The method I used was to have two-letter codes that would also be two-byte numbers. Comments? |
nemo (145) 2546 posts |
Firstly Let’s start with a healthy dose of truth – existing code adopts one of these four strategies:
There’s a few outliers that do weirder things, but that’s broadly it. Whatever one does, the above has to be respected. Number 2 is particularly frustrating because the Messages file will usually be accessed directly or via Country/Territory I don’t think it has ever been defined what it would mean if these two things were different. Country is limited to “7bit” by an OS_Byte, but both it and Territory have 8bits of CMOS. So you can configure your machine to have different values. As I say, I have no idea what that would mean. However, the above fundamental truth means that in reality they must always be the same (or broadly compatible – I’ll come back to the 7bit issue later). Language is not a binary choice I am using a Language Preference String in a SysVar because language is not an all or nothing thing – “Give me Patagonian Spanish or give me death” (in which “death” means British English). So it is essential that a range of languages be selectable, and individually configurable by the user. The intention is to make MessageTrans_OpenFile, Wimp_OpenTemplate (and whatever the Toolbox SWI is) adaptable, so even where the program has made a decision on which resource to access via the Country/Territory number, the system is smarter and will select a set of resources compatible with the Language Preference String. Although I have multiple merging resources accessed via LPS working, I have not prototyped the MessageTrans/Wimp (and probably SpriteOp) smarts yet, so unexpected difficulties may remain. Alphabet Territory has traditionally mandated Alphabet, and changing Alphabet means that Territory then gets things wrong. Separating Alphabet from Territory is broadly speaking the right thing to do now that the UTF-8 alphabet exists – the idea of having two versions of every Territory so you can have one with the ‘old’ 8-bit Alphabet and one with UTF-8 (which has been proposed) is daft. Things like Collation cannot only be done by Alphabet – their are Language/Script/Locale issues too that change alphabetic ordering. So really we have to get away from the tight Alphabet/Territory bind, and Unicode solves that. In the majority of cases Territory becomes very much simpler, but in a few cases it will have to impose local rules on collation (take Spanish for example, which has ‘traditional sort order’ and ‘international sort order’ – the Spanish Territory of course doesn’t give RISC OS the choice, and it absolutely should). Territory So the existing Territory API will continue to exist, will get better at doing what it has always claimed to do (eg by adapting to Alphabet where currently it doesn’t), and will mostly delegate actual functionality to other systems such as Alphabet, UnicodeSupport and so on. There’s no reason why the “Worldbook” stuff can’t be built-in to Territory for all known territories and only require additional specific territory modules to provide the exceptions to the otherwise universal rules. I have a plan for a GUI that allows the user to choose a country/timezone/language without having to resort to country codes or territory numbers, but there will have to be a decision made about what Territory numbers are for, going forward. Territory/Country numbers On the one hand, these are mostly used for language selection, and we have an alternative to that (LPS) and should be able to adapt to any existing code patterns that employ those numbers. However, we still need to be able to record where somebody is. Last time I looked, ISO3166 listed 248 different ‘countries’, and that doesn’t include timezones or significant cultural differences. And if you look at the IANA Language/Script/Region classification there’s well over 9,000 of them, so numbering languages is a silly idea. There’s about 180 different timezones too (by which I mean timezone per country). Territory numbers have been coined which cannot easily be used as Country numbers. OS_Byte,70 (Read/Write Country Number) would seem to reserve only 0 and 127 to mean “default” and “read” respectively, but unfortunately OS_Byte,71 causes difficulty. OS_Byte,71 This call is for setting the Alphabet or selecting the Keyboard Handler “by number”. Unfortunately, though Alphabets have numbers, OS_Byte,71 always treats the number as being a Country number first. So really this call should be called “Select Alphabet by Country Number” and “Select Keyboard Handler by Country Number” (the latter when b7 is set). This means that Country Numbers are not only limited to 7bits but, with 0 and 127 reserved, to 126 minus the number of Alphabet numbers. Alphabet numbers certainly stretch from 95-121, and there’s good reason to think 122-126 should be included too. So that means there can only be 93 countries, and Territory numbers that have already been coined won’t work with this OS_Byte, such as Hungary, Ukraine or Poland. Poland! This clearly needs fixing. Alphabets, Countries and Territories are provided via open interfaces (eg the International service) so there is no need to limit OS_Byte, but we do need a way to map valid numbers onto that existing interface, and I suppose that It would have been a lot neater if b7 had been avoided in all Country/Territory numbers, but that would have reduced the CMOS bytes to a definite 7b. TL;DR Language is best selected by a Language Preference String containing multiple acceptable options. Note that being able to select regional languages makes much of the regionality of Territory redundant – if you select “en-US” you’ll want “tire”, “kerb” and “mm/dd/yyyy” regardless of your Territory setting, probably. |
nemo (145) 2546 posts |
BTW. There is a Keyboard Handler function to “select keyboard (handler) by dial code”. This has always been bad because it requires every keyboard handler to know the dial code and country number of every other keyboard handler. I’ve spun that out into an International service (convert dial code to country number) so that keyboard handlers don’t have to be clairvoyant. In fact, “dial code” is inaccurate, as there are already codes to select keyboards that are not associated with a real dial code, and no dial code in existence can tell the difference between the USA and Canada, for example. So it’s better to think of it as “select keyboard by some kind of number” and generalise it. |
nemo (145) 2546 posts |
Chris teased
Have you? Or have you seen a Maori Territory number? If there’s an actual Territory module I’d love to get hold of a copy. |
Chris Mahoney (1684) 2165 posts |
It was some time ago so I can’t say for sure, but it was probably only the number. |
Rick Murray (539) 13840 posts |
Yes it is. And I think I’m the one that proposed it. Until Territory is torn up and done better, until the Wimp is able to cope with clients that may be Latin-something or UTF-8 at the same time or until FontManager is smart enough to fall back invalid UTF sequences as Latin1… We’re stuck with a mess. English language people don’t really have a need for UTF-8 as all their stuff is in Latin1. Non-English speakers might, but it’s currently a binary choice and all the many types of accents are broken whichever way you look at it.
Honestly? I’d say to hijack “Default” to mean “look elsewhere” where you can list your better thought out options with a new identification system. Anything old that doesn’t understand that will probably just assume Default means English. It’s better then trying to sort out the mess that is the current numbering, where you can consider a Dvorak keyboard to be its own independent democratic country… You don’t need to store any of this in CMOS these days. It’ll need resources in Boot, so might as well be set up there too.
It’s a mess.
CET/CEST? Or do The French write it backwards? ;-) [aside: amuses me how the Americans don’t like to refer to Europe, so they come up with weird things like Romance Standard Time. Romance time is apparently 5.48am. Not for me it ain’t, I’m so not a morning person. That’s probably why I’m still single.]
Is this really possible without making restrictions? I’d still say “support the old numbering as usefully as possible” (so Peru looks like Spanish…) and use Default to mean “look it up with the new API or assume English if you’re too old to know about that”. Another quirky little language issue: Google Docs grammar check, knowing my setup is British English, still faults me putting my punctuation outside of quotation (where it bloody belongs). Sometimes it even tries to tell me that I don’t know how to spell colour or realise.
Plus you can easily support Ireland-post-reunification and Scotland (as an independent country) later on without rejigging every single keyboard handler. |
nemo (145) 2546 posts |
💩
That sort of thing, though “LatinAm” (28) is defined… but which immediately runs into the fault that nothing will have Resources for 28, but may have for 5, which I take to be your point. I’m very hopeful (which is as far as I can go without having coded it) that cleverness in a few existing SWIs will mean the Country/Territory number can be largely ignored. But we’ll see. It may be prudent to define 50 of them to cover most of the planet and leave it at that.
American practices. Here in the civilised world the punctuation only belongs inside the quotation if the punctuation is part of that quotation. “I love you very much, my dear Beaver.” is a valid quotation, but not “We hold these truths to be self evident.” |
Clive Semmens (2335) 3276 posts |
Yay for sanity! |
Rick Murray (539) 13840 posts |
Lucky I saw that earlier on my phone, or I’d only have “f4 a9” to look at. I’d like you to point out what is in UTF-8 that is necessary for the English language. Answer? Nothing. That’s why, until recently, a lot of websites were ISO 8859/1 (but were actually CP1252, annoying for those of us who actually knew the difference). Until recently-ish, a lot of Windows programs used the ANSI 8 bit character set API. Even today, I’ll bet that most people around here will have their RISC OS running in Latin1 rather than UTF-8.
<clunk> That’s the sound of you falling into my trap. I know “LatinAm” is an available country, but it is worth noting that it defines neither a country (it basically reaches from Texas to Antartica and Google tells me there are 33 countries), nor does it define a language (they don’t all speak Spanish), nor does it define a timezone or…
Almost. My point in that your whizzy version (I’m not being facetious, anything will be better) should understand that Peru and Nigaragua are different (as any coffee drinker will tell you), but “for the existing API” it should probably just fake being Spanish. Think about it – with the current existing API, the following are the only countries that need to be understood and recognised: I know, you’ll be asking about all the others. Why not Ireland? Why not Korea? Why not USA? You don’t need to worry about Russia or Israel. Yes, they have defined country numbers. I don’t see anything for them in the Territory modules source directory. Hence, never used. Really, to be honest, I think I’d try to distil everything down to English, French, Spanish, German, or Italian (I note that there’s no Portuguese Territory!) when it comes to the old API. Which is what I meant. The new API can know and understand all about Peru. The old API can just say “Spanish”. But the user will probably have their software in English because, not so many Spanish language resources around inside apps. Hopefully with the new Debugger, this might start to change (or would they prefer Euskadi?). That said, it’s better dealt with in a new manner so they have a choice of Basque → Spanish → something else → English depending on language support in the program. Not something the current API is up to.
Exactly that. But, then, I understand British quoting is also known as “logical” quoting. Which means American is surely….. ;-) |
Pages: 1 2