Further thinking on Internationalisation

34 posts, 8 voices

Pages: 1 2

Nov 13, 2019 10:18am Steve Drain (222) 1620 posts	a single client module … could reproduce what individual territory modules do I thought I would add that the single module could do multiple territories. Initially I implemented these as instantiations, but realised that this was an elaboration too far. ;-) I have an Alphabet module I’ve put in an allocation for the “Language” module Do you have an API for these? If so, it might encourage me to return to Region. My versions would have been pretty limited, although I had adopted the use of ISO codes. I made quite bit of use of ResourceFS and Messages files. Although the Territory system could best be replaced wholesale, I think it is going to be essential to support the current implementation as well. So, what about territory/country numbers? The method I used was to have two-letter codes that would also be two-byte numbers. Comments?

Nov 13, 2019 4:42pm nemo (145) 2546 posts	Firstly Let’s start with a healthy dose of truth – existing code adopts one of these four strategies: No attempt at internationalisation whatsoever. Especially modules. “I put my text in a Messages file, what more do you want?” Reads the Country and uses the number or name as a file or directory name. Reads the Territory and uses the number or name as a file or directory name. There’s a few outliers that do weirder things, but that’s broadly it. Whatever one does, the above has to be respected. Number 2 is particularly frustrating because the Messages file will usually be accessed directly or via `<Obey$Dir>`, and not a path variable that could be redirected. This is ‘cargo-cult’ programming sadly, caused by lack of guidance in the early days (I do have a strategy for it). Country/Territory I don’t think it has ever been defined what it would mean if these two things were different. Country is limited to “7bit” by an OS_Byte, but both it and Territory have 8bits of CMOS. So you can configure your machine to have different values. As I say, I have no idea what that would mean. However, the above fundamental truth means that in reality they must always be the same (or broadly compatible – I’ll come back to the 7bit issue later). Language is not a binary choice I am using a Language Preference String in a SysVar because language is not an all or nothing thing – “Give me Patagonian Spanish or give me death” (in which “death” means British English). So it is essential that a range of languages be selectable, and individually configurable by the user. The intention is to make MessageTrans_OpenFile, Wimp_OpenTemplate (and whatever the Toolbox SWI is) adaptable, so even where the program has made a decision on which resource to access via the Country/Territory number, the system is smarter and will select a set of resources compatible with the Language Preference String. Although I have multiple merging resources accessed via LPS working, I have not prototyped the MessageTrans/Wimp (and probably SpriteOp) smarts yet, so unexpected difficulties may remain. Alphabet Territory has traditionally mandated Alphabet, and changing Alphabet means that Territory then gets things wrong. Separating Alphabet from Territory is broadly speaking the right thing to do now that the UTF-8 alphabet exists – the idea of having two versions of every Territory so you can have one with the ‘old’ 8-bit Alphabet and one with UTF-8 (which has been proposed) is daft. Things like Collation cannot only be done by Alphabet – their are Language/Script/Locale issues too that change alphabetic ordering. So really we have to get away from the tight Alphabet/Territory bind, and Unicode solves that. In the majority of cases Territory becomes very much simpler, but in a few cases it will have to impose local rules on collation (take Spanish for example, which has ‘traditional sort order’ and ‘international sort order’ – the Spanish Territory of course doesn’t give RISC OS the choice, and it absolutely should). Territory So the existing Territory API will continue to exist, will get better at doing what it has always claimed to do (eg by adapting to Alphabet where currently it doesn’t), and will mostly delegate actual functionality to other systems such as Alphabet, UnicodeSupport and so on. There’s no reason why the “Worldbook” stuff can’t be built-in to Territory for all known territories and only require additional specific territory modules to provide the exceptions to the otherwise universal rules. I have a plan for a GUI that allows the user to choose a country/timezone/language without having to resort to country codes or territory numbers, but there will have to be a decision made about what Territory numbers are for, going forward. Territory/Country numbers On the one hand, these are mostly used for language selection, and we have an alternative to that (LPS) and should be able to adapt to any existing code patterns that employ those numbers. However, we still need to be able to record where somebody is. Last time I looked, ISO3166 listed 248 different ‘countries’, and that doesn’t include timezones or significant cultural differences. And if you look at the IANA Language/Script/Region classification there’s well over 9,000 of them, so numbering languages is a silly idea. There’s about 180 different timezones too (by which I mean timezone per country). Territory numbers have been coined which cannot easily be used as Country numbers. OS_Byte,70 (Read/Write Country Number) would seem to reserve only 0 and 127 to mean “default” and “read” respectively, but unfortunately OS_Byte,71 causes difficulty. OS_Byte,71 This call is for setting the Alphabet or selecting the Keyboard Handler “by number”. Unfortunately, though Alphabets have numbers, OS_Byte,71 always treats the number as being a Country number first. So really this call should be called “Select Alphabet by Country Number” and “Select Keyboard Handler by Country Number” (the latter when b7 is set). This means that Country Numbers are not only limited to 7bits but, with 0 and 127 reserved, to 126 minus the number of Alphabet numbers. Alphabet numbers certainly stretch from 95-121, and there’s good reason to think 122-126 should be included too. So that means there can only be 93 countries, and Territory numbers that have already been coined won’t work with this OS_Byte, such as Hungary, Ukraine or Poland. Poland! This clearly needs fixing. Alphabets, Countries and Territories are provided via open interfaces (eg the International service) so there is no need to limit OS_Byte, but we do need a way to map valid numbers onto that existing interface, and I suppose that `R1=((num<<1)ANDNOT255)OR(num AND 127)` will have to do. OS_Byte can then recreate the actual number having switched functionality by b7 as specified. It would have been a lot neater if b7 had been avoided in all Country/Territory numbers, but that would have reduced the CMOS bytes to a definite 7b. TL;DR Language is best selected by a Language Preference String containing multiple acceptable options. TimeZone is usually selected per-country, but that’s a UI detail. Actual functionality is simple. Country/Territory will become a cipher for Region – encapsulating cultural differences that are not switched by language. Note that being able to select regional languages makes much of the regionality of Territory redundant – if you select “en-US” you’ll want “tire”, “kerb” and “mm/dd/yyyy” regardless of your Territory setting, probably.

Nov 13, 2019 4:49pm nemo (145) 2546 posts	BTW. There is a Keyboard Handler function to “select keyboard (handler) by dial code”. This has always been bad because it requires every keyboard handler to know the dial code and country number of every other keyboard handler. I’ve spun that out into an International service (convert dial code to country number) so that keyboard handlers don’t have to be clairvoyant. In fact, “dial code” is inaccurate, as there are already codes to select keyboards that are not associated with a real dial code, and no dial code in existence can tell the difference between the USA and Canada, for example. So it’s better to think of it as “select keyboard by some kind of number” and generalise it.

Nov 13, 2019 4:53pm nemo (145) 2546 posts	Chris teased I’ve seen a Maori territory Have you? Or have you seen a Maori Territory number? If there’s an actual Territory module I’d love to get hold of a copy.

Nov 13, 2019 7:50pm Chris Mahoney (1684) 2165 posts	It was some time ago so I can’t say for sure, but it was probably only the number.

Nov 13, 2019 8:50pm Rick Murray (539) 13840 posts	the idea of having two versions of every Territory so you can have one with the ‘old’ 8-bit Alphabet and one with UTF-8 (which has been proposed) is daft Yes it is. And I think I’m the one that proposed it. Until Territory is torn up and done better, until the Wimp is able to cope with clients that may be Latin-something or UTF-8 at the same time or until FontManager is smart enough to fall back invalid UTF sequences as Latin1… We’re stuck with a mess. English language people don’t really have a need for UTF-8 as all their stuff is in Latin1. Non-English speakers might, but it’s currently a binary choice and all the many types of accents are broken whichever way you look at it. So that means there can only be 93 countries Honestly? I’d say to hijack “Default” to mean “look elsewhere” where you can list your better thought out options with a new identification system. Anything old that doesn’t understand that will probably just assume Default means English. It’s better then trying to sort out the mess that is the current numbering, where you can consider a Dvorak keyboard to be its own independent democratic country… You don’t need to store any of this in CMOS these days. It’ll need resources in Boot, so might as well be set up there too. Poland! This clearly needs fixing. It’s a mess. TimeZone is usually selected per-country, CET/CEST? Or do The French write it backwards? ;-) [aside: amuses me how the Americans don’t like to refer to Europe, so they come up with weird things like Romance Standard Time. Romance time is apparently 5.48am. Not for me it ain’t, I’m so not a morning person. That’s probably why I’m still single.] but we do need a way to map valid numbers onto that existing interface, Is this really possible without making restrictions? I’d still say “support the old numbering as usefully as possible” (so Peru looks like Spanish…) and use Default to mean “look it up with the new API or assume English if you’re too old to know about that”. Another quirky little language issue: Google Docs grammar check, knowing my setup is British English, still faults me putting my punctuation outside of quotation (where it bloody belongs). Sometimes it even tries to tell me that I don’t know how to spell colour or realise. So if Google, with their infamous you-gotta-be-a-genius recruitment tests can’t get this right… just shows that coping with different countries/regions involves a little more than switching language tokens and formatting numbers differently (wasn’t there a recent bug in SciCalc not coping with ‘,’ as a decimal separator in French?). so that keyboard handlers don’t have to be clairvoyant. Plus you can easily support Ireland-post-reunification and Scotland (as an independent country) later on without rejigging every single keyboard handler.

Nov 13, 2019 9:31pm nemo (145) 2546 posts	English language people don’t really have a need for UTF-8 💩 so Peru looks like Spanish… That sort of thing, though “LatinAm” (28) is defined… but which immediately runs into the fault that nothing will have Resources for 28, but may have for 5, which I take to be your point. I’m very hopeful (which is as far as I can go without having coded it) that cleverness in a few existing SWIs will mean the Country/Territory number can be largely ignored. But we’ll see. It may be prudent to define 50 of them to cover most of the planet and leave it at that. still faults me putting my punctuation outside of quotation (where it bloody belongs) American practices. Here in the civilised world the punctuation only belongs inside the quotation if the punctuation is part of that quotation. “I love you very much, my dear Beaver.” is a valid quotation, but not “We hold these truths to be self evident.”

Nov 14, 2019 6:45am Clive Semmens (2335) 3276 posts	Here in the civilised world the punctuation only belongs inside the quotation if the punctuation is part of that quotation. Yay for sanity!

Nov 14, 2019 7:37pm Rick Murray (539) 13840 posts	<poop emoji> Lucky I saw that earlier on my phone, or I’d only have “f4 a9” to look at. I’d like you to point out what is in UTF-8 that is necessary for the English language. Answer? Nothing. That’s why, until recently, a lot of websites were ISO 8859/1 (but were actually CP1252, annoying for those of us who actually knew the difference). Until recently-ish, a lot of Windows programs used the ANSI 8 bit character set API. Even today, I’ll bet that most people around here will have their RISC OS running in Latin1 rather than UTF-8. I’m not saying UTF-8 is useless, I’m simply saying that there is not a pressing need to have it for English (we don’t use macrons, carons, the-Basque-A, etc) which is probably why not so much has been done. Heck, the Wimp can’t even handle “old” apps (Latin-something) and “new” apps (UTF-8). FontManager can’t even sensibly switch fonts to support text that isn’t part of the current font (like switching to Cyberbit for some Hebrew, for instance). Hell, FontManager can’t even fall back to an old mapping when it meets invalid UTF sequences. And these aren’t the weird things you’ve posted about in the past, like… was it Farsi? These are basic things. That sort of thing, though “LatinAm” (28) is defined… <clunk> That’s the sound of you falling into my trap. I know “LatinAm” is an available country, but it is worth noting that it defines neither a country (it basically reaches from Texas to Antartica and Google tells me there are 33 countries), nor does it define a language (they don’t all speak Spanish), nor does it define a timezone or… Basically, Wales (that isn’t really a “country”) has two country allocations, a keyboard mapping has two, and “LatinAm” points at an entire continent (or the lower half, depending on whether you see it as “The Americas” or a separate north and south). Which is probably only useful if your country is a continent. Hello Australia. :-) but which immediately runs into the fault that nothing will have Resources for 28, but may have for 5, which I take to be your point. Almost. My point in that your whizzy version (I’m not being facetious, anything will be better) should understand that Peru and Nigaragua are different (as any coffee drinker will tell you), but “for the existing API” it should probably just fake being Spanish. Think about it – with the current existing API, the following are the only countries that need to be understood and recognised: Australia, France, Germany, Italy, Norway (maybe), South Africa (maybe), Spain, and UK. I know, you’ll be asking about all the others. Why not Ireland? Why not Korea? Why not USA? I’ll tell you why: Ireland was never updated to use the euro, so clearly either nobody uses it or nobody cares. Korea… IME? Font support? It was probably used once, not these days. And as for America, the link I gave a few days ago was to fix a bug in CLib that got the timezone handling fundamentally wrong in all C programs – so either nobody uses the USA Territory, or they all leave on the East Coast. You don’t need to worry about Russia or Israel. Yes, they have defined country numbers. I don’t see anything for them in the Territory modules source directory. Hence, never used. Really, to be honest, I think I’d try to distil everything down to English, French, Spanish, German, or Italian (I note that there’s no Portuguese Territory!) when it comes to the old API. Which is what I meant. The new API can know and understand all about Peru. The old API can just say “Spanish”. But the user will probably have their software in English because, not so many Spanish language resources around inside apps. Hopefully with the new Debugger, this might start to change (or would they prefer Euskadi?). That said, it’s better dealt with in a new manner so they have a choice of Basque → Spanish → something else → English depending on language support in the program. Not something the current API is up to. So just fake it for the old API, and those of us writing programs can embrace something that is more useful. if the punctuation is part of that quotation. Exactly that. But, then, I understand British quoting is also known as “logical” quoting. Which means American is surely….. ;-)