Chars
Steffen Huber (91) 1949 posts |
You might be able to do that if there are illegal sequences (i.e. determine that it is not UTF-8), but in the generic case you cannot determine if a file is UTF-8-encoded or a single byte encoding. |
Rick Murray (539) 13806 posts |
You can also determine that a file is UTF-8 if you come across legal sequences. However, as I said – if the file has no high bit set characters (such as plain English with normal punctuation), you cannot tell any difference between UTF-8 and Latin1. That said, for such a file there is no difference… Having markers in the file adds complication. What should Edit do with such a thing? Show it? Hide it? Allow it to be edited? Insert it upon saving? The thing I like about Edit is that you see what is there – even if you’re looking at binary files. Speaking of binary files, the reason I installed Notepad++ on my PC was because I got fed up of various bits of Windows (such as the RTF handler) “deciding” that a binary file was some sort of Unicode and thus displaying the file in bits of random Chinese. At least Notepad++ can be told what sort of file it is, so I see what is there and not what some algorithm thinks ought to be there… Which is circular. We’re back to Edit. Showing what’s there. It can be useful, looking in a data file, you know. Days when magazine cover discs would unhelpfully provide files in Impression format. I’m an Ovation guy. So I used to dump them into Edit or Zap and just read the content straight out of the file. ;-) |
Frederick Bambrough (1372) 837 posts |
Chris,
Yes.
Haven’t seen that. I do get File ‘<Chars$Dir>.!Help’ not found at line 1800 on selecting Help from the icon bar menu. Chars exits. Desktop is using the standard Homerton font, though with an altered theme for the sprites. |
Steffen Huber (91) 1949 posts |
Every legal UTF-8 sequence is also a legal single byte encoding sequence. Just witness encoding auto detection in browsers – they often get it wrong, because it is an unsolvable problem. |
Chris (121) 472 posts |
Thanks Frederick.
OK, I’ve spoken to ROOL who also can’t reproduce the problems with running out of memory, etc. Could you report the results of these commands:
Are you using a standard ROM download from the site, rather than building your own? |
Rick Murray (539) 13806 posts |
While this is correct, you need to keep looking and not just judge based upon the first sequence found. I think if you encounter, say ten UTF sequences and no invalid high bit stuff, you may be able to have confidence in the file being UTF-8. It would surely be a very rare file that wasn’t UTF-8…while only containing valid UTF-8 sequences. |
Steffen Huber (91) 1949 posts |
Experience says: no, not rare. Especially if your decision is not only “UTF-8 or ISO-8859-1(5)”, but also includes other single byte encodings. You can try to make an educated guess. It can be “judged”. But it cannot be determined. |
Rick Murray (539) 13806 posts |
That’s why I said confidence rather than absolute. It’s like science – it only takes one experiment to disprove something, but any number of “proofs” only increase confidence by virtue of the theory not having been disproven. ;-) |
Steffen Huber (91) 1949 posts |
You said “determine”. That’s why I responded at all. “Determine” is – according to my dictionary – not the same as “guess with some confidence”. |
Paul Sprangers (346) 523 posts |
But, cough… how does Windows do it then? Firefox, Thunderbird, Word – even the humblest notepad displays Unicode and I never noticed any failure. |
Steffen Huber (91) 1949 posts |
You are trying the wrong things :-) Firefox has no problem if proper HTML is used – after all, specifying the correct encoding is part of “proper HTML”. Now place a plain text file on your server, with a single byte encoding of your choice using high-bit characters. There are very good chances that Firefox “guesses” UTF-8 content. Thunderbird usually has no problem because modern emails usually carry the correctly specified encoding (or something like “quoted printable”). Give it an email with unspecified encoding, again single byte encoding with high-bit, and watch it fail miserably. Word always knows which encoding to use because it is either a default (old binary format) or explicitly specified (XML formats). Bottom line: guessing the encoding is difficult. |
Paul Sprangers (346) 523 posts |
Then only one conclusion seems to be left over: RISC OS should be rewritten so that it expects specified encodings in text files. This also seems to contradict Rick’s statement, which actually was mine too. But again, grappa and all that… |
Frederick Bambrough (1372) 837 posts |
Chris *show Chars* *Ex Resources:$.Apps.!Chars Dir. Resources:$.Apps.!Chars Option 02 (Run) CSD Resources:"Unset" Lib. Resources:"Unset" URD Resources:"Unset" !Help WR/ Text 10:27:26 09-Jul-2016 5 kbytes !Run WR/ Obey 10:27:23 09-Jul-2016 235 bytes *Show Wimp* Wimp$IconTheme : Bluberry. Wimp$Scrap : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir.ScrapFile Wimp$ScrapDir : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir Wimp$State : desktop * After running Chars I get; *Show Chars* Chars$Dir : SDFS::HardDisc0.$.Public Chars$Path : SDFS::HardDisc0.$.Public.,Resources:$.Resources.Chars. Public being the dir I’m using for the altered !Run. Yup, I’m using the standard ROM. I wouldn’t know how to build one! |
Frederick Bambrough (1372) 837 posts |
Doh! It eventually occurred to me you want the results after a clean boot and without the changed !Run. Here it is. *show Chars* Chars$Dir : Resources:$.Apps.!Chars Chars$Path : Resources:$.Apps.!Chars.,Resources:$.Resources.Chars. *Ex Resources:$.Apps.!Chars Dir. Resources:$.Apps.!Chars Option 02 (Run) CSD Resources:"Unset" Lib. Resources:"Unset" URD Resources:"Unset" !Help WR/ Text 10:27:26 09-Jul-2016 5 kbytes !Run WR/ Obey 10:27:23 09-Jul-2016 235 bytes *Show Wimp* Wimp$IconTheme : Bluberry. Wimp$Scrap : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir.ScrapFile Wimp$ScrapDir : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir Wimp$State : desktop * I thought cycling was supposed to improve one’s wits. |
Rick Murray (539) 13806 posts |
Yes. It does. And there are often very good reasons why – sniffing the index page of this site (which I note requests two cookies to be set, but doesn’t pop up the obligatory annoying notice ;-) ), the first line is: Content-Type: text/html; charset=utf-8 If you serve a text file and your server is set to include that within the HTTP header, then Firefox is only doing what it was told… I ran into this myself, which is why my site doesn’t specify any encoding in the http header. I used http://web-sniffer.net to look at the headers.
I don’t know about never versions of Thunderbird. Older ones never seemed to suffer too badly for receiving Latin1 emails from a RISC OS application. It would be the usual stuff (fancy quotes in a different place in CP-1252) but nothing extraordinary. Given that I sometimes received mangled address labels, with my “é” turned into some gibberish, I’m wondering if this whole problem isn’t being made harder than it ought to be.
Guessing the encoding with any level of confidence is harder, but then anybody who attempts to determine UTF-8 by looking only at the first sequence found needs a kick in the goolies. There may well be some obscure Polish word in Latin5 that actually contains a valid UTF-8 sequence, so you really need to scan through to find a few sequences to make any sort of judgement. That said, we are really getting off the topic of how the Wimp can be expected to cater for older applications (by older, I mean “every one thus written”) and Unicode applications? Being in the UTF-8 alphabet is a non-starter as it breaks everything else for non-English users… |
Steve Pampling (1551) 8155 posts |
Ah, the joys of misunderstanding the language, even born and bred English speakers get that one wrong. In the context given “proof” is the result of the test and “prove” is “test” so multiple tests giving the same result do imply the theory is correct but they not categorically rule any other option out. |
Doug Webb (190) 1158 posts |
Chris Here are my results after deleting EasyFonts from the start up menu. *show Chars* *ex Resources:$.Apps.!Chars Dir. Resources:$.Apps.!Chars Option 02 (Run) CSD Resources:"Unset" Lib. Resources:"Unset" URD Resources:"Unset" !Help WR/ Text 10:27:26 09-Jul-2016 5 kbytes !Run WR/ Obey 10:27:23 09-Jul-2016 235 bytes *show Wimp* Wimp$Font : Homerton.Medium Wimp$IconTheme : PandaLand2. Wimp$Scrap : SDFS::ARMiniX.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir.ScrapFile Wimp$ScrapDir : SDFS::ARMiniX.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir Wimp$State : desktop Then after attempting to run !Chars *show Chars* Chars$Dir : Resources:$.Apps.!Chars Chars$Path : Resources:$.Apps.!Chars.,Resources:$.Resources.Chars. * I would do it in a nice textual way if the help file was any use whats so ever :-) |
Chris (121) 472 posts |
Frederick: I’m the one whose wits are slow :) The reason you’re getting the error when selecting Help from the menu is that you’ve moved the !Run file, thus setting So that’s one thing solved. But I’m no closer to understanding why Chars on your/Doug’s system runs out of memory. I suppose it would be useful to know if it’s running as it should on OMAP3/4 ROMs generally, or whether this is something that affects all Beagle/Pandaboards. |
Rick Murray (539) 13806 posts |
Which is why it was put in quotes. A “proof” (layman’s definition) doesn’t really prove anything other than “here’s one more test that doesn’t disprove the theory”. |
Steffen Huber (91) 1949 posts |
It is the job of whatever application is showing the text file to support different encodings and, if it cannot be determined, let the user choose the correct encoding. It would be a good idea if the OS would support conversion between different common encodings. Apart from that, the OS should be encoding agnostic. All IMHO of course. |
Doug Webb (190) 1158 posts |
Chris, I think I know what is the issue and it seems to be related to the number of Fonts in the !Fonts directory in Resources. I installed a clean !Boot and then rebooted so all the choices were set up as new and run !Chars and it worked. I then reintroduced all of the added Fonts I had in !Fonts and rebooted and tried !Chars and got the failure. I deleted them gradually, testing each time after a reboot, until I had 23 different font folders in !Fonts at which point !Chars worked. To ensure it didn’t not relate to a particular Font I altered the fonts that made up the 24th entry, though I only tried another 10 different fonts not all of them, and on each occasion !Chars either gave the error. So it does seem to be related to the number of fonts at least on this set up. Hope helps |
Frederick Bambrough (1372) 837 posts |
This was easy for me to confirm. I keep two font directories, one for the default fonts (5) and another for fonts I’ve added (68). This made it easy for me to move the second dir out of Resources temporarily and reboot. Result same as Doug’s – Chars works. |
Chris (121) 472 posts |
In the source in CVS it looks like the code that creates the fontlist, which should grow the wimpslot to accommodate long lists, doesn’t. Not sure why – it used to. I think when I did some tidying of the source for submission I must have had an idiot moment and mangled the code. I’ll take a look at it tonight and should be able to send a fix in. Apologies for the inconvenience, many thanks for your detective work! |
Rick Murray (539) 13806 posts |
That’s okay. That’s why this is not the “stable” release. Think of it as crowd sourced bug bashing. ;-) While I’m here – is there somebody with a large font collection willing to zip up and mail me a copy? I ought to test Ovation with lots of fonts. |
Andrew Conroy (370) 725 posts |
Drop me an email to a.m.conroy (at) owlart.co.uk and I can send you tons of them! |