UTF-8 and ISO-8859-1 00-FF
|
I believe Unicode was designed to display Latin-1 text without conversion. The first entries 00-7F coincide with ASCII and the following 7F-FF named “Latin-1 Supplement” contain the most common special and accented characters used in the West. I publish a blog in Spanish and the lack of UTF-8 support makes updating difficult. New entries to the blog are written as Textile in StrongED and converted to UTF-8 compliant html with a Python program. I then add the new entry to the existing html blog. The problem comes when I discover an error in the main html and want to correct it. For example I discovered yesterday that I had spelled sábedo without an accent. In order to correct it I loaded the blog into StrongED and changed about 25 errors. Now the probem occurs as the correction is ISO-8859-1 so the file is no longer UTF-8 only but contains a mixture of encodings. The display in any web browser is ugly. |
|
Sadly, you got it wrong in the first sentence. Unicode (specifically, UTF-8) was designed to display ASCII-7 text without conversion, not Latin-1. Characters in the Latin-1 set outside the 00-7F range will be converted to multi-byte characters in UTF-8. The Wikipedia article does a reasonable job of explaining the situation. |
|
What he meant was that the accented characters in the upper bit set range have the same character cores, only translated into multibyte values. It’s not like é suddenly becomes character 1023 or something… |
|
á in Latin1 is E1 é in Latin1 is E9 There is the makings of a similarity (é is 8 characters after á in both encodings) but it’s not as if UTF-8 just bungs an extra byte on the front: there is some mangling going on. |
|
The first sentence is correct. Latin 1 1 maps exactly to Unicode code points (fancy name for character, because characters can be more than one ‘character’). However, UTF-8 is an encoding (as are Latin 1, Latin 2, UTF-16…) which maps bytes to Unicode code points. UTF-8 uses one byte for code points <= 7f (ie. ASCII), two bytes for code points <= 7ff (ie. including Latin 1 top-bit chars), three bytes for code points <= ffff, and then four bytes. It uses the top bit to flag it is a multi-byte character and, if set, subsequent bits to flag how many bytes is. See the table at https://en.wikipedia.org/wiki/UTF-8#Description However, back to the problem with StrongEd, I can’t think of an easy solution within the editor. You can’t expect an editor that doesn’t support an encoding to edit that encoding without some form of post-processing – just like you can’t use a Latin 1 editor to write Latin 2 text and expect it to display correctly. Can’t you store and edit the original text, and use the python programme to do the conversion from the original, or perhaps the HTML page could explicitly state it is Latin 1 (though, note that HTML specifies that Latin 1 is actually Windows-1252 so it differs from RISC OS in the middle few characters). 1 That is, Latin 1 according to the ISO specification, not extended by Acorn or MS. ie. the middle 80-9f is unspecified, but both vendors chose to populate them differently. |
|
e1 is 1110,0001. Wikipedia refers to this as yyyyzzzz. UTF-8 two byte sequences are (from the Wikipedia table) 110xxxyy, 10yyzzzz. There are no x’s, so substituting in the four y bits and four z bits, the output will be 11000011, 10100001, which is C3A1. e9 is 1100,1001. ie. only the last nibble has changed, which is ‘zzzz’, so only that will have changed in the output. C3A9. |
|
It isn’t just a display issue. UTF-8 uses variable length encoding, so whatever handles cursors and clicks will need to have some idea of where to move/place the insertion point in the text. You can’t add characters in the middle of a sequence, and deleting a character should remove the sequence and not just the last byte. |
|
On the internet, that fight has been lost. The official position of HTML 5 is to treat ISO8859/1 as a synonym for CP-1252… Unfortunately things like sexed quote marks and so on aren’t part of 8859/1. They are in CP-1252, but in a different place to RISC OS. |
|
Sadly that is no longer possible. Because I did not anticipate the the need to recreate the html I have subsequently modified it and in particular tweeked the CSS quite a bit. The program used to create a new entry is different from the one that created the original html which I considered to be scaffolding. |
|
There was a discussion about this at a ROUGOL Zoom meeting last night at which a UTF-8 capable editor was demonstrated. It is able to handle emoticons and kanji and a lot more besides. According to the author he just need to encode Korean and it will be ready for release. Don’t say can’t wait. |
|
I did mention that. I don’t see it as a fight – there was a chunk in the middle that has no mapping. It’s better to use something that defines these characters than not, and clearly there are more Windows computers than RISC OS so it was the obvious choice.
Ah, yes, I hadn’t considered that. Presumably that would solve the problem (albeit with rather ugly source code), with a quick search & replace. |
|
nemo?
There, fixed that for you. ;) Yes, CP-1252 is a superset of 8859/1 so it is a valid translation, but let’s be fair, back in the day there were plenty of pages with sexed quotes and the like that rendered in unexpected ways because they declared the wrong character set.
I guess if you do it often enough, you gain the ability to mentally translate it as you’re reading. I’d rather see glyphs than the sort of mess that happens when one translates between wide and eight bit character sets 1. Especially given as I might be editing on an Android editor (which may or may not be UTF-8) and uploading via a browser (which may or may not cock up pasting text from another app) or writing in Zap (8859/1) and uploading via NetSurf. Given the selection of variables, better to write something that is “plain ASCII with glyphs” so it’ll be editable on whatever. 1 The number of broken addresses mom 2 used to get on labels from Amazon marketplace sellers… Amazon clearly is UTF-8 and the software the sellers were using clearly wasn’t, so I’d have stuff like 2 I say mom and not me because I rarely use marketplace, but mom got plenty of secondhand books that way. Maybe it’s better? I wouldn’t hold my breath… |
|
Isn’t that a perfect contradiction? :-) Acorn incorrectly called theirs Latin 1 (8859/1) despite adding extra characters. Windows, at least, chose to give it a new name when they did the same. |
|
Yup. You got me. It was getting cold rather rapidly (I was sitting outside) and I wanted to get the message finished so skipped over some obvious stuff like “8859/1ish”. Point still stands, using ASCII+glyphs means it doesn’t matter what any particular machine wants to do. Heck, I could even use DOS. ;)
Perhaps we ought to retroname it “Acorn 1”? At least it was almost standard and not the old Master/BFont. |
|
@ John I built two desktop utilities which are based on TextConv (builder) and which allow me to make these conversions. I especially use UTF8_latin1 because the text contained in PDF files contains UTF8 character. Note: The executable of TexConv is in the applications, it can make other conversions and can be used on the command line and therefore you can create in Stronged a button that would use these possibilities. Usage: TextConv [options] [<inputfile> [<outputfile>]] Options: -from <charset> Set the source charset (default = System alphabet) -to <charset> Set the destination charset (default = System alphabet) Charsets may have one or more appended modifiers: /le Little-endian (eg UTF-16) /encopt Encode optionally encoded characters (eg UTF-7) /noheader No header (eg UTF-16 byte order mark) Known charsets include: US-ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-14, ISO-8859-15, ISO-2022, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2, EUC-JP, Shift_JIS, ISO-2022-CN, ISO-2022-CN-EXT, EUC-CN, Big5, ISO-2022-KR, EUC-KR, Johab, KOI8-R, CP866, Windows-1250, Windows-1251, Windows-1252, Mac-Roman, Mac-CentralEurRoman, Mac-Cyrillic, Mac-Ukrainian, ISO-IR-182, ISO-IR-197, x-Acorn-Latin1, x-Acorn-Fuzzy, x-Current, UTF-7, UTF-8, UTF-16, UCS-4, SCSU <\pre> |
|
I’ve seen stuff like that, in the last few months, in content generated in the last 12 months. |
|
@Jean-Michel
I was looking for a way to edit a large html file on RISC OS. The file is encoded UTF-8 and contains Spanish text. Iwanted to make changes using StrongED which did not introduce any 8859/1 pollution. I have now found that there is no problem inputting accented characters as the HTML modefile automatically substitutes á for á. é for é etc
Thanks for the offer but have a similar script in LUA which can be dropped on the StrongED Process button to do this. I am just a bit wary of applying such changes in bulk. |
|
The problem I’ve just had with sexed quotes on a web page is that some web creation software is automatically converting normal quotes in to them. I was pasting a command line in to bash, and got lots of errors, then noticed it wasn’t normal quotes. At least with RISC OS, pasting it would have revealed different characters (and not interpreted as UTF-8 in most apps). |
|
FTR I’ve supplied John with suitable conversion progs. |
|
And much appreciated. I have now have them integrated into StrongED and have what is to me a satisfactory solution. |