RISC OS Open: Forum: UTF-8 and ISO-8859-1 00-FF

Feb 17, 2025 6:49pm

I believe Unicode was designed to display Latin-1 text without conversion. The first entries 00-7F coincide with ASCII and the following 7F-FF named “Latin-1 Supplement” contain the most common special and accented characters used in the West.
So, why does UTF-8 use two characters to encode “LATIN SMALL LETTER A WITH ACUTE” when both UTF-8 and ISO-8859-1 have “á” in common at x’E1’?

I publish a blog in Spanish and the lack of UTF-8 support makes updating difficult. New entries to the blog are written as Textile in StrongED and converted to UTF-8 compliant html with a Python program. I then add the new entry to the existing html blog. The problem comes when I discover an error in the main html and want to correct it. For example I discovered yesterday that I had spelled sábedo without an accent. In order to correct it I loaded the blog into StrongED and changed about 25 errors. Now the probem occurs as the correction is ISO-8859-1 so the file is no longer UTF-8 only but contains a mixture of encodings. The display in any web browser is ugly.

Feb 17, 2025 11:54pm

Stuart Painting (5389) 732 posts

I believe Unicode was designed to display Latin-1 text without conversion.

Sadly, you got it wrong in the first sentence. Unicode (specifically, UTF-8) was designed to display ASCII-7 text without conversion, not Latin-1. Characters in the Latin-1 set outside the 00-7F range will be converted to multi-byte characters in UTF-8.

The Wikipedia article does a reasonable job of explaining the situation.

Feb 18, 2025 5:49am

Rick Murray (539) 14047 posts

What he meant was that the accented characters in the upper bit set range have the same character cores, only translated into multibyte values. It’s not like é suddenly becomes character 1023 or something…

Feb 18, 2025 8:31am

Stuart Painting (5389) 732 posts

It’s not like é suddenly becomes character 1023 or something…

á in Latin1 is E1
á in UTF-8 is C3A1

é in Latin1 is E9
é in UTF-8 is C3A9

There is the makings of a similarity (é is 8 characters after á in both encodings) but it’s not as if UTF-8 just bungs an extra byte on the front: there is some mangling going on.

Feb 18, 2025 9:07am

Piers (3264) 75 posts

The first sentence is correct. Latin 1 ^¹ maps exactly to Unicode code points (fancy name for character, because characters can be more than one ‘character’). However, UTF-8 is an encoding (as are Latin 1, Latin 2, UTF-16…) which maps bytes to Unicode code points.

UTF-8 uses one byte for code points <= 7f (ie. ASCII), two bytes for code points <= 7ff (ie. including Latin 1 top-bit chars), three bytes for code points <= ffff, and then four bytes. It uses the top bit to flag it is a multi-byte character and, if set, subsequent bits to flag how many bytes is. See the table at https://en.wikipedia.org/wiki/UTF-8#Description

However, back to the problem with StrongEd, I can’t think of an easy solution within the editor. You can’t expect an editor that doesn’t support an encoding to edit that encoding without some form of post-processing – just like you can’t use a Latin 1 editor to write Latin 2 text and expect it to display correctly.

Can’t you store and edit the original text, and use the python programme to do the conversion from the original, or perhaps the HTML page could explicitly state it is Latin 1 (though, note that HTML specifies that Latin 1 is actually Windows-1252 so it differs from RISC OS in the middle few characters).

^¹ That is, Latin 1 according to the ISO specification, not extended by Acorn or MS. ie. the middle 80-9f is unspecified, but both vendors chose to populate them differently.

Feb 18, 2025 9:33am

Piers (3264) 75 posts

e1 is 1110,0001. Wikipedia refers to this as yyyyzzzz.

UTF-8 two byte sequences are (from the Wikipedia table) 110xxxyy, 10yyzzzz. There are no x’s, so substituting in the four y bits and four z bits, the output will be 11000011, 10100001, which is C3A1.

e9 is 1100,1001. ie. only the last nibble has changed, which is ‘zzzz’, so only that will have changed in the output. C3A9.

Feb 18, 2025 11:53am

Rick Murray (539) 14047 posts

You can’t expect an editor that doesn’t support an encoding to edit that encoding without some form of post-processing

It isn’t just a display issue. UTF-8 uses variable length encoding, so whatever handles cursors and clicks will need to have some idea of where to move/place the insertion point in the text. You can’t add characters in the middle of a sequence, and deleting a character should remove the sequence and not just the last byte.

Feb 18, 2025 12:02pm

Rick Murray (539) 14047 posts

That is, Latin 1 according to the ISO specification, not extended by Acorn or MS

On the internet, that fight has been lost. The official position of HTML 5 is to treat ISO8859/1 as a synonym for CP-1252… Unfortunately things like sexed quote marks and so on aren’t part of 8859/1. They are in CP-1252, but in a different place to RISC OS.
All of that being said, they probably ought to be given as glyphs. When I’m writing HTML, I don’t ever write “é”, I write “é”. Then the correct thing shows up regardless of character set munging (unless it’s really broken, but that’s not my problem).

Feb 18, 2025 12:04pm

John Rickman (71) 677 posts

Can’t you store and edit the original text, and use the python programme to do the conversion from the original,

Sadly that is no longer possible. Because I did not anticipate the the need to recreate the html I have subsequently modified it and in particular tweeked the CSS quite a bit. The program used to create a new entry is different from the one that created the original html which I considered to be scaffolding.

Feb 18, 2025 12:14pm

John Rickman (71) 677 posts

It isn’t just a display issue. UTF-8 uses variable length encoding, so whatever handles cursors and clicks will need to have some idea of where to move/place the insertion point in the text.

There was a discussion about this at a ROUGOL Zoom meeting last night at which a UTF-8 capable editor was demonstrated. It is able to handle emoticons and kanji and a lot more besides. According to the author he just need to encode Korean and it will be ready for release. Don’t say can’t wait.

Feb 18, 2025 1:26pm

Piers (3264) 75 posts

On the internet, that fight has been lost. The official position of HTML 5 is to treat ISO8859/1 as a synonym for CP-1252

I did mention that. I don’t see it as a fight – there was a chunk in the middle that has no mapping. It’s better to use something that defines these characters than not, and clearly there are more Windows computers than RISC OS so it was the obvious choice.

All of that being said, they probably ought to be given as glyphs. When I’m writing HTML, I don’t ever write “é”, I write “é”.

Ah, yes, I hadn’t considered that. Presumably that would solve the problem (albeit with rather ugly source code), with a quick search & replace.

Feb 18, 2025 3:22pm

Rick Murray (539) 14047 posts

at which a UTF-8 capable editor was demonstrated

nemo?

clearly there are more Windows computers … used by people who had no idea what their encoding was or why it mattered.

There, fixed that for you. ;)

Yes, CP-1252 is a superset of 8859/1 so it is a valid translation, but let’s be fair, back in the day there were plenty of pages with sexed quotes and the like that rendered in unexpected ways because they declared the wrong character set.

albeit with rather ugly source code

I guess if you do it often enough, you gain the ability to mentally translate it as you’re reading.

I’d rather see glyphs than the sort of mess that happens when one translates between wide and eight bit character sets ¹. Especially given as I might be editing on an Android editor (which may or may not be UTF-8) and uploading via a browser (which may or may not cock up pasting text from another app) or writing in Zap (8859/1) and uploading via NetSurf.

Given the selection of variables, better to write something that is “plain ASCII with glyphs” so it’ll be editable on whatever.

¹ The number of broken addresses mom ² used to get on labels from Amazon marketplace sellers… Amazon clearly is UTF-8 and the software the sellers were using clearly wasn’t, so I’d have stuff like Ãª turn up in my address instead of the correct accented characters.

² I say mom and not me because I rarely use marketplace, but mom got plenty of secondhand books that way. Maybe it’s better? I wouldn’t hold my breath…

Feb 18, 2025 3:55pm

Piers (3264) 75 posts

clearly there are more Windows computers … used by people who had no idea what their encoding was or why it mattered.
…
writing in Zap (8859/1)

Isn’t that a perfect contradiction? :-) Acorn incorrectly called theirs Latin 1 (8859/1) despite adding extra characters. Windows, at least, chose to give it a new name when they did the same.

Feb 18, 2025 5:21pm

Rick Murray (539) 14047 posts

Isn’t that a perfect contradiction? :-)

Yup. You got me. It was getting cold rather rapidly (I was sitting outside) and I wanted to get the message finished so skipped over some obvious stuff like “8859/1ish”.

Point still stands, using ASCII+glyphs means it doesn’t matter what any particular machine wants to do. Heck, I could even use DOS. ;)

Acorn incorrectly called theirs Latin 1

Perhaps we ought to retroname it “Acorn 1”? At least it was almost standard and not the old Master/BFont.

Feb 18, 2025 6:12pm

Jean-Michel BRUCK (3009) 390 posts

@ John
I don’t know what you need. I think I understood that you were looking for conversions between Latin and UTF8, using Stronged?

I built two desktop utilities which are based on TextConv (builder) and which allow me to make these conversions. I especially use UTF8_latin1 because the text contained in PDF files contains UTF8 character.
I don’t know if it can be useful to you …
https://jeanmichelb.riscos.fr/AppliTBox/AppliToolBox.html

Note: The executable of TexConv is in the applications, it can make other conversions and can be used on the command line and therefore you can create in Stronged a button that would use these possibilities.

Usage: TextConv [options] [<inputfile> [<outputfile>]]
Options:
-from <charset>    Set the source charset (default = System alphabet)
-to <charset>      Set the destination charset (default = System alphabet)

Charsets may have one or more appended modifiers:

/le                Little-endian (eg UTF-16)
/encopt            Encode optionally encoded characters (eg UTF-7)
/noheader          No header (eg UTF-16 byte order mark)

Known charsets include:

US-ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5,
ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-14, ISO-8859-15, ISO-2022,
ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2, EUC-JP, Shift_JIS, ISO-2022-CN,
ISO-2022-CN-EXT, EUC-CN, Big5, ISO-2022-KR, EUC-KR, Johab, KOI8-R, CP866,
Windows-1250, Windows-1251, Windows-1252, Mac-Roman, Mac-CentralEurRoman,
Mac-Cyrillic, Mac-Ukrainian, ISO-IR-182, ISO-IR-197, x-Acorn-Latin1,
x-Acorn-Fuzzy, x-Current, UTF-7, UTF-8, UTF-16, UCS-4, SCSU
<\pre>

Feb 18, 2025 7:49pm

Steve Pampling (1551) 8272 posts

but let’s be fair, back in the day there were plenty of pages with sexed quotes and the like that rendered in unexpected ways because they declared the wrong character set.

I’ve seen stuff like that, in the last few months, in content generated in the last 12 months.

Feb 18, 2025 8:06pm

John Rickman (71) 677 posts

@Jean-Michel

don’t know what you need. I think I understood that you were looking for conversions between Latin and UTF8, using Stronged?

I was looking for a way to edit a large html file on RISC OS. The file is encoded UTF-8 and contains Spanish text. Iwanted to make changes using StrongED which did not introduce any 8859/1 pollution. I have now found that there is no problem inputting accented characters as the HTML modefile automatically substitutes á for á. é for é etc

I built two desktop utilities which are based on TextConv (builder) and which allow me to make these conversions. I especially use UTF8_latin1 because the text contained in PDF files contains UTF8 character.

Thanks for the offer but have a similar script in LUA which can be dropped on the StrongED Process button to do this. I am just a bit wary of applying such changes in bulk.

Feb 18, 2025 8:52pm

David J. Ruck (33) 1696 posts

The problem I’ve just had with sexed quotes on a web page is that some web creation software is automatically converting normal quotes in to them. I was pasting a command line in to bash, and got lots of errors, then noticed it wasn’t normal quotes. At least with RISC OS, pasting it would have revealed different characters (and not interpreted as UTF-8 in most apps).

2 hours ago

nemo (145) 2648 posts

FTR I’ve supplied John with suitable conversion progs.

1 hour ago

John Rickman (71) 677 posts

FTR I’ve supplied John with suitable conversion progs.

And much appreciated. I have now have them integrated into StrongED and have what is to me a satisfactory solution.

UTF-8 and ISO-8859-1 00-FF

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Feb 17, 2025 6:49pm John Rickman (71) 677 posts	I believe Unicode was designed to display Latin-1 text without conversion. The first entries 00-7F coincide with ASCII and the following 7F-FF named “Latin-1 Supplement” contain the most common special and accented characters used in the West. So, why does UTF-8 use two characters to encode “LATIN SMALL LETTER A WITH ACUTE” when both UTF-8 and ISO-8859-1 have “á” in common at x’E1’? I publish a blog in Spanish and the lack of UTF-8 support makes updating difficult. New entries to the blog are written as Textile in StrongED and converted to UTF-8 compliant html with a Python program. I then add the new entry to the existing html blog. The problem comes when I discover an error in the main html and want to correct it. For example I discovered yesterday that I had spelled sábedo without an accent. In order to correct it I loaded the blog into StrongED and changed about 25 errors. Now the probem occurs as the correction is ISO-8859-1 so the file is no longer UTF-8 only but contains a mixture of encodings. The display in any web browser is ugly.

Feb 17, 2025 11:54pm Stuart Painting (5389) 732 posts	I believe Unicode was designed to display Latin-1 text without conversion. Sadly, you got it wrong in the first sentence. Unicode (specifically, UTF-8) was designed to display ASCII-7 text without conversion, not Latin-1. Characters in the Latin-1 set outside the 00-7F range will be converted to multi-byte characters in UTF-8. The Wikipedia article does a reasonable job of explaining the situation.

Feb 18, 2025 5:49am Rick Murray (539) 14047 posts	What he meant was that the accented characters in the upper bit set range have the same character cores, only translated into multibyte values. It’s not like é suddenly becomes character 1023 or something…

Feb 18, 2025 8:31am Stuart Painting (5389) 732 posts	It’s not like é suddenly becomes character 1023 or something… á in Latin1 is E1 á in UTF-8 is C3A1 é in Latin1 is E9 é in UTF-8 is C3A9 There is the makings of a similarity (é is 8 characters after á in both encodings) but it’s not as if UTF-8 just bungs an extra byte on the front: there is some mangling going on.

Feb 18, 2025 9:07am Piers (3264) 75 posts	The first sentence is correct. Latin 1 ^¹ maps exactly to Unicode code points (fancy name for character, because characters can be more than one ‘character’). However, UTF-8 is an encoding (as are Latin 1, Latin 2, UTF-16…) which maps bytes to Unicode code points. UTF-8 uses one byte for code points <= 7f (ie. ASCII), two bytes for code points <= 7ff (ie. including Latin 1 top-bit chars), three bytes for code points <= ffff, and then four bytes. It uses the top bit to flag it is a multi-byte character and, if set, subsequent bits to flag how many bytes is. See the table at https://en.wikipedia.org/wiki/UTF-8#Description However, back to the problem with StrongEd, I can’t think of an easy solution within the editor. You can’t expect an editor that doesn’t support an encoding to edit that encoding without some form of post-processing – just like you can’t use a Latin 1 editor to write Latin 2 text and expect it to display correctly. Can’t you store and edit the original text, and use the python programme to do the conversion from the original, or perhaps the HTML page could explicitly state it is Latin 1 (though, note that HTML specifies that Latin 1 is actually Windows-1252 so it differs from RISC OS in the middle few characters). ^¹ That is, Latin 1 according to the ISO specification, not extended by Acorn or MS. ie. the middle 80-9f is unspecified, but both vendors chose to populate them differently.

Feb 18, 2025 9:33am Piers (3264) 75 posts	e1 is 1110,0001. Wikipedia refers to this as yyyyzzzz. UTF-8 two byte sequences are (from the Wikipedia table) 110xxxyy, 10yyzzzz. There are no x’s, so substituting in the four y bits and four z bits, the output will be 11000011, 10100001, which is C3A1. e9 is 1100,1001. ie. only the last nibble has changed, which is ‘zzzz’, so only that will have changed in the output. C3A9.

Feb 18, 2025 11:53am Rick Murray (539) 14047 posts	You can’t expect an editor that doesn’t support an encoding to edit that encoding without some form of post-processing It isn’t just a display issue. UTF-8 uses variable length encoding, so whatever handles cursors and clicks will need to have some idea of where to move/place the insertion point in the text. You can’t add characters in the middle of a sequence, and deleting a character should remove the sequence and not just the last byte.

Feb 18, 2025 12:02pm Rick Murray (539) 14047 posts	That is, Latin 1 according to the ISO specification, not extended by Acorn or MS On the internet, that fight has been lost. The official position of HTML 5 is to treat ISO8859/1 as a synonym for CP-1252… Unfortunately things like sexed quote marks and so on aren’t part of 8859/1. They are in CP-1252, but in a different place to RISC OS. All of that being said, they probably ought to be given as glyphs. When I’m writing HTML, I don’t ever write “é”, I write “é”. Then the correct thing shows up regardless of character set munging (unless it’s really broken, but that’s not my problem).

Feb 18, 2025 12:04pm John Rickman (71) 677 posts	Can’t you store and edit the original text, and use the python programme to do the conversion from the original, Sadly that is no longer possible. Because I did not anticipate the the need to recreate the html I have subsequently modified it and in particular tweeked the CSS quite a bit. The program used to create a new entry is different from the one that created the original html which I considered to be scaffolding.

Feb 18, 2025 12:14pm John Rickman (71) 677 posts	It isn’t just a display issue. UTF-8 uses variable length encoding, so whatever handles cursors and clicks will need to have some idea of where to move/place the insertion point in the text. There was a discussion about this at a ROUGOL Zoom meeting last night at which a UTF-8 capable editor was demonstrated. It is able to handle emoticons and kanji and a lot more besides. According to the author he just need to encode Korean and it will be ready for release. Don’t say can’t wait.

Feb 18, 2025 1:26pm Piers (3264) 75 posts	On the internet, that fight has been lost. The official position of HTML 5 is to treat ISO8859/1 as a synonym for CP-1252 I did mention that. I don’t see it as a fight – there was a chunk in the middle that has no mapping. It’s better to use something that defines these characters than not, and clearly there are more Windows computers than RISC OS so it was the obvious choice. All of that being said, they probably ought to be given as glyphs. When I’m writing HTML, I don’t ever write “é”, I write “é”. Ah, yes, I hadn’t considered that. Presumably that would solve the problem (albeit with rather ugly source code), with a quick search & replace.

Feb 18, 2025 3:22pm Rick Murray (539) 14047 posts	at which a UTF-8 capable editor was demonstrated nemo? clearly there are more Windows computers … used by people who had no idea what their encoding was or why it mattered. There, fixed that for you. ;) Yes, CP-1252 is a superset of 8859/1 so it is a valid translation, but let’s be fair, back in the day there were plenty of pages with sexed quotes and the like that rendered in unexpected ways because they declared the wrong character set. albeit with rather ugly source code I guess if you do it often enough, you gain the ability to mentally translate it as you’re reading. I’d rather see glyphs than the sort of mess that happens when one translates between wide and eight bit character sets ¹. Especially given as I might be editing on an Android editor (which may or may not be UTF-8) and uploading via a browser (which may or may not cock up pasting text from another app) or writing in Zap (8859/1) and uploading via NetSurf. Given the selection of variables, better to write something that is “plain ASCII with glyphs” so it’ll be editable on whatever. ¹ The number of broken addresses mom ² used to get on labels from Amazon marketplace sellers… Amazon clearly is UTF-8 and the software the sellers were using clearly wasn’t, so I’d have stuff like `Ãª` turn up in my address instead of the correct accented characters. ² I say mom and not me because I rarely use marketplace, but mom got plenty of secondhand books that way. Maybe it’s better? I wouldn’t hold my breath…

Feb 18, 2025 3:55pm Piers (3264) 75 posts	clearly there are more Windows computers … used by people who had no idea what their encoding was or why it mattered. … writing in Zap (8859/1) Isn’t that a perfect contradiction? :-) Acorn incorrectly called theirs Latin 1 (8859/1) despite adding extra characters. Windows, at least, chose to give it a new name when they did the same.

Feb 18, 2025 5:21pm Rick Murray (539) 14047 posts	Isn’t that a perfect contradiction? :-) Yup. You got me. It was getting cold rather rapidly (I was sitting outside) and I wanted to get the message finished so skipped over some obvious stuff like “8859/1ish”. Point still stands, using ASCII+glyphs means it doesn’t matter what any particular machine wants to do. Heck, I could even use DOS. ;) Acorn incorrectly called theirs Latin 1 Perhaps we ought to retroname it “Acorn 1”? At least it was almost standard and not the old Master/BFont.

Feb 18, 2025 6:12pm Jean-Michel BRUCK (3009) 390 posts	@ John I don’t know what you need. I think I understood that you were looking for conversions between Latin and UTF8, using Stronged? I built two desktop utilities which are based on TextConv (builder) and which allow me to make these conversions. I especially use UTF8_latin1 because the text contained in PDF files contains UTF8 character. I don’t know if it can be useful to you … https://jeanmichelb.riscos.fr/AppliTBox/AppliToolBox.html Note: The executable of TexConv is in the applications, it can make other conversions and can be used on the command line and therefore you can create in Stronged a button that would use these possibilities. Usage: TextConv [options] [<inputfile> [<outputfile>]] Options: -from <charset> Set the source charset (default = System alphabet) -to <charset> Set the destination charset (default = System alphabet) Charsets may have one or more appended modifiers: /le Little-endian (eg UTF-16) /encopt Encode optionally encoded characters (eg UTF-7) /noheader No header (eg UTF-16 byte order mark) Known charsets include: US-ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-14, ISO-8859-15, ISO-2022, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2, EUC-JP, Shift_JIS, ISO-2022-CN, ISO-2022-CN-EXT, EUC-CN, Big5, ISO-2022-KR, EUC-KR, Johab, KOI8-R, CP866, Windows-1250, Windows-1251, Windows-1252, Mac-Roman, Mac-CentralEurRoman, Mac-Cyrillic, Mac-Ukrainian, ISO-IR-182, ISO-IR-197, x-Acorn-Latin1, x-Acorn-Fuzzy, x-Current, UTF-7, UTF-8, UTF-16, UCS-4, SCSU <\pre>

Feb 18, 2025 7:49pm Steve Pampling (1551) 8272 posts	but let’s be fair, back in the day there were plenty of pages with sexed quotes and the like that rendered in unexpected ways because they declared the wrong character set. I’ve seen stuff like that, in the last few months, in content generated in the last 12 months.

Feb 18, 2025 8:06pm John Rickman (71) 677 posts	@Jean-Michel don’t know what you need. I think I understood that you were looking for conversions between Latin and UTF8, using Stronged? I was looking for a way to edit a large html file on RISC OS. The file is encoded UTF-8 and contains Spanish text. Iwanted to make changes using StrongED which did not introduce any 8859/1 pollution. I have now found that there is no problem inputting accented characters as the HTML modefile automatically substitutes á for á. é for é etc I built two desktop utilities which are based on TextConv (builder) and which allow me to make these conversions. I especially use UTF8_latin1 because the text contained in PDF files contains UTF8 character. Thanks for the offer but have a similar script in LUA which can be dropped on the StrongED Process button to do this. I am just a bit wary of applying such changes in bulk.

Feb 18, 2025 8:52pm David J. Ruck (33) 1696 posts	The problem I’ve just had with sexed quotes on a web page is that some web creation software is automatically converting normal quotes in to them. I was pasting a command line in to bash, and got lots of errors, then noticed it wasn’t normal quotes. At least with RISC OS, pasting it would have revealed different characters (and not interpreted as UTF-8 in most apps).

2 hours ago nemo (145) 2648 posts	FTR I’ve supplied John with suitable conversion progs.

1 hour ago John Rickman (71) 677 posts	FTR I’ve supplied John with suitable conversion progs. And much appreciated. I have now have them integrated into StrongED and have what is to me a satisfactory solution.