RISC OS Open: Forum: Non-ASCII text files

Jun 5, 2021 8:05am

Chris Hall (132) 3558 posts

Text files encoded in ASCII, Latin1 or UTF-8 will all allow ASCII characters to appear unchanged. If they contain just a few ‘special’ characters they should be relatively easy to interpret.

There is a particular case in mind – a GEDCOM file which is defined as a text file containing certain code words including one to specify the encoding of text following a code word. So far all is well.

Some bright spark has created a GEDCOM file that is itself encoded as UTF (so that even ASCII bytes are coded as 16 bit words, zero following, i.e. little endian). The file starts ‘FF’, ‘FE’ – is this an indication that it is UTF little endian encoded? If so then I can work out how to translate such a file.

If I load it into Notepad under Windows, then save it with ‘normal’ encoding (ANSI), that gets it in the right form.

Do we have a RISC OS file type for UTF encoded text files?

Jun 5, 2021 8:11am

Stuart Swales (8827) 1357 posts

It’s UTF-16, which has a byte order mark at the front – you can tell from FF FE that it’s big-endian data, if it were FE FF then it’s little-endian data.

Fireworkz will happily detect byte order marks to permit loading of text encoded in UTF-16 and UTF-32 in either byte order.

Do we have a RISC OS file type for UTF encoded text?

Please $deity no

Jun 5, 2021 8:33am

Chris Hall (132) 3558 posts

If the FF/FE or FE/FF specifies little or big endian UTF-16 then we don’t need another filetype, agreed.

Is it FF/FF/FF/FE or FE/FF/FF/FF to specify UTF-32?

So should StrongEd !Edit and !Zap load a text file starting FF/FE or FE/FF as text rather than data and display it sensibly (perhaps substituting £ or &#U+0163; or even just in Latin 1 as far as possible) and offer to save it in Latin1/Codepage437-852-1237/ANSI/UTF-8/UTF-16 or UTF-32?

Jun 5, 2021 11:16am

Stuart Swales (8827) 1357 posts

The BOM is the character U+FFFE (not -2) encoded as per the rest of the encoding, so UTF-32 representations would be FE/FF/00/00 (LE) and 00/00/FF/FE (BE).

My personal preferences would be for a plain text editor to (a) attempt to detect encoding rather than just loading junk and (b) to save files using the encoding detected as default, with an option to save in a different encoding. Whether it has to try to transform the data to/from multi-byte to be handled with the existing core editor code is another thing!

[For instance, current Fireworkz releases do have to create Acorn Latin1 (or Windows-1252) single byte representations of characters where possible when importing UTF-whatever encoded text files, Unicode characters also being present in RTF and various Excel strings. Fireworkz has to represent the remaining characters as Unicode inline sequences that it knows how to render on platforms with Unicode rendering (Windows, modern RISC OS). Fireworkz can be built to handle UTF-8 internally, in which case everything is more straightforward. Sadly we still can’t rely on Unicode-handling Font Manager being present…]

Jun 5, 2021 12:17pm

David J. Ruck (33) 1636 posts

Whilst UTF-16 is common on Windows (due to it being the first wide character set supported by the MFC class library) I’ve never come across a UTF-32 file in the wild. Using 4 bytes per character was considered wasteful, so the rest of the word standardised on UTF-8 or XML – which is well known for its compact representation – NOT!

Jun 5, 2021 12:25pm

Stuart Swales (8827) 1357 posts

Hazy memory but I think someone did send me a UTF-32 file, which is why Fireworkz ended up being updated to import it.

Jun 24, 2021 7:40pm

Matthew Phillips (473) 721 posts

It’s worth saying that in practice it is quite straightforward distinguishing between UTF-8 and 7 or 8-bit character sets like Latin-1 becase top-bit-set characters only appear in UTF-8 in very specific sequences which are easy to spot.

Non-ASCII text files

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Jun 5, 2021 8:05am Chris Hall (132) 3558 posts	Text files encoded in ASCII, Latin1 or UTF-8 will all allow ASCII characters to appear unchanged. If they contain just a few ‘special’ characters they should be relatively easy to interpret. There is a particular case in mind – a GEDCOM file which is defined as a text file containing certain code words including one to specify the encoding of text following a code word. So far all is well. Some bright spark has created a GEDCOM file that is itself encoded as UTF (so that even ASCII bytes are coded as 16 bit words, zero following, i.e. little endian). The file starts ‘FF’, ‘FE’ – is this an indication that it is UTF little endian encoded? If so then I can work out how to translate such a file. If I load it into Notepad under Windows, then save it with ‘normal’ encoding (ANSI), that gets it in the right form. Do we have a RISC OS file type for UTF encoded text files?

Jun 5, 2021 8:11am Stuart Swales (8827) 1357 posts	It’s UTF-16, which has a byte order mark at the front – you can tell from FF FE that it’s big-endian data, if it were FE FF then it’s little-endian data. Fireworkz will happily detect byte order marks to permit loading of text encoded in UTF-16 and UTF-32 in either byte order. Do we have a RISC OS file type for UTF encoded text? Please $deity no

Jun 5, 2021 8:33am Chris Hall (132) 3558 posts	If the FF/FE or FE/FF specifies little or big endian UTF-16 then we don’t need another filetype, agreed. Is it FF/FF/FF/FE or FE/FF/FF/FF to specify UTF-32? So should StrongEd !Edit and !Zap load a text file starting FF/FE or FE/FF as text rather than data and display it sensibly (perhaps substituting £ or &#U+0163; or even just in Latin 1 as far as possible) and offer to save it in Latin1/Codepage437-852-1237/ANSI/UTF-8/UTF-16 or UTF-32?

Jun 5, 2021 11:16am Stuart Swales (8827) 1357 posts	The BOM is the character U+FFFE (not -2) encoded as per the rest of the encoding, so UTF-32 representations would be FE/FF/00/00 (LE) and 00/00/FF/FE (BE). My personal preferences would be for a plain text editor to (a) attempt to detect encoding rather than just loading junk and (b) to save files using the encoding detected as default, with an option to save in a different encoding. Whether it has to try to transform the data to/from multi-byte to be handled with the existing core editor code is another thing! [For instance, current Fireworkz releases do have to create Acorn Latin1 (or Windows-1252) single byte representations of characters where possible when importing UTF-whatever encoded text files, Unicode characters also being present in RTF and various Excel strings. Fireworkz has to represent the remaining characters as Unicode inline sequences that it knows how to render on platforms with Unicode rendering (Windows, modern RISC OS). Fireworkz can be built to handle UTF-8 internally, in which case everything is more straightforward. Sadly we still can’t rely on Unicode-handling Font Manager being present…]

Jun 5, 2021 12:17pm David J. Ruck (33) 1636 posts	Whilst UTF-16 is common on Windows (due to it being the first wide character set supported by the MFC class library) I’ve never come across a UTF-32 file in the wild. Using 4 bytes per character was considered wasteful, so the rest of the word standardised on UTF-8 or XML – which is well known for its compact representation – NOT!

Jun 5, 2021 12:25pm Stuart Swales (8827) 1357 posts	Hazy memory but I think someone did send me a UTF-32 file, which is why Fireworkz ended up being updated to import it.

Jun 24, 2021 7:40pm Matthew Phillips (473) 721 posts	It’s worth saying that in practice it is quite straightforward distinguishing between UTF-8 and 7 or 8-bit character sets like Latin-1 becase top-bit-set characters only appear in UTF-8 in very specific sequences which are easy to spot.