Non-ASCII text files
Chris Hall (132) 3558 posts |
Text files encoded in ASCII, Latin1 or UTF-8 will all allow ASCII characters to appear unchanged. If they contain just a few ‘special’ characters they should be relatively easy to interpret. There is a particular case in mind – a GEDCOM file which is defined as a text file containing certain code words including one to specify the encoding of text following a code word. So far all is well. Some bright spark has created a GEDCOM file that is itself encoded as UTF (so that even ASCII bytes are coded as 16 bit words, zero following, i.e. little endian). The file starts ‘FF’, ‘FE’ – is this an indication that it is UTF little endian encoded? If so then I can work out how to translate such a file. If I load it into Notepad under Windows, then save it with ‘normal’ encoding (ANSI), that gets it in the right form. Do we have a RISC OS file type for UTF encoded text files? |
Stuart Swales (8827) 1357 posts |
It’s UTF-16, which has a byte order mark at the front – you can tell from FF FE that it’s big-endian data, if it were FE FF then it’s little-endian data. Fireworkz will happily detect byte order marks to permit loading of text encoded in UTF-16 and UTF-32 in either byte order.
Please $deity no |
Chris Hall (132) 3558 posts |
If the FF/FE or FE/FF specifies little or big endian UTF-16 then we don’t need another filetype, agreed. Is it FF/FF/FF/FE or FE/FF/FF/FF to specify UTF-32? So should StrongEd !Edit and !Zap load a text file starting FF/FE or FE/FF as text rather than data and display it sensibly (perhaps substituting £ or &#U+0163; or even just in Latin 1 as far as possible) and offer to save it in Latin1/Codepage437-852-1237/ANSI/UTF-8/UTF-16 or UTF-32? |
Stuart Swales (8827) 1357 posts |
The BOM is the character U+FFFE (not -2) encoded as per the rest of the encoding, so UTF-32 representations would be FE/FF/00/00 (LE) and 00/00/FF/FE (BE). My personal preferences would be for a plain text editor to (a) attempt to detect encoding rather than just loading junk and (b) to save files using the encoding detected as default, with an option to save in a different encoding. Whether it has to try to transform the data to/from multi-byte to be handled with the existing core editor code is another thing! [For instance, current Fireworkz releases do have to create Acorn Latin1 (or Windows-1252) single byte representations of characters where possible when importing UTF-whatever encoded text files, Unicode characters also being present in RTF and various Excel strings. Fireworkz has to represent the remaining characters as Unicode inline sequences that it knows how to render on platforms with Unicode rendering (Windows, modern RISC OS). Fireworkz can be built to handle UTF-8 internally, in which case everything is more straightforward. Sadly we still can’t rely on Unicode-handling Font Manager being present…] |
David J. Ruck (33) 1636 posts |
Whilst UTF-16 is common on Windows (due to it being the first wide character set supported by the MFC class library) I’ve never come across a UTF-32 file in the wild. Using 4 bytes per character was considered wasteful, so the rest of the word standardised on UTF-8 or XML – which is well known for its compact representation – NOT! |
Stuart Swales (8827) 1357 posts |
Hazy memory but I think someone did send me a UTF-32 file, which is why Fireworkz ended up being updated to import it. |
Matthew Phillips (473) 721 posts |
It’s worth saying that in practice it is quite straightforward distinguishing between UTF-8 and 7 or 8-bit character sets like Latin-1 becase top-bit-set characters only appear in UTF-8 in very specific sequences which are easy to spot. |