Safeguarding the past, present and future of RISC OS for everyone

UCS Font Manager (changes)

Showing changes from revision #0 to #1: Added | ~~Removed~~ | ~~Chan~~ged

» UCS Font Manager

The UCS Font Manager is a development of the RISC OS Font Manager whose primary purpose is to allow access to more than 256 characters in a font by using UCS character codes.

Unlike the earlier Bitstream-based prototypes, the UCS Font Manager is fully backwards compatible with older versions of the Font Manager and supports almost all existing API calls. The standard RISC OS font format is used.

For backwards compatibility, font encodings continue to be specified in terms of PostScript glyph names. This is slightly space inefficient, but allows a pleasing elegance of design.

2. Functional description

2.1 Internal changes

The major internal difference between the UCS Font Manager (version 3.41+) and earlier versions is that bitmaps are held in the cache in file order.

Since RISC OS 3 the Font Manager has been capable of using fonts with glyphs in an arbitrary order and remapping – this is used in the standard Corpus, Homerton and Trinity ROM fonts to make them available in any of Latin 1-4.

The remapping from external (e.g. Latin1) character codes to internal (eg /Base0) glyph codes was performed as outlines were turned into bitmaps, and as metrics were scaled. Hence slave fonts in the cache contained data in, say, Latin1 order, while master fonts stored data in /Base0 order.

This made sense because external codes always contained 256 characters, allowing several arrays to be fixed size, irrespective of the size of the font file. Unfortunately, the new external code UCS contains 2,147,483,648 characters, scuppering that idea. Hence the Font Manager now stores everything in file order, including bitmaps, metrics and kerning information.

This now leads to some inefficiencies in the font cache. Using the two fonts “\FHomerton.Medium\ELatin1” and “\FHomerton.Medium\ELatin2” will result in two master fonts and two slave fonts, each pair containing the same outlines and same bitmaps, differing only in the stored encoding tables. It may be worth looking at this at some stage.

Another minor change is that metrics are no longer scaled and stored in slave fonts. Only the unscaled master metrics are stored, and scaled on demand.

2.2 Default encodings

As of Font Manager 3.40, the behaviour of the Font Manager when presented with a font identifier with no encoding has changed subtly, but significantly.

Previously, asking for “Homerton.Medium” would cause the Font Manager to call Territory_AlphabetIdentifier to find the default alphabet for the system territory. Hence, on a UK machine you would get “\FHomerton.Medium\ELatin1”. Asking for “Selwyn” would not apply any encoding.

As of Font Manager 3.40, the behaviour is now to check the current alphabet using OS_Byte 71, and to get the name via Service_International. This means that the command *Alphabet now affects both the system font and future outline font claims.

If a font is a “symbol” font – i.e. contains an “IntMetrics” file rather than an “IntMetric<N>” file, then the default encoding is “Glyph”, rather than the system encoding, unless the system encoding is “UTF8”, in which case this is treated as the default encoding for all fonts.

2.3 Encodings

To support the UCS, a number of changes have been made to the RISC OS font encoding file scheme. Firstly, the %RISCOS_BasedOn directive is now ignored. Up to now, asking for a font with encoding Latin1 would always try to find IntMetric0 and Outlines0 files, because the file Encodings.Latin1 contained the directive "%RISCOS_BasedOn 0".

Now the font manager will find the first file fitting the template “IntMet*”, and use it and its associated Outlines file. (This means that it is no longer possible to have more than one set of metrics/outlines per font.) It will then deduce the base encoding of that file from its suffix. If the file is, say, IntMetri32, it will use the encoding file “Font:Encodings./Base32”. If the file is just IntMetrics, it will use the file “Encoding” in the font directory. If that isn’t present, it will use the encoding file “Font:Encodings./Default”.

This change means that to claim a font with Latin1 encoding no longer requires that the font file be in /Base0 order.

Next, the encoding generation has been relaxed:

Target encoding files (eg Latin1) no longer need to have exactly 256 entries.
If a glyph specified in a target file is not present in a base file, then that glyph is treated as if it were /.notdef, rather than an error being generated.

This permits us to claim MaruGothic.Medium (the Funai 3 Japanese font) with encoding \ELatin1, even though it doesn’t contain all the Latin1 glyphs. Possibly more importantly, it allows us to claim any font with encoding \EUTF8. No font contains all UCS/UTF-8 glyphs :)

The final change to the encoding scheme is an “extension” to the encoding file format that prevents us having to have a 2 billion line long UTF8 encoding file.

The font manager now supports a format compatible with the Adobe Glyph List, which, unlike the straight list of glyph names, allows a “sparse” encoding.

For example, the following encoding file

0005;fred;SYMBOL FOR FRED
0002;Jim;SYMBOL FOR JIM
0007;Harry;SYMBOL FOR HARRY
# Comment
22;000C;Tom;SYMBOL FOR TOM

is equivalent to the old style encoding file

/.notdef /.notdef /Jim     /.notdef
/.notdef /fred    /.notdef /Harry
/.notdef /.notdef /.notdef /.notdef
/Tom

The line format is [XX;]XXXX;NNN;CCCC – the first field is ignored if less than 4 digits long. The CCCC field is ignored.

This format allows the UTF8 encoding file to be a direct concatenation of the main Adobe Glyph List and ZapfDingbats Glyph List.

2.4 UTF8 Encoding

In many respects, the encoding “UTF8” is just like any other encoding. All encodings may now contain up to 2^31 characters, and the format used for the UTF8 file can be used with any encoding. However, a number of special cases are invoked by the encoding being named “UTF8”.

2.4.1 UTF8 encoding file

The UTF8 file in Funai 3 was based on version 1.1 of the Adobe Glyph List. This has now been updated to version 1.2, and all font encoding files have been updated in line.

2.4.2 /uniXXXX identifiers

If the target encoding file being processed is called “UTF8”, then any
identifiers in the base encoding file fitting the template /uniXXXX (where XXXX is a 4 digit upper-case hexadecimal number) will be mapped in at character code XXXX. This covers the majority of UCS characters, which don’t have PostScript glyph names.

Also, as of version 3.43, the form /uniXXXXYYYY is accepted, when XXXX is a high-surrogate (D800-DBFF) and YYYY is a low-surrogate (DC00-DFFF), to represent UCS characters in the range U+00010000 to U+0010FFFF, as specified in version 1.1 of Adobe’s “Unicode and Glyph Names”

There is currently no way to specify codes above U+00110000.

The prioritisation of double-mappings described in “Unicode and Glyph Names” is operative. For example, if a font’s encoding contains “/mu”, the glyph will be given UCS codes 00B5 and 03BC. If a font’s encoding contains both “/mu” and “/uni03BC”, then /mu will be mapped at 00B5, and /uni03BC at 03BC.

However, for non-UTF8 encodings, names must currently match exactly. So if a Greek encoding file contains “/mu”, then it will only match a glyph called “/mu”, not one called “/uni03BC”, and vice-versa.

2.4.3 UTF-8/UTF-16 encoding

If the target encoding for a font handle is “UTF8”, then UTF-8 and UTF-16 processing will be invoked for 8 and 16-bit forms of Font_Paint, etc. respectively. (See below).

2.5 Glyph Encoding

A new encoding “Glyph” is available. This is a special case which requests a font with no encoding applied. Thus the font “\FHomerton.Medium\EGlyph” would contain all 300+ Homerton.Medium glyphs, in the order they are in the font file.

2.6 8/16/32 bit string forms

The SWIs Font_Paint and Font_ScanString now accept strings in 8, 16 or 32-bit forms. (All other SWIs only accept 8-bit strings). This is invoked by bits in R2:

bit 12 set	R1 points to a 16-bit string (bit 13 must be clear)
bit 13 set	R1 points to a 32-bit string (bit 12 must be clear)

A 16-bit string must be half-word aligned, and consists of a number of 16-bit halfwords. A 32-bit string must be word aligned, and consists of a number of 32-bit words.

For example, the string “Hello” might be represented as follows:

8-bit	48 65 6C 6C 6F 00
16-bit	0048 0065 006C 006C 006F 0000
32-bit	00000048 00000065 0000006C 0000006C 0000006F 00000000

This allows access to characters outside the first 256. Control codes work as before, but have to be introduced by a character of the appropriate width. The parameters of control sequences are handled differently in 16 and 32-bit modes:

16-bit parameter forms:

9,11:	dx/dy comprised of 2 16-bit words
17:	1 16-bit word
18:	3 16-bit words
19:	2 x (RR00, BBGG word pairs) + 1 16-bit word
21:	sequence of 16-bit words, terminated by a word < &0020
25:	2 16-bit words
26:	1 16-bit word
27:	4 32-bit words (word-aligned)
28:	6 32-bit words (word-aligned)

32-bit parameter forms:

9,11:	1 32-bit words
17:	1 32-bit word
18:	3 32-bit words
19:	3 32-bit words
21:	sequence of 32-bit words, terminated by a word < &00000020
25:	2 32-bit words
26:	1 32-bit word
27:	4 32-bit words
28:	6 32-bit words

Regardless of the setting of the 16/32-bit bits, string lengths continue to be in bytes (such as R7 on entry to Font_Paint).

2.7 UTF-8/UTF-16 string forms

If a font’s encoding is “UTF8”, then rather than taking “straight” 8 and 16 bit strings, strings are decoded according to UTF-8 or UTF-16. For example, the byte sequence <C2 A3> in UTF-8 represents character &A3. The halfword sequence <D800 DC00> in UTF-16 represents character &10000. 32-bit strings are treated normally (hence the string is treated as UCS-4).

UTF-8 allows access to all 31 bits of the UCS in 8-bit mode. UTF-16 allows access to between 20 and 21 bits of the UCS in 16-bit mode.

Again, string lengths continue to be in bytes. Illegal or incomplete
multibyte/multiword sequences are treated as character &FFFD. This includes “long” UTF-8 forms such as <C0 80>.

2.8 File format extensions

The basic file formats remain unaltered. However, various bugs/limitations in the Font Manager and FontEd’s handling of large files have been fixed.

16-bit composite characters are now supported (but see note below about dependency bytes).
FontEd now handles an arbitrary number of dependency bytes. However the Font Manager only handles 4 dependency bytes (32 bits). Thus composite character inclusions must be within the first 32 chunks (1024 characters).
Bitmap files with > 256 characters are now supported. Font_MakeBitmap now produces version 8 files. Bitmap files are expected to be in the same order and directory as the Outlines file (so latin1 etc subdirectories are no longer used.)

Some limitations (as of 3.43) still remain; for example scaffold data is
still limited to 64K.

3. API changes

3.1 SaveFontCache/LoadFontCache

*LoadFontCache now checks the base address of a saved font cache before loading it and reports a sensible error if not correct. Note that the font cache format for the UCS Font Manager is totally incompatible with previous versions.

3.2 Font_ReadFontMetrics

As of version 3.42, Font_ReadFontMetrics call will no longer return kerning information into the buffer in R5. Further, the bbox, xwidth and ywidth information is only returned for the first 256 characters of the font.

This call is of doubtful use for fonts with up to 2 billion characters, so you should instead use Font_CharBBox, or Font_ScanString, instead of using R1 to R3 in this call.

To obtain kerning information, you can now pass a buffer in R6 – this will return kerning information in a different (but similar) format. Note that unlike the other data returned by this call, the kerning data is unscaled (in 1/1000ths em, rather than millipoints), and character codes are internal codes. You can use Font_EnumerateCharacters to map external codes to internal codes.

The kerning information is indexed by a hash table. The hash function used is:

(first letter) EOR (second letter ROR 4)

where the rotate happens in 8 bits.

Size	Description
256×4	hash table giving offset from table start of first kern pair for each possible value (0-255) of hash function
4	offset of end of all kern pairs from table start
4	flag word: bit 1 set => no x offsets bit 2 set => no y offsets bit 31 set => ‘short’ kern pairs
?	kern pair data

Each kern pair consists of the internal code of the first letter of the kern
pair, followed by the internal code of the second letter of the kern pair,
followed by the x offset in 1/1000 em, followed by the y offset in 1/1000 em.

If bit 31 of the flag word is clear, then each kern pair is held in 2 words, each field being 16 bits:

word 0	bits 0-15	first character code
word 0	bits 15-31	second character code
word 1	bits 0-15	x offset
word 1	bits 16-31	y offset

If bit 31 of the flag word is set, then each kern pair is held in 1 word. This can only happen if all kern pairs apply to the first 256 characters, and kerning is only in one direction:

bits 0-7	first character code
bits 8-15	second character code
bits 16-31	x or y offset

3.3 Font_EnumerateCharacters (SWI &400A9)

This call can be used to determine which characters are present in a font, and which glyphs in the underlying font file characters map to.

On entry
R0	Font handle (0 for current)
R1	Character code (0 to start enumeration)

On exit
R0	Corrupted
R1	Next character (-1 if this character is the last)
R2	Internal character code of this character (-1 if unmapped)

Interrupts

Interrupt status is undefined
Fast interrupts are enabled

Processor Mode

Processor is in SVC mode

Re-entrancy

SWI is not re-entrant

Use

This call works only by looking at encoding files – it cannot guarantee that a given character is actually defined in a font file, but it can say which characters definitely aren’t, by returning with R2 set to -1.

For example, for the font “\FHomerton.Medium\EUTF8”, a call sequence might be:

On entry	On exit
R1 = 0	R1 = &20 (space) R2 = -1
..
R1 = &10F (d-caron)	R1 = &112 (E macron)	R2 = &151
R1 = &112 (E-macron)	R1 = &113 (e macron)	R2 = &185
R1 = &113 (e-macron)	R1 = &116 (E dot)	R2 = &195
..
R1 = &FB02 (fl ligature)	R1 = -1	R2 = &FF

So, in this example, we see that the fl ligature character (Unicode FB02), is character &FF within the Homerton.Medium font file.

This call may be of use in conjuction with Font_ReadFontMetrics’s kerning data, and for font manipulation by MakePSFont. Also, it would allow a character map program to scan through the UCS space to find defined characters.

Related SWIs

None

Related vectors

None

3.4 *Configure FontMax

As of version 3.42, the size of FontMax is set in units of 64K, rather than 4K. The same byte of CMOS is used, but it is now possible to configure FontMax up to a maximum of 16,320K. See *Configure FontMax.

4. Future enhancements

There are still a number of outstanding features that would be desirable:

Glyph substitution, eg “A” + “combining acute” = “Aacute”, or “f” + “i” = “fi ligature”.
Allowing more than 64K of scaffolding data.
Glyph aliases (eg a glyph might match both “Dstroke” and “Eth”).
Compression.
Dependencies outside the first 1024 characters.
More scaffold lines per character (for CJK fonts)
Displaying a symbol for a missing character.
PostScript printing (the current PostScript printer driver can’t handle the UTF8 encoding).

5. References

Adobe, “Unicode and Glyph Names” version 1.1.
http://partners.adobe.com/supportservice/devrelations/typeforum/unicodegn.html (Copy lodged in ROMFonts’ Doc directory.)
ISO/IEC 10646-1:1993, “Universal Multiple-Octet Coded Character Set (UCS)”
The Unicode Standard, Version 2.1

Created on April 21, 2014 11:42:44 by Chris (121)? (62.30.208.154)

Search the Wiki

Social

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.