Showing changes from revision #0 to #1:
Added | Removed | Changed
The UCS Font Manager is a development of the RISC OS Font Manager whose primary purpose is to allow access to more than 256 characters in a font by using UCS character codes.
Unlike the earlier Bitstream-based prototypes, the UCS Font Manager is fully backwards compatible with older versions of the Font Manager and supports almost all existing API calls. The standard RISC OS font format is used.
For backwards compatibility, font encodings continue to be specified in terms of PostScript glyph names. This is slightly space inefficient, but allows a pleasing elegance of design.
The major internal difference between the UCS Font Manager (version 3.41+) and earlier versions is that bitmaps are held in the cache in file order.
Since RISC OS 3 the Font Manager has been capable of using fonts with glyphs in an arbitrary order and remapping – this is used in the standard Corpus, Homerton and Trinity ROM fonts to make them available in any of Latin 1-4.
The remapping from external (e.g. Latin1) character codes to internal (eg /Base0) glyph codes was performed as outlines were turned into bitmaps, and as metrics were scaled. Hence slave fonts in the cache contained data in, say, Latin1 order, while master fonts stored data in /Base0 order.
This made sense because external codes always contained 256 characters, allowing several arrays to be fixed size, irrespective of the size of the font file. Unfortunately, the new external code UCS contains 2,147,483,648 characters, scuppering that idea. Hence the Font Manager now stores everything in file order, including bitmaps, metrics and kerning information.
This now leads to some inefficiencies in the font cache. Using the two fonts “\FHomerton.Medium\ELatin1” and “\FHomerton.Medium\ELatin2” will result in two master fonts and two slave fonts, each pair containing the same outlines and same bitmaps, differing only in the stored encoding tables. It may be worth looking at this at some stage.
Another minor change is that metrics are no longer scaled and stored in slave fonts. Only the unscaled master metrics are stored, and scaled on demand.
As of Font Manager 3.40, the behaviour of the Font Manager when presented with a font identifier with no encoding has changed subtly, but significantly.
Previously, asking for “Homerton.Medium” would cause the Font Manager to call Territory_AlphabetIdentifier to find the default alphabet for the system territory. Hence, on a UK machine you would get “\FHomerton.Medium\ELatin1”. Asking for “Selwyn” would not apply any encoding.
As of Font Manager 3.40, the behaviour is now to check the current alphabet using OS_Byte 71, and to get the name via Service_International. This means that the command *Alphabet now affects both the system font and future outline font claims.
If a font is a “symbol” font – i.e. contains an “IntMetrics” file rather than an “IntMetric<N>” file, then the default encoding is “Glyph”, rather than the system encoding, unless the system encoding is “UTF8”, in which case this is treated as the default encoding for all fonts.
To support the UCS, a number of changes have been made to the RISC OS font encoding file scheme. Firstly, the %RISCOS_BasedOn directive is now ignored. Up to now, asking for a font with encoding Latin1 would always try to find IntMetric0 and Outlines0 files, because the file Encodings.Latin1 contained the directive "%RISCOS_BasedOn 0".
Now the font manager will find the first file fitting the template “IntMet*”, and use it and its associated Outlines file. (This means that it is no longer possible to have more than one set of metrics/outlines per font.) It will then deduce the base encoding of that file from its suffix. If the file is, say, IntMetri32, it will use the encoding file “Font:Encodings./Base32”. If the file is just IntMetrics, it will use the file “Encoding” in the font directory. If that isn’t present, it will use the encoding file “Font:Encodings./Default”.
This change means that to claim a font with Latin1 encoding no longer requires that the font file be in /Base0 order.
Next, the encoding generation has been relaxed:
This permits us to claim MaruGothic.Medium (the Funai 3 Japanese font) with encoding \ELatin1, even though it doesn’t contain all the Latin1 glyphs. Possibly more importantly, it allows us to claim any font with encoding \EUTF8. No font contains all UCS/UTF-8 glyphs :)
The final change to the encoding scheme is an “extension” to the encoding file format that prevents us having to have a 2 billion line long UTF8 encoding file.
The font manager now supports a format compatible with the Adobe Glyph List, which, unlike the straight list of glyph names, allows a “sparse” encoding.
For example, the following encoding file
0005;fred;SYMBOL FOR FRED
0002;Jim;SYMBOL FOR JIM
0007;Harry;SYMBOL FOR HARRY
# Comment
22;000C;Tom;SYMBOL FOR TOM
is equivalent to the old style encoding file
/.notdef /.notdef /Jim /.notdef
/.notdef /fred /.notdef /Harry
/.notdef /.notdef /.notdef /.notdef
/Tom
The line format is [XX;]XXXX;NNN;CCCC – the first field is ignored if less than 4 digits long. The CCCC field is ignored.
This format allows the UTF8 encoding file to be a direct concatenation of the main Adobe Glyph List and ZapfDingbats Glyph List.
In many respects, the encoding “UTF8” is just like any other encoding. All encodings may now contain up to 2^31 characters, and the format used for the UTF8 file can be used with any encoding. However, a number of special cases are invoked by the encoding being named “UTF8”.
The UTF8 file in Funai 3 was based on version 1.1 of the Adobe Glyph List. This has now been updated to version 1.2, and all font encoding files have been updated in line.
If the target encoding file being processed is called “UTF8”, then any
identifiers in the base encoding file fitting the template /uniXXXX (where XXXX is a 4 digit upper-case hexadecimal number) will be mapped in at character code XXXX. This covers the majority of UCS characters, which don’t have PostScript glyph names.
Also, as of version 3.43, the form /uniXXXXYYYY is accepted, when XXXX is a high-surrogate (D800-DBFF) and YYYY is a low-surrogate (DC00-DFFF), to represent UCS characters in the range U+00010000 to U+0010FFFF, as specified in version 1.1 of Adobe’s “Unicode and Glyph Names”
There is currently no way to specify codes above U+00110000.
The prioritisation of double-mappings described in “Unicode and Glyph Names” is operative. For example, if a font’s encoding contains “/mu”, the glyph will be given UCS codes 00B5 and 03BC. If a font’s encoding contains both “/mu” and “/uni03BC”, then /mu will be mapped at 00B5, and /uni03BC at 03BC.
However, for non-UTF8 encodings, names must currently match exactly. So if a Greek encoding file contains “/mu”, then it will only match a glyph called “/mu”, not one called “/uni03BC”, and vice-versa.
If the target encoding for a font handle is “UTF8”, then UTF-8 and UTF-16 processing will be invoked for 8 and 16-bit forms of Font_Paint, etc. respectively. (See below).
A new encoding “Glyph” is available. This is a special case which requests a font with no encoding applied. Thus the font “\FHomerton.Medium\EGlyph” would contain all 300+ Homerton.Medium glyphs, in the order they are in the font file.
The SWIs Font_Paint and Font_ScanString now accept strings in 8, 16 or 32-bit forms. (All other SWIs only accept 8-bit strings). This is invoked by bits in R2:
bit 12 set | R1 points to a 16-bit string (bit 13 must be clear) |
bit 13 set | R1 points to a 32-bit string (bit 12 must be clear) |
A 16-bit string must be half-word aligned, and consists of a number of 16-bit halfwords. A 32-bit string must be word aligned, and consists of a number of 32-bit words.
For example, the string “Hello” might be represented as follows:
8-bit | 48 65 6C 6C 6F 00 |
16-bit | 0048 0065 006C 006C 006F 0000 |
32-bit | 00000048 00000065 0000006C 0000006C 0000006F 00000000 |
This allows access to characters outside the first 256. Control codes work as before, but have to be introduced by a character of the appropriate width. The parameters of control sequences are handled differently in 16 and 32-bit modes:
16-bit parameter forms:
9,11: | dx/dy comprised of 2 16-bit words |
17: | 1 16-bit word |
18: | 3 16-bit words |
19: | 2 x (RR00, BBGG word pairs) + 1 16-bit word |
21: | sequence of 16-bit words, terminated by a word < &0020 |
25: | 2 16-bit words |
26: | 1 16-bit word |
27: | 4 32-bit words (word-aligned) |
28: | 6 32-bit words (word-aligned) |
32-bit parameter forms:
9,11: | 1 32-bit words |
17: | 1 32-bit word |
18: | 3 32-bit words |
19: | 3 32-bit words |
21: | sequence of 32-bit words, terminated by a word < &00000020 |
25: | 2 32-bit words |
26: | 1 32-bit word |
27: | 4 32-bit words |
28: | 6 32-bit words |
Regardless of the setting of the 16/32-bit bits, string lengths continue to be in bytes (such as R7 on entry to Font_Paint).
If a font’s encoding is “UTF8”, then rather than taking “straight” 8 and 16 bit strings, strings are decoded according to UTF-8 or UTF-16. For example, the byte sequence <C2 A3> in UTF-8 represents character &A3. The halfword sequence <D800 DC00> in UTF-16 represents character &10000. 32-bit strings are treated normally (hence the string is treated as UCS-4).
UTF-8 allows access to all 31 bits of the UCS in 8-bit mode. UTF-16 allows access to between 20 and 21 bits of the UCS in 16-bit mode.
Again, string lengths continue to be in bytes. Illegal or incomplete
multibyte/multiword sequences are treated as character &FFFD. This includes “long” UTF-8 forms such as <C0 80>.
The basic file formats remain unaltered. However, various bugs/limitations in the Font Manager and FontEd’s handling of large files have been fixed.
Some limitations (as of 3.43) still remain; for example scaffold data is
still limited to 64K.
*LoadFontCache now checks the base address of a saved font cache before loading it and reports a sensible error if not correct. Note that the font cache format for the UCS Font Manager is totally incompatible with previous versions.
As of version 3.42, Font_ReadFontMetrics call will no longer return kerning information into the buffer in R5. Further, the bbox, xwidth and ywidth information is only returned for the first 256 characters of the font.
This call is of doubtful use for fonts with up to 2 billion characters, so you should instead use Font_CharBBox, or Font_ScanString, instead of using R1 to R3 in this call.
To obtain kerning information, you can now pass a buffer in R6 – this will return kerning information in a different (but similar) format. Note that unlike the other data returned by this call, the kerning data is unscaled (in 1/1000ths em, rather than millipoints), and character codes are internal codes. You can use Font_EnumerateCharacters to map external codes to internal codes.
The kerning information is indexed by a hash table. The hash function used is:
(first letter) EOR (second letter ROR 4)
where the rotate happens in 8 bits.
Size | Description |
256×4 | hash table giving offset from table start of first kern pair for each possible value (0-255) of hash function |
4 | offset of end of all kern pairs from table start |
4 | flag word: bit 1 set => no x offsets bit 2 set => no y offsets bit 31 set => ‘short’ kern pairs |
? | kern pair data |
Each kern pair consists of the internal code of the first letter of the kern
pair, followed by the internal code of the second letter of the kern pair,
followed by the x offset in 1/1000 em, followed by the y offset in 1/1000 em.
If bit 31 of the flag word is clear, then each kern pair is held in 2 words, each field being 16 bits:
word 0 | bits 0-15 | first character code |
word 0 | bits 15-31 | second character code |
word 1 | bits 0-15 | x offset |
word 1 | bits 16-31 | y offset |
If bit 31 of the flag word is set, then each kern pair is held in 1 word. This can only happen if all kern pairs apply to the first 256 characters, and kerning is only in one direction:
bits 0-7 | first character code |
bits 8-15 | second character code |
bits 16-31 | x or y offset |
This call can be used to determine which characters are present in a font, and which glyphs in the underlying font file characters map to.
On entry | |
---|---|
R0 | Font handle (0 for current) |
R1 | Character code (0 to start enumeration) |
On exit | |
---|---|
R0 | Corrupted |
R1 | Next character (-1 if this character is the last) |
R2 | Internal character code of this character (-1 if unmapped) |
Interrupt status is undefined
Fast interrupts are enabled
Processor is in SVC mode
SWI is not re-entrant
This call works only by looking at encoding files – it cannot guarantee that a given character is actually defined in a font file, but it can say which characters definitely aren’t, by returning with R2 set to -1.
For example, for the font “\FHomerton.Medium\EUTF8”, a call sequence might be:
On entry | On exit | ||
R1 = 0 | R1 = &20 (space) R2 = -1 | ||
.. | |||
R1 = &10F (d-caron) | R1 = &112 (E macron) | R2 = &151 | |
R1 = &112 (E-macron) | R1 = &113 (e macron) | R2 = &185 | |
R1 = &113 (e-macron) | R1 = &116 (E dot) | R2 = &195 | |
.. | |||
R1 = &FB02 (fl ligature) | R1 = -1 | R2 = &FF |
So, in this example, we see that the fl ligature character (Unicode FB02), is character &FF within the Homerton.Medium font file.
This call may be of use in conjuction with Font_ReadFontMetrics’s kerning data, and for font manipulation by MakePSFont. Also, it would allow a character map program to scan through the UCS space to find defined characters.
None
None
As of version 3.42, the size of FontMax is set in units of 64K, rather than 4K. The same byte of CMOS is used, but it is now possible to configure FontMax up to a maximum of 16,320K. See *Configure FontMax.
There are still a number of outstanding features that would be desirable: