Safeguarding the past, present and future of RISC OS for everyone

URL_ParseURL (changes)

Showing changes from revision #2 to #3: Added | ~~Removed~~ | ~~Chan~~ged

» URL_ParseURL

URL_ParseURL

(SWI &83E07)

Parse URLs to / from their constituent parts.

Entry
R0	Flags:
	Bit 0 =>	If set, R5 contains number of words in data block, else a default of 10 words is assumed.
	Bit 1 =>	If set, character codes 0 to 31 and 127 in the URL will be escaped (hex encoded, e.g. space becomes ‘%20’) – only available in URL 0.42 or later. URL 0.38 through to 0.41 inclusive always escape these characters. Versions prior to 0.38 never do this.
	Bits 31..2	Reserved (0)
R1	Reason code:
	0 =>	Return component buffer requirements
	1 =>	Return component data in specified buffers
	2 =>	Construct full URL from component buffers
	3 =>	‘Quick parse’
R2	Pointer to base URL
R3	Pointer to URL relative to base URL (or NULL if none)
R4	Pointer to data block of R5 words (unless R1=3, see below, or R0:0 is unset, in which case R4 points to a buffer of at least 10 words in length)
R5	If R0:0 set, size of R4 block in words

If R3 is non-NULL, it is assumed to point to a partial URL which needs to be resolved with respect to the base URL pointed to by R2. If R3 is NULL, then R2 is assumed to point to a full URL.

Exit
R0	Flags:
	Bits 31-0 Reserved (0)
All other registers preserved
SWI is not re-entrant
Interrupt status undefined
Data block at R4 is updated in line with entry reason code

Use

This SWI is used to parse URLs into their constituent parts, enabling clients to extract the various fields from the URL in a reliable manner. The call is also capable of resolving a relative URL to produce a fully-qualified URL, and of reconstructing a full URL from a set of components.

The data block referred to above is either a block of integers which will be updated to contain the size of the required buffer for each element, or a block containing pointers to buffers for the actual data.

All strings are zero-terminated and all lengths include space for the zero terminator.

The number of entries in the block is specified in R5 if R0:0 is set on entry. If R0:0 is clear, then the default value of 10 is assumed. The format of the data block is:

Offset	Usage
+0	Fully canonicalised URL
+4	URL protocol (e.g. “http”, “ftp”) forced to lower-case
+8	Hostname (e.g. “www.acorn.com”) forced to lower-case
+12	Port (e.g. “80”)
+16	Username – used for FTP authentication and mailto
+20	Password – for FTP
+24	Account – for FTP
+28	Path (e.g. “pub/riscos/releases”) [See note]
+32	Query – for HTTP, things after a query character
+36	Fragment – for HTTP, things after a hash character

It is anticipated that this SWI will be called twice: the first time to find the lengths of the buffers, and the second to retrieve a copy of the data into the buffers. The URLs pointed to by R2 and R3 (if used) need not be fully-qualified. e.g. R2 may point to “www.acorn.com/browser/”. The fully canonicalised version of the URL at block+0 refers to a fully-qualified, canonicalised version of it, which in this example would be “http://www.acorn.com/browser/”.

During canonicalisation, the port number will be elided if possible. See the discussion under SWI URL_ProtocolRegister for details of how URL discovers whether this is possible or not.

[Note] The path will not start with a ‘/’ unless the URL being parsed explicitly specified one – this is in keeping with the URL specification, so for example, given the URL “http://www.acorn.com/browser/”, then the path component is “browser/”, and not “/browser/”; the slash between the hostname and path is a separator only, not a part of either component.

The entry reason codes are described below.

URL_ParseURL_ReturnLengths (R1 = 0)

Work out space required for URL components.

When R1 is 0 on entry to the SWI, the data block is treated as a block of unsigned 32-bit integers. The contents of the block are ignored on entry, but on exit are filled in with the lengths of the individual components of the URL. A value of zero is stored for a field which does not exist; non-zero values include space for a zero-byte terminator.

URL_ParseURL_ReturnData (R1 = 1)

Split a URL into its component parts.

When R1 is 1 on entry to the SWI, the data block is treated as a block of pointers to buffers to receive the components of the URL. Each of the pointers in the data block must be either zero, indicating that the caller is not interested in that field, or point to a buffer which is sufficiently long
to receive the field. The client can ensure this by having previously used reason code 0 to determine the length required.

URL_ParseURL_ComposeFromComponents (R1 = 2)

Combine the components of a URL.

When R1 is 2 on entry to the SWI, the data block is treated as containing the broken down fields of a URL . Each of the pointers in the data block must be either zero or point to a buffer containing the null-terminated value of the component, with the exception of the fullURL field, which is a pointer to a buffer to receive the fully canonicalised URL. This buffer is filled in on exit.

URL_ParseURL_QuickResolve (R1 = 3)

Quickly obtain a fully resolved URL.

When R1 is 3 on entry to the SWI, R4 points to a buffer for receiving the fully resolved URL. R5 is the length of the buffer. On exit, the buffer is filled in with the fully resolved URL obtained, and R5 is decreased by the length of the URL (including terminating zero byte). Hence R5 will be negative on exit if the buffer wasn’t large enough. There is no fixed rule for calculating the minimum buffer length required for the answer. To guarantee that the buffer is large enough, it should be calculated as:

length(base URL) + length(relative URL) + 4

If R0:1 is set on entry, there is the potential for up to the entire URL to be hex encoded. In this case, you would need to multiply the above by three. URL 0.37 and earler never hex encodes URLs. Note that URL 0.38, 0.39, 0.40 and 0.41 will always do this; the control through R0:1 was introduced in v0.42. Clients not knowing about this bit (therefore leaving R0:1 unset) will find that 0.42 or later do not automatically escape URLs, this being more sensible default behaviour on the whole.

Characters which are already hex encoded in URLs are left alone in all versions of the URL module.

Clients are strongly recommended to use this reason code if they wish to resolve a relative URL or canonicalise a URL and are only interested in the fully resolved and canonicalised form of the URL, since it is significantly faster than using reason code 0 and then reason code 1. To help reduce the chances of wildly over-allocating buffer space, setting of R0:1 is not recommended unless full hex escaping is definitely required.

Revised on October 27, 2024 17:51:52 by Dave Higton (1515)? (109.157.36.54)

Search the Wiki

Social

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.