Wilbur: XML Parser

  1. Using the XML Parser
  2. XML Parser API
  3. Producer/Consumer API ("SAX-like")
  4. Token Classes
  5. Condition Classes
  6. Auxiliary Parser Functionality
  7. Simple String Dictionaries

1. Using the XML Parser

Invoking the XML parser is done using either the function parse-from-file or the function parse-from-stream. Underneath, the parser itself is implemented by an instance of the class xml-parser. The parser uses a "SAX-like" consumer/producer API. During parsing, the parser may signal various conditions to indicate that something unexpected has happened (such as some syntax error, for example). These errors can be caught, but are typically not continuable. Certain warnings are signaled using the class xml-warning.

parse-from-file (file parser-class &rest options) [Function]

This function will read a file and parse its contents. Parsing is done using a parser instantiated from the class parser-class, with options passed to make-instance. Parsing is done by calling parse-from-stream for the open file stream. The parameter file will have to be anything accepted by the Common Lisp function open.

parse-from-stream (stream locator parser-class &rest options) [Function]

This function will read a document from an open stream and parse its contents, using a parser created by passing parser-class and options to make-instance. The parameter locator is the URL of the source of the stream (if applicable, or nil otherwise).

2. XML Parser API

xml-parser [Class]
:expand-namespaces-p [Initarg]

This is the base class of XML parsers. The initarg expand-namespaces-p, when true (the default), will cause all namespaces references to be expanded (by concatenating the namespace URI with the name in question); if false, names are left unexpanded. This class is a subclass of sax-producer.

get-entity (parser name) [Generic function]

This function will return the definition of an entity named name, or nil if the entity has not been defined. It is called by the parser when it encounters entity references.

(setf get-entity) (definition parser name) [Generic function]

This function will assign the string definition as the definition of the entity named name. It is called by the parser when it encounter entity deifnitions.

get-canonical-uri (parser uri) [Generic function]

This function will return the "canonical" form or the URI uri. It can be used to map old or alternate namespace URIs (for example) to new or canonical ones.

(setf get-canonical-uri) (canonical-uri parser uri) [Generic function]

This function will assign canonical-uri (a string) as the "canonical" version of the URI uri (a string). See the definition of the function get-canonical-uri for a discussion of "canonical" URIs.

parse (parser stream locator) [Generic function]

This function will initiate the parsing process, reading from the open input stream stream. The parameter locator is the URL of the source of the stream.

expand-name-with-namespace (string namespaces) [Function]

This function will expand string (of the format prefix:name) into a full URI, using the a-list namespaces of prefix-to-URI mappings.

reverse-expand-name (string namespaces) [Function]

This functions will perform the opposite of expand-name-with-namespace, taking the URI string and returning it into the prefix:name format.

xml-formatter [Class]
:stream [Initarg]

This class, a subclass of sax-consumer, will take a stream of consumer/producer events and create a textual representation of the XML document represented by the events. The XML text is written into the output stream initialized using the initarg stream.

replay (formatter events) [Generic function]

This function will take a list of events (token subclasses) and use an xml-formatter instance formatter to create a textual representation of the XML document represented by the events.

3. Producer/Consumer API ("SAX-like")

The Producer/Consumer API is a bit like the SAX API for Java. The general idea is that an XML parser, as the "producer", generates "events" during parsing which are communicated to a "consumer" of these events. The events form a stream, and this stream can pass through "filters" (objects which are both consumers and producers). In the implementation, the events are communicated using function (method) calls, passing "tokens" to the consumer(s). The token classes are described in a later section of this document. The figure below shows entire class hierarchy of the consumer and producer classes.

consumer/producer class hierarchy

sax-producer [Class]
:consumer [Initarg]

Base class for event producers. The initarg consumer initializes the consumer of the producer.

sax-producer-consumer (producer) [Generic function]

Accesses the consumer of the producer.

(setf sax-producer-consumer) (consumer producer) [Generic function]

Assigns the consumer of the producer. The method implementation of this generic function will also set the "inverse" relation, i.e., assign the producer of the consumer.

sax-consumer [Class]
:producer [Initarg]

Base class for event consumers. The initarg producer initializes the producer of the consumer.

sax-consumer-producer (consumer) [Generic function]

Accesses the producer of the consumer.

start-document (consumer locator) [Generic function]

This event is generated when the parsing of a document begins. The parameter locator is the URL of the document.

end-document (consumer mode) [Generic function]

This event is generated when the parsing of a document has been completed. See the function sax-consumer-mode for a discussion of parser modes and the parameter mode.

start-element (consumer tag mode) [Generic function]

This event is generated for every start tag in an XML document. The parameter tag is an open-tag instance. See the function sax-consumer-mode for a discussion of parser modes and the parameter mode.

end-element (consumer tag mode) [Generic function]

This event is generated for every end tag in an XML document. The parameter tag is the corresponding open-tag instance (not the close-tag instance which you might have expected). See the function sax-consumer-mode for a discussion of parser modes and the parameter mode.

char-content (consumer content mode) [Generic function]

This event is generated for all every occurrence of character content in an XML document (in simple terms, the stuff between start and end tags that is not another tag). The parameter content is a string. See the function sax-consumer-mode for a discussion of parser modes and the parameter mode.

proc-instruction (consumer instruction mode) [Generic function]

This event is generated for every processing instruction in an XML document. The parameter instruction is an instance of proc-instruction. See the function sax-consumer-mode for a discussion of parser modes and the parameter mode.

find-first-producer (consumer) [Generic function]

Given a chain of producers and consumers, this method will return the first producer of the chain (the source of events, that is).

sax-consumer-mode (parser) [Generic function]

Every event method (except start-document) will accept a mode parameter. The value of this parameter is calculated by calling the sax-consumer-mode method of the consumer in question. In other words, a call (in the parser implementation) to, say, start-element, might look like this:
(start-element consumer tag (sax-consumer-mode consumer))
The idea is that any consumer can specialize the event methods on this parameter (typically using eql specializers) and thus easily implement a state machine, for example. The implementation (i.e., any method) for sax-consumer-mode is sax-consumer subclass-dependent. The default method returns nil.

4. Token Classes

The XML parser passes instances of various subclasses of token to the "event" methods of the producer/consumer interface. Below is a table showing the class hierarchy of these classes, and indicates what kind of processing (if any) the parser takes with each type of token (remember, this is an XML parser built to enable RDF parsing, it is not a general XML parser).

Note that char-content is shown as a subclass of token; this is true only conceptually, in the implementation the char-content "tokens" are Common Lisp strings.

The figure below shows the hierarchy of token classes.

token class hierarchy

The following table describes the classes in terms of their processing and whether they are instantiable.
 

Class Class Type Processing External?
token abstract n/a yes
tag abstract n/a no
open-tag concrete processed yes
close-tag concrete processed yes
proc-instruction concrete ignored yes
declaration abstract n/a no
entity-declaration concrete processed yes
element-declaration concrete ignored (xml-warning signaled) no
attlist-declaration concrete ignored (xml-warning signaled) no
comment concrete ignored yes
(char-content) (Common Lisp string) processed yes
dtd-bracket abstract n/a no
dtd-start concrete ignored (xml-warning signaled if external DTD) no
dtd-end concrete ignored no

token [Class]

This is the abstract base class of all tokens.

token-string (token) [Generic function]

Accesses a string which names the token (for tags, it is the tag name; for others, some descriptive, useful representation of the token).

open-tag [Class]

Class of start tags.

close-tag [Class]

Class of end tags.

tag-counterpart (tag) [Generic function]

Accesses the corresponding open-tag instance of a close-tag instance.

tag-empty-p (open-tag) [Generic function]

True if an open-tag instance has no content (i.e., there is no end tag).

tag-attribute (open-tag attribute) [Generic function]

Accesses the attribute (named using attribute) of an open-tag instance.

tag-namespaces (open-tag) [Generic function]

Accesses the list (an a-list) of all namespace definitions for the element in which the open-tag instance belongs (this includes all namespaces declared with this particular tag, as well as all namespaces of enclosing elements).

proc-instruction [Class]

Class of processing instructions (token-string accesses the name of the processing instruction). Current implementation generates these but then ignores them.

entity-declaration [Class]

Class of entity declarations. Note that token-string accesses the definition of the entity (the "expanded" form, that is).

entity-name (entity-declaration) [Generic function]

Accesses the name of the entity declaration (the "unexpanded" form, that is, minus the ampersand and semicolon).

comment [Class]

Class of comments (token-string accesses the entire text of the comment).

5. Condition Classes

All condition classes (except xml-warning) are subclasses of xml-error. The figure below shows the hierarchy of condition classes, including the RDF conditions.

xml-error condition class hierarchy

xml-error [Condition class]
:thing [Initarg]

This is the abstract base class of all XML parser condition classes. The initarg thing is an error-dependent object useful for the processing of the error.

error-thing (error) [Generic function]

Accesses the thing object of the error (its semantics are subclass-dependent).

syntax-error [Condition class]

This is the class of all syntax errors. The function error-thing accesses some descriptive object useful in characterizing the error.

pi-termination-problem [Condition class]

Signaled for unterminated or improperly terminated processing instructions (error-thing accesses the name of the PI).

dtd-termination-problem [Condition class]

Signaled for improperly terminated DTDs.

unexpected-end-tag [Condition class]
:expectation [Initarg]

Signaled for those end tags which we did not expect (i.e., the current corresponding start tag has some other name). The function error-thing accesses the end tag, and the function error-expectation (initialized using the initarg expectation) accesses the expected corresponding start tag.

error-expectation (unexpected-end-tag) [Generic function]

Accesses the corresponding start tag of the expected end tag.

unknown-declaration [Condition class]

Signaled for any unrecognized declaration.

unknown-character-reference [Condition class]

Signaled when an unrecognized character reference is encountered. The function error-thing accesses the reference.

malformed-url [Condition class]

Signaled when the parsing of a URL string fails. The function error-thing accesses the ill-formed URI string.

feature-not-supported [Condition class]

Signaled when some feature not supported by the current implementation is encountered. The function error-thing accesses some descriptive representation of the feature.

missing-definition [Condition class]
:type [Initarg]

This abstract condition class represents the situation where a reference is made to something which has not been defined yet. The concrete subclasses of this class are used with missing entity and namespace definitions. The initarg type initializes the definition type.

error-definition-type (missing-definition) [Generic function]

Accesses the definition type of a missing-definition instance.

missing-entity-definition [Condition class]

Signaled when a reference is made to an entity which has not been defined. Continuing from this error will cause the "unexpanded" form to be used instead. The function error-thing accesses the unknown entity name.

missing-namespace-definition [Condition class]

Signaled when a reference is made to a namespace which has not been defined. Continuing from this error will cause the "unexpanded" form to be used instead. The function error-thing accesses the unknown namespace prefix.

xml-warning [Condition class]

This condition class is used whenever the system need to issue a warning.

xml-warning (message &rest args) [Macro]

This macro is used in place of warn to issue warnings of the type xml-warning.

6. Auxiliary Parser Functionality

*current-parser* [Variable]

This variable is bound to the current XML parser instance during parsing. The purpose of this variable is to provide access to the parser instance from reader-macro functions (which are not methods).

read-using (readtable stream &optional recursivep) [Function]

This function will bind the readtable passed in readtable to *readtable* and call read in the following manner:
(read stream t nil recursivep)
This function is useful when used with the XML name readtable (bound to variable *name-reader*) when XML-style names have to be read outside the parser.

*name-reader* [Variable]

This variable is bound to the XML name readtable, and can be used to read in XML-style names.

make-file-url (pathname) [Function]

This function will turn a Common Lisp pathname (or a string) into a file-URL string.

make-http-url (host port path) [Function]

This function will construct an http-URL from a host, a port and a path (all strings, except that port can also be an integer).

parse-url (url) [Function]

This function will parse an URL string, and return two values: the URL type (either :file or :http) and a property list of URL components (these are :host for the host name, :port for the TCP port, and :path for the path component of the URL; file-URLs only have the path component). If the URL string can not be parser, malformed-url condition is signaled.

-whitespace-chars- [Variable]

This constant is a string containing all those characters which the parser considers to be whitespace.

7. Simple String Dictionaries

String dictionaries are an abstraction which allows string keys to be mapped to string values. They are used, for example, in the implementation of tag attributes and namespace mappings. String comparisons are case-sensitive (they use string=).

string-dict-get (keys&values key) [Function]

Get the value from the string dictionary keys&values corresponding to key, or nil if not found.

string-dict-get-by-value (keys&values value) [Function]

Get the first key from the string dictionary keys&values corresponding to value, or nil if not found.

string-dict-add (keys&values key value) [Function]

Return a list where an association from key to value has been added to the string dictionary keys&values. This function will not destructively modify keys&values.

string-dict-del (keys&values key) [Function]

Return a list where the association from key (to its corresponding value) in the string dictionary keys&values has been deleted. This function destructively modifies keys&values.

do-string-dict ((key value dict) &body body) [Macro]

This macro will execute the forms in body in a loop where, in turn, variables key and value are bound to the consecutive mappings of the string dictionary dict.


Copyright © 2001 Nokia. All Rights Reserved.
Subject to the NOKOS License version 1.0
Author: Ora Lassila (ora.lassila@nokia.com)