XML Validation

1. Introduction

The database supports implicit and explicit validation of XML documents. Implicit validation can be executed automagically when documents are being inserted into the database, explicit validation can be performed using xquery extension functions.

2. Implicit validation

To enable this feature the eXist-db configuration must be changed by editing the file conf.xml. The following items must be configured:

Example: Default configuration

    <validation mode="auto">
        <entity-resolver>
            <catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml" />
        </entity-resolver>
    </validation>

2.1. Validation mode

With the parameter mode it is possible to switch on/off the validation capabilities of the (Xerces) XML parser. The possible values are:

yes

Switch on validation. All XML documents will be validated. When the grammar (XML schema, DTD) documents cannot be resolved, the document is rejected.

no

Switch off validation. All well-formed XML documents will be accepted.

auto (default)

Validation of an XML document will be performed based on the contents of the document. When a document contains a reference to a grammar (XML schema or DTD) document, the XML parser tries to resolve this grammar and the XML document will be validated against this grammar, just like mode="yes" is configured. If the grammar cannot be resolved, the XML document will be rejected. When the XML document does not contain a reference to a grammar, it will be parsed like mode="no" is configured.

2.2. Catalog Entity Resolver

All grammar (XML schema, DTD) files that must be part of the implicit validation process must be registered to the database using OASIS catalog files. These catalog files can be stored on disk and in the database.

In the upper example the ${WEBAPP_HOME} is substituted by a file:// URL pointing to the 'webapp'-directory of eXist (e.g. '$EXIST_HOME/webapp/') or the equivalent directory of a deployed WAR file when eXist is deployed in a servlet container (e.g. '${CATALINA_HOME}/webapps/exist/')

A catalog which is stored in the database can be addressed by an URL like 'xmldb:exist:///db/mycollection/catalog.xml' (note the 3 slashes) or the shorter equivalent '/db/mycollection/catalog.xml'.

Example: Default OASIS catalog file

    <?xml version="1.0" encoding="UTF-8"?>
    <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
        <public publicId="-//PLAY//EN" uri="entities/play.dtd"/>
        <system systemId="play.dtd" uri="entities/play.dtd"/>
        <system systemId="mondial.dtd" uri="entities/mondial.dtd"/>    
        
        <uri name="http://exist-db.org/samples/shakespeare" uri="entities/play.xsd"/>
        
        <uri name="http://www.w3.org/XML/1998/namespace" uri="entities/xml.xsd"/>
    	<uri name="http://www.w3.org/2001/XMLSchema" uri="entities/XMLSchema.xsd"/>
    
        <uri name="urn:oasis:names:tc:entity:xmlns:xml:catalog" uri="entities/catalog.xsd" />
    </catalog>

It is possible to configure any number of catalog entries in the entity-resolver section of conf.xml .

2.3. Collection validation configuration

The validation mode for each individal collection can be configured using collection.xconf documents, in the same way these are used for configuring indexes. These documents need to be stored in '/db/system/config/db/....'.

Example: collection.xconf

<?xml version='1.0'?>
<collection xmlns="http://exist-db.org/collection-config/1.0">
    <validation mode="yes"/>
</collection>

3. Explicit validation

The database provides two extension functions to perform XML validation from an xquery script:

The first function returns a simple true or false while the second generates a XML validation report. Both functions accept either one or two parameters:

Explaination of parameters:

$a

XML document as

  • xs:anyURI pointing to an XML resource (e.g. 'xmldb:exist:///db/mycollection/doc.xml')

  • node (element or document node)

$b

xs:anyURI pointing to

  • an OASIS catalog xml-file (uri ends with ".xml")

  • a grammar file (uri ends with ".xsd" or ".dtd")

  • a collection (uri ends with "/") inside the database; XSDs are directly queried using the appropriate namespace ("http://www.w3.org/2001/XMLSchema") while the DTDs are resolved by querying OASIS XML catalog documents (namespace "urn:oasis:names:tc:entity:xmlns:xml:catalog").

Example: Validation report

    <report>
        <status>invalid</status>
        <time>62</time>
        <message level="Error" line="12" column="15">cvc-complex-type.2.4.a: Invalid content was 
        found starting with element 'name'. One of '{"http://jmvanel.free.fr/xsd/addressBook":cname}' is expected.</message>
    </report></exist>

4. Grammar management

The XML parser (Xerces) compiles all grammar files (dtd, xsd) when they are used. For efficiency reasons these compiled grammars are cached for reuse, this results into a dramatic increase of validation speed. However under certain conditions (e.g. grammar development) this cache must be cleared. There are two grammar management functions available:

Example: Cached grammars Report

    <?xml version='1.0'?>
    <report>
    <grammar type="http://www.w3.org/2001/XMLSchema">
        <Namespace>http://www.w3.org/XML/1998/namespace</Namespace>
        <BaseSystemId>file:/Users/guest/existdb/trunk/webapp//WEB-INF/entities/XMLSchema.xsd</BaseSystemId>
        <LiteralSystemId>http://www.w3.org/2001/xml.xsd</LiteralSystemId>
        <ExpandedSystemId>http://www.w3.org/2001/xml.xsd</ExpandedSystemId>
    </grammar>
    <grammar type="http://www.w3.org/2001/XMLSchema">
        <Namespace>http://www.w3.org/2001/XMLSchema</Namespace>
        <BaseSystemId>file:/Users/guest/existdb/trunk/schema/collection.xconf.xsd</BaseSystemId>
    </grammar>
    </report>

Note: the element BaseSystemId typically does not provide usefull information.

5. Interactive Client

The interactive shell mode of the java client provides a simple validate command that accepts the similar explicit validation arguments.

6. XML instance examples

This section provides a number of XML fragments demonstrating the required format of the XML documents. Note that a root element should always have a reference to a namespace.

Example: namespace

Most simple reference to an XML schema. The xmlns info is used by the parser to resolve the grammar document.

    <?xml version='1.0'?>
    <addressBook xmlns="http://jmvanel.free.fr/xsd/addressBook">
        .....
    </addressBook>

Example: schemaLocation

xsi:schemaLocation provides additional information to the parser on how to resolve grammar file. According to the XML schema specifications this information is considered to be a hint and might be ignored. eXist will ignore this informaton, the grammar will be resolved like the previous example.

    <?xml version='1.0'?>
    <addressBook xsi:schemaLocation="http://jmvanel.free.fr/xsd/addressBook http://myshost/schema.xsd" 
                 xmlns="http://jmvanel.free.fr/xsd/addressBook" 
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        .....
    </addressBook>

Example: noNamespaceSchemaLocation

Taken from: conf.xml. The xsi:noNamespaceSchemaLocation is honoured by the parser during implicit validation.

    <?xml version='1.0'?>
    <exist xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           xsi:noNamespaceSchemaLocation="schema/conf.xsd">
        .....
    </exist>

Example: DTD DOCTYPE

Taken from 'samples/validation/dtd'. eXist resolves the grammar by searching catalog files for the PUBLIC identifier.

    <?xml version='1.0'?>
    <!DOCTYPE PLAY PUBLIC "-//VALIDATION//EN" "hamlet.dtd">
    <PLAY>
        .....
    </PLAY>

7. Special notes

Dannes Wessels
dizzzz at exist-db.org