Saturday, May 15, 2010

XML Namepool

Following a conversation in the Open Discussion forum of Saxon, I've started to think about different aspects of a XML namepool again. I have implemented a quick@dirty one for LivCos quite a while ago already, but always kept in mind to optimize it, once I'll find the time. Well, current weather conditions in Switzerland tell me not to go hang gliding, so...

First a deeper study of the net.saxon.om.Namepool. Here some summary notes:
- int name code
- 10bit prefix per URI index | 10bit bucket index | 10bit local name hash
- special treatment for "StandardNames"
- name code 0-1023
- 3bit URI | 7bit localname
- special codes for URIs and prefixes
- namespace code: 16bit prefix code | 16bit URI code
- URI code: index to "uri" array
- prefix code: index to "prefixes" array
- NameEntry object for every localName/nsURI
- local name hash table to lookup NameEntry
- linked NameEntry list per hash slot
- QNames and clark-names are built on demand
- int[] for every URI to hold the prefixes-per-URI
- bit mask for fingerprints 0xfffff
- max. fingerprints: ~1024..~1000000 (depending on the hash distribution)
- max. URI codes: 32000
- max. prefix codes: 32000
- max. prefixes per URI: 1023
- String.intern() on NameEntry.localname
- String.intern() on clark names

I guess, the implementation focuses mostly on solving the concrete requirements of Saxon. Namespace URI lookups for XSLT, XML Schema,... are handled via switch cases, while others are scanned via equals (probably less important in the contexts used). The hash code / bucket index part of a non-standard name-code seems to be a compact way of addressing the NameEntries and also seems to be able to handle thousands of fingerprints efficiently. A simple String.intern() helps to compare clark-names quickly. Caches for qualified names don't seem necessary.
I like the idea of encoding the prefix index per namespace URI.

My older LivCos Namepool implementation is focusing on qualified names (for some reason the SAX interface asks for them a lot). The name code consist of 12bits for the namespace URI index and 20 bits for the qualified name index. So it can only handle 4000 namespaces. But it's fast to compare, lookup qualified names. For the fingerprint it needs to index one step deeper.

The QualifiedName namepool implementation, suggested in the Saxon discussion, models every name in a uri-localname-prefix Java object. Here the drawbacks are difficulty to integrate into Saxon, the memory usage (specially on 64bit systems) and the large number of small objects for the JVM to handle.

Yet another new Idea
We start with an int name code. We focus on the fingerprints and encode a simple index to a fingerprint array into the name code. Then we use 10 bits for the prefix-per-URI index. => 4 million fingerprints, 1000 prefixes per URI. A fingerprint entry contains an index to the local-name array and one to the URI array. The URI index is also used to address the list of prefix codes. Lookup and compares should be quite fast.

The fingerprint index could also be used to address a possible clark-name cache. Qualified name caches would become more difficult.

Wait for my findings and some comparison data in the next post.

No comments:

Post a Comment