Tuesday, June 1, 2010

XML Namepool 3

In the previous 2 posts about XML namepools I have tried to find the optimal implementation for the data model. In this post I'm gonna collect my results while optimizing the performance of the StringArray (=> Cosmos4) and Saxon2 (=> Saxon5) implementation for concurrent name allocation.

Double-check
Since XML often contains a large number of elements with the same name (and type), the lookup of existing names is successful many times. For these "allocations" of already added names, no synchronization is necessary. A second, synchronized lookup ensures to add every name only once.

Pre-allocation
The resizing of the int and String arrays is probably a big time consumer. Also when one thread needs to resize an array, all the following threads need to wait until this thread has allocated a new, bigger array and has copied the old values to their new destination.
At least while the allocation of the new array, the following threads could still add to the old one, when the resizing is triggered before the old array is really full. The triggering thread would still be slowed down, but some of the later thread would not have to wait.
Also the copying of the part before the trigger could be done without the need to sync with the following threads and the synchronized block would become even shorter.

The resizing process could also be placed into a separate "resizing" thread. Any allocation thread would then only be slowed down while copying the running part of the array (added since the pre-allocation trigger) in the worst case.

The actual gain in performance of this approach is difficult to measure and depends strongly on the concrete usage scenario.

Concurrency Test
For a quick&easy test of the concurrent allocation of XML names I have simply extended the tests with 16 threads, each allocating all the names.
Protocol           SAXON2       SAXON5       STRARRAY    COSMOS4 
                 [ms]  [kB]   [ms]  [kB]   [ms]  [kB]  [ms]  [kB]
ALLOCATE         4250  2623   1122  2623    948 11387   343  2761
READ_PREFIX        95     0    106     0     57     0    57 -8906
READ_URI           81     0     83     0     57     0    57     0
READ_LOCALNAME     87     0     84     0     52     0    52     0
READ_QNAME        905  6235    918  6235    136  6235    84  6235
READ_CLARKNAME    161  2671    175  2671    N/A   N/A   N/A   N/A
GET_URI_CODE      827     0    782     0    N/A   N/A   N/A   N/A
GET_NS_CODE      1512     0   1407     0    N/A   N/A   N/A   N/A
GET_NS_CODE2       65     0     85     0    N/A   N/A   N/A   N/A
GET_PREFIX_CODE   604     0    554     0    N/A   N/A   N/A   N/A
COMPARE_FP         15     0     15     0     78     0    89     0
COMPARE_NAME       31     0     77     0     82     0    82     0
COMPARE_QNAME      99     0    108     0     89     0    89     0
While this is not really a realistic scenario, it shows the improvement of the double-check approach.

Why does Saxon5 need double the time of Saxon2 to compare local names?

Outcome
While the improvement by replacing the name pool for LivCos (Cosmos4) seems only minor (concurrent access is minimal), the implementation is simpler and thread safe.
A Saxon 9.2.1.1 version, patched with a Saxon5 implementation, shows a gain in performance of about 5% when used in LivCos.

Thursday, May 20, 2010

XML Namepool 2

Five different new NamePool implementations try to handle XML names most efficiently and with the smallest memory footprint possible.

Note, that these implementations are mostly prove-of-concept and have not yet been tested really!

Saxon 9.2
The current Saxon implementation of a name pool.

LivCos 0.7
The current, quick&practical name pool implementation for LivCos.

QualifedName Object
Strait forward approach, modeling the qualified name as simple Java objects.

LivCos2
An optimized implementation with a fingerprint array and hash map.

Saxon2
The optimized implementation for the Saxon interface.

Int-Wrapper for QualifiedName
Int index for an array of QualifiedNames to reduce DOM memory foot print on 64bit systems. Would allow to be integrated into Saxon.

String Array
Implementation of QualifiedName objects with a "simple" String array.

Evaluation
To evaluate the performance of these approaches, I've written a small, simple console application, stopping the time between tasks and roughly measuring memory usage. For the first run, the application generates 100'000 qualified names in 500 namespaces, each with 1 prefix (=> 200 local names per namespace) and adds them to the chosen namepool. Then it reads the different name parts, gets codes and compares parts of the name.
Protocol           SAXON           QNAME           COSMOS 
                 [ms]  [kB]      [ms]  [kB]      [ms]  [kB]
ALLOCATE         1644  2374       146 22713     13841 10178
READ_PREFIX      3365     0        53     0        35 -8593
READ_URI         3284     0        53     0        23     0
READ_LOCALNAME   3296     0        56     0        34     0
READ_QNAME       7615  6235       191  6235        22     0
READ_CLARKNAME  11545 -6235        53     0      1599  8437
GET_URI_CODE    53640     0       N/A   N/A     17262     0
GET_NS_CODE     95210     0       N/A   N/A       N/A   N/A
GET_NS_CODE2     3318     0       N/A   N/A       N/A   N/A
GET_PREFIX_CODE 32992     0       N/A   N/A       N/A   N/A
COMPARE_FP         14     0        71     0        26     0
COMPARE_NAME     3352     0        91     0        30     0
COMPARE_QNAME    3390     0        81     0        24     0
Every name is read and compared 100 times, the heap is on 1.5GB and it runs with the "-server" option.

Since Saxon handles "StandardNames" differently, a second run measures this set of standard names (~170).
Protocol           SAXON           STANDARD        SAXON2          STRARRAY
[ms]  [kB]      [ms]  [kB]      [ms]  [kB]      [ms]  [kB]
ALLOCATE          765    49       579     0       174   108       775 40049
READ_PREFIX        22     0        18   -20        35     0         9     0
READ_URI           24     0        18     0        26     0         9     0
READ_LOCALNAME     22     0        15     0        26     0         9     0
READ_QNAME        201    10       175    10       203    10        10    10
READ_CLARKNAME    328    10       321    10        29    10        10     0
GET_URI_CODE       71     0       N/A   N/A        57     0       N/A   N/A
GET_NS_CODE       125     0       N/A   N/A        74     0       N/A   N/A
GET_NS_CODE2       30     0       N/A   N/A        25     0       N/A   N/A
GET_PREFIX_CODE    58     0       N/A   N/A        36     0       N/A   N/A
COMPARE_FP         44     0        41     0        36     0        53     0
COMPARE_NAME      166     0        44     0        64     0        52     0
COMPARE_QNAME     182     0        56     0        81     0        50     0
In this run the names are "allocated" and read 10'000 times and compared 100'000 times.

The values for StandardNames differ less for the different implementations. Maybe there is a chance for Saxon to simplify, optimize the handling for StandardNames.

Optimize Saxon 9.2
The namepool in Saxon focuses on fingerprint comparison. I guess, it is important for processing XSLT efficiently. The optimized implementation for Saxon also uses a fingerprint bit mask on the name code. It uses some additional arrays and hash maps.
Protocol           SAXON           SAXON2          STRARRAY
[ms]  [kB]      [ms]  [kB]      [ms]  [kB]
ALLOCATE         1644  2374       240  2620       143 14822
READ_PREFIX      3365     0        94     0        25     0
READ_URI         3284     0        79     0        25     0
READ_LOCALNAME   3296     0        76     0        26     0
READ_QNAME       7615  6235       923  6235       119  6235
READ_CLARKNAME  11545 -6235       148  2671        27     0
GET_URI_CODE    53640     0       786     0       N/A   N/A
GET_NS_CODE     95210     0      1385     0       N/A   N/A
GET_NS_CODE2     3318     0        58     0       N/A   N/A
GET_PREFIX_CODE 32992     0       553     0       N/A   N/A
COMPARE_FP         14     0        14     0        34     0
COMPARE_NAME     3352     0        29     0        38     0
COMPARE_QNAME    3390     0        90     0        35     0
The StringArray uses the index as the name code. Still the compares seem to be really fast. Theoretically it would also allow to replace the Saxon namepool (int name codes), but it needs double the time to compare fingerprints (how big of a drop in performance for Saxon would that be?).

To try the Saxon2 implementation with the Saxon processor you also need to replace the StandardNames class with this code.

QualifiedName Objects
The major drawback for the direct approach, modeling the names as individual objects, is the need for every DOM node to hold an object reference (64bit systems). Another disadvantage is the difficulty to integrate into the existing Saxon processor. Both could be worked around by putting the objects into an array and working with the array indexes as the name codes.
Protocol           QNAME           QNAMEINT        STRARRAY
[ms]  [kB]      [ms]  [kB]      [ms]  [kB]
ALLOCATE          146 22713        82 10890       143 14822
READ_PREFIX        53     0        46     0        25     0
READ_URI           53     0        45     0        25     0
READ_LOCALNAME     56     0        45     0        26     0
READ_QNAME        191  6235       113  6235       119  6235
READ_CLARKNAME     53     0       118  8906        27     0
GET_URI_CODE      N/A   N/A       N/A   N/A       N/A   N/A
GET_NS_CODE       N/A   N/A       N/A   N/A       N/A   N/A
GET_NS_CODE2      N/A   N/A       N/A   N/A       N/A   N/A
GET_PREFIX_CODE   N/A   N/A       N/A   N/A       N/A   N/A
COMPARE_FP         71     0        68     0        34     0
COMPARE_NAME       91     0        66     0        38     0
COMPARE_QNAME      81     0        66     0        35     0
While the String Array is not really stores QualifiedName objects, it similarly compounds the five strings of a name in an block within the array.

As expected the large number of objects use a lot of memory, but the large java.util.HashMaps are somehow missing in the measurement...

Optimize LivCos
In the context of LivCos we have one big namepool, we don't need to compare fingerprint a lot and there can be many local names in the Cosmos namespace. For the SAX interface the QName needs to be read a lot.
Protocol           COSMOS          COSMOS3         STRARRAY        SAXON2       
[ms]  [kB]      [ms]  [kB]      [ms]  [kB]      [ms]  [kB]     
ALLOCATE        13841 10178       272  2578       143 14822       240  2620     
READ_PREFIX        35 -8593        75     0        25     0        94     0     
READ_URI           23     0        35     0        25     0        79     0     
READ_LOCALNAME     34     0        34     0        26     0        76     0     
READ_QNAME         22     0       793  6235       119  6235       923  6235     
READ_CLARKNAME   1599  8437       828  2671        27     0       148  2671     
GET_URI_CODE    17262     0       792     0       N/A   N/A       786     0     
GET_NS_CODE       N/A   N/A       N/A   N/A       N/A   N/A      1385     0     
GET_NS_CODE2      N/A   N/A       N/A   N/A       N/A   N/A        58     0     
GET_PREFIX_CODE   N/A   N/A       564     0       N/A   N/A       553     0     
COMPARE_FP         26     0        14     0        34     0        14     0     
COMPARE_NAME       30     0        29     0        38     0        29     0     
COMPARE_QNAME      24     0        40     0        35     0        90     0     
While COSMOS3 is an adapted SAXON2 copy, the StringArray seems to be the way to go. The key for the main hash map could be optimized and the clark name, since not really needed, could be removed.

Notes
- optimal fingerprint compare requires encoded name codes
- name code constants for StandardNames to be used in switch statements
- index or reference to the full qualified name allows simple caching of qNames and clarkNames
- self-made hash-to-int maps need improvement
- server HotSpot JVM optimizes the test noticeably

some values for the "-client" JVM
Protocol           SAXON           QNAME           STRARRAY        SAXON2          COSMOS3 
                 [ms]  [kB]      [ms]  [kB]      [ms]  [kB]      [ms]  [kB]      [ms]  [kB]
ALLOCATE         1756  2374        96 22713        68 14822       163  2620       164  2578
READ_PREFIX      3517     0       155     0        61     0       122     0       115     0
READ_URI         3445     0       153     0        64     0       102     0        75     0
READ_LOCALNAME   3421     0       158     0        59     0        75     0        68     0
READ_QNAME       8094  6235       241  6235       122  6235      1211  6235      1230  6235
READ_CLARKNAME  11983 -6235       150     0       138     0       157  2671      1372  2671
GET_URI_CODE    78621     0       N/A   N/A       N/A   N/A       967     0       968     0
GET_NS_CODE    106389     0       N/A   N/A       N/A   N/A      1486     0       N/A   N/A
GET_NS_CODE2     3457     0       N/A   N/A       N/A   N/A        89     0       N/A   N/A
GET_PREFIX_CODE 26850     0       N/A   N/A       N/A   N/A       390     0       392     0
COMPARE_FP         30     0       130     0        54     0        30     0        30     0
COMPARE_NAME     3515     0       130     0        53     0        50     0        52     0
COMPARE_QNAME    3485     0       139     0        93     0        91     0        91     0

Saturday, May 15, 2010

XML Namepool

Following a conversation in the Open Discussion forum of Saxon, I've started to think about different aspects of a XML namepool again. I have implemented a quick@dirty one for LivCos quite a while ago already, but always kept in mind to optimize it, once I'll find the time. Well, current weather conditions in Switzerland tell me not to go hang gliding, so...

First a deeper study of the net.saxon.om.Namepool. Here some summary notes:
- int name code
- 10bit prefix per URI index | 10bit bucket index | 10bit local name hash
- special treatment for "StandardNames"
- name code 0-1023
- 3bit URI | 7bit localname
- special codes for URIs and prefixes
- namespace code: 16bit prefix code | 16bit URI code
- URI code: index to "uri" array
- prefix code: index to "prefixes" array
- NameEntry object for every localName/nsURI
- local name hash table to lookup NameEntry
- linked NameEntry list per hash slot
- QNames and clark-names are built on demand
- int[] for every URI to hold the prefixes-per-URI
- bit mask for fingerprints 0xfffff
- max. fingerprints: ~1024..~1000000 (depending on the hash distribution)
- max. URI codes: 32000
- max. prefix codes: 32000
- max. prefixes per URI: 1023
- String.intern() on NameEntry.localname
- String.intern() on clark names

I guess, the implementation focuses mostly on solving the concrete requirements of Saxon. Namespace URI lookups for XSLT, XML Schema,... are handled via switch cases, while others are scanned via equals (probably less important in the contexts used). The hash code / bucket index part of a non-standard name-code seems to be a compact way of addressing the NameEntries and also seems to be able to handle thousands of fingerprints efficiently. A simple String.intern() helps to compare clark-names quickly. Caches for qualified names don't seem necessary.
I like the idea of encoding the prefix index per namespace URI.

My older LivCos Namepool implementation is focusing on qualified names (for some reason the SAX interface asks for them a lot). The name code consist of 12bits for the namespace URI index and 20 bits for the qualified name index. So it can only handle 4000 namespaces. But it's fast to compare, lookup qualified names. For the fingerprint it needs to index one step deeper.

The QualifiedName namepool implementation, suggested in the Saxon discussion, models every name in a uri-localname-prefix Java object. Here the drawbacks are difficulty to integrate into Saxon, the memory usage (specially on 64bit systems) and the large number of small objects for the JVM to handle.

Yet another new Idea
We start with an int name code. We focus on the fingerprints and encode a simple index to a fingerprint array into the name code. Then we use 10 bits for the prefix-per-URI index. => 4 million fingerprints, 1000 prefixes per URI. A fingerprint entry contains an index to the local-name array and one to the URI array. The URI index is also used to address the list of prefix codes. Lookup and compares should be quite fast.

The fingerprint index could also be used to address a possible clark-name cache. Qualified name caches would become more difficult.

Wait for my findings and some comparison data in the next post.

Wednesday, April 14, 2010

Open XSLT Processing 2

Intro
To understand the context of the problem, please read the previous Open XSLT Processing post first. Mr. Kay has given me the hint of taking the DOMSource out of the equation. Thanks!

New Tests
The new implementation adds a DOM-to-TinyTree conversion step before the actual runs. In addition it now handles 3 namespaces. The source still can be found at
http://code.google.com/p/livcos/source/browse/proto/OpenXslt/src/proto/.

JVM options: -Xms1500M -Xmx1500M -XX:NewRatio=1

Here the new results:
0 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   281ms |    17ms (-94%) |    35ms (-88%) |    50ms (-83%)
    2 |   161ms |     8ms (-95%) |    18ms (-89%) |    48ms (-70%)
    4 |   146ms |     7ms (-95%) |    17ms (-89%) |    51ms (-65%)
    6 |   137ms |     9ms (-94%) |    21ms (-85%) |    57ms (-58%)
    8 |   118ms |    10ms (-92%) |    26ms (-78%) |    74ms (-38%)
   10 |   100ms |    11ms (-89%) |    29ms (-71%) |    85ms (-15%)
   12 |    99ms |    13ms (-87%) |    34ms (-66%) |   104ms (  5%)
   15 |   104ms |    16ms (-85%) |    41ms (-61%) |   128ms ( 22%)
   20 |   106ms |    20ms (-81%) |    53ms (-50%) |   151ms ( 42%)
   40 |   119ms |    36ms (-70%) |    97ms (-19%) |   281ms (135%)
   60 |   135ms |    53ms (-61%) |   145ms (  7%) |   426ms (215%)
   80 |   152ms |    68ms (-55%) |   188ms ( 23%) |   559ms (266%)
  100 |   165ms |    84ms (-50%) |   230ms ( 39%) |   710ms (330%)
  200 |   247ms |   169ms (-32%) |   457ms ( 84%) |  1384ms (458%)
  400 |   417ms |   331ms (-21%) |   908ms (117%) |  2791ms (568%)
  600 |   578ms |   498ms (-14%) |  1372ms (137%) |  4192ms (624%)
  800 |   752ms |   664ms (-12%) |  1815ms (141%) |  5573ms (640%)
 1000 |   911ms |   835ms ( -9%) |  2269ms (148%) |  6937ms (660%)
 2000 |  1746ms |  1652ms ( -6%) |  4518ms (158%) | 13865ms (693%)
 3000 |  2571ms |  2449ms ( -5%) |  6738ms (162%) | 20759ms (707%)
 4000 |  3405ms |  3301ms ( -4%) |  9068ms (166%) | 27701ms (713%)
 6000 |  5061ms |  4932ms ( -3%) | 13570ms (168%) | 41572ms (721%)
 8000 |  6755ms |  6597ms ( -3%) | 18067ms (167%) | 55494ms (721%)
 9999 |  8368ms |  8234ms ( -2%) | 22611ms (170%) | 68887ms (723%)

1 content element, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   286ms |    16ms (-95%) |    35ms (-88%) |    51ms (-83%)
    2 |   162ms |     9ms (-95%) |    26ms (-84%) |    49ms (-70%)
    4 |   145ms |     9ms (-94%) |    24ms (-84%) |    52ms (-64%)
    6 |   138ms |    11ms (-92%) |    32ms (-77%) |    60ms (-57%)
    8 |   123ms |    12ms (-90%) |    40ms (-68%) |    77ms (-38%)
   10 |   103ms |    14ms (-87%) |    45ms (-56%) |    90ms (-13%)
   12 |   102ms |    16ms (-84%) |    53ms (-49%) |   111ms (  8%)
   15 |   107ms |    20ms (-82%) |    67ms (-38%) |   142ms ( 32%)
   20 |   110ms |    25ms (-78%) |    87ms (-21%) |   147ms ( 33%)
   40 |   128ms |    46ms (-64%) |   155ms ( 21%) |   296ms (131%)
   60 |   149ms |    67ms (-55%) |   228ms ( 52%) |   431ms (189%)
   80 |   165ms |    83ms (-50%) |   296ms ( 79%) |   576ms (249%)
  100 |   184ms |   104ms (-44%) |   366ms ( 98%) |   722ms (290%)
  200 |   286ms |   206ms (-29%) |   735ms (156%) |  1453ms (407%)
  400 |   496ms |   408ms (-18%) |  1465ms (195%) |  2898ms (484%)
  600 |   694ms |   611ms (-13%) |  2200ms (216%) |  4371ms (529%)
  800 |   899ms |   813ms (-10%) |  2903ms (222%) |  5767ms (541%)
 1000 |  1103ms |  1022ms ( -8%) |  3618ms (228%) |  7189ms (551%)
 2000 |  2118ms |  2035ms ( -4%) |  7262ms (242%) | 14417ms (580%)
 3000 |  3144ms |  3069ms ( -3%) | 10871ms (245%) | 21687ms (589%)
 4000 |  4186ms |  4085ms ( -3%) | 14523ms (246%) | 28788ms (587%)
 6000 |  6221ms |  6152ms ( -2%) | 21792ms (250%) | 43101ms (592%)
 8000 |  8293ms |  8174ms ( -2%) | 29106ms (250%) | 57651ms (595%)
 9999 | 10363ms | 10247ms ( -2%) | 36428ms (251%) | 71705ms (591%)

2 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   284ms |    15ms (-95%) |    38ms (-87%) |    53ms (-82%)
    2 |   166ms |     9ms (-95%) |    26ms (-84%) |    50ms (-70%)
    4 |   145ms |     9ms (-94%) |    31ms (-79%) |    53ms (-64%)
    6 |   136ms |    12ms (-91%) |    44ms (-68%) |    60ms (-56%)
    8 |   122ms |    14ms (-88%) |    53ms (-57%) |    82ms (-33%)
   10 |   107ms |    17ms (-84%) |    63ms (-41%) |   105ms ( -2%)
   12 |   104ms |    18ms (-83%) |    71ms (-33%) |   110ms (  5%)
   15 |   108ms |    23ms (-79%) |    87ms (-20%) |   148ms ( 35%)
   20 |   110ms |    28ms (-74%) |   108ms ( -3%) |   153ms ( 38%)
   40 |   136ms |    54ms (-60%) |   209ms ( 53%) |   302ms (122%)
   60 |   161ms |    80ms (-51%) |   311ms ( 92%) |   448ms (177%)
   80 |   179ms |   100ms (-44%) |   406ms (126%) |   600ms (234%)
  100 |   203ms |   127ms (-38%) |   506ms (148%) |   745ms (266%)
  200 |   329ms |   246ms (-26%) |  1003ms (204%) |  1503ms (356%)
  400 |   567ms |   491ms (-14%) |  2005ms (253%) |  3010ms (430%)
  600 |   807ms |   732ms (-10%) |  3002ms (271%) |  4528ms (460%)
  800 |  1052ms |   972ms ( -8%) |  3974ms (277%) |  6034ms (473%)
 1000 |  1301ms |  1220ms ( -7%) |  5001ms (284%) |  7548ms (479%)
 2000 |  2510ms |  2434ms ( -4%) | 10007ms (298%) | 15280ms (508%)
 3000 |  3754ms |  3660ms ( -3%) | 14989ms (299%) | 22667ms (503%)
 4000 |  4968ms |  4874ms ( -2%) | 20279ms (308%) | 30199ms (507%)
 6000 |  7394ms |  7320ms ( -2%) | 30054ms (306%) | 45115ms (510%)
 8000 |  9827ms |  9713ms ( -2%) | 40095ms (307%) | 59919ms (509%)
 9999 | 12290ms | 12150ms ( -2%) | 49756ms (304%) | 74072ms (502%)

4 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   289ms |    17ms (-95%) |    41ms (-86%) |    52ms (-82%)
    2 |   163ms |    11ms (-94%) |    35ms (-79%) |    54ms (-67%)
    4 |   146ms |    12ms (-92%) |    46ms (-69%) |    55ms (-62%)
    6 |   140ms |    14ms (-90%) |    64ms (-55%) |    69ms (-51%)
    8 |   129ms |    18ms (-86%) |    76ms (-41%) |    80ms (-38%)
   10 |   109ms |    20ms (-81%) |    94ms (-14%) |   105ms ( -4%)
   12 |   109ms |    24ms (-78%) |   106ms ( -4%) |   129ms ( 17%)
   15 |   130ms |    32ms (-75%) |   148ms ( 13%) |   149ms ( 14%)
   20 |   117ms |    36ms (-69%) |   167ms ( 42%) |   163ms ( 39%)
   40 |   154ms |    72ms (-54%) |   329ms (113%) |   323ms (109%)
   60 |   180ms |    96ms (-47%) |   479ms (166%) |   486ms (170%)
   80 |   208ms |   123ms (-41%) |   632ms (204%) |   630ms (202%)
  100 |   239ms |   151ms (-37%) |   770ms (221%) |   813ms (238%)
  200 |   394ms |   304ms (-23%) |  1538ms (290%) |  1586ms (302%)
  400 |   689ms |   609ms (-12%) |  3143ms (355%) |  3189ms (362%)
  600 |   994ms |   905ms ( -9%) |  4686ms (371%) |  4713ms (374%)
  800 |  1303ms |  1235ms ( -6%) |  6312ms (384%) |  6367ms (388%)
 1000 |  1627ms |  1463ms (-11%) |  7711ms (373%) |  8002ms (391%)
 2000 |  3188ms |  3005ms ( -6%) | 15561ms (388%) | 15803ms (395%)
 3000 |  4743ms |  4587ms ( -4%) | 23475ms (394%) | 23901ms (403%)
 4000 |  6169ms |  6157ms ( -1%) | 31316ms (407%) | 31437ms (409%)
 6000 |  9391ms |  9285ms ( -2%) | 46588ms (396%) | 47771ms (408%)
 8000 | 12467ms | 12348ms ( -1%) | 62862ms (404%) | 63416ms (408%)
 9999 | 15567ms | 15487ms ( -1%) | 78131ms (401%) | 78577ms (404%)

10 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   297ms |    19ms (-94%) |    57ms (-81%) |    54ms (-82%)
    2 |   167ms |    15ms (-91%) |    60ms (-65%) |    56ms (-67%)
    4 |   157ms |    18ms (-89%) |    87ms (-45%) |    62ms (-61%)
    6 |   148ms |    23ms (-85%) |   121ms (-19%) |    75ms (-50%)
    8 |   135ms |    28ms (-80%) |   152ms ( 12%) |    95ms (-30%)
   10 |   127ms |    33ms (-74%) |   183ms ( 44%) |   116ms ( -9%)
   12 |   126ms |    38ms (-70%) |   220ms ( 74%) |   137ms (  8%)
   15 |   136ms |    46ms (-67%) |   258ms ( 89%) |   151ms ( 11%)
   20 |   145ms |    60ms (-59%) |   333ms (128%) |   187ms ( 28%)
   40 |   186ms |   102ms (-45%) |   644ms (245%) |   371ms ( 99%)
   60 |   233ms |   152ms (-35%) |   971ms (315%) |   553ms (136%)
   80 |   286ms |   204ms (-29%) |  1275ms (344%) |   725ms (153%)
  100 |   335ms |   248ms (-26%) |  1598ms (377%) |   913ms (172%)
  200 |   583ms |   495ms (-15%) |  3174ms (444%) |  1841ms (215%)
  400 |  1075ms |   984ms ( -9%) |  6344ms (489%) |  3670ms (241%)
  600 |  1563ms |  1479ms ( -6%) |  9543ms (510%) |  5498ms (251%)
  800 |  2076ms |  1974ms ( -5%) | 12732ms (513%) |  7333ms (253%)
 1000 |  2580ms |  2486ms ( -4%) | 15952ms (518%) |  9159ms (254%)
 2000 |  5115ms |  5029ms ( -2%) | 32031ms (526%) | 18285ms (257%)
 3000 |  7576ms |  7508ms ( -1%) | 47963ms (533%) | 27429ms (262%)
 4000 | 10128ms | 10075ms ( -1%) | 64225ms (534%) | 36588ms (261%)
 6000 | 15168ms | 15140ms ( -1%) | 96086ms (533%) | 54834ms (261%)
 8000 | 20246ms | 20219ms ( -1%) |129007ms (537%) | 73107ms (261%)
 9999 | 25382ms | 25264ms ( -1%) |161209ms (535%) | 91117ms (258%)

100 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   337ms |    53ms (-85%) |   272ms (-20%) |    97ms (-72%)
    2 |   222ms |    70ms (-69%) |   393ms ( 76%) |   117ms (-48%)
    4 |   242ms |   105ms (-57%) |   635ms (162%) |   159ms (-35%)
    6 |   268ms |   140ms (-48%) |   970ms (262%) |   179ms (-34%)
    8 |   263ms |   159ms (-40%) |  1106ms (320%) |   232ms (-12%)
   10 |   276ms |   186ms (-33%) |  1338ms (384%) |   278ms (  0%)
   12 |   308ms |   221ms (-29%) |  1584ms (414%) |   328ms (  6%)
   15 |   350ms |   270ms (-23%) |  1962ms (460%) |   407ms ( 16%)
   20 |   438ms |   355ms (-20%) |  2603ms (493%) |   529ms ( 20%)
   40 |   751ms |   665ms (-12%) |  5112ms (580%) |  1039ms ( 38%)
   60 |  1094ms |  1000ms ( -9%) |  7612ms (595%) |  1552ms ( 41%)
   80 |  1440ms |  1323ms ( -9%) | 10147ms (604%) |  2085ms ( 44%)
  100 |  1798ms |  1718ms ( -5%) | 12636ms (602%) |  2585ms ( 43%)
  200 |  3539ms |  3418ms ( -4%) | 25423ms (618%) |  5249ms ( 48%)
  400 |  7085ms |  7086ms (  0%) | 51728ms (630%) | 10561ms ( 49%)
  600 | 10931ms | 10822ms ( -1%) | 77671ms (610%) | 15795ms ( 44%)
  800 | 14680ms | 14577ms ( -1%) |103519ms (605%) | 21147ms ( 44%)
 1000 | 18560ms | 18817ms (  1%) |129679ms (598%) | 26345ms ( 41%)
 2000 | 36419ms | 36735ms (  0%) |258503ms (609%) | 52402ms ( 43%)
 3000 | 55917ms | 55606ms ( -1%) |384785ms (588%) | 77508ms ( 38%)
 4000 | 73902ms | 73556ms ( -1%) |519067ms (602%) |103688ms ( 40%)
 6000 |108664ms |109314ms (  0%) |781070ms (618%) |152491ms ( 40%)
 8000 |145420ms |143874ms ( -2%) |1039416ms (614%) |208028ms ( 43%)
 9999 |184118ms |184185ms (  0%) |1340946ms (628%) |259675ms ( 41%)

nodes: Number of input test elements 1..9999 (per namespace => 2..19998 child nodes).
composed: Approach 1 with one XSLT importing 2 others. Scanning the input for namespaces. (100%).
pre-compiled: Run 1a with one static, pre-compiled XSLT for comparison. Still scanning the input for namespaces.
process pipe: Approach 2 with 4-stage XSLTs. Scanning the input for namespaces.
extension: Approach 3 with one XSLT for the root namespace and with a separate transformation for every test element.

Noticed
The use of DOMSource slows things down, especially with large documents. Since the per-element transformation gets the input element as a TinyTree directly, the approach 3 has had an advantage over the others, having to deal with wrapped DOM nodes in the last implementation.

The results for large documents now make more sense. The per-element transformation does not suddenly outperform the one-stylesheet approach anymore and the pre-compiled approaches only score with smaller documents in all the tests.

As expected approach 3 benefits from a low foreign/root namespace ratio.

Conclusions
A good general solution for our problem is now the approach 3. Even though it performs worse than the approach 1 for large documents, it is faster on smaller ones.

The best solution is a combination of approach 1 (for large documents) and approach 3 (for smaller ones). Again a pre-compiled cache for often-used namespace sets improves the process of small documents further.

Tuesday, April 13, 2010

Open XSLT Processing

The Problem
Imaging a problem ;) You have hundreds of XML documents, containing elements of different types, defined in different namespaces. On average a document contains about 100 elements in 5 namespaces, while 50 elements belong to the namespace of the root element. Each of the namespaces is one out of 10000.
Then you want to transform these documents into an output, lets say into HTML. You have XSLT template-match definitions for every element in every namespace (~50 * 10000 = 50000 matches).
One XSLT stylesheet defining all the templates, importing all the definitions, probably wouldn't perform very good and won't even fit in memory, so you have to break it down somehow...

Solutions
1) The process first scans the input document for all the namespaces contained.
Then it dynamically builds a stylesheet, importing the according stylesheets
per namespace. The result gets compiled and the input transformed.

2) The process scans the input document for the namespaces contained and builds
a pipeline with the pre-compiled stylesheets, handling these namespaces. The
input gets transformed step-by-step.

3) The process starts with the stylesheet for the root namespace. For all the
unknown namespaces transforms the node with the pre-compiled stylesheet for
the according namespace via the saxon:transform() extension. Every "foreign"
node gets transformed separately.

Tests
To compare these approaches, I've written a small test:
http://code.google.com/p/livcos/source/browse/proto/OpenXslt/src/proto/
(tested with 2 namespaces, extended to 3 namespaces)

The input document contains 1..9999 elements per namespace for 2 namespaces. The root node is in another namespace and holds 0..100 content elements (with the same namespace) in addition to the "foreign" elements. These "foreign" elements also contain content elements with the same namespace.
Here the results:
0 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   231ms |    16ms (-93%) |    41ms (-82%) |    33ms (-86%)
    2 |   130ms |     9ms (-93%) |    19ms (-86%) |    38ms (-71%)
    4 |   110ms |     8ms (-93%) |    13ms (-88%) |    46ms (-58%)
    6 |   102ms |     8ms (-92%) |    14ms (-87%) |    40ms (-62%)
    8 |    97ms |     9ms (-91%) |    15ms (-84%) |    46ms (-53%)
   10 |    85ms |    10ms (-88%) |    17ms (-80%) |    55ms (-36%)
   12 |    85ms |    12ms (-86%) |    20ms (-76%) |    64ms (-26%)
   15 |    85ms |    13ms (-85%) |    24ms (-72%) |    78ms ( -9%)
   20 |    79ms |    16ms (-80%) |    29ms (-63%) |   106ms ( 33%)
   40 |    89ms |    29ms (-68%) |    55ms (-39%) |   208ms (133%)
   60 |   101ms |    42ms (-58%) |    84ms (-17%) |   303ms (199%)
   80 |   117ms |    57ms (-52%) |   107ms ( -9%) |   400ms (241%)
  100 |   127ms |    68ms (-47%) |   132ms (  3%) |   503ms (293%)
  200 |   194ms |   135ms (-31%) |   258ms ( 32%) |   995ms (411%)
  400 |   330ms |   266ms (-20%) |   514ms ( 55%) |  1998ms (504%)
  600 |   460ms |   397ms (-14%) |   774ms ( 67%) |  2992ms (549%)
  800 |   590ms |   532ms (-10%) |  1029ms ( 74%) |  3985ms (575%)
 1000 |   725ms |   658ms (-10%) |  1295ms ( 78%) |  5000ms (589%)
 2000 |  1387ms |  1315ms ( -6%) |  2596ms ( 87%) | 10009ms (621%)
 3000 |  2037ms |  1992ms ( -3%) |  3857ms ( 89%) | 15088ms (640%)
 4000 |  2716ms |  2643ms ( -3%) |  5175ms ( 90%) | 20128ms (640%)
 6000 |  4043ms |  3973ms ( -2%) |  7768ms ( 92%) | 30148ms (645%)
 8000 |  5354ms |  5285ms ( -2%) | 10362ms ( 93%) | 40199ms (650%)
 9999 |  6675ms |  6594ms ( -2%) | 12897ms ( 93%) | 50212ms (652%)

1 content element, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   236ms |    20ms (-92%) |    44ms (-82%) |    32ms (-87%)
    2 |   135ms |     9ms (-93%) |    18ms (-87%) |    38ms (-72%)
    4 |   110ms |    10ms (-91%) |    17ms (-84%) |    46ms (-58%)
    6 |   105ms |    10ms (-90%) |    20ms (-81%) |    42ms (-60%)
    8 |    99ms |    12ms (-88%) |    25ms (-75%) |    50ms (-50%)
   10 |    90ms |    14ms (-84%) |    28ms (-69%) |    60ms (-34%)
   12 |    89ms |    16ms (-82%) |    32ms (-64%) |    71ms (-21%)
   15 |    89ms |    19ms (-79%) |    38ms (-58%) |    89ms ( -1%)
   20 |    87ms |    23ms (-74%) |    46ms (-47%) |   117ms ( 35%)
   40 |   105ms |    43ms (-59%) |    88ms (-16%) |   228ms (117%)
   60 |   122ms |    62ms (-49%) |   131ms (  7%) |   332ms (171%)
   80 |   145ms |    82ms (-43%) |   176ms ( 21%) |   441ms (204%)
  100 |   163ms |   103ms (-37%) |   212ms ( 30%) |   548ms (235%)
  200 |   265ms |   202ms (-24%) |   423ms ( 59%) |  1111ms (318%)
  400 |   464ms |   398ms (-15%) |   836ms ( 79%) |  2227ms (379%)
  600 |   665ms |   602ms (-10%) |  1252ms ( 88%) |  3342ms (402%)
  800 |   863ms |   801ms ( -8%) |  1676ms ( 94%) |  4430ms (413%)
 1000 |  1073ms |  1004ms ( -7%) |  2087ms ( 94%) |  5524ms (414%)
 2000 |  2065ms |  1994ms ( -4%) |  4182ms (102%) | 11119ms (438%)
 3000 |  3065ms |  2999ms ( -3%) |  6286ms (105%) | 16660ms (443%)
 4000 |  4073ms |  3966ms ( -3%) |  8363ms (105%) | 22165ms (444%)
 6000 |  6071ms |  6003ms ( -2%) | 12598ms (107%) | 33311ms (448%)
 8000 |  8086ms |  8006ms ( -1%) | 16718ms (106%) | 44384ms (448%)
 9999 | 10052ms |  9989ms ( -1%) | 20959ms (108%) | 55413ms (451%)

2 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   240ms |    19ms (-93%) |    47ms (-81%) |    32ms (-87%)
    2 |   130ms |    10ms (-92%) |    19ms (-85%) |    39ms (-70%)
    4 |   111ms |    11ms (-90%) |    21ms (-81%) |    47ms (-58%)
    6 |   107ms |    12ms (-89%) |    25ms (-77%) |    44ms (-59%)
    8 |   103ms |    14ms (-86%) |    30ms (-71%) |    52ms (-50%)
   10 |    93ms |    16ms (-83%) |    35ms (-62%) |    63ms (-33%)
   12 |    91ms |    19ms (-79%) |    40ms (-56%) |    74ms (-19%)
   15 |    94ms |    21ms (-77%) |    48ms (-49%) |    91ms ( -3%)
   20 |    90ms |    28ms (-69%) |    60ms (-33%) |   123ms ( 36%)
   40 |   113ms |    55ms (-51%) |   116ms (  2%) |   237ms (109%)
   60 |   139ms |    82ms (-42%) |   172ms ( 23%) |   353ms (153%)
   80 |   169ms |   108ms (-36%) |   230ms ( 36%) |   473ms (179%)
  100 |   198ms |   137ms (-31%) |   290ms ( 46%) |   596ms (201%)
  200 |   381ms |   314ms (-18%) |   609ms ( 59%) |  1241ms (225%)
  400 |  1007ms |   943ms ( -7%) |  1531ms ( 52%) |  2683ms (166%)
  600 |  1887ms |  1846ms ( -3%) |  2742ms ( 45%) |  4407ms (133%)
  800 |  3244ms |  3159ms ( -3%) |  4333ms ( 33%) |  6326ms ( 95%)
 1000 |  4798ms |  4735ms ( -2%) |  6216ms ( 29%) |  8440ms ( 75%)
 2000 | 30837ms | 30734ms ( -1%) | 33919ms (  9%) | 21383ms (-31%)
 3000 | 70254ms | 70443ms (  0%) | 75177ms (  7%) | 38438ms (-46%)
 4000 |126738ms |126845ms (  0%) |131896ms (  4%) | 59297ms (-54%)
 6000 |285009ms |286287ms (  0%) |295802ms (  3%) |112712ms (-61%)
 8000 |506940ms |508315ms (  0%) |520498ms (  2%) |181728ms (-65%)
 9999 |794292ms |793994ms ( -1%) |810903ms (  2%) |267628ms (-67%)

4 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   248ms |    17ms (-94%) |    51ms (-80%) |    32ms (-88%)
    2 |   133ms |    12ms (-91%) |    23ms (-83%) |    41ms (-69%)
    4 |   114ms |    14ms (-88%) |    30ms (-74%) |    51ms (-56%)
    6 |   110ms |    16ms (-85%) |    36ms (-68%) |    49ms (-56%)
    8 |   107ms |    19ms (-82%) |    43ms (-60%) |    57ms (-48%)
   10 |   101ms |    23ms (-78%) |    51ms (-50%) |    69ms (-32%)
   12 |   102ms |    26ms (-75%) |    60ms (-42%) |    85ms (-18%)
   15 |   103ms |    29ms (-72%) |    71ms (-32%) |   102ms ( -1%)
   20 |    99ms |    39ms (-61%) |    91ms ( -9%) |   135ms ( 35%)
   40 |   141ms |    74ms (-48%) |   179ms ( 26%) |   263ms ( 86%)
   60 |   170ms |   116ms (-32%) |   261ms ( 53%) |   391ms (129%)
   80 |   206ms |   147ms (-29%) |   346ms ( 67%) |   517ms (150%)
  100 |   258ms |   186ms (-28%) |   440ms ( 70%) |   656ms (154%)
  200 |   475ms |   419ms (-12%) |   914ms ( 92%) |  1361ms (186%)
  400 |  1223ms |  1157ms ( -6%) |  2152ms ( 75%) |  2923ms (138%)
  600 |  2196ms |  2159ms ( -2%) |  3655ms ( 66%) |  4635ms (111%)
  800 |  3503ms |  3423ms ( -3%) |  5398ms ( 54%) |  6578ms ( 87%)
 1000 |  5079ms |  5295ms (  4%) |  7791ms ( 53%) |  8668ms ( 70%)
 2000 | 25708ms | 23913ms ( -7%) | 28935ms ( 12%) | 22370ms (-13%)
 3000 | 65818ms | 67420ms (  2%) | 74858ms ( 13%) | 40017ms (-40%)
 4000 |125688ms |126088ms (  0%) |136408ms (  8%) | 61669ms (-51%)
 6000 |287168ms |288365ms (  0%) |303169ms (  5%) |117042ms (-60%)
 8000 |512868ms |513733ms (  0%) |533504ms (  4%) |188468ms (-64%)
 9999 |799576ms |802145ms (  0%) |828192ms (  3%) |275584ms (-66%)

10 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   255ms |    18ms (-93%) |    63ms (-76%) |    35ms (-86%)
    2 |   137ms |    17ms (-88%) |    37ms (-73%) |    46ms (-66%)
    4 |   123ms |    23ms (-82%) |    52ms (-58%) |    59ms (-53%)
    6 |   122ms |    28ms (-78%) |    66ms (-47%) |    61ms (-50%)
    8 |   122ms |    34ms (-73%) |    80ms (-35%) |    72ms (-42%)
   10 |   117ms |    40ms (-66%) |    97ms (-17%) |    87ms (-26%)
   12 |   119ms |    47ms (-61%) |   113ms ( -6%) |   104ms (-13%)
   15 |   126ms |    54ms (-57%) |   138ms (  9%) |   128ms (  1%)
   20 |   138ms |    72ms (-48%) |   183ms ( 32%) |   170ms ( 23%)
   40 |   199ms |   138ms (-31%) |   340ms ( 70%) |   324ms ( 62%)
   60 |   265ms |   204ms (-24%) |   514ms ( 93%) |   495ms ( 86%)
   80 |   334ms |   275ms (-18%) |   677ms (102%) |   651ms ( 94%)
  100 |   402ms |   345ms (-15%) |   854ms (112%) |   819ms (103%)
  200 |   797ms |   734ms ( -8%) |  1735ms (117%) |  1693ms (112%)
  400 |  1884ms |  1809ms ( -4%) |  3854ms (104%) |  3634ms ( 92%)
  600 |  3247ms |  3177ms ( -3%) |  6219ms ( 91%) |  5827ms ( 79%)
  800 |  4934ms |  4835ms ( -3%) |  8930ms ( 80%) |  8146ms ( 65%)
 1000 |  7037ms |  6977ms ( -1%) | 12088ms ( 71%) | 10806ms ( 53%)
 2000 | 32551ms | 32032ms ( -2%) | 42344ms ( 30%) | 27523ms (-16%)
 3000 | 86523ms | 86692ms (  0%) |102437ms ( 18%) | 49170ms (-44%)
 4000 |157235ms |157528ms (  0%) |178602ms ( 13%) | 75523ms (-52%)
 6000 |355111ms |355885ms (  0%) |387521ms (  9%) |142279ms (-60%)
 8000 |628937ms |630111ms (  0%) |703281ms ( 11%) |228502ms (-64%)
 9999 |979335ms |980668ms (  0%) |1037671ms (  5%) |334994ms (-66%)

100 content elements, 10 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |    75ms |    10ms (-86%) |    48ms (-36%) |    11ms (-86%)
    2 |    33ms |    10ms (-70%) |    25ms (-25%) |    14ms (-58%)
    4 |    36ms |    15ms (-58%) |    39ms ( 10%) |    22ms (-38%)
    6 |    39ms |    20ms (-49%) |    54ms ( 37%) |    32ms (-19%)
    8 |    43ms |    25ms (-42%) |    70ms ( 59%) |    41ms ( -7%)
   10 |    47ms |    29ms (-38%) |    82ms ( 73%) |    47ms (  0%)
   12 |    50ms |    33ms (-34%) |    96ms ( 89%) |    54ms (  7%)
   15 |    57ms |    40ms (-30%) |   115ms ( 99%) |    63ms ( 10%)
   20 |    67ms |    52ms (-23%) |   152ms (124%) |    73ms (  7%)
   40 |   116ms |    99ms (-14%) |   292ms (152%) |   128ms ( 11%)
   60 |   164ms |   153ms ( -7%) |   434ms (163%) |   189ms ( 14%)
   80 |   219ms |   200ms ( -9%) |   571ms (159%) |   255ms ( 16%)
  100 |   269ms |   253ms ( -6%) |   711ms (164%) |   320ms ( 19%)
  200 |   534ms |   528ms ( -2%) |  1449ms (171%) |   642ms ( 20%)
  400 |  1161ms |  1157ms ( -1%) |  2994ms (157%) |  1329ms ( 14%)
  600 |  1870ms |  1878ms (  0%) |  4667ms (149%) |  2044ms (  9%)
  800 |  2699ms |  2690ms ( -1%) |  6400ms (137%) |  2789ms (  3%)
 1000 |  3630ms |  3643ms (  0%) |  8303ms (128%) |  3578ms ( -2%)
 2000 | 10629ms | 10900ms (  2%) | 20206ms ( 90%) |  7980ms (-25%)
 3000 | 21130ms | 21548ms (  1%) | 35574ms ( 68%) | 13244ms (-38%)
 4000 | 34769ms | 35397ms (  1%) | 54171ms ( 55%) | 19251ms (-45%)
 6000 | 70734ms | 72451ms (  2%) |100466ms ( 42%) | 33602ms (-53%)
 8000 |119603ms |122070ms (  2%) |160432ms ( 34%) | 51418ms (-58%)
 9999 |182144ms |185494ms (  1%) |231385ms ( 27%) | 72163ms (-61%)

nodes: Number of input test elements 1..9999 (per namespace => 2..19998 child nodes).
composed: Approach 1 with one XSLT importing 2 others. Scanning the input for namespaces. (100%).
pre-compiled: Run 1a with one static, pre-compiled XSLT for comparison. Still scanning the input for namespaces.
process pipe: Approach 2 with 3-stage XSLTs. Scanning the input for namespaces.
extension: Approach 3 with one XSLT for the root namespace and with a separate transformation for every test element.

Noticed
As expected the static, pre-compiled XSLT performs generally best.

As expected the pre-compiled savings become relative as the input data size grows.

Multi-stage pipeline processing performs good for small documents and ok for bigger ones. As expected more namespaces -> more stylesheets -> more stages -> more time (test with the implementation, extended to 3 namespaces).

Compared to the other approaches the per-element transformations (approach 3) becomes expensive, when all the contained elements get transformed (> 100 "foreign" elements, 0 content elements), but improves rapidly with content elements.

What happens in the process with 2 content elements? With >2000 elements, suddenly the per-element transformation is faster, than a transformation with a pre-compiled stylesheet? Even though the per-namespaces stylesheets only need to process one element per transformation, the initial stylesheet still needs to match all the elements, much like the pre-compiled, all-in-one stylesheet. The tree walk to collect the namespaces (not needed in approach 3) cannot be the issue here, can it?

Also while it takes about 7s to process 9999 elements with no content and about 10s with 1 content element, it takes more than 13min for 9999 elements with 2 content elements!
Maybe with this size of the input document the garbage collector kicks in... But GC time seems to be constant and also there no significant change on a 64bit-VM with 6GB of heap.
Also, I guess, 9999 elements with one child element should need roughly the same or even a bigger amount of memory as 5000 elements with 2 child elements. To process the second it takes more than 10 times longer.

Conclusions
To generally solve our problem, the approach 2 seems to be most suitable, with a good performance also with larger documents.

Dynamically compound stylesheets would serve documents with more than 200 nodes better. Combined with a smart cache of pre-compiled stylesheets for often used namespace sets, it could also perform best for smaller documents.

The per-element transformation approach (3) works good for small documents and strangely for large documents (> 3000 nodes, foreign/root < 1/3) as well. Due to it's flexibility (individual stylesheet selection not only by namespace), it's also worth to be offered.

Update: please see Open XSLT Processing 2 to answer the open questions.