Wednesday, April 14, 2010

Open XSLT Processing 2

Intro
To understand the context of the problem, please read the previous Open XSLT Processing post first. Mr. Kay has given me the hint of taking the DOMSource out of the equation. Thanks!

New Tests
The new implementation adds a DOM-to-TinyTree conversion step before the actual runs. In addition it now handles 3 namespaces. The source still can be found at
http://code.google.com/p/livcos/source/browse/proto/OpenXslt/src/proto/.

JVM options: -Xms1500M -Xmx1500M -XX:NewRatio=1

Here the new results:
0 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   281ms |    17ms (-94%) |    35ms (-88%) |    50ms (-83%)
    2 |   161ms |     8ms (-95%) |    18ms (-89%) |    48ms (-70%)
    4 |   146ms |     7ms (-95%) |    17ms (-89%) |    51ms (-65%)
    6 |   137ms |     9ms (-94%) |    21ms (-85%) |    57ms (-58%)
    8 |   118ms |    10ms (-92%) |    26ms (-78%) |    74ms (-38%)
   10 |   100ms |    11ms (-89%) |    29ms (-71%) |    85ms (-15%)
   12 |    99ms |    13ms (-87%) |    34ms (-66%) |   104ms (  5%)
   15 |   104ms |    16ms (-85%) |    41ms (-61%) |   128ms ( 22%)
   20 |   106ms |    20ms (-81%) |    53ms (-50%) |   151ms ( 42%)
   40 |   119ms |    36ms (-70%) |    97ms (-19%) |   281ms (135%)
   60 |   135ms |    53ms (-61%) |   145ms (  7%) |   426ms (215%)
   80 |   152ms |    68ms (-55%) |   188ms ( 23%) |   559ms (266%)
  100 |   165ms |    84ms (-50%) |   230ms ( 39%) |   710ms (330%)
  200 |   247ms |   169ms (-32%) |   457ms ( 84%) |  1384ms (458%)
  400 |   417ms |   331ms (-21%) |   908ms (117%) |  2791ms (568%)
  600 |   578ms |   498ms (-14%) |  1372ms (137%) |  4192ms (624%)
  800 |   752ms |   664ms (-12%) |  1815ms (141%) |  5573ms (640%)
 1000 |   911ms |   835ms ( -9%) |  2269ms (148%) |  6937ms (660%)
 2000 |  1746ms |  1652ms ( -6%) |  4518ms (158%) | 13865ms (693%)
 3000 |  2571ms |  2449ms ( -5%) |  6738ms (162%) | 20759ms (707%)
 4000 |  3405ms |  3301ms ( -4%) |  9068ms (166%) | 27701ms (713%)
 6000 |  5061ms |  4932ms ( -3%) | 13570ms (168%) | 41572ms (721%)
 8000 |  6755ms |  6597ms ( -3%) | 18067ms (167%) | 55494ms (721%)
 9999 |  8368ms |  8234ms ( -2%) | 22611ms (170%) | 68887ms (723%)

1 content element, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   286ms |    16ms (-95%) |    35ms (-88%) |    51ms (-83%)
    2 |   162ms |     9ms (-95%) |    26ms (-84%) |    49ms (-70%)
    4 |   145ms |     9ms (-94%) |    24ms (-84%) |    52ms (-64%)
    6 |   138ms |    11ms (-92%) |    32ms (-77%) |    60ms (-57%)
    8 |   123ms |    12ms (-90%) |    40ms (-68%) |    77ms (-38%)
   10 |   103ms |    14ms (-87%) |    45ms (-56%) |    90ms (-13%)
   12 |   102ms |    16ms (-84%) |    53ms (-49%) |   111ms (  8%)
   15 |   107ms |    20ms (-82%) |    67ms (-38%) |   142ms ( 32%)
   20 |   110ms |    25ms (-78%) |    87ms (-21%) |   147ms ( 33%)
   40 |   128ms |    46ms (-64%) |   155ms ( 21%) |   296ms (131%)
   60 |   149ms |    67ms (-55%) |   228ms ( 52%) |   431ms (189%)
   80 |   165ms |    83ms (-50%) |   296ms ( 79%) |   576ms (249%)
  100 |   184ms |   104ms (-44%) |   366ms ( 98%) |   722ms (290%)
  200 |   286ms |   206ms (-29%) |   735ms (156%) |  1453ms (407%)
  400 |   496ms |   408ms (-18%) |  1465ms (195%) |  2898ms (484%)
  600 |   694ms |   611ms (-13%) |  2200ms (216%) |  4371ms (529%)
  800 |   899ms |   813ms (-10%) |  2903ms (222%) |  5767ms (541%)
 1000 |  1103ms |  1022ms ( -8%) |  3618ms (228%) |  7189ms (551%)
 2000 |  2118ms |  2035ms ( -4%) |  7262ms (242%) | 14417ms (580%)
 3000 |  3144ms |  3069ms ( -3%) | 10871ms (245%) | 21687ms (589%)
 4000 |  4186ms |  4085ms ( -3%) | 14523ms (246%) | 28788ms (587%)
 6000 |  6221ms |  6152ms ( -2%) | 21792ms (250%) | 43101ms (592%)
 8000 |  8293ms |  8174ms ( -2%) | 29106ms (250%) | 57651ms (595%)
 9999 | 10363ms | 10247ms ( -2%) | 36428ms (251%) | 71705ms (591%)

2 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   284ms |    15ms (-95%) |    38ms (-87%) |    53ms (-82%)
    2 |   166ms |     9ms (-95%) |    26ms (-84%) |    50ms (-70%)
    4 |   145ms |     9ms (-94%) |    31ms (-79%) |    53ms (-64%)
    6 |   136ms |    12ms (-91%) |    44ms (-68%) |    60ms (-56%)
    8 |   122ms |    14ms (-88%) |    53ms (-57%) |    82ms (-33%)
   10 |   107ms |    17ms (-84%) |    63ms (-41%) |   105ms ( -2%)
   12 |   104ms |    18ms (-83%) |    71ms (-33%) |   110ms (  5%)
   15 |   108ms |    23ms (-79%) |    87ms (-20%) |   148ms ( 35%)
   20 |   110ms |    28ms (-74%) |   108ms ( -3%) |   153ms ( 38%)
   40 |   136ms |    54ms (-60%) |   209ms ( 53%) |   302ms (122%)
   60 |   161ms |    80ms (-51%) |   311ms ( 92%) |   448ms (177%)
   80 |   179ms |   100ms (-44%) |   406ms (126%) |   600ms (234%)
  100 |   203ms |   127ms (-38%) |   506ms (148%) |   745ms (266%)
  200 |   329ms |   246ms (-26%) |  1003ms (204%) |  1503ms (356%)
  400 |   567ms |   491ms (-14%) |  2005ms (253%) |  3010ms (430%)
  600 |   807ms |   732ms (-10%) |  3002ms (271%) |  4528ms (460%)
  800 |  1052ms |   972ms ( -8%) |  3974ms (277%) |  6034ms (473%)
 1000 |  1301ms |  1220ms ( -7%) |  5001ms (284%) |  7548ms (479%)
 2000 |  2510ms |  2434ms ( -4%) | 10007ms (298%) | 15280ms (508%)
 3000 |  3754ms |  3660ms ( -3%) | 14989ms (299%) | 22667ms (503%)
 4000 |  4968ms |  4874ms ( -2%) | 20279ms (308%) | 30199ms (507%)
 6000 |  7394ms |  7320ms ( -2%) | 30054ms (306%) | 45115ms (510%)
 8000 |  9827ms |  9713ms ( -2%) | 40095ms (307%) | 59919ms (509%)
 9999 | 12290ms | 12150ms ( -2%) | 49756ms (304%) | 74072ms (502%)

4 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   289ms |    17ms (-95%) |    41ms (-86%) |    52ms (-82%)
    2 |   163ms |    11ms (-94%) |    35ms (-79%) |    54ms (-67%)
    4 |   146ms |    12ms (-92%) |    46ms (-69%) |    55ms (-62%)
    6 |   140ms |    14ms (-90%) |    64ms (-55%) |    69ms (-51%)
    8 |   129ms |    18ms (-86%) |    76ms (-41%) |    80ms (-38%)
   10 |   109ms |    20ms (-81%) |    94ms (-14%) |   105ms ( -4%)
   12 |   109ms |    24ms (-78%) |   106ms ( -4%) |   129ms ( 17%)
   15 |   130ms |    32ms (-75%) |   148ms ( 13%) |   149ms ( 14%)
   20 |   117ms |    36ms (-69%) |   167ms ( 42%) |   163ms ( 39%)
   40 |   154ms |    72ms (-54%) |   329ms (113%) |   323ms (109%)
   60 |   180ms |    96ms (-47%) |   479ms (166%) |   486ms (170%)
   80 |   208ms |   123ms (-41%) |   632ms (204%) |   630ms (202%)
  100 |   239ms |   151ms (-37%) |   770ms (221%) |   813ms (238%)
  200 |   394ms |   304ms (-23%) |  1538ms (290%) |  1586ms (302%)
  400 |   689ms |   609ms (-12%) |  3143ms (355%) |  3189ms (362%)
  600 |   994ms |   905ms ( -9%) |  4686ms (371%) |  4713ms (374%)
  800 |  1303ms |  1235ms ( -6%) |  6312ms (384%) |  6367ms (388%)
 1000 |  1627ms |  1463ms (-11%) |  7711ms (373%) |  8002ms (391%)
 2000 |  3188ms |  3005ms ( -6%) | 15561ms (388%) | 15803ms (395%)
 3000 |  4743ms |  4587ms ( -4%) | 23475ms (394%) | 23901ms (403%)
 4000 |  6169ms |  6157ms ( -1%) | 31316ms (407%) | 31437ms (409%)
 6000 |  9391ms |  9285ms ( -2%) | 46588ms (396%) | 47771ms (408%)
 8000 | 12467ms | 12348ms ( -1%) | 62862ms (404%) | 63416ms (408%)
 9999 | 15567ms | 15487ms ( -1%) | 78131ms (401%) | 78577ms (404%)

10 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   297ms |    19ms (-94%) |    57ms (-81%) |    54ms (-82%)
    2 |   167ms |    15ms (-91%) |    60ms (-65%) |    56ms (-67%)
    4 |   157ms |    18ms (-89%) |    87ms (-45%) |    62ms (-61%)
    6 |   148ms |    23ms (-85%) |   121ms (-19%) |    75ms (-50%)
    8 |   135ms |    28ms (-80%) |   152ms ( 12%) |    95ms (-30%)
   10 |   127ms |    33ms (-74%) |   183ms ( 44%) |   116ms ( -9%)
   12 |   126ms |    38ms (-70%) |   220ms ( 74%) |   137ms (  8%)
   15 |   136ms |    46ms (-67%) |   258ms ( 89%) |   151ms ( 11%)
   20 |   145ms |    60ms (-59%) |   333ms (128%) |   187ms ( 28%)
   40 |   186ms |   102ms (-45%) |   644ms (245%) |   371ms ( 99%)
   60 |   233ms |   152ms (-35%) |   971ms (315%) |   553ms (136%)
   80 |   286ms |   204ms (-29%) |  1275ms (344%) |   725ms (153%)
  100 |   335ms |   248ms (-26%) |  1598ms (377%) |   913ms (172%)
  200 |   583ms |   495ms (-15%) |  3174ms (444%) |  1841ms (215%)
  400 |  1075ms |   984ms ( -9%) |  6344ms (489%) |  3670ms (241%)
  600 |  1563ms |  1479ms ( -6%) |  9543ms (510%) |  5498ms (251%)
  800 |  2076ms |  1974ms ( -5%) | 12732ms (513%) |  7333ms (253%)
 1000 |  2580ms |  2486ms ( -4%) | 15952ms (518%) |  9159ms (254%)
 2000 |  5115ms |  5029ms ( -2%) | 32031ms (526%) | 18285ms (257%)
 3000 |  7576ms |  7508ms ( -1%) | 47963ms (533%) | 27429ms (262%)
 4000 | 10128ms | 10075ms ( -1%) | 64225ms (534%) | 36588ms (261%)
 6000 | 15168ms | 15140ms ( -1%) | 96086ms (533%) | 54834ms (261%)
 8000 | 20246ms | 20219ms ( -1%) |129007ms (537%) | 73107ms (261%)
 9999 | 25382ms | 25264ms ( -1%) |161209ms (535%) | 91117ms (258%)

100 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   337ms |    53ms (-85%) |   272ms (-20%) |    97ms (-72%)
    2 |   222ms |    70ms (-69%) |   393ms ( 76%) |   117ms (-48%)
    4 |   242ms |   105ms (-57%) |   635ms (162%) |   159ms (-35%)
    6 |   268ms |   140ms (-48%) |   970ms (262%) |   179ms (-34%)
    8 |   263ms |   159ms (-40%) |  1106ms (320%) |   232ms (-12%)
   10 |   276ms |   186ms (-33%) |  1338ms (384%) |   278ms (  0%)
   12 |   308ms |   221ms (-29%) |  1584ms (414%) |   328ms (  6%)
   15 |   350ms |   270ms (-23%) |  1962ms (460%) |   407ms ( 16%)
   20 |   438ms |   355ms (-20%) |  2603ms (493%) |   529ms ( 20%)
   40 |   751ms |   665ms (-12%) |  5112ms (580%) |  1039ms ( 38%)
   60 |  1094ms |  1000ms ( -9%) |  7612ms (595%) |  1552ms ( 41%)
   80 |  1440ms |  1323ms ( -9%) | 10147ms (604%) |  2085ms ( 44%)
  100 |  1798ms |  1718ms ( -5%) | 12636ms (602%) |  2585ms ( 43%)
  200 |  3539ms |  3418ms ( -4%) | 25423ms (618%) |  5249ms ( 48%)
  400 |  7085ms |  7086ms (  0%) | 51728ms (630%) | 10561ms ( 49%)
  600 | 10931ms | 10822ms ( -1%) | 77671ms (610%) | 15795ms ( 44%)
  800 | 14680ms | 14577ms ( -1%) |103519ms (605%) | 21147ms ( 44%)
 1000 | 18560ms | 18817ms (  1%) |129679ms (598%) | 26345ms ( 41%)
 2000 | 36419ms | 36735ms (  0%) |258503ms (609%) | 52402ms ( 43%)
 3000 | 55917ms | 55606ms ( -1%) |384785ms (588%) | 77508ms ( 38%)
 4000 | 73902ms | 73556ms ( -1%) |519067ms (602%) |103688ms ( 40%)
 6000 |108664ms |109314ms (  0%) |781070ms (618%) |152491ms ( 40%)
 8000 |145420ms |143874ms ( -2%) |1039416ms (614%) |208028ms ( 43%)
 9999 |184118ms |184185ms (  0%) |1340946ms (628%) |259675ms ( 41%)

nodes: Number of input test elements 1..9999 (per namespace => 2..19998 child nodes).
composed: Approach 1 with one XSLT importing 2 others. Scanning the input for namespaces. (100%).
pre-compiled: Run 1a with one static, pre-compiled XSLT for comparison. Still scanning the input for namespaces.
process pipe: Approach 2 with 4-stage XSLTs. Scanning the input for namespaces.
extension: Approach 3 with one XSLT for the root namespace and with a separate transformation for every test element.

Noticed
The use of DOMSource slows things down, especially with large documents. Since the per-element transformation gets the input element as a TinyTree directly, the approach 3 has had an advantage over the others, having to deal with wrapped DOM nodes in the last implementation.

The results for large documents now make more sense. The per-element transformation does not suddenly outperform the one-stylesheet approach anymore and the pre-compiled approaches only score with smaller documents in all the tests.

As expected approach 3 benefits from a low foreign/root namespace ratio.

Conclusions
A good general solution for our problem is now the approach 3. Even though it performs worse than the approach 1 for large documents, it is faster on smaller ones.

The best solution is a combination of approach 1 (for large documents) and approach 3 (for smaller ones). Again a pre-compiled cache for often-used namespace sets improves the process of small documents further.

Tuesday, April 13, 2010

Open XSLT Processing

The Problem
Imaging a problem ;) You have hundreds of XML documents, containing elements of different types, defined in different namespaces. On average a document contains about 100 elements in 5 namespaces, while 50 elements belong to the namespace of the root element. Each of the namespaces is one out of 10000.
Then you want to transform these documents into an output, lets say into HTML. You have XSLT template-match definitions for every element in every namespace (~50 * 10000 = 50000 matches).
One XSLT stylesheet defining all the templates, importing all the definitions, probably wouldn't perform very good and won't even fit in memory, so you have to break it down somehow...

Solutions
1) The process first scans the input document for all the namespaces contained.
Then it dynamically builds a stylesheet, importing the according stylesheets
per namespace. The result gets compiled and the input transformed.

2) The process scans the input document for the namespaces contained and builds
a pipeline with the pre-compiled stylesheets, handling these namespaces. The
input gets transformed step-by-step.

3) The process starts with the stylesheet for the root namespace. For all the
unknown namespaces transforms the node with the pre-compiled stylesheet for
the according namespace via the saxon:transform() extension. Every "foreign"
node gets transformed separately.

Tests
To compare these approaches, I've written a small test:
http://code.google.com/p/livcos/source/browse/proto/OpenXslt/src/proto/
(tested with 2 namespaces, extended to 3 namespaces)

The input document contains 1..9999 elements per namespace for 2 namespaces. The root node is in another namespace and holds 0..100 content elements (with the same namespace) in addition to the "foreign" elements. These "foreign" elements also contain content elements with the same namespace.
Here the results:
0 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   231ms |    16ms (-93%) |    41ms (-82%) |    33ms (-86%)
    2 |   130ms |     9ms (-93%) |    19ms (-86%) |    38ms (-71%)
    4 |   110ms |     8ms (-93%) |    13ms (-88%) |    46ms (-58%)
    6 |   102ms |     8ms (-92%) |    14ms (-87%) |    40ms (-62%)
    8 |    97ms |     9ms (-91%) |    15ms (-84%) |    46ms (-53%)
   10 |    85ms |    10ms (-88%) |    17ms (-80%) |    55ms (-36%)
   12 |    85ms |    12ms (-86%) |    20ms (-76%) |    64ms (-26%)
   15 |    85ms |    13ms (-85%) |    24ms (-72%) |    78ms ( -9%)
   20 |    79ms |    16ms (-80%) |    29ms (-63%) |   106ms ( 33%)
   40 |    89ms |    29ms (-68%) |    55ms (-39%) |   208ms (133%)
   60 |   101ms |    42ms (-58%) |    84ms (-17%) |   303ms (199%)
   80 |   117ms |    57ms (-52%) |   107ms ( -9%) |   400ms (241%)
  100 |   127ms |    68ms (-47%) |   132ms (  3%) |   503ms (293%)
  200 |   194ms |   135ms (-31%) |   258ms ( 32%) |   995ms (411%)
  400 |   330ms |   266ms (-20%) |   514ms ( 55%) |  1998ms (504%)
  600 |   460ms |   397ms (-14%) |   774ms ( 67%) |  2992ms (549%)
  800 |   590ms |   532ms (-10%) |  1029ms ( 74%) |  3985ms (575%)
 1000 |   725ms |   658ms (-10%) |  1295ms ( 78%) |  5000ms (589%)
 2000 |  1387ms |  1315ms ( -6%) |  2596ms ( 87%) | 10009ms (621%)
 3000 |  2037ms |  1992ms ( -3%) |  3857ms ( 89%) | 15088ms (640%)
 4000 |  2716ms |  2643ms ( -3%) |  5175ms ( 90%) | 20128ms (640%)
 6000 |  4043ms |  3973ms ( -2%) |  7768ms ( 92%) | 30148ms (645%)
 8000 |  5354ms |  5285ms ( -2%) | 10362ms ( 93%) | 40199ms (650%)
 9999 |  6675ms |  6594ms ( -2%) | 12897ms ( 93%) | 50212ms (652%)

1 content element, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   236ms |    20ms (-92%) |    44ms (-82%) |    32ms (-87%)
    2 |   135ms |     9ms (-93%) |    18ms (-87%) |    38ms (-72%)
    4 |   110ms |    10ms (-91%) |    17ms (-84%) |    46ms (-58%)
    6 |   105ms |    10ms (-90%) |    20ms (-81%) |    42ms (-60%)
    8 |    99ms |    12ms (-88%) |    25ms (-75%) |    50ms (-50%)
   10 |    90ms |    14ms (-84%) |    28ms (-69%) |    60ms (-34%)
   12 |    89ms |    16ms (-82%) |    32ms (-64%) |    71ms (-21%)
   15 |    89ms |    19ms (-79%) |    38ms (-58%) |    89ms ( -1%)
   20 |    87ms |    23ms (-74%) |    46ms (-47%) |   117ms ( 35%)
   40 |   105ms |    43ms (-59%) |    88ms (-16%) |   228ms (117%)
   60 |   122ms |    62ms (-49%) |   131ms (  7%) |   332ms (171%)
   80 |   145ms |    82ms (-43%) |   176ms ( 21%) |   441ms (204%)
  100 |   163ms |   103ms (-37%) |   212ms ( 30%) |   548ms (235%)
  200 |   265ms |   202ms (-24%) |   423ms ( 59%) |  1111ms (318%)
  400 |   464ms |   398ms (-15%) |   836ms ( 79%) |  2227ms (379%)
  600 |   665ms |   602ms (-10%) |  1252ms ( 88%) |  3342ms (402%)
  800 |   863ms |   801ms ( -8%) |  1676ms ( 94%) |  4430ms (413%)
 1000 |  1073ms |  1004ms ( -7%) |  2087ms ( 94%) |  5524ms (414%)
 2000 |  2065ms |  1994ms ( -4%) |  4182ms (102%) | 11119ms (438%)
 3000 |  3065ms |  2999ms ( -3%) |  6286ms (105%) | 16660ms (443%)
 4000 |  4073ms |  3966ms ( -3%) |  8363ms (105%) | 22165ms (444%)
 6000 |  6071ms |  6003ms ( -2%) | 12598ms (107%) | 33311ms (448%)
 8000 |  8086ms |  8006ms ( -1%) | 16718ms (106%) | 44384ms (448%)
 9999 | 10052ms |  9989ms ( -1%) | 20959ms (108%) | 55413ms (451%)

2 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   240ms |    19ms (-93%) |    47ms (-81%) |    32ms (-87%)
    2 |   130ms |    10ms (-92%) |    19ms (-85%) |    39ms (-70%)
    4 |   111ms |    11ms (-90%) |    21ms (-81%) |    47ms (-58%)
    6 |   107ms |    12ms (-89%) |    25ms (-77%) |    44ms (-59%)
    8 |   103ms |    14ms (-86%) |    30ms (-71%) |    52ms (-50%)
   10 |    93ms |    16ms (-83%) |    35ms (-62%) |    63ms (-33%)
   12 |    91ms |    19ms (-79%) |    40ms (-56%) |    74ms (-19%)
   15 |    94ms |    21ms (-77%) |    48ms (-49%) |    91ms ( -3%)
   20 |    90ms |    28ms (-69%) |    60ms (-33%) |   123ms ( 36%)
   40 |   113ms |    55ms (-51%) |   116ms (  2%) |   237ms (109%)
   60 |   139ms |    82ms (-42%) |   172ms ( 23%) |   353ms (153%)
   80 |   169ms |   108ms (-36%) |   230ms ( 36%) |   473ms (179%)
  100 |   198ms |   137ms (-31%) |   290ms ( 46%) |   596ms (201%)
  200 |   381ms |   314ms (-18%) |   609ms ( 59%) |  1241ms (225%)
  400 |  1007ms |   943ms ( -7%) |  1531ms ( 52%) |  2683ms (166%)
  600 |  1887ms |  1846ms ( -3%) |  2742ms ( 45%) |  4407ms (133%)
  800 |  3244ms |  3159ms ( -3%) |  4333ms ( 33%) |  6326ms ( 95%)
 1000 |  4798ms |  4735ms ( -2%) |  6216ms ( 29%) |  8440ms ( 75%)
 2000 | 30837ms | 30734ms ( -1%) | 33919ms (  9%) | 21383ms (-31%)
 3000 | 70254ms | 70443ms (  0%) | 75177ms (  7%) | 38438ms (-46%)
 4000 |126738ms |126845ms (  0%) |131896ms (  4%) | 59297ms (-54%)
 6000 |285009ms |286287ms (  0%) |295802ms (  3%) |112712ms (-61%)
 8000 |506940ms |508315ms (  0%) |520498ms (  2%) |181728ms (-65%)
 9999 |794292ms |793994ms ( -1%) |810903ms (  2%) |267628ms (-67%)

4 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   248ms |    17ms (-94%) |    51ms (-80%) |    32ms (-88%)
    2 |   133ms |    12ms (-91%) |    23ms (-83%) |    41ms (-69%)
    4 |   114ms |    14ms (-88%) |    30ms (-74%) |    51ms (-56%)
    6 |   110ms |    16ms (-85%) |    36ms (-68%) |    49ms (-56%)
    8 |   107ms |    19ms (-82%) |    43ms (-60%) |    57ms (-48%)
   10 |   101ms |    23ms (-78%) |    51ms (-50%) |    69ms (-32%)
   12 |   102ms |    26ms (-75%) |    60ms (-42%) |    85ms (-18%)
   15 |   103ms |    29ms (-72%) |    71ms (-32%) |   102ms ( -1%)
   20 |    99ms |    39ms (-61%) |    91ms ( -9%) |   135ms ( 35%)
   40 |   141ms |    74ms (-48%) |   179ms ( 26%) |   263ms ( 86%)
   60 |   170ms |   116ms (-32%) |   261ms ( 53%) |   391ms (129%)
   80 |   206ms |   147ms (-29%) |   346ms ( 67%) |   517ms (150%)
  100 |   258ms |   186ms (-28%) |   440ms ( 70%) |   656ms (154%)
  200 |   475ms |   419ms (-12%) |   914ms ( 92%) |  1361ms (186%)
  400 |  1223ms |  1157ms ( -6%) |  2152ms ( 75%) |  2923ms (138%)
  600 |  2196ms |  2159ms ( -2%) |  3655ms ( 66%) |  4635ms (111%)
  800 |  3503ms |  3423ms ( -3%) |  5398ms ( 54%) |  6578ms ( 87%)
 1000 |  5079ms |  5295ms (  4%) |  7791ms ( 53%) |  8668ms ( 70%)
 2000 | 25708ms | 23913ms ( -7%) | 28935ms ( 12%) | 22370ms (-13%)
 3000 | 65818ms | 67420ms (  2%) | 74858ms ( 13%) | 40017ms (-40%)
 4000 |125688ms |126088ms (  0%) |136408ms (  8%) | 61669ms (-51%)
 6000 |287168ms |288365ms (  0%) |303169ms (  5%) |117042ms (-60%)
 8000 |512868ms |513733ms (  0%) |533504ms (  4%) |188468ms (-64%)
 9999 |799576ms |802145ms (  0%) |828192ms (  3%) |275584ms (-66%)

10 content elements, 100 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |   255ms |    18ms (-93%) |    63ms (-76%) |    35ms (-86%)
    2 |   137ms |    17ms (-88%) |    37ms (-73%) |    46ms (-66%)
    4 |   123ms |    23ms (-82%) |    52ms (-58%) |    59ms (-53%)
    6 |   122ms |    28ms (-78%) |    66ms (-47%) |    61ms (-50%)
    8 |   122ms |    34ms (-73%) |    80ms (-35%) |    72ms (-42%)
   10 |   117ms |    40ms (-66%) |    97ms (-17%) |    87ms (-26%)
   12 |   119ms |    47ms (-61%) |   113ms ( -6%) |   104ms (-13%)
   15 |   126ms |    54ms (-57%) |   138ms (  9%) |   128ms (  1%)
   20 |   138ms |    72ms (-48%) |   183ms ( 32%) |   170ms ( 23%)
   40 |   199ms |   138ms (-31%) |   340ms ( 70%) |   324ms ( 62%)
   60 |   265ms |   204ms (-24%) |   514ms ( 93%) |   495ms ( 86%)
   80 |   334ms |   275ms (-18%) |   677ms (102%) |   651ms ( 94%)
  100 |   402ms |   345ms (-15%) |   854ms (112%) |   819ms (103%)
  200 |   797ms |   734ms ( -8%) |  1735ms (117%) |  1693ms (112%)
  400 |  1884ms |  1809ms ( -4%) |  3854ms (104%) |  3634ms ( 92%)
  600 |  3247ms |  3177ms ( -3%) |  6219ms ( 91%) |  5827ms ( 79%)
  800 |  4934ms |  4835ms ( -3%) |  8930ms ( 80%) |  8146ms ( 65%)
 1000 |  7037ms |  6977ms ( -1%) | 12088ms ( 71%) | 10806ms ( 53%)
 2000 | 32551ms | 32032ms ( -2%) | 42344ms ( 30%) | 27523ms (-16%)
 3000 | 86523ms | 86692ms (  0%) |102437ms ( 18%) | 49170ms (-44%)
 4000 |157235ms |157528ms (  0%) |178602ms ( 13%) | 75523ms (-52%)
 6000 |355111ms |355885ms (  0%) |387521ms (  9%) |142279ms (-60%)
 8000 |628937ms |630111ms (  0%) |703281ms ( 11%) |228502ms (-64%)
 9999 |979335ms |980668ms (  0%) |1037671ms (  5%) |334994ms (-66%)

100 content elements, 10 runs
nodes |composed |  pre-compiled  |  process pipe  | extension
    1 |    75ms |    10ms (-86%) |    48ms (-36%) |    11ms (-86%)
    2 |    33ms |    10ms (-70%) |    25ms (-25%) |    14ms (-58%)
    4 |    36ms |    15ms (-58%) |    39ms ( 10%) |    22ms (-38%)
    6 |    39ms |    20ms (-49%) |    54ms ( 37%) |    32ms (-19%)
    8 |    43ms |    25ms (-42%) |    70ms ( 59%) |    41ms ( -7%)
   10 |    47ms |    29ms (-38%) |    82ms ( 73%) |    47ms (  0%)
   12 |    50ms |    33ms (-34%) |    96ms ( 89%) |    54ms (  7%)
   15 |    57ms |    40ms (-30%) |   115ms ( 99%) |    63ms ( 10%)
   20 |    67ms |    52ms (-23%) |   152ms (124%) |    73ms (  7%)
   40 |   116ms |    99ms (-14%) |   292ms (152%) |   128ms ( 11%)
   60 |   164ms |   153ms ( -7%) |   434ms (163%) |   189ms ( 14%)
   80 |   219ms |   200ms ( -9%) |   571ms (159%) |   255ms ( 16%)
  100 |   269ms |   253ms ( -6%) |   711ms (164%) |   320ms ( 19%)
  200 |   534ms |   528ms ( -2%) |  1449ms (171%) |   642ms ( 20%)
  400 |  1161ms |  1157ms ( -1%) |  2994ms (157%) |  1329ms ( 14%)
  600 |  1870ms |  1878ms (  0%) |  4667ms (149%) |  2044ms (  9%)
  800 |  2699ms |  2690ms ( -1%) |  6400ms (137%) |  2789ms (  3%)
 1000 |  3630ms |  3643ms (  0%) |  8303ms (128%) |  3578ms ( -2%)
 2000 | 10629ms | 10900ms (  2%) | 20206ms ( 90%) |  7980ms (-25%)
 3000 | 21130ms | 21548ms (  1%) | 35574ms ( 68%) | 13244ms (-38%)
 4000 | 34769ms | 35397ms (  1%) | 54171ms ( 55%) | 19251ms (-45%)
 6000 | 70734ms | 72451ms (  2%) |100466ms ( 42%) | 33602ms (-53%)
 8000 |119603ms |122070ms (  2%) |160432ms ( 34%) | 51418ms (-58%)
 9999 |182144ms |185494ms (  1%) |231385ms ( 27%) | 72163ms (-61%)

nodes: Number of input test elements 1..9999 (per namespace => 2..19998 child nodes).
composed: Approach 1 with one XSLT importing 2 others. Scanning the input for namespaces. (100%).
pre-compiled: Run 1a with one static, pre-compiled XSLT for comparison. Still scanning the input for namespaces.
process pipe: Approach 2 with 3-stage XSLTs. Scanning the input for namespaces.
extension: Approach 3 with one XSLT for the root namespace and with a separate transformation for every test element.

Noticed
As expected the static, pre-compiled XSLT performs generally best.

As expected the pre-compiled savings become relative as the input data size grows.

Multi-stage pipeline processing performs good for small documents and ok for bigger ones. As expected more namespaces -> more stylesheets -> more stages -> more time (test with the implementation, extended to 3 namespaces).

Compared to the other approaches the per-element transformations (approach 3) becomes expensive, when all the contained elements get transformed (> 100 "foreign" elements, 0 content elements), but improves rapidly with content elements.

What happens in the process with 2 content elements? With >2000 elements, suddenly the per-element transformation is faster, than a transformation with a pre-compiled stylesheet? Even though the per-namespaces stylesheets only need to process one element per transformation, the initial stylesheet still needs to match all the elements, much like the pre-compiled, all-in-one stylesheet. The tree walk to collect the namespaces (not needed in approach 3) cannot be the issue here, can it?

Also while it takes about 7s to process 9999 elements with no content and about 10s with 1 content element, it takes more than 13min for 9999 elements with 2 content elements!
Maybe with this size of the input document the garbage collector kicks in... But GC time seems to be constant and also there no significant change on a 64bit-VM with 6GB of heap.
Also, I guess, 9999 elements with one child element should need roughly the same or even a bigger amount of memory as 5000 elements with 2 child elements. To process the second it takes more than 10 times longer.

Conclusions
To generally solve our problem, the approach 2 seems to be most suitable, with a good performance also with larger documents.

Dynamically compound stylesheets would serve documents with more than 200 nodes better. Combined with a smart cache of pre-compiled stylesheets for often used namespace sets, it could also perform best for smaller documents.

The per-element transformation approach (3) works good for small documents and strangely for large documents (> 3000 nodes, foreign/root < 1/3) as well. Due to it's flexibility (individual stylesheet selection not only by namespace), it's also worth to be offered.

Update: please see Open XSLT Processing 2 to answer the open questions.