Three types of averages are used by statisticians: the mean (layperson’s average), the median, and the mode.
The mean is trivial—simply sum using Recipe 2.6 and divide by the count.
The median is the number that falls in the middle of the set of numbers when they are sorted. If the count is even, then the mean of the two middle numbers is generally taken:
<xsl:template name="math:median"> <xsl:param name="nodes" select="/.."/> <xsl:variable name="count" select="count($nodes)"/> <xsl:variable name="middle" select="ceiling($count div 2)"/> <xsl:variable name="even" select="not($count mod 2)"/> <xsl:variable name="m1"> <xsl:for-each select="$nodes"> <xsl:sort data-type="number"/> <xsl:if test="position( ) = $middle"> <xsl:value-of select=". + ($even * ./following-sibling::*[1])"/> </xsl:if> </xsl:for-each> </xsl:variable> <!-- The median --> <xsl:value-of select="$m1 div ($even + 1)"/> </xsl:template>
Handling the even case relies on the Boolean-to-number conversion
trick used in several other examples in this book. If the number of
nodes is odd, $m1
ends up being equal to the
middle node, and you divide by 1 to get the answer. On the other
hand, if the number of nodes is odd, $m1
ends up
being the sum of the two middle nodes, and you divide by two to get
the answer.
The mode is the most frequently occurring element(s) in a set of elements that need not be numbers. If identical nodes compare with equality on their string values, then the following solution does the trick:
<xsl:template name="math:mode"> <xsl:param name="nodes" select="/.."/> <xsl:param name="max" select="0"/> <xsl:param name="mode" select="/.."/> <xsl:choose> <xsl:when test="not($nodes)"> <xsl:copy-of select="$mode"/> </xsl:when> <xsl:otherwise> <xsl:variable name="first" select="$nodes[1]"/> <xsl:variable name="try" select="$nodes[. = $first]"/>
<xsl:variable name="count" select="count($try)"/> <!-- Recurse with nodes not equal to first --> <xsl:call-template name="math:mode"> <xsl:with-param name="nodes" select="$nodes[not(. = $first)]"/>
<!-- If we have found a node that is more frequent then pass the count otherwise pass the old max count --> <xsl:with-param name="max" select="($count > $max) * $count + not($count > $max) * $max"/> <!-- Compute the new mode as ... --> <xsl:with-param name="mode"> <xsl:choose> <!-- the first element in try if we found a new max --> <xsl:when test="$count > $max"> <xsl:copy-of select="$try[1]"/> </xsl:when> <!-- the old mode union the first element in try if we found an equivalent count to current max --> <xsl:when test="$count = $max"> <!-- Caution: you will need to convert $mode to a --> <!-- node set if you are using a version of XSLT --> <!-- that does not convert automatically --> <xsl:copy-of select="$mode | $try[1]"/> </xsl:when> <!-- othewise the old mode stays the same --> <xsl:otherwise> <xsl:copy-of select="$mode"/> </xsl:otherwise> </xsl:choose> </xsl:with-param> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template>
If not, then replace the comparisons with an appropriate test. For
example, if equality is contingent on an attribute called age, the
test would be ./@age = $first/@age
.
The variance and standard
deviation are common statistical measures of dispersion or the spread
in the values about the average. The easiest way to compute a
variance is to obtain three values: sum
= the sum
of the numbers, sum-sq
= the sum of each number
squared, and count
= the size of the set of
numbers. The variance is then (sum-sq
-
sum2
/
count)
/
count
-
1
.
You can compute them all in one shot with the following
tail-recursive template:
<xsl:template name="math:variance"> <xsl:param name="nodes" select="/.."/> <xsl:param name="sum" select="0"/> <xsl:param name="sum-sq" select="0"/> <xsl:param name="count" select="0"/> <xsl:choose> <xsl:when test="not($nodes)"> <xsl:value-of select="($sum-sq - ($sum * $sum) div $count) div ($count - 1)"/> </xsl:when> <xsl:otherwise> <xsl:variable name="value" select="$nodes[1]"/> <xsl:call-template name="math:variance"> <xsl:with-param name="nodes" select="$nodes[position( ) != 1]"/> <xsl:with-param name="sum" select="$sum + $value"/> <xsl:with-param name="sum-sq" select="$sum-sq + ($value * $value)"/> <xsl:with-param name="count" select="$count + 1"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template>
You may recognize this template as a variation of
math:sum
that was extended to compute the other
two components that comprise the variance calculation. As such, an
XSLT implementation without support for tail recursion runs into
trouble on large sets. In that case, you must take an alternate
piecewise strategy based on the standard definition of variance:
∑(mean -
xi)2 / (count -
1). First, compute the mean by using the divide-and-conquer or batch
forms of sum and diving by the count. Then use a divide-and-conquer
or batch template that computes the sum of the squares of the
difference between the mean and each number. Finally, divide the
result by count - 1.
Once you can compute the variance, the standard deviation follows as the square root of the variance. See Recipe 2.5 for square root.
Statistical functions are common tools for analyzing numerical data, and these templates can be a useful addition to your toolkit. However, XSLT was never intended as a tool for statistical analysis. An alternate approach would use XSLT as a frontend for converting XML data to comma- or tab-delimited data and then import this data into a spreadsheet or statistics package.