Entropy Reduction

Heimdall, He Who Watches the Event Log

2016-12-07T16:21:00.000-05:00

I have a tiny server, sitting in the cloud, running cygwin’s OpenSSH. Logging in requires an SSH key, but this doesn't stop people from trying all sorts of ways to get in.

Normally, this does nothing whatsoever, other than fill up my event logs with “invalid user” messages. However, I thought it might be nice to filter these users out at the firewall level. That’s what this program does: it monitors the event log for invalid SSH connection attempts, and adds the offending IP to the Windows Firewall list of blocked IP addresses.

Since my server is named Bifröst, I've named this program Heimdall. Heimdall is designed to be run from the command line, or as a scheduled task. While Heimdall is not a terribly complex program, I did learn a few things as I wrote it. This post is both to document what I found, and to serve as a reminder to me, should I ever need this information again.

The first thing Heimdall does is scan through the event log, looking for events matching a specific pattern. This is done using the EventLogReader and EventLogQuery classes. In my case, the relevant code followed this pseudo-code:

public static IEnumerable<EventEntry> GetEvents(int entriesToScan)
{
 const string queryString = "*[System[Provider[@Name='sshd'] and EventID=0]]";

 var eventsQuery = 
  new EventLogQuery("Application", PathType.LogName, queryString)
  {
   ReverseDirection = true
  };

 // The "EventEntry" class is just a model for holding information about this
 // a single event.  Keep reading for further details.
 var events = new List<EventEntry>();
 entriesToScan = Math.Max(entriesToScan, 1);

 try
 {
  using (var logReader = new EventLogReader(eventsQuery))
  {
   EventRecord eventInstance;
   int currentEvent;

   for (
    eventInstance = logReader.ReadEvent(), currentEvent = 1;
    eventInstance != null && currentEvent <= entriesToScan;
    eventInstance = logReader.ReadEvent(), currentEvent += 1)
   {
    EventEntry entry;

    try
    {
     entry = EventEntry.From(eventInstance);
    }
    finally
    {
     eventInstance.Dispose();
    }

    if (entry != null)
    {
     events.Add(entry);
    }
   }

   eventInstance?.Dispose();
  }
 }
 catch (EventLogNotFoundException e)
 {
  Console.Write("Failed to query the log!", e);
  return null;
 }

 return events;
}

I learned that the event log messages emitted from logReader.ReadEvent() implement IDisposable and should be disposed of.

I did not find any way to limit the number of items returned by the query, other than manually counting them. Since the every query looks like a standard xpath query, I tried experimenting with position(), but I could not make it work.

In my case, I needed information out of the EventData section of the event object. I could not find a way to access this information using any of the conveinance methods, but fortunately, the complete event information is available as XML. I was able to access the EventData by parsing the XML from the EventRecord object:

public static string GetData(EventRecord eventInstance)
{
 const string namespaceName =
  "http://schemas.microsoft.com/win/2004/08/events/event";

 var eventDataName = XName.Get("EventData", namespaceName);
 var dataName = XName.Get("Data", namespaceName);

 return XDocument.Parse(eventInstance.ToXml())
  .Descendants(eventDataName).FirstOrDefault()?
  .Descendants(dataName).FirstOrDefault()?
  .Value;
}

Once I had the event data, I was able to parse it using a regular expression, looking for a username and IP address. The username gets compared to a white-list of allowed usernames, and the IP address is checked to make sure it isn't coming from a private block. This is partially for my own protection: I figure that if I ever accidentally lock myself out, I can always try again from a private IP address.

If the attempted username is not on the white-list, and if the IP address is not private, then Heimdall adds the IP address to the list of addresses blocked by the Windows Firewall. Working with the Windows Firewall is done by way of the INetFwPolicy2 interface.

For the purposes of this program, I make no attempt to create new firewall rules. Instead, I modify an existing Firewall rule, which I manually created beforehand. A useful additional to Heimdall would be the ability to create its own rule, to reduce the amount of manual configuration necessary. However, as of right now, Heimdall assumes that this rule already exists:

This is a simple blocking rule, which Heimdall is able to access by name:

private static INetFwRule GetBlockingRule() =>
 ((INetFwPolicy2)
 Activator.CreateInstance(Type.GetTypeFromProgID("HNetCfg.FwPolicy2")))
  .Rules
  .OfType<INetFwRule>()
  .FirstOrDefault(r => r.Name == "Block Specific IPs");

Once we have a reference to the firewall rule, we just need to add the target IP address to the list of remote addresses affected by this rule.

I took some care to handle ranges of IP addresses; that is, if two or more adjacent IP addresses are blocked, they are added to the firewall as a range, instead of individual entries. I found the IPAddressRange project to be very useful in assisting with this.

public static void BlockIp(IPAddress ip)
{
 var rule = GetBlockingRule();

 var addresses = rule.RemoteAddresses
  .Split(',')
  .Where(s => string.IsNullOrWhiteSpace(s) == false)
  .Select(IPAddressRange.Parse)
  .ToList();

 addresses.Add(new IPAddressRange(ip));

 // Details of ConsolidateRanges are not relevant to this post; check the
 // source on BitBucket if you are curious.
 addresses = ConsolidateRanges(addresses);

 rule.RemoteAddresses = string.Join(",", addresses.Select(r => r.ToString()));
}

Finally, Heimdall sends me an email whenever a new IP address is blocked. I've had this program running for about a week now, and I've enjoyed seeing some of the creative usernames that people try to log in with. Over the past week alone, there have been 35 separate IP addresses blocked, which have attempted 205 unique usernames.

If you are curious enough to have read this far, you may be interested in seeing the source code for Heimdall!

Visual Studio as a Diffing Tool

2013-10-23T19:00:00.000-04:00

I have recently learned how to make use of Visual Studio 2012's built-in diffing tool. I'm happy to report that it does a reasonably decent job at this. While diffing, you have access to Visual Studio's superior syntax highlighting and intellisense capabilities. Also, it reuses existing windows, which can be handy if your target file is already open in your IDE. I have always used external diffing tools in the past, but there is a certain appeal to having an integrated tool. Just being able to diff and edit in a consistent color scheme is nice.

Sadly, Visual Studio does not support three-way merging or directory comparison, so I won't be putting KDiff3 down just yet. Also, unless you have set up a Team Foundation source control server, this capability is rather obscure and difficult to invoke. It involves some command line incantations:

devenv.exe /Diff Source Target [SourceName] [TargetName]

That isn't too bad, as command line arguments go. Still, I do not want to specify the complete path for every source and target file that I want to compare. Fortunately, it is quite possible to set up Visual Studio as a merge tool option for TortoiseHg. To do so, edit your global mercurial.ini file, and add this to your [merge-tools] section (go ahead and create that section if it does not exist):

[merge-tools]

vs.executable = ${ProgramFiles(x86)}/Microsoft Visual Studio 11.0/Common7/IDE/devenv.exe

vs.gui = True

vs.diffargs = /diff $parent $child "$plabel1" "$clabel"

vs.priority = 1

Now go up to your [tortoisehg] section. Create or edit your vdiff setting to point to your newly defined merge tool. It should look like this (along with whatever other items you have in that section):

[tortoisehg]
vdiff = vs

Restart TortoiseHg. Now, when you ask TortoiseHg to diff a single file for you, it should open that file in Visual Studio. If you ask for a three-way merge or a directory diff, TortoiseHg is smart enough to know that Visual Studio can't handle it, and will pick a different tool.

Introducing MinCat for MSBuild / Visual Studio

2011-10-11T22:50:00.000-04:00

I am happy to release a new project today. Introducing MinCat: the JavaScript and CSS minimizer and concatenation utility for MSBuild and (optionally) Microsoft MVC.

The Problem

JavaScript presents an interesting problem to many web developers. As a client-side scripting language, it is very forgiving about how it is used, and how it is included in the context of a larger web document. There is a sloppy way to serve JavaScript, and a professional way to serve JavaScript.

The "sloppy" method covers many possibilities: inline amongst the HTML, buried inside the href of an anchor, or heaped into the <head> in a stack of script-tags that comes up to your eyeballs. All of these methods work, but they also have issues.

The current best practice recommends a different approach. Instead of including HTML inline, we should serve it as external files. Moreover, these files should be minimized to reduce the page load time. If multiple files are necessary, we should concatenate them to reduce the number of HTTP requests required to load them.

However beneficial this is to the end-user, the act of minimizing and concatenating script files adds an additional burden to the developer. Fortunately, it is quite possible to automate this process, and numerous tools exist to facilitate this. MinCat is such a tool.

The Solution

MinCat exists to address exactly the problem described above. It is a small, simple tool that lets you control the minimization and concatenation of your JavaScript files. It is designed to integrate directly with MSBuild. Because Visual Studio uses MSBuild under the hood, this means it provides easy integration with your existing Visual Studio web projects.

It's usage is simple and straightforward. By adding <Minimize> or <Concatenate> commands to your project file, you can control how your JavaScript is prepared for your source files. For instance, this command would minimize all of the JavaScript files in your "Scripts" folder, placing the resulting files in "Scripts\min":

<Minimize Input="Scripts\*.js" Outpath="Scripts\min\" />

For MVC projects, MinCat also provides an MVC Script helper extension. At a basic level, this extension will facilitate the switch between the minimized or development version of a particular file. However, it really begins to shine when combined with the new "directives" that MinCat offers (more on those later!)

@Html.Script(Url.Content("~/Scripts/MyScript.js"))

Since minimizing JavaScript is not always a quick operation, MinCat will only re-minimize files if they have changed since the last time they were minimized. It accomplishes this by comparing the last modified date of the source files with the corresponding date on the minimized files. This helps keep the overall build time down to a minimum.

MinCat comes bundled with the excellent YUICompressor. However, it is designed so that the compressor is a modular component, and it could easily be reworked to use a different compression engine.

New JavaScript Directives

As if all that weren't enough, MinCat additionally provides support for a number of directives, designed to be placed in the actual source of individual JavaScript files. These provide further control over how each file is treated by the minimization process. Because these directives are fully supported by the (optional) MinCat MVC Script helper extension, they provide some exciting new ways to organize and structure your JavaScript.

/* @skip minimize */: This directive will cause the source file to never be minimized, even if it is otherwise included by a wildcard in the input path. This has no effect when used with the MVC helper extension.
/* @require "filename.js" */: This directive instructs the minimizer that the current file depends on code from another file. When this file is minimized, all required external files will be collected (recursively, so a required file can also require a file), sorted into the correct order, reduced to a distinct list, and concatenated in front of the file that specifies them. Therefore, the resulting minimized file will contain everything it needs to function. When used with the MVC Script helper extension, each required file will be loaded in its own <script /> tag, in the correct order, before the <script /> tag that contains your target file.
/* @include "filename.js" */: This directives causes the minimizer to include the complete contents of the specified file directly inline, wherever that directive is found. When used with the MVC Script helper extension, the loaded <script /> file will also include the desired content directly inline.

Note the differences between @require and @include. The @include directive is primarily exciting because it allows you to create assembly-like structures:

var Global = (function () {
  "use strict";

  var internal = {};

  function Global() {
    var instance = {};

    /* @include "Access to Internal, Instance, and Global Variables.js" */
  }

  /* @include "Access to Internal and Global Variables.js" */

  /* @require "Access to Global Variables Only.js" */
  return Global;
}());

Learn More!

MinCat is open source software, released under a BSD License. It is available on BitBucket, and as a NuGet package.

Complete documentation is available on its wiki page.

As always, I am excited to learn how people are using my tools. If you try this out, please feel free to drop me a note with your impressions. And of course, always let me know if you find a bug!

JSLint, Licensing, and JSON Visualization

2011-07-20T10:27:00.000-04:00

The JSLint license specifies that it should be used for Good, not Evil.

Suppose I construct some program that I believe to be Good, and use JSLint to improve the quality of that program. Further suppose that I release my code under a permissive license, such as a BSD license. Now suppose that somebody else takes my program and uses it for Evil.

In this case, the evil-doer has not actually violated their license agreement with me, since BSD licenses permit Evil. However, the evil-doer has now profited from the increased code quality gained from using JSLint.

Is the evil-doer now in violation of the JSLint license, despite never actually using JSLint themselves? Alternatively, perhaps I am now in violation of the JSLint license? Does the JSLint license require me to use the "Good" clause in my license as well, to prevent this scenario?

These are the questions that trouble me this morning. By the way, I've released the source to my JSON visualizer under a BSD license.

Notes on GetSchemaTable

2010-12-29T09:59:00.001-05:00

I recent found myself using the SqlDataReader.GetSchemaTable method. I made some notes detailing what it will return for various data types in SQL Server 2005.

Notes on GetSchemaTable
ColumnName	DataType	ColumnSize	Precision	Scale	Notes
			Numeric
bigint	Int64	8	19	255
bit	Boolean	1	255	255
decimal	Decimal	17	18	0
int	Int32	4	10	255
money	Decimal	8	19	255
numeric	Decimal	17	18	0
smallint	Int16	2	5	255
smallmoney	Decimal	4	10	255
tinyint	Byte	1	3	255
float	Double	8	15	255
real	Single	4	7	255
datetime	DateTime	8	23	3
smalldatetime	DateTime	4	16	0
char	String	20	255	255
varchar	String	20	255	255
text	String	2147483647	255	255	isLong
nchar	String	20	255	255
nvarchar	String	20	255	255
varchar(max)	String	2147483647	255	255	isLong
nvarchar(max)	String	2147483647	255	255	isLong
ntext	String	1073741823	255	255	isLong
timestamp	Byte[]	8	255	255	IsRowVersion
xml	String	2147483647	255	255	isLong

I created these notes through the simple method of creating a table with every possible data type. Or rather, every data type that I am interested in, which is very nearly the same thing--I mean, who uses varbinary, really? I then used the GetSchemaTable method on my new table, and inspected the results.

There is a page on MSDN that appears to contain this information, but upon closer inspection, it must be talking about something else entirely. For instance, it claims that the SQL Server data type "varchar" has no mapping whatsoever, not even as a System.String.

Certainly, this makes sense upon further reflection: it's possible to put any varchar value into a string, but not the other way around. Still, it's not terribly useful if you just want to know what sort of data type you can reasonably expect to get out of a given column.

I further note that the MSDN documentation states that the NumericPrecision column should be null for non-numeric data types, but this is simply not true.

Conditional Operators and Bracket Notation

2010-07-10T13:09:00.001-04:00

Did you know it is valid JavaScript to use conditional operators inside bracket notation to access object properties? For whatever reason, I have only just now realized this.

In other words, this is a perfectly valid fragment:

var obj = {
  valid: [],
  invalid: []
};

items.forEach(function (item) {
  obj[item.isValid() ? "valid" : "invalid"].push(item);
});

The conditional operator, which takes the form condition ? ifTrue : ifFalse, is a shorthand version of an if statement. In this case, it results in the name of the object property that I wish to access.

I am yet undecided on whether this is more or less readable than the verbose version:

items.forEach(function (item) {
  if (item.isValid()) {
    obj.valid.push(item);
  } else {
    obj.invalid.push(item);
  }
});

Perhaps I will look back on this post in a few months and marvel at how obvious it all seems in hindsight. In the meantime, it is nice that I am still learning new things about JavaScript.

Wacky Code

2010-02-12T20:34:00.002-05:00

This has got to be the most wacky code I have written all week:

range1.setEndPoint('StartToStart', range2);
range1.setEndPoint('StartToStart', range2);

Yes, the same line is there twice, and yes, it is supposed to be like that. The ranges in question are IE-only TextRange objects, which I am using to manage the position of the cursor within a rich text editor. The line has to be repeated because the first execution doesn't quite work, but the second one does:

<p>widget{cursor}</p> <-- position after first line -->
<p>{cursor}gadget</p> <-- requested position,
                          position after second setting -->

Wacky. Sadly, this is just one of many TextRange eccentricities that I've run into.

Spell Checking in HTML

2010-02-09T16:42:00.001-05:00

I recently completed a spell checking module as part of a larger project for an HTML-based editor. Some of the challenges of the project were interesting; here are my experiences, the problems I encountered, and how I solved them.

The first step of spell checking a block of text is determining what the text actually is. For a simple text document, this is quite easy: it's no more than a string of characters. However, for an HTML document, it is not so simple. Markup introduces a lot of extra items to deal with: we want to check the text as it would appear to a human, not to a computer.

In other words, it is one thing to detect the word "html" in a string of words: "blah blah blah, html, blah blah blah," but quite another thing to detect the same word as it would appear to a human if embedded in a complex DOM structure: html.

One obvious way of solving this problem is to derive the textual value of a given parent element. This is quite easy: innerText works in IE, and textContent works in Mozilla. Thus, a human version of the text of a DOM element can be obtained thusly:

var parent = document.body;
var text = parent.textContent; // for Mozilla
if (text === undefined) {
  text = parent.innerText; // for IE
}

However, this leaves a more significant problem to deal with: once we determine which words are misspelled, how will we re-associate those words with the correct DOM structures? If "html" comes back as being incorrect, we will need to know which text nodes originally produced the word, for how else will we correct it? I have a suspicion that this could be solved with a clever application of TextRange objects in IE and Range objects in Mozilla, but that isn't the route I took (working with IE's TextRange is a headache all by itself anyway).

My basic strategy is to walk through the DOM and derive two simultaneous structures that represent each text node: one version contains the text as it would appear to a user, and the other version contains a map between the "human" version and the original DOM.

Rendering the HTML as text is not exactly straightforward either; one cannot simply concatenate the values of each individual text node. There are a number of HTML elements which introduce white space for humans, such as  or <div> or even some inline ones, like  . Similarly, there are some HTML elements which do not introduce white space, such as , or , or . Other rules apply as well: if you have  then you have a  that adds white space even when  normally wouldn't.

To resolve this, I spent some time empirically testing each suspicious HTML element to determine whether it would cause visible white space to be inserted inside a word. Using this list and a variety of other rules (floating nodes always make white space, display: block nodes always make white space, etc), I construct a text string out of the DOM that more-or-less resembles the words as a human would see them. Notably, it is not necessary to obtain an exact representation of white space; I don't particularly need to render tables as tables, I just need to know that a </td><td> causes a break in a word but a  doesn't.

As I construct this string, I keep track of each text node that I encounter in the DOM, and meticulously record the starting and ending indices of that node's textual content in my generated string.

As an example, consider the following DOM fragment, with text nodes shown explicitly:

<p>
  <b>
    <textNode 1>h</textNode 1>
  </b>
  <i>
    <textNode 2>t</textNode 2>
  </i>
  <em>
    <textNode 3>m</textNode 3>
  </em>
  <strong>
    <textNode 4>l</textNode 4>
  </strong>
  <textNode 5> is complex</textNode 5>
</p>

After parsing the above, I would end up with a structure similar to this:

result = {
  html: [
    { node: <textNode 1>, start: 0, end:  1 },
    { node: <textNode 2>, start: 1, end:  2 },
    { node: <textNode 3>, start: 2, end:  3 },
    { node: <textNode 4>, start: 3, end:  4 },
    { node: <textNode 5>, start: 4, end: 15 }
  ],
  text: 'html is complex'
}

Once this process is complete, I split the result.text value into a list of unique individual words. Locating each word is easily accomplished through regular expressions, although special provisions must be made for hyphenated words and contractions. Unicode "smart quote" characters and their brethren are also normalized into regular ASCII quotes. This list is finally sent up to the server, where all the real work happens.

Given any list of words, there are several methods to perform spell checking upon it. For my purposes, I've implemented a dictionary-based approach, as this gives me a greater sense of confidence and considerably more control than statistical analysis. Locating misspelled words is therefore a trivial matter of determining whether each provided word is in the dictionary or not.

However, it is not enough to know which words are incorrect: I must also offer spelling suggestions. I accomplished this through implementing the Damerau-Levenshtein Distance algorithm, which has been more fun than I have had in code in quite some time. While an in-depth explanation can be found on Wikipedia, put simply, the algorithm can compare any two words and generate a numeric score that indicates how similar those two words are.

Armed with this capability, I am able to generate spelling suggestions by calculating the Damerau-Levenshtein Distance between each misspelled word, and each word in the dictionary. The algorithm is quite speedy, but as the dictionary is quite large, some optimizations are in order: for instance, only words of a comparable length to the original word are tested.

While I am given to understand that a relational database is not ideal for this, I have implemented this algorithm as a C# assembly on SQL Server 2005. Although I feel that additional optimizations are no doubt possible, I am already able to process comparatively large documents in under a second using this method.

As a result of all this, the server sends back a list of misspelled words. The JavaScript code takes these words and locates each one within the result.text string that it previously constructed. For each word's matching indices in result.text, it scans through the list of nodes in result.html until it finds the text node or nodes that made up that word.

Finally! The code now knows which words are misspelled AND which nodes correspond to those words. From here, it is a relatively simple matter of presenting this information to the user and letting them decide how to handle it. For the time being, I've chosen to provide a dialog box, similar to how MS Word's spellchecker operates. This gives the user familiar buttons for things like "Ignore, Ignore All, Add to Dictionary, etc." I am rather tempted to go back and implement a Google-style interface, wherein the incorrect words are highlighted and the user is given a context menu instead. However, there are many more features left to complete, and I have to move on at some point.

As soon as the user decides on a new spelling for a word, I alter each text node that composed the original word, updating it to contain the new values instead. Should a word be split across multiple text nodes, I fill in the letters one at a time, from left to right, letting the final node be the one to expand or contract if there is a difference in the number of letters.

And that's all there is to it!

XPath Axis Selectors Implemented in JavaScript

2009-12-31T20:45:00.005-05:00

Here is a quick implementation of the XPath axis selectors in JavaScript. Each function accepts a context node as a parameter, a filtering function, and a stop function. The two function arguments are optional and may be omitted.

If the filter function is present, it will be executed for each node encountered along the requested axis. It must return true for the node to be included in the output.

If the stop function is present, it will also be executed for each node encountered. When and if it returns true, the search along the axis will stop, and whatever nodes have been accumulated up to (and including) that point will be returned.

Here it is for download. Here is a minified version, only 2k!

This would be used like so:

var node = document.getElementById('myTable'), results;

// Find all descendant nodes
results = AXIS.descendant(node);

// Find all descendant text nodes
results = AXIS.descendant(node, function (n) {
  return (n.nodeType === 3);
});

// Find the closest FORM ancestor:
function isForm(n) {
  return (n.nodeName.toLowerCase() === 'form');
}
results = AXIS.ancestor(node, isForm, isForm)[0];

// Locate all H1 items between this node and the next table:
results = AXIS.following(node, function (n) {
  return (n.nodeName.toLowerCase() === 'h1');
}, function (n) {
  return (n.nodeName.toLowerCase() === 'table');
});

These are specifically written to operate on nodes, not elements. In other words, text nodes will be included as potential return values. This is great for me, as my first use for these is to assist in determining what text the user has selected. If the inclusion of text nodes were not a requirement, then one might consider optimizing the "descendant" axis to use querySelectorAll('*') instead--at least, in modern browsers.

This should work in all browsers. I did run into one snag with IE6. Apparently, if you have a <base> tag in the source, then the resulting tree structure ends up looking something like this:

<html>
    <head>
        <base>
            <body>...</body> <-- Same body element!
        </base>
    </head>
    <body>...</body>         <-- Same body element!
</html>

That <body> tag there is not a duplicate; it's really the same element, just included in two places in the DOM. It is both a descendant and a sibling of the <head> element. This causes an infinite loop when crawling the "following" axis, as the code crawls out of the <body> into the <base>, then into the <head>, then enters the <body> again.

Conditional compilation is used to fix this specifically for IE6, so the performance of other browsers should not be affected.

Flattening recursive XML data in SQL Server 2005

2009-12-07T21:41:00.005-05:00

I recently had to flatten an XML data set that looked something like this:

<topic id="1" name="planets">
  <topicList>
    <topic id="2" name="jupiter">
      <topicList>
        <topic id="3" name="ganymede" />
        <topic id="18" name="callisto" />
        <topic id="92" name="io">
          <topicList>
            <topic id="21" name="europa" />
          </topicList>
        </topic>
      </topicList>
    </topic>
    <topic id="7" name="saturn">
      <topicList>
        <topic id="11" name="titan" />
      </topicList>
    </topic>
  </topicList>
</topic>

I needed to produce a result set that preserved the parent/child relationship between each topic. In other words, I needed to know that Callisto belonged to Jupiter when I finished. At first, I thought this was clearly a job for recursion. I envisioned leaping deftly from each topic to each of its children in turn. Indeed, with SQL Server 2005 and common table expressions, this is quite possible:

Declare @xml XML;
Set @xml = '.... (see above) ....';
With topics (id, parent, name, children)
As (
  Select
    @xml.value('/topic[1]/@id', 'integer') As id,
    Cast(null As integer) As parent,
    @xml.value('/topic[1]/@name', 'varchar(50)') As name,
    @xml.query('/topic/topicList') As children

  Union All

  Select
    child.node.value('@id', 'integer') As id,
    topics.id As parent,
    child.node.value('@name', 'varchar(50)') As name,
    child.node.query('topicList') As children
  From topics
  Cross Apply topics.children.nodes('/topicList/topic') As child(node))
Select id, parent, name
From topics;

As with all recursive CTEs, this one is broken into two parts. The rows returned by the first half are used as starting points for the recursion in the second half. In this case, the first half returns only a single row:

Select
  @xml.value('/topic[1]/@id', 'integer') As id,
  Cast(null As integer) As parent,
  @xml.value('/topic[1]/@name', 'varchar(50)') As name,
  @xml.query('/topic/topicList') As children

Result:
id	parent	name	children
1	NULL	planets	{xml}

The second half handles the recursion part. The secret here is in the final column. Each topic row includes a reference to its own children. The recursive portion of the CTE works on each of these children, one by one. Each child further includes a reference to its own children, and so the cycle continues. Thus, each recursion builds on the rows produced by the previous recursion.

Results:
cycle	id	parent	name
1	1	NULL	planets
2	2	1	jupiter
2	7	1	saturn
3	18	2	callisto
3	3	2	ganymede
3	92	2	io
3	11	7	titan
4	21	92	europa

As interesting as this is, it is not terribly performant. But fear not: a better solution exists! Recall that each XML node contains information concerning its position within its document; it has references to its children, siblings, and parent nodes. Rather than trying to recursively walk the XML tree, then, it is possible to simply get a list of all "topic" nodes and derive the parent information from the node itself. This results in a much simpler query:

Select
  topics.topic.value('@id', 'integer') As id,
  topics.topic.value('../../@id', 'integer') As parent,
  topics.topic.value('@name', 'varchar(50)') As name
From @xml.nodes('//topic') As topics(topic)

Yup, that's all. The @xml.nodes('//topic') finds every <topic/> node, no matter how deeply nested it is. Then the parent id is found simply by looking at each topic's ancestors: topics.topic.value('../../@id', 'integer').

Quick, easy, and simple!

XML Visualization

2009-08-12T10:10:00.003-04:00

I've created a new tool to assist with developing in XML. This is capable of taking an arbitrary XML document, and performing a variety of tasks on it.

At the moment, it can beautify, or pretty-print, a file, apply an arbitrary xPath and show the results, and apply an arbitrary XSL transformation.

All processing is done by the browser (there is no server-side component to this at all). Because of this, results will vary slightly by browser, since each browser flavor has idiosyncrasies involving their respective XML / XSL engines.

It works in IE6+, Firefox, Opera, Safari, and Chrome. It was quite fun to make, and I'd love to hear suggestions for future improvements!

The tool itself can be found here: http://chris.photobooks.com/xml/default.htm

SQL Server Subquery Quirk

2009-06-10T15:13:00.008-04:00

Here's an interesting quirk of SQL Server 2005 that bit me today.

Consider this fictional query:

Select pageID, givenName
From phy.Contacts
Where pageID In (
  Select pageID
  From (
    Select id
    From oth.Pages
    Where [Hidden] = 0) As [subquery])

Here there are two levels of subqueries here. The innermost level returns a set of [id] values. The next level up is asking the innermost level for a set of [pageID] values. Alas, the innermost query does not contain a [pageID] column.

However, this query executes anyway, without any errors, and appears to return decent results. This surprised me: the middle-level query is looking for a column that doesn't exist. I would have expected an error message in this case, not a result set.

This behavior is explained in a tiny "Caution" box in MSDN:

If a column is referenced in a subquery that does not exist in the table referenced by the subquery's FROM clause, but exists in a table referenced by the outer query's FROM clause, the query executes without error. SQL Server implicitly qualifies the column in the subquery with the table name in the outer query.

So apparently, the middle-level query is pulling the pageID column from the OUTER query instead. Fascinating.