Tuesday, February 9, 2010

Spell Checking in HTML

I recently completed a spell checking module as part of a larger project for an HTML-based editor. Some of the challenges of the project were interesting; here are my experiences, the problems I encountered, and how I solved them.

The first step of spell checking a block of text is determining what the text actually is. For a simple text document, this is quite easy: it's no more than a string of characters. However, for an HTML document, it is not so simple. Markup introduces a lot of extra items to deal with: we want to check the text as it would appear to a human, not to a computer.

In other words, it is one thing to detect the word "html" in a string of words: "blah blah blah, html, blah blah blah," but quite another thing to detect the same word as it would appear to a human if embedded in a complex DOM structure: <p><b>h</b><i>t</i><em>m</em><strong>l</strong></p>.

One obvious way of solving this problem is to derive the textual value of a given parent element. This is quite easy: innerText works in IE, and textContent works in Mozilla. Thus, a human version of the text of a DOM element can be obtained thusly:

var parent = document.body,
text = parent.textContent; // for Mozilla
if (text === undefined) {
    text = parent.innerText; // for IE
}

However, this leaves a more significant problem to deal with: once we determine which words are misspelled, how will we re-associate those words with the correct DOM structures? If "html" comes back as being incorrect, we will need to know which text nodes originally produced the word, for how else will we correct it? I have a suspicion that this could be solved with a clever application of TextRange objects in IE and Range objects in Mozilla, but that isn't the route I took (working with IE's TextRange is a headache all by itself anyway).

My basic strategy is to walk through the DOM and derive two simultaneous structures that represent each text node: one version contains the text as it would appear to a user, and the other version contains a map between the "human" version and the original DOM.

Rendering the HTML as text is not exactly straightforward either; one cannot simply concatenate the values of each individual text node. There are a number of HTML elements which introduce white space for humans, such as <p> or <div> or even some inline ones, like <br>. Similarly, there are some HTML elements which do not introduce white space, such as <b>, or <i>, or <em>. Other rules apply as well: if you have <b style="display: block;"> then you have a <b> that adds white space even when <b> normally wouldn't.

To resolve this, I spent some time empirically testing each suspicious HTML element to determine whether it would cause visible white space to be inserted inside a word. Using this list and a variety of other rules (floating nodes always make white space, display: block nodes always make white space, etc), I construct a text string out of the DOM that more-or-less resembles the words as a human would see them. Notably, it is not necessary to obtain an exact representation of white space; I don't particularly need to render tables as tables, I just need to know that a </td><td> causes a break in a word but a </span><span> doesn't.

As I construct this string, I keep track of each text node that I encounter in the DOM, and meticulously record the starting and ending indices of that node's textual content in my generated string.

As an example, consider the following DOM fragment, with text nodes shown explicitly:

<p>
    <b>
        <textNode 1>h</textNode 1>
    </b>
    <i>
        <textNode 2>t</textNode 2>
    </i>
    <em>
        <textNode 3>m</textNode 3>
    </em>
    <strong>
        <textNode 4>l</textNode 4>
    </strong>
    <textNode 5> is complex</textNode 5>
</p>

After parsing the above, I would end up with a structure similar to this:

result = {
    html: [
        { node: <textNode 1>, start: 0, end: 1 },
        { node: <textNode 2>, start: 1, end: 2 },
        { node: <textNode 3>, start: 2, end: 3 },
        { node: <textNode 4>, start: 3, end: 4 },
        { node: <textNode 5>, start: 4, end: 15 }
    ],
    text: 'html is complex'
}

Once this process is complete, I split the result.text value into a list of unique individual words. Locating each word is easily accomplished through regular expressions, although special provisions must be made for hyphenated words and contractions. Unicode "smart quote" characters and their brethren are also normalized into regular ASCII quotes. This list is finally sent up to the server, where all the real work happens.

Given any list of words, there are several methods to perform spell checking upon it. For my purposes, I've implemented a dictionary-based approach, as this gives me a greater sense of confidence and considerably more control than statistical analysis. Locating misspelled words is therefore a trivial matter of determining whether each provided word is in the dictionary or not.

However, it is not enough to know which words are incorrect: I must also offer spelling suggestions. I accomplished this through implementing the Damerau-Levenshtein Distance algorithm, which has been more fun than I have had in code in quite some time. While an in-depth explanation can be found on Wikipedia, put simply, the algorithm can compare any two words and generate a numeric score that indicates how similar those two words are.

Armed with this capability, I am able to generate spelling suggestions by calculating the Damerau-Levenshtein Distance between each misspelled word, and each word in the dictionary. The algorithm is quite speedy, but as the dictionary is quite large, some optimizations are in order: for instance, only words of a comparable length to the original word are tested.

While I am given to understand that a relational database is not ideal for this, I have implemented this algorithm as a C# assembly on SQL Server 2005. Although I feel that additional optimizations are no doubt possible, I am already able to process comparatively large documents in under a second using this method.

As a result of all this, the server sends back a list of misspelled words. The JavaScript code takes these words and locates each one within the result.text string that it previously constructed. For each word's matching indices in result.text, it scans through the list of nodes in result.html until it finds the text node or nodes that made up that word.

Finally! The code now knows which words are misspelled AND which nodes correspond to those words. From here, it is a relatively simple matter of presenting this information to the user and letting them decide how to handle it. For the time being, I've chosen to provide a dialog box, similar to how MS Word's spellchecker operates. This gives the user familiar buttons for things like "Ignore, Ignore All, Add to Dictionary, etc." I am rather tempted to go back and implement a Google-style interface, wherein the incorrect words are highlighted and the user is given a context menu instead. However, there are many more features left to complete, and I have to move on at some point.

As soon as the user decides on a new spelling for a word, I alter each text node that composed the original word, updating it to contain the new values instead. Should a word be split across multiple text nodes, I fill in the letters one at a time, from left to right, letting the final node be the one to expand or contract if there is a difference in the number of letters.

And that's all there is to it!

Thursday, December 31, 2009

XPath Axis Selectors Implemented in JavaScript

Here is a quick implementation of the XPath axis selectors in JavaScript. Each function accepts a context node as a parameter, a filtering function, and a stop function. The two function arguments are optional and may be omitted.

If the filter function is present, it will be executed for each node encountered along the requested axis. It must return true for the node to be included in the output.

If the stop function is present, it will also be executed for each node encountered. When and if it returns true, the search along the axis will stop, and whatever nodes have been accumulated up to (and including) that point will be returned.

Here it is for download. Here is a minified version, only 2k!

This would be used like so:

var node = document.getElementById('myTable'), results;

// Find all descendant nodes
results = AXIS.descendant(node);

// Find all descendant text nodes
results = AXIS.descendant(node, function (n) {
    return (n.nodeType === 3);
});

// Find the closest FORM ancestor:
function isForm(n) {
    return (n.nodeName.toLowerCase() === 'form');
}
results = AXIS.ancestor(node, isForm, isForm)[0];

// Locate all H1 items between this node and the next table:
results = AXIS.following(node, function (n) {
    return (n.nodeName.toLowerCase() === 'h1');
}, function (n) {
    return (n.nodeName.toLowerCase() === 'table');
});

These are specifically written to operate on nodes, not elements. In other words, text nodes will be included as potential return values. This is great for me, as my first use for these is to assist in determining what text the user has selected. If the inclusion of text nodes were not a requirement, then one might consider optimizing the "descendant" axis to use querySelectorAll('*') instead--at least, in modern browsers.

This should work in all browsers. I did run into one snag with IE6. Apparently, if you have a <base> tag in the source, then the resulting tree structure ends up looking something like this:

<html>
    <head>
        <base>
            <body>...</body> <-- Same body element!
        </base>
    </head>
    <body>...</body>         <-- Same body element!
</html>

That <body> tag there is not a duplicate; it's really the same element, just included in two places in the DOM. It is both a descendant and a sibling of the <head> element. This causes an infinite loop when crawling the "following" axis, as the code crawls out of the <body> into the <base>, then into the <head>, then enters the <body> again.

Conditional compilation is used to fix this specifically for IE6, so the performance of other browsers should not be affected.

Monday, December 7, 2009

Flattening recursive XML data in SQL Server 2005

I recently had to flatten an XML data set that looked something like this:

<topic id="1" name="planets">
<topicList>
<topic id="2" name="jupiter">
<topicList>
<topic id="3" name="ganymede" />
<topic id="18" name="callisto" />
<topic id="92" name="io">
<topicList>
<topic id="21" name="europa" />
</topicList>
</topic>
</topicList>
</topic>
<topic id="7" name="saturn">
<topicList>
<topic id="11" name="titan" />
</topicList>
</topic>
</topicList>
</topic>

I needed to produce a result set that preserved the parent/child relationship between each topic. In other words, I needed to know that Callisto belonged to Jupiter when I finished. At first, I thought this was clearly a job for recursion. I envisioned leaping deftly from each topic to each of its children in turn. Indeed, with SQL Server 2005 and common table expressions, this is quite possible:

Declare @xml XML;
Set @xml = '.... (see above) ....';
With topics (id, parent, name, children)
As (
Select @xml.value('/topic[1]/@id', 'integer') As id
,Cast(null As integer) As parent
,@xml.value('/topic[1]/@name', 'varchar(50)') As name
,@xml.query('/topic/topicList') As children

Union All

Select child.node.value('@id', 'integer') As id
,topics.id As parent
,child.node.value('@name', 'varchar(50)') As name
,child.node.query('topicList') As children

From topics
Cross Apply topics.children.nodes('/topicList/topic') As child(node))
Select id, parent, name
From topics;

As with all recursive CTEs, this one is broken into two parts. The rows returned by the first half are used as starting points for the recursion in the second half. In this case, the first half returns only a single row:


Select @xml.value('/topic[1]/@id', 'integer') As id
,Cast(null As integer) As parent
,@xml.value('/topic[1]/@name', 'varchar(50)') As name
,@xml.query('/topic/topicList') As children
Result:
idparentnamechildren
1NULLplanets{xml}

The second half handles the recursion part. The secret here is in the final column. Each topic row includes a reference to its own children. The recursive portion of the CTE works on each of these children, one by one. Each child further includes a reference to its own children, and so the cycle continues. Thus, each recursion builds on the rows produced by the previous recursion.

Results:
cycleidparentname
11NULLplanets
221jupiter
271saturn
3182callisto
332ganymede
3922io
3117titan
42192europa

As interesting as this is, it is not terribly performant. But fear not: a better solution exists! Recall that each XML node contains information concerning its position within its document; it has references to its children, siblings, and parent nodes. Rather than trying to recursively walk the XML tree, then, it is possible to simply get a list of all "topic" nodes and derive the parent information from the node itself. This results in a much simpler query:

Select topics.topic.value('@id', 'integer') As id
,topics.topic.value('../../@id', 'integer') As parent
,topics.topic.value('@name', 'varchar(50)') As name
From @xml.nodes('//topic') As topics(topic)

Yup, that's all. The @xml.nodes('//topic') finds every <topic/> node, no matter how deeply nested it is. Then the parent id is found simply by looking at each topic's ancestors: topics.topic.value('../../@id', 'integer').

Quick, easy, and simple!

Wednesday, August 12, 2009

XML Visualization

I've created a new tool to assist with developing in XML. This is capable of taking an arbitrary XML document, and performing a variety of tasks on it.

At the moment, it can beautify, or pretty-print, a file, apply an arbitrary xPath and show the results, and apply an arbitrary XSL transformation.

All processing is done by the browser (there is no server-side component to this at all). Because of this, results will vary slightly by browser, since each browser flavor has idiosyncrasies involving their respective XML / XSL engines.

It works in IE6+, Firefox, Opera, Safari, and Chrome. It was quite fun to make, and I'd love to hear suggestions for future improvements!

The tool itself can be found here: http://chris.photobooks.com/xml/default.htm

Wednesday, June 10, 2009

SQL Server Subquery Quirk

Here's an interesting quirk of SQL Server 2005 that bit me today.

Consider this fictional query:

Select pageID, givenName
From phy.Contacts
Where pageID In (
Select pageID
From (
Select id
From oth.Pages
Where [Hidden] = 0)
As [subquery])

Here there are two levels of subqueries here. The innermost level returns a set of [id] values. The next level up is asking the innermost level for a set of [pageID] values. Alas, the innermost query does not contain a [pageID] column.

However, this query executes anyway, without any errors, and appears to return decent results. This surprised me: the middle-level query is looking for a column that doesn't exist. I would have expected an error message in this case, not a result set.

This behavior is explained in a tiny "Caution" box in MSDN:

If a column is referenced in a subquery that does not exist in the table referenced by the subquery's FROM clause, but exists in a table referenced by the outer query's FROM clause, the query executes without error. SQL Server implicitly qualifies the column in the subquery with the table name in the outer query.

So apparently, the middle-level query is pulling the pageID column from the OUTER query instead. Fascinating.