Wednesday, October 23, 2013

Visual Studio as a Diffing Tool

I have recently learned how to make use of Visual Studio 2012's built-in diffing tool. I'm happy to report that it does a reasonably decent job at this. While diffing, you have access to Visual Studio's superior syntax highlighting and intellisense capabilities. Also, it reuses existing windows, which can be handy if your target file is already open in your IDE. I have always used external diffing tools in the past, but there is a certain appeal to having an integrated tool. Just being able to diff and edit in a consistent color scheme is nice.

Sadly, Visual Studio does not support three-way merging or directory comparison, so I won't be putting KDiff3 down just yet. Also, unless you have set up a Team Foundation source control server, this capability is rather obscure and difficult to invoke. It involves some command line incantations:

devenv.exe /Diff Source Target [SourceName] [TargetName]

That isn't too bad, as command line arguments go. Still, I do not want to specify the complete path for every source and target file that I want to compare. Fortunately, it is quite possible to set up Visual Studio as a merge tool option for TortoiseHg. To do so, edit your global mercurial.ini file, and add this to your [merge-tools] section (go ahead and create that section if it does not exist):

[merge-tools]
vs.executable = ${ProgramFiles(x86)}/Microsoft Visual Studio 11.0/Common7/IDE/devenv.exe
vs.gui = True
vs.diffargs = /diff $parent $child "$plabel1" "$clabel"
vs.priority = 1

Now go up to your [tortoisehg] section. Create or edit your vdiff setting to point to your newly defined merge tool. It should look like this (along with whatever other items you have in that section):

[tortoisehg]
vdiff = vs

Restart TortoiseHg. Now, when you ask TortoiseHg to diff a single file for you, it should open that file in Visual Studio. If you ask for a three-way merge or a directory diff, TortoiseHg is smart enough to know that Visual Studio can't handle it, and will pick a different tool.

Tuesday, October 11, 2011

Introducing MinCat for MSBuild / Visual Studio

I am happy to release a new project today. Introducing MinCat: the JavaScript and CSS minimizer and concatenation utility for MSBuild and (optionally) Microsoft MVC.

The Problem

JavaScript presents an interesting problem to many web developers. As a client-side scripting language, it is very forgiving about how it is used, and how it is included in the context of a larger web document. There is a sloppy way to serve JavaScript, and a professional way to serve JavaScript.

The "sloppy" method covers many possibilities: inline amongst the HTML, buried inside the href of an anchor, or heaped into the <head> in a stack of script-tags that comes up to your eyeballs. All of these methods work, but they also have issues.

The current best practice recommends a different approach. Instead of including HTML inline, we should serve it as external files. Moreover, these files should be minimized to reduce the page load time. If multiple files are necessary, we should concatenate them to reduce the number of HTTP requests required to load them.

However beneficial this is to the end-user, the act of minimizing and concatenating script files adds an additional burden to the developer. Fortunately, it is quite possible to automate this process, and numerous tools exist to facilitate this. MinCat is such a tool.

The Solution

MinCat exists to address exactly the problem described above. It is a small, simple tool that lets you control the minimization and concatenation of your JavaScript files. It is designed to integrate directly with MSBuild. Because Visual Studio uses MSBuild under the hood, this means it provides easy integration with your existing Visual Studio web projects.

It's usage is simple and straightforward. By adding <Minimize> or <Concatenate> commands to your project file, you can control how your JavaScript is prepared for your source files. For instance, this command would minimize all of the JavaScript files in your "Scripts" folder, placing the resulting files in "Scripts\min":

<Minimize Input="Scripts\*.js" Outpath="Scripts\min\" />

For MVC projects, MinCat also provides an MVC Script helper extension. At a basic level, this extension will facilitate the switch between the minimized or development version of a particular file. However, it really begins to shine when combined with the new "directives" that MinCat offers (more on those later!)

@Html.Script(Url.Content("~/Scripts/MyScript.js"))

Since minimizing JavaScript is not always a quick operation, MinCat will only re-minimize files if they have changed since the last time they were minimized. It accomplishes this by comparing the last modified date of the source files with the corresponding date on the minimized files. This helps keep the overall build time down to a minimum.

MinCat comes bundled with the excellent YUICompressor. However, it is designed so that the compressor is a modular component, and it could easily be reworked to use a different compression engine.

New JavaScript Directives

As if all that weren't enough, MinCat additionally provides support for a number of directives, designed to be placed in the actual source of individual JavaScript files. These provide further control over how each file is treated by the minimization process. Because these directives are fully supported by the (optional) MinCat MVC Script helper extension, they provide some exciting new ways to organize and structure your JavaScript.

/* @skip minimize */
This directive will cause the source file to never be minimized, even if it is otherwise included by a wildcard in the input path. This has no effect when used with the MVC helper extension.
/* @require "filename.js" */
This directive instructs the minimizer that the current file depends on code from another file. When this file is minimized, all required external files will be collected (recursively, so a required file can also require a file), sorted into the correct order, reduced to a distinct list, and concatenated in front of the file that specifies them. Therefore, the resulting minimized file will contain everything it needs to function. When used with the MVC Script helper extension, each required file will be loaded in its own <script /> tag, in the correct order, before the <script /> tag that contains your target file.
/* @include "filename.js" */
This directives causes the minimizer to include the complete contents of the specified file directly inline, wherever that directive is found. When used with the MVC Script helper extension, the loaded <script /> file will also include the desired content directly inline.

Note the differences between @require and @include. The @include directive is primarily exciting because it allows you to create assembly-like structures:

var Global = (function () { "use strict"; var internal = {}; function Global() { var instance = {}; /* @include "Access to Internal, Instance, and Global Variables.js" */ } /* @include "Access to Internal and Global Variables.js" */ /* @require "Access to Global Variables Only.js" */ return Global; }());

Learn More!

MinCat is open source software, released under a BSD License. It is available on BitBucket, and as a NuGet package.

Complete documentation is available on its wiki page.

As always, I am excited to learn how people are using my tools. If you try this out, please feel free to drop me a note with your impressions. And of course, always let me know if you find a bug!

Wednesday, July 20, 2011

JSLint, Licensing, and JSON Visualization

The JSLint license specifies that it should be used for Good, not Evil.

Suppose I construct some program that I believe to be Good, and use JSLint to improve the quality of that program. Further suppose that I release my code under a permissive license, such as a BSD license. Now suppose that somebody else takes my program and uses it for Evil.

In this case, the evil-doer has not actually violated their license agreement with me, since BSD licenses permit Evil. However, the evil-doer has now profited from the increased code quality gained from using JSLint.

Is the evil-doer now in violation of the JSLint license, despite never actually using JSLint themselves? Alternatively, perhaps I am now in violation of the JSLint license? Does the JSLint license require me to use the "Good" clause in my license as well, to prevent this scenario?

These are the questions that trouble me this morning. By the way, I've released the source to my JSON visualizer under a BSD license.

Wednesday, December 29, 2010

Notes on GetSchemaTable

I recent found myself using the SqlDataReader.GetSchemaTable method. I made some notes detailing what it will return for various data types in SQL Server 2005.

Notes on GetSchemaTable
 Numeric 
ColumnNameDataTypeColumnSizePrecisionScaleNotes
bigintInt64819255 
bitBoolean1255255 
decimalDecimal17180 
intInt32410255 
moneyDecimal819255 
numericDecimal17180 
smallintInt1625255 
smallmoneyDecimal410255 
tinyintByte13255 
floatDouble815255 
realSingle47255 
datetimeDateTime8233 
smalldatetimeDateTime4160 
charString20255255 
varcharString20255255 
textString2147483647255255isLong
ncharString20255255 
nvarcharString20255255 
varchar(max)String2147483647255255isLong
nvarchar(max)String2147483647255255isLong
ntextString1073741823255255isLong
timestampByte[]8255255IsRowVersion
xmlString2147483647255255isLong

I created these notes through the simple method of creating a table with every possible data type. Or rather, every data type that I am interested in, which is very nearly the same thing--I mean, who uses varbinary, really? I then used the GetSchemaTable method on my new table, and inspected the results.

There is a page on MSDN that appears to contain this information, but upon closer inspection, it must be talking about something else entirely. For instance, it claims that the SQL Server data type "varchar" has no mapping whatsoever, not even as a System.String.

Certainly, this makes sense upon further reflection: it's possible to put any varchar value into a string, but not the other way around. Still, it's not terribly useful if you just want to know what sort of data type you can reasonably expect to get out of a given column.

I further note that the MSDN documentation states that the NumericPrecision column should be null for non-numeric data types, but this is simply not true.

Saturday, July 10, 2010

Conditional Operators and Bracket Notation

Did you know it is valid JavaScript to use conditional operators inside bracket notation to access object properties? For whatever reason, I have only just now realized this.

In other words, this is a perfectly valid fragment:

var obj = {
    valid:   [],
    invalid: []
};

items.forEach(function (item) {
    obj[item.isValid() ? "valid" : "invalid"].push(item);
});

The conditional operator, which takes the form condition ? ifTrue : ifFalse, is a shorthand version of an if statement. In this case, it results in the name of the object property that I wish to access.

I am yet undecided on whether this is more or less readable than the verbose version:

items.forEach(function (item) {
    if (item.isValid()) {
        obj.valid.push(item);
    } else {
        obj.invalid.push(item);
    }
});

Perhaps I will look back on this post in a few months and marvel at how obvious it all seems in hindsight. In the meantime, it is nice that I am still learning new things about JavaScript.

Friday, February 12, 2010

Wacky Code

This has got to be the most wacky code I have written all week:

range1.setEndPoint('StartToStart', range2);
range1.setEndPoint('StartToStart', range2);

Yes, the same line is there twice, and yes, it is supposed to be like that. The ranges in question are IE-only TextRange objects, which I am using to manage the position of the cursor within a rich text editor. The line has to be repeated because the first execution doesn't quite work, but the second one does:

<p>widget{cursor}</p> <-- position after first line -->
<p>{cursor}gadget</p> <-- requested position,
                          position after second setting -->

Wacky. Sadly, this is just one of many TextRange eccentricities that I've run into.

Tuesday, February 9, 2010

Spell Checking in HTML

I recently completed a spell checking module as part of a larger project for an HTML-based editor. Some of the challenges of the project were interesting; here are my experiences, the problems I encountered, and how I solved them.

The first step of spell checking a block of text is determining what the text actually is. For a simple text document, this is quite easy: it's no more than a string of characters. However, for an HTML document, it is not so simple. Markup introduces a lot of extra items to deal with: we want to check the text as it would appear to a human, not to a computer.

In other words, it is one thing to detect the word "html" in a string of words: "blah blah blah, html, blah blah blah," but quite another thing to detect the same word as it would appear to a human if embedded in a complex DOM structure: <p><b>h</b><i>t</i><em>m</em><strong>l</strong></p>.

One obvious way of solving this problem is to derive the textual value of a given parent element. This is quite easy: innerText works in IE, and textContent works in Mozilla. Thus, a human version of the text of a DOM element can be obtained thusly:

var parent = document.body,
text = parent.textContent; // for Mozilla
if (text === undefined) {
    text = parent.innerText; // for IE
}

However, this leaves a more significant problem to deal with: once we determine which words are misspelled, how will we re-associate those words with the correct DOM structures? If "html" comes back as being incorrect, we will need to know which text nodes originally produced the word, for how else will we correct it? I have a suspicion that this could be solved with a clever application of TextRange objects in IE and Range objects in Mozilla, but that isn't the route I took (working with IE's TextRange is a headache all by itself anyway).

My basic strategy is to walk through the DOM and derive two simultaneous structures that represent each text node: one version contains the text as it would appear to a user, and the other version contains a map between the "human" version and the original DOM.

Rendering the HTML as text is not exactly straightforward either; one cannot simply concatenate the values of each individual text node. There are a number of HTML elements which introduce white space for humans, such as <p> or <div> or even some inline ones, like <br>. Similarly, there are some HTML elements which do not introduce white space, such as <b>, or <i>, or <em>. Other rules apply as well: if you have <b style="display: block;"> then you have a <b> that adds white space even when <b> normally wouldn't.

To resolve this, I spent some time empirically testing each suspicious HTML element to determine whether it would cause visible white space to be inserted inside a word. Using this list and a variety of other rules (floating nodes always make white space, display: block nodes always make white space, etc), I construct a text string out of the DOM that more-or-less resembles the words as a human would see them. Notably, it is not necessary to obtain an exact representation of white space; I don't particularly need to render tables as tables, I just need to know that a </td><td> causes a break in a word but a </span><span> doesn't.

As I construct this string, I keep track of each text node that I encounter in the DOM, and meticulously record the starting and ending indices of that node's textual content in my generated string.

As an example, consider the following DOM fragment, with text nodes shown explicitly:

<p>
    <b>
        <textNode 1>h</textNode 1>
    </b>
    <i>
        <textNode 2>t</textNode 2>
    </i>
    <em>
        <textNode 3>m</textNode 3>
    </em>
    <strong>
        <textNode 4>l</textNode 4>
    </strong>
    <textNode 5> is complex</textNode 5>
</p>

After parsing the above, I would end up with a structure similar to this:

result = {
    html: [
        { node: <textNode 1>, start: 0, end: 1 },
        { node: <textNode 2>, start: 1, end: 2 },
        { node: <textNode 3>, start: 2, end: 3 },
        { node: <textNode 4>, start: 3, end: 4 },
        { node: <textNode 5>, start: 4, end: 15 }
    ],
    text: 'html is complex'
}

Once this process is complete, I split the result.text value into a list of unique individual words. Locating each word is easily accomplished through regular expressions, although special provisions must be made for hyphenated words and contractions. Unicode "smart quote" characters and their brethren are also normalized into regular ASCII quotes. This list is finally sent up to the server, where all the real work happens.

Given any list of words, there are several methods to perform spell checking upon it. For my purposes, I've implemented a dictionary-based approach, as this gives me a greater sense of confidence and considerably more control than statistical analysis. Locating misspelled words is therefore a trivial matter of determining whether each provided word is in the dictionary or not.

However, it is not enough to know which words are incorrect: I must also offer spelling suggestions. I accomplished this through implementing the Damerau-Levenshtein Distance algorithm, which has been more fun than I have had in code in quite some time. While an in-depth explanation can be found on Wikipedia, put simply, the algorithm can compare any two words and generate a numeric score that indicates how similar those two words are.

Armed with this capability, I am able to generate spelling suggestions by calculating the Damerau-Levenshtein Distance between each misspelled word, and each word in the dictionary. The algorithm is quite speedy, but as the dictionary is quite large, some optimizations are in order: for instance, only words of a comparable length to the original word are tested.

While I am given to understand that a relational database is not ideal for this, I have implemented this algorithm as a C# assembly on SQL Server 2005. Although I feel that additional optimizations are no doubt possible, I am already able to process comparatively large documents in under a second using this method.

As a result of all this, the server sends back a list of misspelled words. The JavaScript code takes these words and locates each one within the result.text string that it previously constructed. For each word's matching indices in result.text, it scans through the list of nodes in result.html until it finds the text node or nodes that made up that word.

Finally! The code now knows which words are misspelled AND which nodes correspond to those words. From here, it is a relatively simple matter of presenting this information to the user and letting them decide how to handle it. For the time being, I've chosen to provide a dialog box, similar to how MS Word's spellchecker operates. This gives the user familiar buttons for things like "Ignore, Ignore All, Add to Dictionary, etc." I am rather tempted to go back and implement a Google-style interface, wherein the incorrect words are highlighted and the user is given a context menu instead. However, there are many more features left to complete, and I have to move on at some point.

As soon as the user decides on a new spelling for a word, I alter each text node that composed the original word, updating it to contain the new values instead. Should a word be split across multiple text nodes, I fill in the letters one at a time, from left to right, letting the final node be the one to expand or contract if there is a difference in the number of letters.

And that's all there is to it!