Woozle Wuzzle
String Edit Distance

One commonly used approach for measuring similarity between strings is the string edit distance. The basic premise of the string edit distance is to find the minimum number of character edit operations (copy, replace, add, and remove) needed to transform one string into the other. For example, the edit distance between parks and spark is two: one edit is needed to add a leading s to parks and a second edit is needed to delete the trailing s. Not suprisingly, string edit distance is used in spell-checking applications.

It is convenient to look at string edit distance as a string alignment problem. The example below should provide some insight into this:

 String 1:     p a r k s 
 String 2:   s p a r k   
 Operation:  i c c c c d 

The operation is the character operation needed to align or transform string 1 into string 2. i is insert, c is copy, r is replace, and d is delete. (You may find other references using s (substitute) instead of r (replace).) The space at the beginning of string 1 or at the end of string 2 is called a gap.

For a given set of strings, there may be many ways to align the strings. For example, the strings abc123 and abcqwertabc123 have two obvious alignments:

 a b c                 1 2 3
 a b c q w e r t a b c 1 2 3

and

                 a b c 1 2 3
 a b c q w e r t a b c 1 2 3

The particular application would determine which alignment is preferred.

Different distance metrics are used to select the desired alignment. For example, the measurement used in the original example of the cost between parks and spark is the Levenshtein distance. In this metric the cost of replacing, adding or deleting a character is one whereas the cost of equal (or copied) characters is zero. (In the conversion of parks to spark, the cost of copying letters p a r k is zero.) The second alignment above (where abc123 has no gap) can be obtained using the Smith-Waterman distance.

For simple strings it is relatively easy to understand how to transform one string into another but with longer strings or very dissimilar strings it is not so clear. For example:

 See Spot run
 We work and play

There are brute force techniques that consider all possible alignments and find the one(s) with the minimum cost. I refer you to a few good references on string edit distance and the associated dynamic programming algorithms for more information.

URL Similarity

In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I'm going to use Amazon.com as an example.

Search for "book" on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:

http://www.amazon.com/Harry-Potter-Half-Blood-Prince-Book/dp/0439785960/ref=pd_bbs_1?ie=UTF8&s=books&qid=1196364228&sr=8-1
http://www.amazon.com/Little-Green-Book-Getting-Your/dp/0131576070/ref=pd_bbs_2?ie=UTF8&s=books&qid=1196364228&sr=8-2
http://www.amazon.com/Learning-Curve-LC97727-Lamaze-Caterpillar/dp/B00009IMD8/ref=pd_bbs_3?ie=UTF8&s=baby-products&qid=1196364228&sr=8-3
http://www.amazon.com/Book-General-Ignorance-John-Mitchinson/dp/0307394913/ref=pd_bbs_sr_4?ie=UTF8&s=books&qid=1196364228&sr=8-4
http://www.amazon.com/The-Daring-Book-for-Girls/dp/B000UZJQNM/ref=sr_1_13?ie=UTF8&s=books&qid=1196364228&sr=8-13
http://www.amazon.com/Inconvenient-Book-Solutions-Biggest-Problems/dp/B000WJVLLG/ref=sr_1_16?ie=UTF8&s=books&qid=1196364228&sr=8-16

We can easily see a pattern (the template) in those urls:

http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index>

Our job is to find an algorithm that will allow us to automatically discover that template.

(I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page's URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template.

I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.)

We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment).

Template Extraction

The goal of template extraction is to discover the template(s) (if there are any) used to generate a page. There are two major categories of templates: those that are used within a single web page (such as the individual search results entries on a Google search result page) and those that are used across web pages (such as a story template used on CNN.com). In the former case, the goal is to be able to extract the template(s) from any page that has at least two similar data records on it. In the latter case, the goal is to be able to extract the template(s) from any two pages that have similar data records. As more similar records or pages are found we expect our precision to increase.

By visually looking at a rendered Google search results page, for example, a human can easily see similar, repeated sections. Although it is a bit more difficult, someone familiar with HTML can look at the source to that same Google page and find the repeated sections. Similarly, there are two general approaches to automatic template discovery -- one that relies on the rendered representation of the page and one that relies on the source of the page. A survey of the research literature does not show a distinct advantage of one technique over the other and in many cases the underlying algorithms are very similar. I am going to focus on the latter approach since that is where my experience lies.

Template Extraction and Web Wrapper Generation

With the recent movement towards mashups, the semantic web and market intelligence, there is a large need to get at the data and information that is stored in web pages. Data extraction startups are popping up like weeds (e.g. InfoSquire and QL2). Many of these startups focus on services where you specify what sites you want scraped and they provide you with the resulting data feed. The technologies that they use are primarily rules-based (e.g. regex). Rules-based systems are highly brittle given the dynamic nature of the web. Specifically, there is a high maintenance cost to maintaining and monitoring the rules to ensure that they are up to date with any changes made to the underlying web pages. The ability to automatically generate data extractors with high precision would be a vast improvement over a rules-based system.

Much research has gone into extracting data (either structured or unstructured) from a web page using web wrappers. A web wrapper is a tool for "converting information implicitly stored as an HTML document into information explicitly stored as a data-structure for further processing" [W4F]. One particular type of automatically generated web wrapper uses template extraction. Template extraction is the inverse of creating a web page from a template -- for a given web page, attempt to deduce the template that was used to generate the page. If a template can be generated for any (template derived) web page, then the data that populates that template can be easily extracted.

Over the next few weeks I am going to focus on machine learing and other automatic web wrapper technologies in a series of postings.

Amen
"Testing by itself does not improve software quality. Test results are an indicator of quality, but in and of themselves, they don't improve it. Trying to improve Software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software development techniques you use determine how many errors testing will find. If you want to lose weight, don't buy a new scale; change your diet. If you want to improve your software, don't test more; develop better."
Steve McConnell, Code Complete
Sold!

I just stumbled on Live Clipboard and I'm sold. The screencasts are quite enlightening (though for type-A people like myself very painful to sit through).

For interested parties, the discussion archives are located here rather than the broken link from the Live Clipboard site.

Dapper

I have been doing some research on the Enterprise 2.0 landscape and web mashup tools. I stumbled across Dapper and their demo videos. It would be an understatement to say that I was impressed with what I saw. I have no knowledge about the stability, performance, and actual usability of their product but I am certainly excited about what they're promising. Their blog has some information about new features and possible uses for those features (such as the Login Dapps). This company is obviously ripe for the buying and I hope that someone that has a history of realizing acquired companies brings this product to full maturity.

StreamCruncher 1.0, a lightweight Event Processing Kernel
StreamCruncher is an Event Processor. It supports a language based on SQL which allows you to define Event Processing constructs like Sliding Windows, Time Based Windows, Partitions and Aggregates. Queries can be written using this language, which are used to monitor streams of incoming Events. StreamCruncher is a multi-threaded Kernel that runs on Java™.

Check out StreamCruncher for more info.

C to MIPS to Java bytecode

I refer you to binkley's BLOG for information regarding translating C/C++ code to Java.

Java IAQ

I stumbled across the Java IAQ (Infrequently Answered Questions). It's a bit dated but there are still some interesting tidbits in there.

SQL Injection Inference

I never put too much thought into how one mines a database via SQL injection especially when a web page is designed for only a certain type of output. This paper has quite a bit of information about mining through inference. Much of the paper is directed at MS SQL Server but there is information about other databases as inference is a general attack.

JFace's CellEditors

If for some reason you can't get your CellEditor to show up on a JFace table then make sure that you have set the column properites (TableViewer.setColumnProperties(String[])). This is not entirely explicit especially in ICellModifier.

Java class vs. variable access

Here's an interesting question for all of you Java-ites out there. Given the following code:

class Test {
    public Test() {
        final Foo Bar = new Foo();
        Bar.go();
    }
}

class Foo {
    public static void go() { /*does something Foo-y*/ }
}

class Bar {
    public static void go() { /*does something Bar-y*/ }
}

which go() is called from Test's constructor?

The answer is that Foo's go() is called. So now the next question is, how does one call Bar's go()? The obvious answer is "fully qualify Bar" (i.e. specify it's package name). But what if Bar is in the default pacakge?

This is obviously a contrived example and most people would say "use cAmEl case for you variable names so that this is never a problem" and you're right, that's a perfectly cromulent approach. I actually ran into this in practice when working with JavaScript which is a whole different problem.

Java Performance Tuning

I ran across this before and forgot the link. So rather than forgetting again, here is a link about Java 1.5 and GC performance.

Java Performance Tuning has oodles of tuning tips on a variety of topics. There are also references to interesting articles (such as the Commando Pattern).

Can't find dependent libraries

If you've done any JNI work or have worked with a 3rd party library with .dlls on Windows then you may have run into the dreaded "Can't find dependent libraries" error. This is the extra credit problem that goes one step beyond java.library.path.

The root of the problem is that even with java.library.path set correctly, Windows will not look in anything other than its PATH for dependent libraries. This posting covers much of the problem, cause and solution. (I should point out that this is a Java problem not an Eclipse problem.) You might need to use something such as Dependency Walker to trace the set of DLL dependencies.

In case you were wondering, the dependency order for BDBXML 2.2.13 is:

System.loadLibrary("msvcp71");
System.loadLibrary("msvcr71");
System.loadLibrary("libdb43");
System.loadLibrary("libdb_java43");

System.loadLibrary("xerces-c_2_7");
System.loadLibrary("Pathan_7.1");
System.loadLibrary("libxquery12");
System.loadLibrary("libdbxml22");
System.loadLibrary("libdbxml_java22");

And the dependency order for BDBXML 2.3.10 is:

System.loadLibrary("msvcp71");
System.loadLibrary("msvcr71");
System.loadLibrary("libdb45");
System.loadLibrary("libdb_java45");

System.loadLibrary("xerces-c_2_7");
System.loadLibrary("Pathan_7.1");
System.loadLibrary("xqilla10");
System.loadLibrary("libdbxml23");
System.loadLibrary("libdbxml_java23");
Constructor exceptions

Here's a real noodle scratcher when you first encounter it. This makes a good interview question since it forces candidates to talk through the construction sequence. This isn't a question that I would necessarily expect someone to answer correctly but I would expect someone to be able to talk through the various cases and the effects of those cases.

/**
 * <p>Tests that refernces can be made to an member even though its constructor 
 * fails.</p>
 */
public class ConstructorExceptionTest extends TestCase
{
    /**
     * <p>Tests that refernces can be made to an member even though its 
     * constructor fails.</p>
     */
    public void testException()
    {
        final ArrayList<ConstructorException> objects = new ArrayList<ConstructorException>();
        ConstructorException failedObject = null;  
        try
        {
            failedObject = new ConstructorException(objects);
        } catch(final Exception e)
        {
            // ignore since it is required to occur
        }
        assertNull("The newly created member should be null.", failedObject);
        assertTrue("The list should not be empty.", !objects.isEmpty());

        final ConstructorException listObject = objects.get(0);
        assertNotNull("References to the object still exist.", listObject);
        assertNull("The object's member should be null.", listObject.member);
    }

    /**
     * <p>An member that throws an exception on construction to ascertain what
     * ooccurs to references to it that are made before the construction 
     * completes.</p>
     */
    private class ConstructorException
    {
        /**
         * <p>Some member that is "created" after the constructor should have
         * failed.</p>
         */
        final Object member;

        /**
         * <p>Adds itself to the specified list of objects before it throws an
         * exception.</p>
         * 
         * @param  objects a list of objects to which this member adds itself
         * @throws Exception always
         */
        public ConstructorException(final List<ConstructorException> objects)
            throws Exception
        {
            objects.add(this);
            if(true)
                throw new Exception();
            /* else -- cannot occur */

            member = new Object();
        }
    }
}

I should point out to all of you doubting Thomas' out there that this test runs green. I should also point out that the if(true) is a trivial replacement for anything that can go wrong during construction.

Logging and throwing exceptions

I was getting tired of remembering to have to log my exceptions and I wanted something to do it for me. I remembered something about generics and exceptions. This is what I came up with:

/**
 * <p>A convenience method for 
 * {@link org.apache.log4j.Logger#error(java.lang.Object, java.lang.Exception) logging}
 * and throwing the specified {@link java.lang.Exception}.</p>
 *
 * <p>Example usage:</p>
 * 
 * <pre>
     private void someMethod()
         throws SomeException
     {
         ...
         LogException.logThrow(log, new SomeException("This will be logged and thrown."));
         ...
     }
 * <pre>
 * 
 * @param  log the <code>Logger</code> to which the exception is logged.
 * @param  exception the <code>Exception</code> to be logged and <code>throw</code>n
 * @throws E the specified exception
 */
public static <E extends Exception> void logThrow(final Logger log, 
                                                  final E exception)
    throws E
{
    log.error(exception, exception);
    throw exception;
}

The only obvious problem with this technique is that the compiler doesn't know the method never returns normally. So in a case such as:

public int returnInt()
    throws SomeException
{
    final int value;
    try
    {
        value = ....
    } catch(final OtherException e)
    {
        LogException.logThrow(log, e);
    }
    return value;
}

the compiler will complain that value may not have been initialized. Oh well. I can dream can't I?

BDBXML and document modifications

There are a number of problems with updating XML documents in an XML store. The primary problem is that XQuery 1.0 does not have a facility for updating documents. The XQuery Update Facility Requirements exists but it clearly states "the WG does not intend to produce a Recommendation from this Working Draft" which leaves me with a big fat question mark over my head. This has caused each XML DB vendor to provide their own update mechanism. And this leads to the topic of this post:

BDBXML is a decent XML DB but it's somewhat rough around the edges. When updating documents it is common to get the following error:

Cannot perform a modification on an XmlValue that isn't either Node or Document type, errcode = INVALID_VALUE

To make a long story short, you cannot specify a doc('...') in the XmlQueryExpression that you specify to any of the XmlModify modification methods. The documentation implies this but does not drive the point home. Given that all non-modification uses of XmlQueryExpression require some sort of "navigation function" (collection('...'), doc('...'), etc) it feels odd not to specify it for modfications.

More on XSD, PSVI and non-native attributes

If you have been following along with the trials and tribulations of XSD, PSVI and non-native attributes then you have been left with wondering about case where you have a non-native attribute and no <xsd:annotation>. For example:

<xsd:element name="something" myNS:name="Something" ...>
  <xsd:simpleType>
    ..
  </xsd:simpleType>
</xsd:element>

Since there is no <xsd:annotation> one might expect that there is no XSAnnotation object. This is exactly what occurs. So then, how does one access the non-native attribute? After a brief session of splunking through Xerces I stumbled across the notion of a "synthetic annotation". A quick hop over to Google and one quicly finds out about the generate synthetic annotations feature which purports to "[generate a synthetic annotation] when a schema component has non-schema attributes but no child annotation".

So we're done, right? Well, no. The page from which the "generate synthetic annotations" is taken is actually for the SAX parser which is not what we use. A quick search of the intersection between XSModel, XSLoader (which is used to parse the XSD into the XSModel) and "feature" reveals nothing. Broadening the search to include all of the synonyms of "feature" (such as "parameter", "attribute" and "property") finally reveals a hit. XSLoader exposes a DOMConfiguration which allows one to set "parameters". Listing all parameter names (via getParameterNames()) shows the sought after "generate-synthetic-annotations". Whew!

To round this out, the synthetic annotation appears as:

<xsd:annotation myNS:name="Something" ...>
  <xsd:documentation>SYNTHETIC_ANNOTATION</xsd:documentation>
</xsd:annotation>

and the setup code to get an XSModel from an XSD is:

System.setProperty(DOMImplementationRegistry.PROPERTY, 
                   "org.apache.xerces.dom.DOMXSImplementationSourceImpl");
final DOMImplementationRegistry registry = 
        DOMImplementationRegistry.newInstance();
final XSImplementation xsImpl = 
        (XSImplementation)registry.getDOMImplementation("XS-Loader");
final XSLoader schemaLoader = 
        xsImpl.createXSLoader(null/*all XML Schema Versions*/);

// NOTE:  synthetic annotation nodes MUST be created for non-native 
//        attributes to be parsed and added to the XSAnnotation object
//        (for cases where there is no )
final DOMConfiguration config = schemaLoader.getConfig();
    config.setParameter("http://apache.org/xml/features/generate-synthetic-annotations", 
                        true);

final XSModel xsModel = schemaLoader.loadURI(xsdURI.toString());
The problem with XML Schema

Let's just clear the air up front: I like XML Schema. It's convient. It's solves about 90% of all XML validation concerns. It's concise.

I have been doing work lately that leverages the validation provided by XSD by supplimenting it with JavaScript and Java. The problem with XSD is that it's not just simple XML. Sure, it's written in XML, but that's not the point. You can't just rip through an XSD with XPath an extract out the information that you want. This becomes obvious when you think about the structure that XSD represents: there's inheritance and references and all kinds of things that go "Bump" in the night. So the primary way that you can get at the guts behind what XSD is providing is via the Post-Schema-Validation Infoset (PSVI). But there's a problem: it appears that there is no way to access non-native (i.e. non-xsd) attributes. It appears that if you have extended XSD in any way then there's no way to access this information. Why allow it to be specified if one cannot access it?

Problem solved

I have been banging my head looking for a way to access non-native attributes of an XSD via PSVI. The problem I kept hitting was that the XSD API defines a seemingly limited XSAnnotation. There is only a annotationString method for retrieving the annotation. Taken literally (which is what I was doing) this will return the information within the <xsd:annotation> and that's it.

I started looking for alternative techniques. I found some interesting information regarding XML Beans, SchemaAnnotation and non-native attributes. And this got me to thinking. I started to do some code splunking in Xerces and found that the member that backs getAnnotationString() looks like:

// the content of the annotation node, including all children, along
// with any non-schema attributes from its parent
private String fData = null;

Well that certainly does not match the PSVI description of "A text representation of the annotation". The key is the "along with any non-schema attributes from its parent". If your XSD looks like:

<xsd:element name="something" myNS:name="Something">
  <xsd:annotation>
    <xsd:appinfo>Stuff</xsd:appinfo>
  <xsd:annotation>
  ...
</xsd:element>

then getAnnotationString() will return:

<xsd:annotation myNS:name="Something" ... >
  <xsd:appinfo>Stuff</xsd:appinfo>
<xsd:annotation>

No, really.

Now those of you that are careful readers are likely sitting there wondering how this is even possible since it's inconsistent. You're wondering what happens when your XSD looks like:

<xsd:element name="something" myNS:name="Something">
  <xsd:annotation myNS:name="Something else">
    <xsd:appinfo>Stuff</xsd:appinfo>
  <xsd:annotation>
  ...
</xsd:element>

Well, I'm sure you've already guessed the answer:

<xsd:annotation myNS:name="Something else" ... >
  <xsd:appinfo>Stuff</xsd:appinfo>
<xsd:annotation>

Yup, the attribute from the <xsd:element> is "overwritten" and lost. And people wonder why I get bitter about these things. Always follow the rule of thumb: anytime that you do something "crafty" (such as updating the annotation to include the "annotations" (non-native attributes) from the parent element) you're shooting yourself in the foot.

Why doesn't XSObject have a XSObjectList getNonNativeAttributes() where the XSObjectList contains perhaps XSNonNativeAttribute objects, I don't know. But that would have certainly saved me a few hours.

This is continued in More on XSD, PSVI and non-native attributes.

NamespaceContext and XPath

I am using Xerces to parse an XML Schema annotation. I stumbled across two intesting situations when dealing with Xerces PSVI:

  1. The contents of an annotation can only be retrieved as a string. It would have been nice to have access to a Node object instead.
  2. When using XPath, one must provide their own NamespaceContext object when using namespaces. Why a trivial implementation that was backed with a Map of strings was not provided I cannot guess. (This isn't specific to PSVI but this is the first time that I'm using Xerces rather than dom4j which provides SimpleNamespaceContext via jaxen.) Stefan Podkowinski has felt my pain. The O'Reilly Network has code for an implementation.

As a side note to the NamespaceContext: if you have a default namespace in the XML that you are parsing then you must have a blank namespace (not null but "") registered with the NamespaceContext.

BDBXML and default namespaces

If you're looking to do some simple XQuery work, Berkeley DB XML (BDBXML) seems to be a good answer.

The XML that I was loading into the DB has a default namespace. For example:

<?xml version="1.0" encoding="UTF-8"?>
<item xmlns="http://example.org/item"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://example.org/item item.xsd">
    <size>...</size>
    ...
<item>

Unfortunately BDBXML does not allow one to declare the default namespace:

xmlQueryContext.setNamespace("", "http://example.org/item");

You must bind a prefix to the namespace and use that in any XQuery / XPath:

[code]
xmlQueryContext.setNamespace("i", "http://example.org/item");

[query]
/i:item/i:size
Know that you don't know

I have often said in the past:

You have to know what you don't know.

Often my test of a good manager or developer is to query to find out if they know what they don't know. If they have no idea -- if they can't shape out or draw a box around what they don't know -- then that's not a good thing in my book.

I was talking to a colleague of mine, Greg Della-Croce, and he said:

You have to first know that you don't know.

He's absolutely right.

Developers Needed!

I am working with a small start-up that is focusing on a business-centric approach to information security. We need proven Java developers to design and build the core systems of the product.

If you have the qualities listed below (or know of someone that possesses them), send me an email.

  • Agressive and driven
  • Lives for the challenge
  • Does what it takes to get it done but understands what is needed to support it
  • Works best in very small teams
  • Understands what a "team of one working in a group of peers" entails
  • Self lead, self directed and self motivated
  • Soup-to-nuts mentality and has the experience to back it up
  • Has a clear definition of success
  • Has something to prove and needs a place to do it
  • Not looking for a "job"
  • Experience writing COTS products
Searching Code: Jungloid Mining

I was recently looking for an effective and programmatic way to search through an API for method invocations when I stumbled on Prospector. After a little more searching I found this paper Jungloid Mining: Helping to Navigate the API Jungle and the main page for Prospector. While it's not what I was originally looking for, it is an interesting technology. For my personal experience, finding how to convert from one class to another is a common task and requires much time. I hope to see this and its children find its way into my IDE.

Bindmark

I recently stumbled on Bindmark which is purported to be "a comparison of the existing open-source and commercial (when available for free evaluation download) libraries for binding XML data to Java classes."

Does anyone know of a similar comparison for JTS implementations?

Make your apps Sparkle

Microsoft introduced Sparkle last week at the PDC. There is an excellent video covering its capabilities over at Channel9.

For reasons that I don't completely understand, the industry is limiting its vision and only seeing this as a "Flash killer" or "MS Flash". I see this is a necessary and much needed combination of two traditionally disparate development steps: design of a user interface by a graphical designer and laying out that interface by a developer. I can't tell you how much time I have spent switching back and forth between Photoshop creating a particular interface look and then a visual interface designer (or even just directly in code *shudder*) implementing the design wondering if there was a better way. There are few times when you can point to a paradigm shift -- this is one of those times.

I am anxiously anticipating the companies that embrace this paradigm and move it to the next level!

Development Metrics

I am commonly asked "What metrics do you use to measure developer productivity?". There are a of common metrics that people use such as lines of code written (*shudder*) to number of unit tests written. Both of these suffer from the same root cause: they are measuring something artificial. The number of lines of code that I produce is completely artificial. I can have a style that separates declaring a variable from setting the default value. This makes me "twice" as productive as someone that sets the value in the declaration (the latter of which may be more maintainable code). The number of unit tests that I write is also completely artificial. I can write seven unit tests for a form with seven fields or I can write a single unit test that tests all seven fields at once. (I'm obviously using contrived examples to elucidate a point.)

So what is a non-artificial unit that one can measure for developer productivity? David J. Anderson calls it "customer valued functionality". If your organization is already up to speed on an agile development methodology (FDD obviously being the best fit) then you are likely already defining units in terms of something that the customer values. You measure the rate at which your developers produce and the customer accepts this customer valued functionality.

(This is an interesting article on metrics and agility.)

This notion of "non-artificial" is especially poignant for me in the recent months. I have been spending my time working on an approach to enterprise security that integrates decision support with a business process centric view. Currently most security decisions are made based on particular threats and vulnerabilities along with best practices. What's interesting about this is that measuring a company's risk from the standpoint of threats and vulnerabilities is akin to measuring developer productivity by measuring the number of lines of code that they produce. Instead of measuring something "artificial", risk should be understood from the standpoint of "how will this impact my business". From an IT person's view the answer is always death and destruction and the business owners just react to that. But the real question for the business owners should be: what is your tolerance to this particular "bad thing" and what impact will it have on your business? I'll be talking about this more in the future.

New Testing Forum Available

Over the past few months I have been swamped with questions concerning testing and quality assurance. I never believed that a blog was a good mechanism for asking and responding to questions so I have made a new forum available.

Software Engineering

Currently there is only a forum for QA and Testing. As the need arises or if people ask, I will create new forums for software engineering topics.

A huge thank you goes out to Igor who set up the forum and does all of my system administration. Thanks Igor!

I stumbled on when I was doing some unrelated research on π-calculus. Cω home page is located here. There are some interesting ideas in there. When I have more time I'll look into it more.

More on Code Comments

I have written many times before about code comments. I recently read Mike Clark's article titled Write Sweet-Smelling Comments. Just as before, I disagree with a goodly chunk of what he writes. (If you want to read some excellent tips on code comments, I recommend Code Complete, Second Edition.)

Since I could ramble on for hours about what I don't like about Mr. Clark's article, I'll limit it to just his "eye-popping comments". He got the "eye-popping" part right since my eyes bugged out of my head and I nearly puked on my keyboard.

You may be tempted to remove the read lock here, but don't because...

This can be "refactored" simply to be "This read lock ..." where suitable "what"'s and (most importantly) "why"'s are stated.

The following code needs to be refactored, and I'm embarrassed
to call it my own.  However, we must ship this code now!  I promise
to clean it up before anyone starts thinking smelly code is acceptable.

For the love of all that is good and pure in this world, don't write editorials in code. To be more succinct as to why this is so downright bad:

  • Who is "I", "we", and "my"? Writing comments in the first person is just plain wrong. After the code is checked in and is pertubated for a while, there is no effective way to know who "I", "me", and "my" are! So don't use them. If there's a reason to include someone's name in a comment then base it on the author field or use initials or something so that if someone reads the comment in one week or one year, they know what's going on.
  • It's one thing to state that "The following code needs to be refactored" and it's quite another to state why the code needs to be refactored. All of the time wasted editorializing could have been more fruitfully spent quickly bulleting out the "smelly" points and possible resolution steps. The scariest thing to see in code is a comment that says "Refactor!" followed by perfectly sane looking code that works. Why should the code be refactored? Has it been duplicated? Are there bad practices in use?
  • When is "now"? The moment after the code is checked in, relative times (just like using the first person) lose their context.
  • Don't editorialize in comments. We have blogs now. Save it for a nice blog entry.
  • In case I missed mentioning it earlier, don't editorialize in comments.

So that I don't appear to be a "Negative Nelly", I will say that Mr. Clark's "Use of a published algorithm" is an excellent comment that I wish that more developers would use. I'll broaden that statement further to say that I wish that more developers would research the algorithms that they use. I can't even count any more the number of times that I've come across home grown algorithms that have left me scratching my heading questioning if the developer didn't know there are fairly standard ways of implementing a piece of functionality or they were just winging it.

So as not to ramble incessantly about bad comments I will close this entry by kindly reminding developers that the code is really all that they have to show for in their life's work (Yes, yes. The finished product. We'll talk about that some other time.) Take some pride and show some professionalism when you write code and its associated comments.

Use a Rubric to choose a tool

"What tool should we use?"

How many times have we hear those words? When a group of developers is presented with that question most will scurry off and start searching the web for information. A chunk of those developers will eventually enter a spin cycle and will never report back a result. The rest of them will come back with an answer and say "Here's what we should use and why."

Move forward about a year.

  • The project was successful. Everyone's happy. The team moves on to the next project.
  • The project was unsuccessful (regardless of whether or not it was the "fault" of the tool or not).
  • Some key developers leave the project. New developers are brought on who weren't part of the original decision process.

Now the questions are: were those original "why's" valid to the problem at hand and how do they affect projects moving forward?

In most environments the answers are easy: "we don't know". Most teams don't record the decisions made and they don't have a way to measure success at any level (save, perhaps, for the project itself).

Enter the rubric.

When the "What tool should we use?" question is first asked do this: spend an hour or so with the group and define the major business and technical requirements as they are viewed at the time and write up a rough rubric -- a set of criteria and a scale for determining relative importance of the criteria. Agile teams should be able to plow through this requirements definition by using one of the standard planning methods.

The rubric is used to ensure that each tool is evaluated in a consistent manner. It ensures that the group is all pointed in the same direction. You may have to change your rubric as you go along to meet new concerns. (Make sure that all tool candidates are rescored against the new rubric.) Once a decision is made you can be assured that the reasons for choosing that tool are known and understood.

Will a rubric ensure that you never make a bad decision? Certainly not. So why do it? Let's examine a case where a rubric is not used:

A tool is chosen by standard means (i.e. someone just picked it). The project fails. During the lessons learned meeting the following is heard:

  Manager:  "Why did we choose this tool?"
  Developer:  "I don't know.  Joe chose it."
  Manager:  "Where's Joe?"
  Developer:  "He left the project a few months back."

In many environments the blame will be placed solely on Joe. Without some way to compare against what was orginally expected of the tool, how can the success of the tool be rated? More importantly, how can you be sure that the same mistakes wont be made on the next project?

  Developer:  "We're not going to use that same tool that we used on the previous project!"
  Manager:  "Why?"
  Developer:  "Don't you remember?  That project crashed and burned!"
  Manager:  "You're right!"

It's easy to confuse the cause-and-effect reasons for the failure of a project. It may be that the tool was the reason for the project's failure or it may be that the tool was the reason for the project's success. But the root cause of the project's success or failure may be unknown since the tools were chosen in an ad hoc manner.

Had a rubric been used in the above case, the team would be able to go back and look at the requirements for the tool and how those requirements were weighted. Let's say that the project failed because the tool had a number of critical bugs that were not patched in a timely manner. Looking back at the rubric one could check to see how the product was rated in the categories of "time between releases", "relative stability and maturity", "number of outstanding bugs", etc. If none of those qualities were investigated then for the next project the team knows that they need to look at those and rate them appropriately.

Through the proper use of a rubric, the consistency and quality of projects can be systematically improved.

SWT OpenGL Binding

As a follow on to the article entitled Using OpenGL with SWT I would like to mention that I have a SWT OpenGL binding available at:

http://www.realityinteractive.com/software/oss/index.html#SWTOGL

Happy SWT and OpenGL Coding!

Java's Nested Classes

Most everyone knows that a nested class can access private memebers of its enclosing class. But did you know that the reverse is true? The enclosing class can access private members of the nested class. No, really.

For those doubting Thomas' out there, I'll provide you with a reference.

Does anyone know the motivation behind why the enclosing class would be able to access private members of the nested class?

Archetypes, Color, and the Domain-Neutral Component

My parents were always apt to say "you learn something new each day". For the most part they've been right from the little things (turn the door knob before trying to go through the door) to the big things (when a sign says "Do not feed the bears," man, you better not feed the bears).

Today's new thing was stumbling on the paper Archetypes, Color, and the Domain-Neutral Component. As I've often said: when you sit down to solve a differential equation you don't just try to figure it out, you use the set of known patterns to guide you to a solution. I believe that the same is true with software architecture and development.

Here are some complimentary links on color modeling and the DNC:

Emergence

I had the privilege of seeing a lecture by Dr. Robert Laughlin (1998 Nobel Prize winner in Physics) at the Adler Planetarium. "[Dr. Laughlin argued] that the true frontier of physics may not lie in studying tiny individual particles, but rather by studying the properties that emerge when large collections of particles are taken together as a whole." This notion of "emergence" flies in the face of centuries of reductionism. (This is also an interesting reductionism link.)

Dr. Laughlin demonstrated that some (all?) sufficiently complex systems show emergent or collective behaviours -- behaviors that are greater than the sum of their parts. What's facinating about these behaviors is that they do not follow reduction-based models. Take for example a rigid aluminimum bar. What causes its rigidity? If you reduce the problem down to a nano scale, say, a single layer of aluminimum atoms, you find that the system is no longer rigid; it actually demonstrates fluid behaviors. As you increase the number of atoms the system begins to display more and more of the rigid behaviors. There is no single point at which, like a light switch, rigidity can be on or off.

Another interesting example is classical (Newtonian) mechanics. As you attempt to reduce the problem of mechanics down father and father you enter the realm of quantum mechanics where probabilities rule. There is no point at which the transition from one model to the other takes place. Classical mechanics emerges from quantum mechanics when a large set of particles is observed. If there is a 100M ton freight train coming straight at you, you don't use the probabilites of quantum mechanics to compute whether or not the train is going to hit you -- you're going to get hit!

When listening to this lecture my mind began to spin and the wheels began to turn. Coming from a physics background I am naturally a reductionist -- I tend to believe that every problem can be reduced down to first principles. This belief has bled over into my software engineering work. I began to wonder: what if the significantly complex interactions of software in modern systems begin to display emergent behavior? How could we possibly begin to systematically analyze and understand the nature of this behavior?

It seems that in general modern software engineering has a reductionist approach. Most agile methodologies, for example, advocate unit testing. Unit tests are nothing more than a firm belief in and practice of reductionism. Even without emergent behaviors most software engineers know that the transition from unit testing to integration and system testing is non-trivial. Simply understanding the interaction of various software components and understanding the failure modes that are introduced is a hard and not well understood problem.

So then what if non-reducible behaviors do occur? Are there going to be cases where we throw our propositional logic and lambda calculus out the window in favor of representations that model the emergent behavior? Interesting times are ahead!

Negative Testing

I have been bit more than once when developing an application where all of my tests ran green yet the application still failed. This situation is inherent in basic Code Unit Test First methodology when you gloss over the details. I have met many developers that will recite "Write a unit test for the desired new capability. Write the code for the functionality. Run the test and if it succeeds then you're done." but will miss the very important "run the test first before the functionality code is written and make sure that it fails". (Never Writea Line Of Code Withouta Failing Test)

But ensuring that the test fails without the desired functionality does not ensure that the test will fail when the desired functionality is broken in some other manner. Let's look at an example:

When a particular piece of functionality is missing an exception is thrown. A unit test is written to explicitly check for this thrown exception and fail when it is found. When the new functionality is added the exception is not thrown and the test succeeds.

Is this particular unit test meaningful or useful? If a common failure mode is known to be that the functionality is not present (say, that the functionality is loaded dynamically at runtime) then the test may be deemed useful. If the functionality's presence can be determined by other means such as compilation errors, then the test is not particularly useful.

What is important to note in this case is that other failure modes of the functionality must be exploited. Enter negative testing. Negative testing (based on the methodology that you follow) is "showing an error when not supposed to and not showing an error when supposed to" or "showing that software will fail and that the failure is handled in a specified manner". Rather than having a lengthly posting about negative testing, the paper entitled A Positive View of Negative Testing is a great read. Understanding the subtleties of positive and negative testing is important to ensure that your unit tests are providing the sturdy safety net that you're relying on.

Regression Tests

How many times have we heard:

Unit tests provide a sort of safety net which "[allows] you to refactor at any time without fear of breaking existing code, so you can constantly improve the design of your program"

But those of you that are familiar with testing know that this is nothing but a regression test:

A regression test is "a testing technique consisting of the repetition of a test after the work product under test has been iterated. Regression testing is used to identify any defec