Woozle Wuzzle
String Edit Distance

One commonly used approach for measuring similarity between strings is the string edit distance. The basic premise of the string edit distance is to find the minimum number of character edit operations (copy, replace, add, and remove) needed to transform one string into the other. For example, the edit distance between parks and spark is two: one edit is needed to add a leading s to parks and a second edit is needed to delete the trailing s. Not suprisingly, string edit distance is used in spell-checking applications.

It is convenient to look at string edit distance as a string alignment problem. The example below should provide some insight into this:

 String 1:     p a r k s 
 String 2:   s p a r k   
 Operation:  i c c c c d 

The operation is the character operation needed to align or transform string 1 into string 2. i is insert, c is copy, r is replace, and d is delete. (You may find other references using s (substitute) instead of r (replace).) The space at the beginning of string 1 or at the end of string 2 is called a gap.

For a given set of strings, there may be many ways to align the strings. For example, the strings abc123 and abcqwertabc123 have two obvious alignments:

 a b c                 1 2 3
 a b c q w e r t a b c 1 2 3

and

                 a b c 1 2 3
 a b c q w e r t a b c 1 2 3

The particular application would determine which alignment is preferred.

Different distance metrics are used to select the desired alignment. For example, the measurement used in the original example of the cost between parks and spark is the Levenshtein distance. In this metric the cost of replacing, adding or deleting a character is one whereas the cost of equal (or copied) characters is zero. (In the conversion of parks to spark, the cost of copying letters p a r k is zero.) The second alignment above (where abc123 has no gap) can be obtained using the Smith-Waterman distance.

For simple strings it is relatively easy to understand how to transform one string into another but with longer strings or very dissimilar strings it is not so clear. For example:

 See Spot run
 We work and play

There are brute force techniques that consider all possible alignments and find the one(s) with the minimum cost. I refer you to a few good references on string edit distance and the associated dynamic programming algorithms for more information.

URL Similarity

In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I'm going to use Amazon.com as an example.

Search for "book" on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:

http://www.amazon.com/Harry-Potter-Half-Blood-Prince-Book/dp/0439785960/ref=pd_bbs_1?ie=UTF8&s=books&qid=1196364228&sr=8-1
http://www.amazon.com/Little-Green-Book-Getting-Your/dp/0131576070/ref=pd_bbs_2?ie=UTF8&s=books&qid=1196364228&sr=8-2
http://www.amazon.com/Learning-Curve-LC97727-Lamaze-Caterpillar/dp/B00009IMD8/ref=pd_bbs_3?ie=UTF8&s=baby-products&qid=1196364228&sr=8-3
http://www.amazon.com/Book-General-Ignorance-John-Mitchinson/dp/0307394913/ref=pd_bbs_sr_4?ie=UTF8&s=books&qid=1196364228&sr=8-4
http://www.amazon.com/The-Daring-Book-for-Girls/dp/B000UZJQNM/ref=sr_1_13?ie=UTF8&s=books&qid=1196364228&sr=8-13
http://www.amazon.com/Inconvenient-Book-Solutions-Biggest-Problems/dp/B000WJVLLG/ref=sr_1_16?ie=UTF8&s=books&qid=1196364228&sr=8-16

We can easily see a pattern (the template) in those urls:

http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index>

Our job is to find an algorithm that will allow us to automatically discover that template.

(I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page's URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template.

I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.)

We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment).

Template Extraction

The goal of template extraction is to discover the template(s) (if there are any) used to generate a page. There are two major categories of templates: those that are used within a single web page (such as the individual search results entries on a Google search result page) and those that are used across web pages (such as a story template used on CNN.com). In the former case, the goal is to be able to extract the template(s) from any page that has at least two similar data records on it. In the latter case, the goal is to be able to extract the template(s) from any two pages that have similar data records. As more similar records or pages are found we expect our precision to increase.

By visually looking at a rendered Google search results page, for example, a human can easily see similar, repeated sections. Although it is a bit more difficult, someone familiar with HTML can look at the source to that same Google page and find the repeated sections. Similarly, there are two general approaches to automatic template discovery -- one that relies on the rendered representation of the page and one that relies on the source of the page. A survey of the research literature does not show a distinct advantage of one technique over the other and in many cases the underlying algorithms are very similar. I am going to focus on the latter approach since that is where my experience lies.

Template Extraction and Web Wrapper Generation

With the recent movement towards mashups, the semantic web and market intelligence, there is a large need to get at the data and information that is stored in web pages. Data extraction startups are popping up like weeds (e.g. InfoSquire and QL2). Many of these startups focus on services where you specify what sites you want scraped and they provide you with the resulting data feed. The technologies that they use are primarily rules-based (e.g. regex). Rules-based systems are highly brittle given the dynamic nature of the web. Specifically, there is a high maintenance cost to maintaining and monitoring the rules to ensure that they are up to date with any changes made to the underlying web pages. The ability to automatically generate data extractors with high precision would be a vast improvement over a rules-based system.

Much research has gone into extracting data (either structured or unstructured) from a web page using web wrappers. A web wrapper is a tool for "converting information implicitly stored as an HTML document into information explicitly stored as a data-structure for further processing" [W4F]. One particular type of automatically generated web wrapper uses template extraction. Template extraction is the inverse of creating a web page from a template -- for a given web page, attempt to deduce the template that was used to generate the page. If a template can be generated for any (template derived) web page, then the data that populates that template can be easily extracted.

Over the next few weeks I am going to focus on machine learing and other automatic web wrapper technologies in a series of postings.

Amen
"Testing by itself does not improve software quality. Test results are an indicator of quality, but in and of themselves, they don't improve it. Trying to improve Software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software development techniques you use determine how many errors testing will find. If you want to lose weight, don't buy a new scale; change your diet. If you want to improve your software, don't test more; develop better."
Steve McConnell, Code Complete
Sold!

I just stumbled on Live Clipboard and I'm sold. The screencasts are quite enlightening (though for type-A people like myself very painful to sit through).

For interested parties, the discussion archives are located here rather than the broken link from the Live Clipboard site.

Dapper

I have been doing some research on the Enterprise 2.0 landscape and web mashup tools. I stumbled across Dapper and their demo videos. It would be an understatement to say that I was impressed with what I saw. I have no knowledge about the stability, performance, and actual usability of their product but I am certainly excited about what they're promising. Their blog has some information about new features and possible uses for those features (such as the Login Dapps). This company is obviously ripe for the buying and I hope that someone that has a history of realizing acquired companies brings this product to full maturity.

StreamCruncher 1.0, a lightweight Event Processing Kernel
StreamCruncher is an Event Processor. It supports a language based on SQL which allows you to define Event Processing constructs like Sliding Windows, Time Based Windows, Partitions and Aggregates. Queries can be written using this language, which are used to monitor streams of incoming Events. StreamCruncher is a multi-threaded Kernel that runs on Java™.

Check out StreamCruncher for more info.

C to MIPS to Java bytecode

I refer you to binkley's BLOG for information regarding translating C/C++ code to Java.

Java IAQ

I stumbled across the Java IAQ (Infrequently Answered Questions). It's a bit dated but there are still some interesting tidbits in there.

SQL Injection Inference

I never put too much thought into how one mines a database via SQL injection especially when a web page is designed for only a certain type of output. This paper has quite a bit of information about mining through inference. Much of the paper is directed at MS SQL Server but there is information about other databases as inference is a general attack.

JFace's CellEditors

If for some reason you can't get your CellEditor to show up on a JFace table then make sure that you have set the column properites (TableViewer.setColumnProperties(String[])). This is not entirely explicit especially in ICellModifier.

Java class vs. variable access

Here's an interesting question for all of you Java-ites out there. Given the following code:

class Test {
    public Test() {
        final Foo Bar = new Foo();
        Bar.go();
    }
}

class Foo {
    public static void go() { /*does something Foo-y*/ }
}

class Bar {
    public static void go() { /*does something Bar-y*/ }
}

which go() is called from Test's constructor?

The answer is that Foo's go() is called. So now the next question is, how does one call Bar's go()? The obvious answer is "fully qualify Bar" (i.e. specify it's package name). But what if Bar is in the default pacakge?

This is obviously a contrived example and most people would say "use cAmEl case for you variable names so that this is never a problem" and you're right, that's a perfectly cromulent approach. I actually ran into this in practice when working with JavaScript which is a whole different problem.

Java Performance Tuning

I ran across this before and forgot the link. So rather than forgetting again, here is a link about Java 1.5 and GC performance.

Java Performance Tuning has oodles of tuning tips on a variety of topics. There are also references to interesting articles (such as the Commando Pattern).

Can't find dependent libraries

If you've done any JNI work or have worked with a 3rd party library with .dlls on Windows then you may have run into the dreaded "Can't find dependent libraries" error. This is the extra credit problem that goes one step beyond java.library.path.

The root of the problem is that even with java.library.path set correctly, Windows will not look in anything other than its PATH for dependent libraries. This posting covers much of the problem, cause and solution. (I should point out that this is a Java problem not an Eclipse problem.) You might need to use something such as Dependency Walker to trace the set of DLL dependencies.

In case you were wondering, the dependency order for BDBXML 2.2.13 is:

System.loadLibrary("msvcp71");
System.loadLibrary("msvcr71");
System.loadLibrary("libdb43");
System.loadLibrary("libdb_java43");

System.loadLibrary("xerces-c_2_7");
System.loadLibrary("Pathan_7.1");
System.loadLibrary("libxquery12");
System.loadLibrary("libdbxml22");
System.loadLibrary("libdbxml_java22");

And the dependency order for BDBXML 2.3.10 is:

System.loadLibrary("msvcp71");
System.loadLibrary("msvcr71");
System.loadLibrary("libdb45");
System.loadLibrary("libdb_java45");

System.loadLibrary("xerces-c_2_7");
System.loadLibrary("Pathan_7.1");
System.loadLibrary("xqilla10");
System.loadLibrary("libdbxml23");
System.loadLibrary("libdbxml_java23");
Constructor exceptions

Here's a real noodle scratcher when you first encounter it. This makes a good interview question since it forces candidates to talk through the construction sequence. This isn't a question that I would necessarily expect someone to answer correctly but I would expect someone to be able to talk through the various cases and the effects of those cases.

/**
 * <p>Tests that refernces can be made to an member even though its constructor 
 * fails.</p>
 */
public class ConstructorExceptionTest extends TestCase
{
    /**
     * <p>Tests that refernces can be made to an member even though its 
     * constructor fails.</p>
     */
    public void testException()
    {
        final ArrayList<ConstructorException> objects = new ArrayList<ConstructorException>();
        ConstructorException failedObject = null;  
        try
        {
            failedObject = new ConstructorException(objects);
        } catch(final Exception e)
        {
            // ignore since it is required to occur
        }
        assertNull("The newly created member should be null.", failedObject);
        assertTrue("The list should not be empty.", !objects.isEmpty());

        final ConstructorException listObject = objects.get(0);
        assertNotNull("References to the object still exist.", listObject);
        assertNull("The object's member should be null.", listObject.member);
    }

    /**
     * <p>An member that throws an exception on construction to ascertain what
     * ooccurs to references to it that are made before the construction 
     * completes.</p>
     */
    private class ConstructorException
    {
        /**
         * <p>Some member that is "created" after the constructor should have
         * failed.</p>
         */
        final Object member;

        /**
         * <p>Adds itself to the specified list of objects before it throws an
         * exception.</p>
         * 
         * @param  objects a list of objects to which this member adds itself
         * @throws Exception always
         */
        public ConstructorException(final List<ConstructorException> objects)
            throws Exception
        {
            objects.add(this);
            if(true)
                throw new Exception();
            /* else -- cannot occur */

            member = new Object();
        }
    }
}

I should point out to all of you doubting Thomas' out there that this test runs green. I should also point out that the if(true) is a trivial replacement for anything that can go wrong during construction.

Logging and throwing exceptions

I was getting tired of remembering to have to log my exceptions and I wanted something to do it for me. I remembered something about generics and exceptions. This is what I came up with:

/**
 * <p>A convenience method for 
 * {@link org.apache.log4j.Logger#error(java.lang.Object, java.lang.Exception) logging}
 * and throwing the specified {@link java.lang.Exception}.</p>
 *
 * <p>Example usage:</p>
 * 
 * <pre>
     private void someMethod()
         throws SomeException
     {
         ...
         LogException.logThrow(log, new SomeException("This will be logged and thrown."));
         ...
     }
 * <pre>
 * 
 * @param  log the <code>Logger</code> to which the exception is logged.
 * @param  exception the <code>Exception</code> to be logged and <code>throw</code>n
 * @throws E the specified exception
 */
public static <E extends Exception> void logThrow(final Logger log, 
                                                  final E exception)
    throws E
{
    log.error(exception, exception);
    throw exception;
}

The only obvious problem with this technique is that the compiler doesn't know the method never returns normally. So in a case such as:

public int returnInt()
    throws SomeException
{
    final int value;
    try
    {
        value = ....
    } catch(final OtherException e)
    {
        LogException.logThrow(log, e);
    }
    return value;
}

the compiler will complain that value may not have been initialized. Oh well. I can dream can't I?

BDBXML and document modifications

There are a number of problems with updating XML documents in an XML store. The primary problem is that XQuery 1.0 does not have a facility for updating documents. The XQuery Update Facility Requirements exists but it clearly states "the WG does not intend to produce a Recommendation from this Working Draft" which leaves me with a big fat question mark over my head. This has caused each XML DB vendor to provide their own update mechanism. And this leads to the topic of this post:

BDBXML is a decent XML DB but it's somewhat rough around the edges. When updating documents it is common to get the following error:

Cannot perform a modification on an XmlValue that isn't either Node or Document type, errcode = INVALID_VALUE

To make a long story short, you cannot specify a doc('...') in the XmlQueryExpression that you specify to any of the XmlModify modification methods. The documentation implies this but does not drive the point home. Given that all non-modification uses of XmlQueryExpression require some sort of "navigation function" (collection('...'), doc('...'), etc) it feels odd not to specify it for modfications.

More on XSD, PSVI and non-native attributes

If you have been following along with the trials and tribulations of XSD, PSVI and non-native attributes then you have been left with wondering about case where you have a non-native attribute and no <xsd:annotation>. For example:

<xsd:element name="something" myNS:name="Something" ...>
  <xsd:simpleType>
    ..
  </xsd:simpleType>
</xsd:element>

Since there is no <xsd:annotation> one might expect that there is no XSAnnotation object. This is exactly what occurs. So then, how does one access the non-native attribute? After a brief session of splunking through Xerces I stumbled across the notion of a "synthetic annotation". A quick hop over to Google and one quicly finds out about the generate synthetic annotations feature which purports to "[generate a synthetic annotation] when a schema component has non-schema attributes but no child annotation".

So we're done, right? Well, no. The page from which the "generate synthetic annotations" is taken is actually for the SAX parser which is not what we use. A quick search of the intersection between XSModel, XSLoader (which is used to parse the XSD into the XSModel) and "feature" reveals nothing. Broadening the search to include all of the synonyms of "feature" (such as "parameter", "attribute" and "property") finally reveals a hit. XSLoader exposes a DOMConfiguration which allows one to set "parameters". Listing all parameter names (via getParameterNames()) shows the sought after "generate-synthetic-annotations". Whew!

To round this out, the synthetic annotation appears as:

<xsd:annotation myNS:name="Something" ...>
  <xsd:documentation>SYNTHETIC_ANNOTATION</xsd:documentation>
</xsd:annotation>

and the setup code to get an XSModel from an XSD is:

System.setProperty(DOMImplementationRegistry.PROPERTY, 
                   "org.apache.xerces.dom.DOMXSImplementationSourceImpl");
final DOMImplementationRegistry registry = 
        DOMImplementationRegistry.newInstance();
final XSImplementation xsImpl = 
        (XSImplementation)registry.getDOMImplementation("XS-Loader");
final XSLoader schemaLoader = 
        xsImpl.createXSLoader(null/*all XML Schema Versions*/);

// NOTE:  synthetic annotation nodes MUST be created for non-native 
//        attributes to be parsed and added to the XSAnnotation object
//        (for cases where there is no )
final DOMConfiguration config = schemaLoader.getConfig();
    config.setParameter("http://apache.org/xml/features/generate-synthetic-annotations", 
                        true);

final XSModel xsModel = schemaLoader.loadURI(xsdURI.toString());
The problem with XML Schema

Let's just clear the air up front: I like XML Schema. It's convient. It's solves about 90% of all XML validation concerns. It's concise.

I have been doing work lately that leverages the validation provided by XSD by supplimenting it with JavaScript and Java. The problem with XSD is that it's not just simple XML. Sure, it's written in XML, but that's not the point. You can't just rip through an XSD with XPath an extract out the information that you want. This becomes obvious when you think about the structure that XSD represents: there's inheritance and references and all kinds of things that go "Bump" in the night. So the primary way that you can get at the guts behind what XSD is providing is via the Post-Schema-Validation Infoset (PSVI). But there's a problem: it appears that there is no way to access non-native (i.e. non-xsd) attributes. It appears that if you have extended XSD in any way then there's no way to access this information. Why allow it to be specified if one cannot access it?

Problem solved

I have been banging my head looking for a way to access non-native attributes of an XSD via PSVI. The problem I kept hitting was that the XSD API defines a seemingly limited XSAnnotation. There is only a annotationString method for retrieving the annotation. Taken literally (which is what I was doing) this will return the information within the <xsd:annotation> and that's it.

I started looking for alternative techniques. I found some interesting information regarding XML Beans, SchemaAnnotation and non-native attributes. And this got me to thinking. I started to do some code splunking in Xerces and found that the member that backs getAnnotationString() looks like:

// the content of the annotation node, including all children, along
// with any non-schema attributes from its parent
private String fData = null;

Well that certainly does not match the PSVI description of "A text representation of the annotation". The key is the "along with any non-schema attributes from its parent". If your XSD looks like:

<xsd:element name="something" myNS:name="Something">
  <xsd:annotation>
    <xsd:appinfo>Stuff</xsd:appinfo>
  <xsd:annotation>
  ...
</xsd:element>

then getAnnotationString() will return:

<xsd:annotation myNS:name="Something" ... >
  <xsd:appinfo>Stuff</xsd:appinfo>
<xsd:annotation>

No, really.

Now those of you that are careful readers are likely sitting there wondering how this is even possible since it's inconsistent. You're wondering what happens when your XSD looks like:

<xsd:element name="something" myNS:name="Something">
  <xsd:annotation myNS:name="Something else">
    <xsd:appinfo>Stuff</xsd:appinfo>
  <xsd:annotation>
  ...
</xsd:element>

Well, I'm sure you've already guessed the answer:

<xsd:annotation myNS:name="Something else" ... >
  <xsd:appinfo>Stuff</xsd:appinfo>
<xsd:annotation>

Yup, the attribute from the <xsd:element> is "overwritten" and lost. And people wonder why I get bitter about these things. Always follow the rule of thumb: anytime that you do something "crafty" (such as updating the annotation to include the "annotations" (non-native attributes) from the parent element) you're shooting yourself in the foot.

Why doesn't XSObject have a XSObjectList getNonNativeAttributes() where the XSObjectList contains perhaps XSNonNativeAttribute objects, I don't know. But that would have certainly saved me a few hours.

This is continued in More on XSD, PSVI and non-native attributes.

NamespaceContext and XPath

I am using Xerces to parse an XML Schema annotation. I stumbled across two intesting situations when dealing with Xerces PSVI:

  1. The contents of an annotation can only be retrieved as a string. It would have been nice to have access to a Node object instead.
  2. When using XPath, one must provide their own NamespaceContext object when using namespaces. Why a trivial implementation that was backed with a Map of strings was not provided I cannot guess. (This isn't specific to PSVI but this is the first time that I'm using Xerces rather than dom4j which provides SimpleNamespaceContext via jaxen.) Stefan Podkowinski has felt my pain. The O'Reilly Network has code for an implementation.

As a side note to the NamespaceContext: if you have a default namespace in the XML that you are parsing then you must have a blank namespace (not null but "") registered with the NamespaceContext.

BDBXML and default namespaces

If you're looking to do some simple XQuery work, Berkeley DB XML (BDBXML) seems to be a good answer.

The XML that I was loading into the DB has a default namespace. For example:

<?xml version="1.0" encoding="UTF-8"?>
<item xmlns="http://example.org/item"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://example.org/item item.xsd">
    <size>...</size>
    ...
<item>

Unfortunately BDBXML does not allow one to declare the default namespace:

xmlQueryContext.setNamespace("", "http://example.org/item");

You must bind a prefix to the namespace and use that in any XQuery / XPath:

[code]
xmlQueryContext.setNamespace("i", "http://example.org/item");

[query]
/i:item/i:size
Know that you don't know

I have often said in the past:

You have to know what you don't know.

Often my test of a good manager or developer is to query to find out if they know what they don't know. If they have no idea -- if they can't shape out or draw a box around what they don't know -- then that's not a good thing in my book.

I was talking to a colleague of mine, Greg Della-Croce, and he said:

You have to first know that you don't know.

He's absolutely right.

Developers Needed!

I am working with a small start-up that is focusing on a business-centric approach to information security. We need proven Java developers to design and build the core systems of the product.

If you have the qualities listed below (or know of someone that possesses them), send me an email.

  • Agressive and driven
  • Lives for the challenge
  • Does what it takes to get it done but understands what is needed to support it
  • Works best in very small teams
  • Understands what a "team of one working in a group of peers" entails
  • Self lead, self directed and self motivated
  • Soup-to-nuts mentality and has the experience to back it up
  • Has a clear definition of success
  • Has something to prove and needs a place to do it
  • Not looking for a "job"
  • Experience writing COTS products
Searching Code: Jungloid Mining

I was recently looking for an effective and programmatic way to search through an API for method invocations when I stumbled on Prospector. After a little more searching I found this paper Jungloid Mining: Helping to Navigate the API Jungle and the main page for Prospector. While it's not what I was originally looking for, it is an interesting technology. For my personal experience, finding how to convert from one class to another is a common task and requires much time. I hope to see this and its children find its way into my IDE.

Bindmark

I recently stumbled on Bindmark which is purported to be "a comparison of the existing open-source and commercial (when available for free evaluation download) libraries for binding XML data to Java classes."

Does anyone know of a similar comparison for JTS implementations?

Make your apps Sparkle

Microsoft introduced Sparkle last week at the PDC. There is an excellent video covering its capabilities over at Channel9.

For reasons that I don't completely understand, the industry is limiting its vision and only seeing this as a "Flash killer" or "MS Flash". I see this is a necessary and much needed combination of two traditionally disparate development steps: design of a user interface by a graphical designer and laying out that interface by a developer. I can't tell you how much time I have spent switching back and forth between Photoshop creating a particular interface look and then a visual interface designer (or even just directly in code *shudder*) implementing the design wondering if there was a better way. There are few times when you can point to a paradigm shift -- this is one of those times.

I am anxiously anticipating the companies that embrace this paradigm and move it to the next level!

Development Metrics

I am commonly asked "What metrics do you use to measure developer productivity?". There are a of common metrics that people use such as lines of code written (*shudder*) to number of unit tests written. Both of these suffer from the same root cause: they are measuring something artificial. The number of lines of code that I produce is completely artificial. I can have a style that separates declaring a variable from setting the default value. This makes me "twice" as productive as someone that sets the value in the declaration (the latter of which may be more maintainable code). The number of unit tests that I write is also completely artificial. I can write seven unit tests for a form with seven fields or I can write a single unit test that tests all seven fields at once. (I'm obviously using contrived examples to elucidate a point.)

So what is a non-artificial unit that one can measure for developer productivity? David J. Anderson calls it "customer valued functionality". If your organization is already up to speed on an agile development methodology (FDD obviously being the best fit) then you are likely already defining units in terms of something that the customer values. You measure the rate at which your developers produce and the customer accepts this customer valued functionality.

(This is an interesting article on metrics and agility.)

This notion of "non-artificial" is especially poignant for me in the recent months. I have been spending my time working on an approach to enterprise security that integrates decision support with a business process centric view. Currently most security decisions are made based on particular threats and vulnerabilities along with best practices. What's interesting about this is that measuring a company's risk from the standpoint of threats and vulnerabilities is akin to measuring developer productivity by measuring the number of lines of code that they produce. Instead of measuring something "artificial", risk should be understood from the standpoint of "how will this impact my business". From an IT person's view the answer is always death and destruction and the business owners just react to that. But the real question for the business owners should be: what is your tolerance to this particular "bad thing" and what impact will it have on your business? I'll be talking about this more in the future.

New Testing Forum Available

Over the past few months I have been swamped with questions concerning testing and quality assurance. I never believed that a blog was a good mechanism for asking and responding to questions so I have made a new forum available.

Software Engineering

Currently there is only a forum for QA and Testing. As the need arises or if people ask, I will create new forums for software engineering topics.

A huge thank you goes out to Igor who set up the forum and does all of my system administration. Thanks Igor!

I stumbled on when I was doing some unrelated research on π-calculus. Cω home page is located here. There are some interesting ideas in there. When I have more time I'll look into it more.

More on Code Comments

I have written many times before about code comments. I recently read Mike Clark's article titled Write Sweet-Smelling Comments. Just as before, I disagree with a goodly chunk of what he writes. (If you want to read some excellent tips on code comments, I recommend Code Complete, Second Edition.)

Since I could ramble on for hours about what I don't like about Mr. Clark's article, I'll limit it to just his "eye-popping comments". He got the "eye-popping" part right since my eyes bugged out of my head and I nearly puked on my keyboard.

You may be tempted to remove the read lock here, but don't because...

This can be "refactored" simply to be "This read lock ..." where suitable "what"'s and (most importantly) "why"'s are stated.

The following code needs to be refactored, and I'm embarrassed
to call it my own.  However, we must ship this code now!  I promise
to clean it up before anyone starts thinking smelly code is acceptable.

For the love of all that is good and pure in this world, don't write editorials in code. To be more succinct as to why this is so downright bad:

  • Who is "I", "we", and "my"? Writing comments in the first person is just plain wrong. After the code is checked in and is pertubated for a while, there is no effective way to know who "I", "me", and "my" are! So don't use them. If there's a reason to include someone's name in a comment then base it on the author field or use initials or something so that if someone reads the comment in one week or one year, they know what's going on.
  • It's one thing to state that "The following code needs to be refactored" and it's quite another to state why the code needs to be refactored. All of the time wasted editorializing could have been more fruitfully spent quickly bulleting out the "smelly" points and possible resolution steps. The scariest thing to see in code is a comment that says "Refactor!" followed by perfectly sane looking code that works. Why should the code be refactored? Has it been duplicated? Are there bad practices in use?
  • When is "now"? The moment after the code is checked in, relative times (just like using the first person) lose their context.
  • Don't editorialize in comments. We have blogs now. Save it for a nice blog entry.
  • In case I missed mentioning it earlier, don't editorialize in comments.

So that I don't appear to be a "Negative Nelly", I will say that Mr. Clark's "Use of a published algorithm" is an excellent comment that I wish that more developers would use. I'll broaden that statement further to say that I wish that more developers would research the algorithms that they use. I can't even count any more the number of times that I've come across home grown algorithms that have left me scratching my heading questioning if the developer didn't know there are fairly standard ways of implementing a piece of functionality or they were just winging it.

So as not to ramble incessantly about bad comments I will close this entry by kindly reminding developers that the code is really all that they have to show for in their life's work (Yes, yes. The finished product. We'll talk about that some other time.) Take some pride and show some professionalism when you write code and its associated comments.

Use a Rubric to choose a tool

"What tool should we use?"

How many times have we hear those words? When a group of developers is presented with that question most will scurry off and start searching the web for information. A chunk of those developers will eventually enter a spin cycle and will never report back a result. The rest of them will come back with an answer and say "Here's what we should use and why."

Move forward about a year.

  • The project was successful. Everyone's happy. The team moves on to the next project.
  • The project was unsuccessful (regardless of whether or not it was the "fault" of the tool or not).
  • Some key developers leave the project. New developers are brought on who weren't part of the original decision process.

Now the questions are: were those original "why's" valid to the problem at hand and how do they affect projects moving forward?

In most environments the answers are easy: "we don't know". Most teams don't record the decisions made and they don't have a way to measure success at any level (save, perhaps, for the project itself).

Enter the rubric.

When the "What tool should we use?" question is first asked do this: spend an hour or so with the group and define the major business and technical requirements as they are viewed at the time and write up a rough rubric -- a set of criteria and a scale for determining relative importance of the criteria. Agile teams should be able to plow through this requirements definition by using one of the standard planning methods.

The rubric is used to ensure that each tool is evaluated in a consistent manner. It ensures that the group is all pointed in the same direction. You may have to change your rubric as you go along to meet new concerns. (Make sure that all tool candidates are rescored against the new rubric.) Once a decision is made you can be assured that the reasons for choosing that tool are known and understood.

Will a rubric ensure that you never make a bad decision? Certainly not. So why do it? Let's examine a case where a rubric is not used:

A tool is chosen by standard means (i.e. someone just picked it). The project fails. During the lessons learned meeting the following is heard:

  Manager:  "Why did we choose this tool?"
  Developer:  "I don't know.  Joe chose it."
  Manager:  "Where's Joe?"
  Developer:  "He left the project a few months back."

In many environments the blame will be placed solely on Joe. Without some way to compare against what was orginally expected of the tool, how can the success of the tool be rated? More importantly, how can you be sure that the same mistakes wont be made on the next project?

  Developer:  "We're not going to use that same tool that we used on the previous project!"
  Manager:  "Why?"
  Developer:  "Don't you remember?  That project crashed and burned!"
  Manager:  "You're right!"

It's easy to confuse the cause-and-effect reasons for the failure of a project. It may be that the tool was the reason for the project's failure or it may be that the tool was the reason for the project's success. But the root cause of the project's success or failure may be unknown since the tools were chosen in an ad hoc manner.

Had a rubric been used in the above case, the team would be able to go back and look at the requirements for the tool and how those requirements were weighted. Let's say that the project failed because the tool had a number of critical bugs that were not patched in a timely manner. Looking back at the rubric one could check to see how the product was rated in the categories of "time between releases", "relative stability and maturity", "number of outstanding bugs", etc. If none of those qualities were investigated then for the next project the team knows that they need to look at those and rate them appropriately.

Through the proper use of a rubric, the consistency and quality of projects can be systematically improved.

SWT OpenGL Binding

As a follow on to the article entitled Using OpenGL with SWT I would like to mention that I have a SWT OpenGL binding available at:

http://www.realityinteractive.com/software/oss/index.html#SWTOGL

Happy SWT and OpenGL Coding!

Java's Nested Classes

Most everyone knows that a nested class can access private memebers of its enclosing class. But did you know that the reverse is true? The enclosing class can access private members of the nested class. No, really.

For those doubting Thomas' out there, I'll provide you with a reference.

Does anyone know the motivation behind why the enclosing class would be able to access private members of the nested class?

Archetypes, Color, and the Domain-Neutral Component

My parents were always apt to say "you learn something new each day". For the most part they've been right from the little things (turn the door knob before trying to go through the door) to the big things (when a sign says "Do not feed the bears," man, you better not feed the bears).

Today's new thing was stumbling on the paper Archetypes, Color, and the Domain-Neutral Component. As I've often said: when you sit down to solve a differential equation you don't just try to figure it out, you use the set of known patterns to guide you to a solution. I believe that the same is true with software architecture and development.

Here are some complimentary links on color modeling and the DNC:

Emergence

I had the privilege of seeing a lecture by Dr. Robert Laughlin (1998 Nobel Prize winner in Physics) at the Adler Planetarium. "[Dr. Laughlin argued] that the true frontier of physics may not lie in studying tiny individual particles, but rather by studying the properties that emerge when large collections of particles are taken together as a whole." This notion of "emergence" flies in the face of centuries of reductionism. (This is also an interesting reductionism link.)

Dr. Laughlin demonstrated that some (all?) sufficiently complex systems show emergent or collective behaviours -- behaviors that are greater than the sum of their parts. What's facinating about these behaviors is that they do not follow reduction-based models. Take for example a rigid aluminimum bar. What causes its rigidity? If you reduce the problem down to a nano scale, say, a single layer of aluminimum atoms, you find that the system is no longer rigid; it actually demonstrates fluid behaviors. As you increase the number of atoms the system begins to display more and more of the rigid behaviors. There is no single point at which, like a light switch, rigidity can be on or off.

Another interesting example is classical (Newtonian) mechanics. As you attempt to reduce the problem of mechanics down father and father you enter the realm of quantum mechanics where probabilities rule. There is no point at which the transition from one model to the other takes place. Classical mechanics emerges from quantum mechanics when a large set of particles is observed. If there is a 100M ton freight train coming straight at you, you don't use the probabilites of quantum mechanics to compute whether or not the train is going to hit you -- you're going to get hit!

When listening to this lecture my mind began to spin and the wheels began to turn. Coming from a physics background I am naturally a reductionist -- I tend to believe that every problem can be reduced down to first principles. This belief has bled over into my software engineering work. I began to wonder: what if the significantly complex interactions of software in modern systems begin to display emergent behavior? How could we possibly begin to systematically analyze and understand the nature of this behavior?

It seems that in general modern software engineering has a reductionist approach. Most agile methodologies, for example, advocate unit testing. Unit tests are nothing more than a firm belief in and practice of reductionism. Even without emergent behaviors most software engineers know that the transition from unit testing to integration and system testing is non-trivial. Simply understanding the interaction of various software components and understanding the failure modes that are introduced is a hard and not well understood problem.

So then what if non-reducible behaviors do occur? Are there going to be cases where we throw our propositional logic and lambda calculus out the window in favor of representations that model the emergent behavior? Interesting times are ahead!

Negative Testing

I have been bit more than once when developing an application where all of my tests ran green yet the application still failed. This situation is inherent in basic Code Unit Test First methodology when you gloss over the details. I have met many developers that will recite "Write a unit test for the desired new capability. Write the code for the functionality. Run the test and if it succeeds then you're done." but will miss the very important "run the test first before the functionality code is written and make sure that it fails". (Never Writea Line Of Code Withouta Failing Test)

But ensuring that the test fails without the desired functionality does not ensure that the test will fail when the desired functionality is broken in some other manner. Let's look at an example:

When a particular piece of functionality is missing an exception is thrown. A unit test is written to explicitly check for this thrown exception and fail when it is found. When the new functionality is added the exception is not thrown and the test succeeds.

Is this particular unit test meaningful or useful? If a common failure mode is known to be that the functionality is not present (say, that the functionality is loaded dynamically at runtime) then the test may be deemed useful. If the functionality's presence can be determined by other means such as compilation errors, then the test is not particularly useful.

What is important to note in this case is that other failure modes of the functionality must be exploited. Enter negative testing. Negative testing (based on the methodology that you follow) is "showing an error when not supposed to and not showing an error when supposed to" or "showing that software will fail and that the failure is handled in a specified manner". Rather than having a lengthly posting about negative testing, the paper entitled A Positive View of Negative Testing is a great read. Understanding the subtleties of positive and negative testing is important to ensure that your unit tests are providing the sturdy safety net that you're relying on.

Regression Tests

How many times have we heard:

Unit tests provide a sort of safety net which "[allows] you to refactor at any time without fear of breaking existing code, so you can constantly improve the design of your program"

But those of you that are familiar with testing know that this is nothing but a regression test:

A regression test is "a testing technique consisting of the repetition of a test after the work product under test has been iterated. Regression testing is used to identify any defects were inadvertently introduced (i.e., to determine if the work product has regressed) since the previous test."

I cannot honestly say that I have ever heard "safety net" tests called regression tests. Why is this? Are the more aglie methodologies trying to separate themselves from the more heavy-weight methodologies by using a different nomenclature? Are there people out there believing that they've discovered something new?

Send your thoughts!

Managing expectations

It is common for the relationship between development and management to be tenuous. Both are like good fighters circling each other, testing for weaknesses. The moment that a weakness is found: whammo! And it's usually the developers that are down for the count.

As a developer the way that I typically win the fight with management is by effectively managing their expectations. When a project starts out, I set very clear expectations on an initial deliverable. This is not what most would consider to be a milestone. This is something typically much smaller. It is this piece of documentation or it is showing them that few buttons and fields working on a web page. Now here's the important part: I meet that expectation. After the first expectation is met, another is set, and, guess what?, I meet that expectation too. This goes on again and again and again. Within a few weeks management has become more amicable when it comes to negoitating terms.

(What's ironic, for me at lesat, is this is what I believe that XP is trying to promote with its tight involvement with the clients and its short development cycles. Unforunately for me, I had to develop this methodology on my own over the years.)

For those that may have missed it, let me emphasize again what is going on. By setting expectations and by consistenty meeting those expectations, the tenuous relationship between management and development becomes more amicable. Effectively, development builds and earns managements trust. Only through this earned trust can development begins to dictate its own terms. (I call this "Managing the Managers".)

A quick illustration: when developers and management first meet, there is no trust between them. Management tends to be the dominant force since it tends percieves development to be "working for" management. As a developer how can I say "well, we think that your time estimates are off by a factor of 'n'" without any proven trust or track recond? You must also factor in the fact that the management team may have been bitten in the past by developers so there's already an uphill battle. (Yes, it may be that management has been bitten as a result of their own mis- or over-management.) Only after consistently and continually setting and meeting expectations and by establishing a degree of trust is it much more palatable to go to management and say "We believe that the timelines you have established are invalid. This new timeline is what we believe is more realistic. We can back up this estimate with the fact that we have defined timelines in the past and have consistently met them."

You may have also noticed by now that this stream of small deliverables shows management clear progress. There's no confusion as to the state of the project. This does wonders for reducing management's anxieties.

A few tricks that I use to gain trust faster (I hope that there are no managers around and I hope that people will understand my intentions and not misconstrue them):

  • It is possible to placate management by delivering some of their "dream" ideas early in the project. If you have a manager that likes purple hippos, then put some purple hippos on the web page. If you have some manager that really wants something that you know is dumb, just add it. (Remember: we're talking about small deliverables. We're not talking about adding that new e-commerce functionality that they've wanted!) This will go a long way to gaining their favor. Loosen up the relationship early if you can and as necessary.
  • Work a little harder on the first deliverables and pad the time a bit more so that on just a few of them you deliver early. It's just a cheap and dirty way to quickly provide the perception that the development team is trustworthy. This will certainly help with a very icy relationship.
  • I hesitate to mention this trick because I know it will be misconstrued by some, but a few of these deliverables can be orthogonal to the actual development efforts if there is a tactical situation that can be exploited. If there is a particularly icy relationship that needs to be warmed up quickly and if there is an opportunity to make a quick win then weigh the risks and try it as necessary. I'll leave it at that.

I now need to stress that this does add quite a bit of management overhead on the part of the development team. Accurately defining small deliverables and ensuring that these deliverables are met while ensuring that the deliverables are in line with the goal of finishing the project can be quite taxing on the development lead. It can consume quite a lot of their time.

I also need to stress that it's very easy to "fall from glory" in the early phases of a project by missing a single expectation. One item that I cannot emphasize enough to teams is clear and early communication about a missed deliverable. Telling management that you're going to miss a deliverable that is expected either the same day or the day before does nothing to paint you in a good light. It simply shows them what they already believe: that you have no ability to effectively plan and anticipate problems. If you know that you're going to miss a deliverable then tell them! Just don't say: "Hey Bob! We're going to miss the deliverable" and leave it at that. Say: "Hey Bob! We're going to miss the deliverable for these reasons, here's when we're going to deliver on the deliverable [or, alternatively, here's the items that we're going to deliver], and here's what we're going to do next time to not miss the deliverable."

Finally, you must continue to manage the relationship even after it has become amicable. I have seen a few projects where the developers have achieved a decent relationship by following what I've outlined here and then they stopped setting and meeting expectations. The relationship can quickly return to its original state. Management is very much about "what have you done for me today". If there's no "today" to frame some request then you quickly fall out of favor.

Persistence frameworks

(Please note: "Persistence tools / frameworks" used throughout this entry refers to Hibernate, JDO, iBATIS, SDO, etc. I'm not considering JDBC to be a persistence tool for this entry.)

The risk associated with the persistence tools is very high given the state of flux that the industry is in (especially with EJB 3.0 coming out in the "near" future). Remember that choosing a technology isn't just based on the cost to implement it today. It is based on the total cost which also includes all of the facets of maintenance and change.

If I had a project that had a persistence component and I knew that I was going to be actively developing and maintaining it for at least 2 years, I would likely use JDBC. Let's look at some of the factors involved in my decision:

  • The persistence frameworks are all going to change significantly over the next two years especially with EJB 3.0 coming out. The same goes for the various query languages that they expose.
  • New persistence technologies may come out that replace the incumbents redering them effectively dead. (I'm concerned less about this with Hibernate than I am with, say, JDO. But that's just a gut feeling based on the number of people using it, exposure, etc.) If IBM does what I think it will do with SDO, for example, then the space will change dramatically.
  • The learning curve with most of these persistence tools is very high. Some may poo-poo this, but if you do anything beyond "just the basics" then you're investing a LOT of time learning, debugging, and posting on forums. *grin*
  • The maintenance efforts and risks for the persistence tools are not well understood. Given that the tools themselves are still in flux, the techniques and skills needed to maintain the tools and associated code are also changing.
  • JDBC, for the most part will remain unchanged. It is a known quantity with known risks and the maintenance efforts and associated risks are well understood.
  • There is no "specialized training" associated with JDBC. I don't have the "hit by a bus" problem with JDBC that I have with the other persistence tools.

My job as a development manager is to produce software with the lowest total cost and the lowest risk. To me, JDBC is still the clear winner for the cases that I stated earlier.

I should point out that if you're prototyping an application, working on certain types of one-off application, or working on a application with a severely limited lifetime then the RAD aspects of the persistence tools may work to your benefit.

If you disagree, let's hear it!

GEF Command creation and debugging tip

My particular GEF application has very complex and coordinated EditPolicy's. For example, on a LayoutEditPolicy's getAddCommand(), I have to dispatch other Requests and accumulate their resulting Commands. To be more concrete: when a child figure (and its associated component (model)) is moved from one figure to another there may be a set of other Requests that get fired off to perform functions such as changing the look of the child based on the new parent and adding and removing other children which provide context.

I following the advice of the GEF guru's and modelled my application after the logic example. Unfortunately, the paradigm presented therein is not suitable for my situation for one primary reason: when "creating" Commands (I put that in quotes only to emphasize the fact that I'm talking about both construction and the calling of the appropriate setters) some of the elements may change at execution time due to other Commands that have been called.

For example:

LogicFlowEditPolicy.createAddCommand(...) {
    AddCommand command = new AddCommand();
    command.setChild((LogicSubpart)child.getModel());
    command.setParent((LogicFlowContainer)getHost().getModel());
    int index = getHost().getChildren().indexOf(after);
    command.setIndex(index);
    return command;
}

In my particular application the parent cannot be set when "creating" the Command since the component may be reparented (by other Commands) between the time that the Command is "created" and added to the CommandStack and the time that it is executed. Only on Command.execute() can the parent be retrieved from the child and be stored. (Note that the parent must be stored on Command.execute() for Command.undo() to function correctly. If Command.undo() was to retrieve the parent when it was called then that would obviously be the incorrect parent.)

Tracking down this situation was non-trival since there are two different factors that must be considered: when / where the Command was "created" and when / where the Command was executed. In my case, the "created" part was the most important. But as most of you know, the various EditPolicy methods are called quite frequently and it's very very very hard to determine which Command "create" is the one that is actually added to the CommandStack and executed. (If you don't already know, the various EditPolicy methods are also called to make sure that you can do something such as when components are selected ComponentEditPolicy.createDeleteCommand(..) is called to ensure that the component is deletable (i.e. Command.canExecute() is called). In this case the Command retrieved from ComponentEditPolicy.createDeleteCommand(..) is never actually exectued. (The Command from ComponentEditPolicy.createDeleteCommand(..) is only exectued when the DeleteAction occurs.) This can be very confusing the first time you stumble across it.)

To get around this problem of knowing which Command is actually executed and where it was "created" from, I used the following trick:

In the constructor of my Command I did the following:

public SomeConcreteCommand() {
    ...
    this.constructionException = new Exception();
}

(where constructionException is declared as private Exception constructionException). Then in execute() I did the following:

public void execute() {
    ...
    constructionException.printStackTrace(...);
    ...
}

What this does is allow me to see the stacktrace of where the Command was "created" only when it is executed.

(I should mention of completeness that this technique for storing an Exception() is very heavy weight and will affect performance. This is only suitable for development / debugging and should not be kept for production code.)

In summary: There may be cases in which some elements of a Command cannot be set when "created". These elements can be retrieved and stored when Command.execute() is called. To facilitate debugging Commands it may be useful to store an Exception created on construction of the Command and display the stacktrace of the Exception when Command.execute() is called.

How Not to Write FORTRAN in Any Language

I recently read the article How Not to Write FORTRAN in Any Language. It "emphasize[s] form and style ... particularly those features that apply across programming languages". It's an interesting read.

The ACM Queue has a number of good articles that are worth the time to read them.

Early return

While doing my morning blog walk I discovered, much to my surprise, that the "early return" is considered "bad practice".

Rather than reiterate all of the dialog surrounding the "single function exit point" debate, I direct you to the following links:

If my primary language was C/C++ (or any language without automatic garbage collection) then I would agree that an early return is bad practice; it is simply too difficult in most cases to prevent memory leaks with early returns. Since I have shed the albatross of manually managing memory, I have fewer constraints on when I may safely exit a method.

One of my favorite features of Eclipse (and, as usual, I don't care if IDE XYZ supports it) is that I can click on the return type of a method and it will highlight all points at which the method returns (either due to throwing an exception, an exception bubbled up, or a return). See Highlight method exit points for more information. This feature certainly simplifies examining code with early returns.

My personal viewpoint on coding is: reducing the number of factors that one has to be concerned about, in general, reduces the complexity. Reduced complexity tends to lead to code that is less obfuscated, easier to maintain and less likely to contain defects. In order to reduce the number of factors that one has to be concerned about, I liberally employ the Guard Clause which (in most forms) is contrary to abolishing early returns.

But then again, what do I know? I also put comments in my code. *grin*

Software theorists and experimentalists

The Java Information Group asked the question: if you could pick anyone in the Java community to be your personal mentor for one year, who would that be any why?

Here's my response:

I do not believe that there is a single individual that possesses both the Java and software engineering skills necessary to be my personal mentor. There are those that possess unique insight into Java and there are those that have mastered the mechanics of writing, managing and maintaining Java software but it seems that never the twain shall meet. Though if I had to choose a mentor from the two groups it would certainly be someone that has mastered software engineering.

I have two follow up questions of my own:

  1. Do you agree with my observation that it seems that people either excel at being theorists (those that possess unique Java insight) or at being experimentalists (those that have mastered mechanics of writing, managing and maintaining Java software)?
  2. Who are the well-known experimentalists today?
Architecture roles and responsibilities

In the past I have talked about the difficulties in succinctly defining what a software architecture is. It seems that this difficulty spans over to the role of a software architect.

Below are some links to show how widely the roles and responsibilities of a software architect vary:

It was all a lie -- Part 2

A comment on Javalobby caught my eye and expresses the root of what It was all a lie was all about:

In this day and age I should be profiting from two decades of code written by brilliant and talented programmers who came before me. I should be able to take advantage of the billions of line of code written to do everything I could possibly think of doing. That was the promise of object oriented programming right? Code reuse remember that?

Instead I have to re-invent the wheel in every language I use for every project I start.

...

I want a better language and better tools. Being a programmer has nothing to do with useless mind numbing work that passes for programming these days. I want my language to enable me to express myself freely and not get in my way, not punish me for making the wrong decision too early....

(From Javalobby)
It was all a lie

As I embark on my next project that will require me to duplicate efforts that I know have already been made, I think back to one of the initial promises of OO: reusable objects. When I first started learning about OO many moons ago I was told that there would be this wonderful and vast trove of objects that I could simply pick up, augment and reuse for my own purposes. The unfortunate reality is that there are a precious few objects available.

But why is this?

Off the top of my head a few reasons come to mind.

  • Licensing. I have worked in what could be considered to be the "staff augmentation" business for quite some time. All this that boils down to is I typically have very little say or sway when it comes to licensing potentially useful technology. Compounding that is the much feared GPL. Companies run shrieking away from GPL'd software.
  • Language constraints. A number of widely used programming languages are severely limited in their ability to augment existing objects. For example, in Java I cannot add a new member to Object that would then be present in all objects. (In theory I actually can do this with AOP but the JVM goes out of its way to make this a nightmare.)
  • YAGNI. The whole religion of "you aren't gonna need it" mandates that the objects you create should only have the functionality that they need. By not providing those "well, maybe" hooks, these objects are limiting their own lifetimes and usefulness.
  • Lack of imagination. The ability for a developer to think out of their domain is severely limited (and is only going to get worse as each domain becomes more complex). This will limit the potential usefulness of produced objects to other domains.
  • Different standards. Different frameworks, different approaches, etc all contribute to objects that are fundamentally incompatible.

This entry was not written to bring about change or to criticize but only to point out some of the road blocks that exist.

Continue reading It was all a lie -- Part 2.

Monkey testing

I have read a number of articles in the past few days on what is commonly called "monkey testing". The name comes from the old saying: "A thousand monkeys at a thousand typewriters will eventually type out the entire works of Shakespeare."

I felt compelled to do a little research into the origins of the monkey saying and found the Infinite monkey theorem. From this I found the The Monkey Shakespeare Simulator which appears to be a better use for my spare CPU cycles than checking for extraterrestrial intelligence (grin).

Technical Debt

The concept of Technical Debt or Complexity as Debt was introduced by Ward Cunningham. There are a number of articles on it such as this one by Martin Fowler, this one and this one on Technical Dividens (the benefits of not incurring technical debt). I certainly will admit that I have this pattern and have been attempting to explain its consequences for years. Proving a name for this phenomina will help me communicate the situation both up and down the management chain.

A thank you goes out to Brian Sletten for introducing me to the concept and providing me with the necessary background information.

To prefix or not to prefix: that is the question.

I've been mucking around with Eclipse lately (specifically, the GEF). The predominant paradigm is subclassing classes provided by the framework to provide for extended functionality.

The question is: if the class you're extending is called ContextMenuProvider for example, do you call the subclass ContextMenuProvider or do you call it BlahContextMenuProvider where Blah is some fabricated name (such as the project name or My (shudder))?

Java provides this wonderful namespace delineator (i.e. the package) but we seem to be afraid to use it. I'm perhaps the worst class namer that there is, so if I can get out of naming a class then I'm happy! What do you think? Use a prefix or use the package defined namespace?

If everyone does it incorrectly, then don't do it

It seems that Bruce Tate is falling for the "if people do it incorrectly then it should be avoided" argument. Mike Clark was using the same argument with respect to comments: "how many times have you been bitten by inconsistent or wrong comments?". (I should mention that Mr. Clark didn't advocate dropping comments completely -- only that they be used when the code isn't clear enough.)

Like Mr. Tate, I too have reviewed much code. There are two mistakes that are quite common in the code I have seen:

  • The effects of multiple threads on code are not well understood. Specifically, synchronized is sprinkled apparently randomly over the code.
  • Inheritance is used as a code-reuse mechanism rather than an is-a relationship.

Following Mr. Tate's and Mr. Clark's logic, we should eliminate inheritance and synchronized (and probably even threads!).

As I've said before, to me this is all a training issue. I honestly don't expect developers to use exceptions, comments, inheritance and synchronized correctly since they're never trained to do so. Most "figure it out" (coding or "software engineering" that is) as they go along. There are a number of books out there with the necessary information, but unfortunately, there are no "problems to solve at the end of the chapter". Until we address this training issue, I'm afraid that most developers will continue to incorrectly use these constructs.

Combined fragment

I'm all for UML to help communicate and serialize ideas in a consistent fashion. There have always been a number of pain points with UML that made me angry. One of them has been in-line guards -- representing conditionals in a sequence diagram. It has been nearly impossible to succinctly represent complex conditonals (e.g. multiple else ifs). UML 2 has introduced the "combined fragment". IBM Developer Works has an article that demonstrates them in action. If you're not up for reading the lengthy article at this time then just check out figure 8. This article uses them in a sequence diagram.

If you use Visio, then these UML 2.0 stencils may come in handy.

Unit test pitfalls

I have been reading a number of articles on unit tests and test driven development (TDD). I even attended a few unit test specific talks at the NFJS Java Symposium to make sure that I was up to date on all of the latest and greatest. Unit tests and TDD are all the craze these days and everyone is touting their benefits. But there are also a number of pitfalls that must be understood before entrusting your life with these safety nets.

Pitfall 1: Parallel Concerns

While writing a unit test for class Foo you think of cases A through F. These six cases represent the whole of what you are concerned about for Foo. Feeling confident in the tests you being coding. As you are coding the implementation, you're are thinking about cases A through F since that's the whole of the concerns that you know about.

In the middle of coding you realize that there is a case G. G is not in the tests you already wrote so you add a unit test for it. So now, as far as you know, the functionality is completely covered by cases A through G.

But what if there are cases H and I? Since you didn't think of them initially, they're not in the unit tests. Since they didn't become evident as code was written they are likely not accounted for. If you're lucky, cases H and I will be implicitly tested and/or coded for. But what if they're not?

The concerns of testing and coding tend to be parallel to eachother; if you thought of it, then it will be coded and tested. The test and the code will typically approach the problem in the same way. What about orthogonal concerns -- things that you didn't think of or aspects that are only uncovered when approached from a different angle?

With coding experience, one gains a sense of knowing what you don't know. These aspects can usually be boxed into a corner, documented and possibly explicitly tested for. We're all human and this knowing what you don't know certainly isn't quantifiable so there are going to be orthogonal concerns that are missed.

Pitfall 2: Invalid or not Useful

Some time ago I was working with a third party library. I wanted to provide another implementation for some custom functionality. I was in luck; the code was structured in such a way that there was an interface right where I needed it. And to top things off, there were even unit tests provided with the interface. This was going to be a good day!

I started off by running the unit tests on the existing (functional) implementation. I was rewarded with green bars; my safety net is in place. Sure, I don't know what the unit tests are acutally testing for, but there are a number of them and the current developers use them. I will learn what the tests do as need arises and I will add my own as necessary.

I started to stub out my own implementation. After a few hours, I have all of the basics in place. Now's a good time to see how far off base I am. I run the tests and all shows up green! How good can this day get?

But wait! I notice an error in my logic that should not produce a correct result. Well, maybe the unit tests don't explicitly test for that case. I'll make a note to write a test.

But something just doesn't feel right so I quickly stub out another implementation that just retuns default hardcoded values. I run the unit tests and all is green. What the ....?!?

After devoting a heafty chunk of time to weeding through the existing unit tests I discovered the problem: the unit tests were all negative tests. In other words, each test checked to ensure that something wouldn't happen. Since my code never returned null or threw an exception, they all passed. Fifteen tests that made up about one thousand lines of code served to tell me nothing useful in my case.

Pitfall 3: Requirements Match

The business logic for computing the results of some portfolio requires the use of a spatial partitioning tree structure. You approach the problem by first writing a suite of tests. These tests are robust and cover all corner cases. Due to the complexity of the algorithm, the tests even go as far as to parse the internal tree structure to provide that extra degree of certainty.

After a week of development time, you and the rest of the engineering staff are confident in the logic. The software is pushed off to quality control and the results come back: zero defects. The obligatory pizza and beer party is thrown for this extraordinary feat and the engineering staff goes home for a well deserved weekend of rest.

On Monday, still riding the high of the previous week, you arrive early only to find a post-it on your monitor telling you that the client rejected the software. You feel your heart sink. How could this be?.

After some investigation, it was determined that your understanding of the tree structure and that of the clients differed drastically. The software was certainly free from defects but the results did not match the requirements.

Conclusion

All of the pitfalls have the common theme: don't let the green light lull you into a sense of false confidence. Unit tests may provide a safety net but make sure that the net is directly under you and that it isn't just lying on the ground. Also make sure that if the tightrope is frayed and does break that the safety net isn't made of the same faulty material.

Maybe we're approaching this the wrong way

I have been thinking a bit lately about the process of coding. At this point I believe that there are two distinct phases: blank page and non-blank page. The "blank page" phase is when a new component is being added -- essentially, you have a "blank" editor window. The "non-blank page" is slightly harder to narrow down, but in essence, it is the point at which you have something coded. (I'll leave something undefined at this point.) The act of typing out language statements (i.e. "coding") appears to be most effective during the blank-page phase. You will be hard pressed to find a WYSIWYG, drag and drop editor that allows a more rapid entry of logic and expressions. (Please note that I'm not saying that a drag and drop editor would be slower in cases of building up a class stub, for example. I'm referring to for loops or general if logic and the like.)

Code and coding appears to be less effective once something (still undefined) has been written -- the non-blank-page phase. Attempting to understand large-scale structure and to perform change analysis is extremely difficult when just looking at code. Comments combined with code help facilitate change analysis quite a bit, but still fall short of a magic bullet.

Let's approach this from a different angle. The primary reason for comments is to inform you or another developer of the intent, purpose, concerns, etc of the code so that when modifications or debugging are performed, all concerns and implications are known. (Please note that I'm only concerned with non-API defining comments in this disucssion. API comments are those only used by "external" developers to use an API.) In other words, comments aren't useful in any way during the blank-page phase. Comments are only added to be useful in the non-blank-page phase.

If you think about a standard development hierarchy, more senior developers are doing the blank-page phase work (as well as some early non-blank-page work. This is where the ill-defined "coded something" will eventually derive its definition from.) whereas junior developers are typically thrown into maintenance (certainly non-blank-page work). This creates an environment whereby those that are least able to write effective comments are the ones that must be writing the comments. I say that senior developers are least able because they have had more time away from code maintenance and only by doing regular code maintenance does one know what are the most effective comments to write.

I've just thrown out a whole bunch of ideas that may seem orthogonal. I'll summarize:

  • There appears to be two distinct phases of the process of coding: blank-page and non-blank-page.
  • The act of coding (writing language statements) appears to be best suited for the blank-page phase.
  • Code (the output of the blank-page phase) is ill-suited for the non-blank-page phase. Coding itself is of dubious use during the non-blank-page phase unless new features are being added (in which case, one could argue that this is a form of "blank-page").
  • Comments provide important information that is not expressible in code.
  • Comments are of little to no use in the blank-page phase but are paramount in the non-blank-page phase. (Please note that I am omitting comments used for describing an API (e.g. javadocs) in this discussion.)
  • Comments must be specified in the blank-page phase and maintained during the non-blank-page phase (as part of the code maintenance).

All of this screams dichotomy. Code and comments, the output of the blank-page phase, appear to be the least suitable inputs to the non-blank-page phase. Code ("language statements") is written since it appears to be the fastest way of "data entry" for the blank-page phase but code is not suitable for change analysis during the non-blank-page phase.

Couple this dichotomy with the average times spent in each phase. If you look at "pre-1.0" vs. "post-1.0" times (development time vs. maintenance time) you can see numbers like 1:2, or 1:10. In other words, maintenance time (non-blank-page phase) far outweighs the initial development (blank-page phase'ish) time. Yet all of our techniques, our "tools", are best suited for the blank-page phase.

Perhaps we're approaching code the wrong way.

For those familiar with the concept of Intentional Programming (IP), a light blub may have lit up. I'm actually talking about a level above IP. One can make the argument that Eclipse (with the AST) combined with the EMF (well, some yet-to-be-written instance of the EMF) provides for the IP paradigm. The problem is that an AST view of IP does not effectively incorporate comments.

Eliminate comments?

While I was at the NFJS Symposium I heard from a number of speakers stories about how they were bit by inaccurate comments to the point that they now use comments sparingly. As one may be able to tell from my other postings, this goes contrary to eveything that I believe in. To me, comments are a training issue. If developers are trained effectively so that comments and code are really one in the same then there's never a problem of having comments be out of date. But given that comments cannot be checked for accuracy, there's always the possibility of a refactoring (especially one done automatically though an IDE) that produces inaccurate comments. Does this mean that comments should be eradicated? No. It simply means that more diligence is needed. For example, before an automatic refactoring is checked in, each change should be checked for accuracy. This is good practice in any case since there is always the possibility that the refactoring had implications that were not initially understood. But how many developers do that? Again, solved by training.

Art or not?

My background is in physics and math. I can remember my first advanced mechanics course. The professor was solving a rather innocuous looking problem. The first black board was filled with equations quickly followed by the second, third and fourth. The professor appeared to be a magician waving his wand over the board as technique after incredulous techique flowed from him. Integration by parts followed a deft use of completing the square. On and on it went until forty-five minutes and six black boards later the problem was solved. I was left breathless. Clearly the black arts had their hand in this weaving of advanced mathematics and only artists such as Jackson Pollack could see the beauty in what was created.

Was this a display of art or well practiced skill?

Before we move towards an answer, let's define succinctly what is meant by art and what is meant by skill. Using Merriam-Webster as a reference, art implies a personal, unanalyzable creative power whereas skill stresses technical knowledge and proficiency.

At the time I witnessed that display of mathematical prowess, given my limited experiences, I will admit that I believed I was watching an art. There was simply no way for me to imagine that the succession of techniques coupled with a vision of the desired solution could be known a priori.

While taking the mechanics course I was also taking differential equations and another advanced mathematics course. Over the period of the three courses I began to see that there was no man behind a curtain controlling a mystical presence. Practicing technique after technique hundreds of times in different progressions slowly built up a type of vision. Initially I could not see step B from step A unless I was told the technique to use and actually worked through to the result. Eventually, I was able to see step B without being told which technique to use. I then progressed to seeing step C and D from A without ever writing down a single equation. Directly from step A I was able to see if applying one technique would lead me to a dead end three or four steps ahead. Through the systematic practice of techniques I was able to develop a skill.

Had I taken the mathematical courses before the mechanics course, I would have never believed that I was seeing an art. It was necessary for me to first build up a skill before I could comprehend what was going on (thereby ingraining the meaning of the word prerequisite indelibly into my brain).

Software refactoring is a relatively new technique. One could argue that it has only been recently with the introduction of appropriate tools (i.e. IDE's with refactoring support) that refactoring could be effectively practiced. With such a short existence, it is no wonder that refactoring appears more like an art than the systematic practice (a skill) that it actually is. What's more is that most software engineers do not have the knowlegde of the desired end result.

Let us contrast this with the mechanics story from above. Most academic mechanics problems are stated in the following way:

Determine the equations of motion for X to move along surface defined by Y under the influence of force Z.

"equations of motion" have a succinctly defined form -- you know exactly what the form of the results should look like. One simply has to take the givens and manipulate them until they are in the desired form. But what is the desired form in software engineering?

Using quality as a basis, it is clear to me what the desired end result of refactoring looks like. Unfortunately, it is not a simple expression as in mechanics. It is a careful weighting of many factors including coupling, clarity, size and so on. To compound the problem even further, each factor is not well defined. What is "clarity"? How is "coupling" measured? What is good coupling and what is bad? None of these questions have a single answer and attempting to find the composite of these multivalued solutions may just be NP-hard.

Is there a solution to this refactoring problem? I believe that there is. But before we can begin looking for solutions, we need to better understand and define the problems. Once we finally make the jump from:

Refactoring is an art

to:

Refactoring is a skill that requires much practice and experience with maintaining the end result

then we can begin to find objective means from which solutions are derived.

At the core I believe that this is a training issue. Whereas I spent years manipulating equations so that I could see the results before ever writing a single equation, CS students do not spend years writing and maintaining code. Unless you have spent considerable time understanding and working with the implications of various results how can you being to refactor code?

What is a Software Architecture?

As I have mentioned previously, I am in the process of writing a technology presentation for the physics department faculty and graduate students at my alma matter. I have been continually running into road blocks when trying to find the best way to describe common concepts in software engineering to people that do not work with software all day.

The topic for today is: software architecture. I did the ususal "define: software architecture" on Google for a baseline and then I queried a number of colleagues for their thoughts. I ended up with many variations on the theme: an arrangement of components to meet a particular objective. The problem with this is that for someone who does not work with software, components is completely undefined. Throwing in the word "software" into the definition does little to elucidate the matter.

I stumbled on this link and from that I was directed to a link at the SEI. (If you've read some of my other postings you know that I'm enamored with the CMM and by proxy the SEI). The SEI link provides a trove of information including a section that has user specified definitions. I haven't settled on my own concise definition as of yet, but the SEI has certainly given me more to go on.

Intent through code is lossy

Expressing intent only though code is lossy.

If it is accepted that a reason to weed though someone else's code is to find and remove bugs, how can the argument "the code is self describing" stand? The original intent has clearly become partially lost due to a defect.

There are at least two cases that exist in the presence of a defect:

  1. The original intent is described through the code.
  2. The original intent was improperly transcribed into code.

Without comments or some other mechanism to state intent, the two cases are indistinguishable. The first case is a much larger concern (i.e. risk) than is the second since it has the implication that other decisions were based on a flawed intent.

When a "code is self commenting" programmer is also a "crafty" programmer, you have the recipe for disaster. A crafty programmer is one that attempts to use small tricks such as over using a variable (e.g. a null value implies some other value does not exist) to achieve a goal. The problem with the resulting code is that the intent is obfucated. Anyone else reading the code typically has to scratch their head and wonder "is this intentional or a bug?". Information about the intent is clearly lost in the translation to code.

Manditory Access Control for Java

This paper presents manditory access control (MAC) in Java. Below is an excerpt:

... we have extended the JVM with functionality to do mandatory access control at the granularity of objects. Our implementation strictly separates the enforcement mechanism from the specification of polices. This allows flexible specification and enforcement of a wide range of policies. Moreover, we show that these techniques are implementable in current JVMs with minimal modifications to other JVM subsystems, while maintaining full backwards compatibility.

We have implemented this by adding an access control tag to each object, and modifying the virtual machine to check that tag at every data access to an object. Policies will take the form of predicates over these access control tags. Since mechanism and policy are strictly separated, various policies can be plugged in to the VM.

Frameworks and Toolkits

I'm in the process of writing a presentation that I will be giving in early October entitled Computing to Support Scientific Research: How to stay focused on the science, not the software for the physics department at my alma matter. Given the audience, I want to ensure that all terms and concepts that I would typically use willy-nilly are well defined and concisely used.

One focus of the presentation is tools that have enabled or simplifed development of complex applications. In the abstract, frameworks and toolkits have provided enormous traction toward this goal. I quickly realized that I did not have a concise definition for either term (my thinking tends to be more visual which typically does not allow me to readily translate thoughts into succinct terms). A little searching has resulting in the following:

  • Framework: A specification or implementation (code) that provides a general solution to some problem or aspect of applications.
  • Toolkit: A collection of programming subroutine libraries that can be used to make development easier.

If the code path is taken, a framework is-a toolkit.

AOP and fault injection

I have voiced my concerns in the past about maintaining and growing code that uses AOP. I am not one to quickly shun a technology and never look back. I constantly review technologies.

I was reading this paper on recovery oriented computing (ROC) which is in line with my autonomic persuits and they mentioned FIG. FIG is fault injection in glibc. If you've ever tried to set up a networking test where write() would block long enough to see the impact without massively mutating your code, FIG (or its concept) would be something to look into.

When I read about FIG, the little 10W light blub in my head flicked to a dim glow. Use AOP to do fault injection!

Unit testing, even if the developer of the unit test is not the developer of the code being tested, is of limited use. No matter how hard we might try, unit tests tend to be parallel to the concerns of the code rather than orthogonal. In other words, unit tests tend to miss many of the bugs that aren't directly in line with the use / purpose of the code. For example, if you have a class that reads data from the network and writes it back then the unit tests are typically going to align themselves with that -- they will ensure that stuff in matches stuff out. But what about the orthogonal concerns, most of which involve difficult to trigger network concerns? (Yes, I'm horrible at giving motivational examples.) Attempting to write code that will cause exceptions to occur is likely to require a harness that's bigger than the original code and test combined. And how do you guarantee the validity of the harness? In other words, who tests the tests? The coast guard? I don't think so!

Using AOP (or just java.lang.reflect.Proxy) to inject faults will allow you to test the corner cases without a massive harness and without changes to your existing code or unit tests.

On a parallel note, this article brings up some interesting uses of AOP for exception handling.

Autonomic computing

For reasons still left unstated I have been doing research in autonomic computing. For those unfamiliar, here are some interesting links:

Class invariants

I'm a constructor-based dependency injection kinda guy but with everyone always talking about setter-based dependency injection I started to question my approach. When Dave Thomas reminded me about class invariants I knew that my constructor-based approach was the right one.

From Wikipedia:

...a class does not allow use of all possible values for the state of the object, only those that are well-defined by the semantics of the intended use...

and

The main purpose of a constructor is to establish the invariant of the class, failing if the invariant isn't valid.

You know that you should use constructor-based dependency injection when the dependency answers the question is this a class invariant.

Creative Commons License Unless otherwise expressly stated, all original material of whatever nature created by Rob Grzywinski and included in this weblog and any related pages, including the weblog's archives, is licensed under a Creative Commons License.