Tuesday, January 31, 2012

How to Reduce External Dependencies for your Java Libraries

For people who write or contribute to java open source products, external dependencies are a blessing and a curse. They are a blessing in that these external dependencies provide needed functionality that shortens development. I couldn't imagine writing code without the benefit of Apache Commons Lang, Commons Collections, Commons IO, Commons BeanUtils, and many more. They shorten development a tremendous amount, but for open source libraries, they also present problems.

The first problem is that external dependencies can cause class conflicts.  for example, it's possible that the library you release works perfectly fine under Commons Lang 2.6, but doesn't run properly with Commons Lang 2.1. Yes, you can run your unit tests using previous versions of your dependent products; but there's no guarantee that this will catch everything. Furthermore, it takes time and effort which is often better spent adding new features to enhancements to your library. 

The second problem is that new releases of your external dependencies can cause runtime problems in the future.  There's no way you can test against un-released versions of these products. Just because you work fine with Commons Lang 3.1 doesn't mean that you will run properly with upcoming releases. This is also a problem for the users of your library. Typically, web applications have a vast assortment of libraries they depend on, each with their own dependency list. It's possible for these dependency lists to conflict. Yes, there are tools to help you identify these conflicts. Yes, we try to choose dependencies wisely and choose products with a good history of maintaining backward compatibility. But, these aren't going to completely keep users out of trouble.

With an open source product I'm involved with, Admin4J, we took a different approach. Yes, we leverage other products, but we do so differently. We repackage the most of the products we use. That is, we slightly refactor their underlying source to have a unique package structure. For example, Apache Commons Lang's main package is org.apache.commons.lang3. We refactor that so that the package Admin4J relies upon is net.admin4j.deps.commons.lang3. We make no other changes.
The advantages of this approach are the following:
  • We benefit from the functionality provided by these other products.
  • We have a more consistent runtime environment; we don't need to worry about dependency version differences with the versions we develop and test with.
  • Our users don't have to be concerned that our dependency list conflicts with the list of one of their other dependent products.
The disadvantages are the following:
  • We consume additional memory (PermGen space) for additional copies of classes that might already be in the users classpath.
  • Some products don't work well with this strategy; this strategy didn't work well and isn't used by us with Freemarker and Slf4J. We still list these two products as external dependencies.
To give credit where credit is due, we borrowed this technique from Tomcat, which uses it quite successfully. Tomcat uses this technique for to utilize Apache Commons Logging and Commons DBCP. The secret sauce to accomplish this refactoring is the replace Ant task. We use Ant to perform the package refactoring, compile the resulting code, and package it either for development or as part of our deployed runtime jar. An excerpt from our build script to illustrate follows:

<!-- Perform package refactoring -->
<replace dir="${temp.src.dir}/net/admin4j/deps/commons" >
     <replacefilter token="org.apache.commons.lang3"
           value="net.admin4j.deps.commons.lang3" />
     <replacefilter token="org.apache.commons.mail"
           value="net.admin4j.deps.commons.mail" />
     <replacefilter token="org.apache.commons.fileupload"
           value="net.admin4j.deps.commons.fileupload" />
     <replacefilter token="org.apache.commons.io"
           value="net.admin4j.deps.commons.io" />
     <replacefilter token="org.apache.commons.dbutils"
           value="net.admin4j.deps.commons.dbutils" />
     <replacefilter token="org.apache.commons.beanutils"
           value="net.admin4j.deps.commons.beanutils" />
     <replacefilter token="org.apache.commons.collections"
           value="net.admin4j.deps.commons.collections" />
     <replacefilter token="org.apache.commons.logging"
           value="net.admin4j.deps.commons.logging" />
</replace>

For those of you who write or contribute to open source libraries, I'm interested in any other strategies you might have encountered and how they worked out.

Monday, January 16, 2012

The Benefits of a Standardized Application Architecture

There is much literature on software architecture and design. Most of that literature focuses on coding patterns and best practices. That is, the literature focuses on an applications internal structure and improving quality at a code level, usually with a single application as the intended scope. In fact, most application architectures are created and deployed for a small number of applications. It's time we looked at the larger picture and considered the benefits of deploying an standardized software architecture across multiple custom applications across the enterprise. Over the past several years, I've had an opportunity to do just that, and observed several benefits that are worth documenting.
Let's start by defining terms so we're all on the same page. I define application architecture as the internal structure of an application. That is, an application architecture details software products it uses internally and the programming patterns employed. For example, application architecture details whether the application uses a layered architecture of one of the other architecture patterns, the exception, logging, and transaction management strategies used, etc. This is different from “applications architecture” (with an “s”) that is used in some EA circles to describe what business data is produced and consumed by each application in business terms.
Most organizations will standardize the hardware platform and operating system to be used. Most organizations will also standardize significant vendors involved, such as Microsoft (in the case of .Net) or J2EE container vendor and relational database will be used. Some organizations will provide base services and methods for securing, deploying and monitoring applications. Many organizations won't standardize much more than that leaving the application architecture up to individual teams.
Over the past several years, I've had the opportunity to guide the development of a standard application architecture in the Java/J2EE space that is used for over a dozen applications. The application architecture defines the entire technical stack including the web framework and ORM used. The architecture specifies a package structure and a specific software layering paradigm along with coding conventions. This architecture provides a standard method for instrumenting applications for performance, memory, logging, and exception management purposes. This application provides base build and deployment procedures.
This standardization of application architecture provided a consistency between the applications that allows for several benefits:
All applications can easily consume architectural improvements. For example, we developed a strategies (with several open source products) to measure performance of each application, provide run-time log-level management, provide memory shortage alerts and low-watermark history, and much more. All custom applications in the enterprise were easily configured to consume these features. This lowers the price tag should it be necessary to replace one of the products used in the tech stack; the solution and migration procedures are developed once and merely reused for all applications.
It is much easier to switch developers between applications. In most organizations, J2EE applications are different enough that there is a significant burn-time for new developers or to move developers from one application to another. As these applications are written in very similar ways, having experience with one application equates to having experience with all of them.
Developer time is optimized in several ways. First, a standard technical stack puts limits on the list of products a developer needs to be current in. With most organizations, the combined technical stack for all applications is normally much larger. Second, development speed for new applications is improved as all basic technical decisions have already been made.
The common architecture leads to a significant base of code that is commonly shared between applications. This implements DRY and keeps developers from having to re-invent the wheel.
The benefits are numerous and have definitely resulted in lower support costs. While I can't divulge exact numbers publicly, the number of support personnel needed for this collection of applications is far less than other organization developing J2EE applications that I'm aware of. It's clear to me that the benefits received stem from the consistency between the applications; not from any specific technical advantages from the individual products used.
There are some challenges to this approach. We've become very careful about what code makes it into the 'common' code base that's shared across the applications. Once it's published and used, the impact of change can be wide-spread. As a result, nothing is published to the 'common' code base until more than one application needs it.
We're a bit slower to consume newly released versions of the open source products used. Any new product release caries the potential for breaking something. We're mitigated this risk by effectively “versioning” the technical stack and common code base so that applications can consume new versions of the architecture at different times.
Some developers consider themselves artists. This type of developer won't like the idea of standardizing the application architecture as it limits their creativity. In this world, creativity (or artistry) comes into play when a new type of technical problem or need surfaces that hasn't already been addressed by the standard application architecture. After several years, the rate that new technical problems and needs have surfaced has decreased substantially;
While my primary experience with deploying a standard architecture of this type is in the J2EE space, there's every reason to believe that other development platforms would see similar benefits to standardizing the application architecture and deploying it across multiple applications.
I'm curious about your thoughts and experiences. I'm particularly interested if you've experienced benefits or costs to an approach like this that isn't already listed.

Monday, January 2, 2012

With Java: 'private' does not mean private.

It turns out, you can get access to fields and methods in JDK classes declared as 'private' through reflection.  I don't recommend this and prefer not to use what I'm about to describe.  In fact, I consider what I'm about to describe as an option of last resort.

Recently, I ran into bug 6526376 which describes a memory leak in the JDK that impacts products that use java.io.File.deleteOnExit().  This method tracks the file that will be deleted in a statically defined Set in class java.io.DeleteOnExitHook.  This set grows, but never shrinks during the life of the JVM, creating a memory leak. 

While this bug is marked as fixed in the latest release of the JDK, upgrading to the latest release wasn't an option.  Furthermore, this Set is defined as 'private' with no way to get access to its contents and clear the Set (after doing its work), or so I thought.  Upon reading a post from Cedriks weblog, I realized that there was a way to access the files listed in this private Set and perform its work through a batch job.

The secret ingredient for this solution is the following four lines:

Class<?> exitHookClass = Class.forName("java.io.DeleteOnExitHook");
Field privateField = exitHookClass.getDeclaredField("files");
privateField.setAccessible(true);
LinkedHashSet<String> fileSet = (LinkedHashSet<String>) privateField.get(null);
// You then have full access to the privately defined Set and can modify its content.


With this particular solution, you do want to take care to synchronize access to the Set and not to delete files that are currently in use.  I haven't tried this trick to execute private methods yet, but I see no reason why it wouldn't work.  Furthermore, these four lines could easily be abstracted and consolidated into a utility class somewhere and re-used in other situations.

I've mixed feelings about this solution.  I'm glad that I was able to work around the memory leak with a short amount of code. We would have been stuck recycling our containers periodically, otherwise.  I've run into bugs with open source products where this solution could conceivably be used as well.

On the other hand, this breaks encapsulation and leaves developers no way to really guard against unauthorized access to private fields and methods.  An architecture principle I try to follow is 'Swim with the stream'.  That is, use products as they are intended to be used.  This hack definitely breaks this principle.  Should we really be able to do this?

There is no question that this is a dangerous tactic.  Private methods and fields are effectively unpublished from the perspective of the developers who wrote the code in question.  Those developers, can and should feel completely free to refactor (e.g. change or even remove) those private methods and fields with the understanding that they aren't directly affecting calling code.  These private fields/methods may disappear in future releases of the code being accessed this way or their intended use may be drastically changed.  My solution is fragile and will likely break with a future release of the JDK.

It's possible that developers using this tactic to break into sections of code they weren't given access to might not understand the full context of the fields and methods being used; there may be unintended runtime consequences to accessing and changing private data that hasn't been considered.  In my case, how can I really tell if a File listed in the shutdown hook isn't still being used?  I elected to address this issue by not deleting any file until an hour after last update or later, but this isn't fool-proof.  It's possible that by deleting these files early and removing the set entry, I've created a derivative bug somewhere that will likely be much harder now to find and fix.

On further consideration, the creators of the JDK were wise to allow this.  It gives us options when dealing with bugs like this.  If I've a developer on staff that routinely uses fragile solutions like this, I've got larger issues.  Changing the JDK to prevent this type of unauthorized access to private fields/methods won't save me from the other damage a developer like that can cause.