Archive for August, 2007

I recently went through a software architecture evaluation for one of my projects. What follows is a technical summary of the evaluation and resulting decisions. I’ve also posted a sample application that demonstrates much of the important features.

Project Goal
The goal of the project was to choose the system architecture and software development environment for a large scale web application. The choices made at this stage will have far reaching consequences in terms of expenses, staffing and schedule (see platform peril). Some of the key decisions:

  • Programming language
  • Development frameworks
  • Scalability architecture
  • Database technology
  • Third party applications, tools and components
  • Build environment

Some Requirements
The project is a large scale web application that operates in a homogeneous computing environment completely within the operator’s control (e.g. hosted). The application should be able to support tens of millions of unique visitors per month on hundreds of servers. The development team is small and is skilled with java programming with hibernate and spring, but they can easily switch to Ruby on Rails or PHP if necessary. The key design factors are (in order of importance):

  1. Low Hosting costs = Support a high traffic web site with a low number of CPU’s per million unique visitors per month.
  2. High Developer productivity = approach the speed of Ruby on Rails. Focus on object oriented and test driven development.
  3. Low Complexity = The less disparate parts, the better.
  4. Fast Learning curve = New developers should only need programming language and web development skills.
  5. Use stable, popular third party tools and components = maximize the available choices of UI widgets, AJAX libaries, etc.

Results
I chose Java over Ruby on Rails and PHP. The total solution includes: Java, MySQL, Hibernate, Spring, Spring MVC convention over configuraion, Yahoo UI/Ajax, Ant/Ivy, MyEclipse with hot-deploy, JSP page.tag for layout, Memcached (or equivalent) for session management.

Here is a sample application with all of these things working, except for Yahoo UI and memcached.

Details
For languages, I considered PHP, Ruby on Rails and Java. I chose Java.

Java: Java has the lowest hosting costs, and best scalability options, but the solution is more complex. Developer productivity varies drastically depending on which platforms, third party tools and development environments are chosen. Poor decisions have far reaching consequences.

Ruby: I like Ruby on Rails a lot. It offers the highest developer productivity and least complexity. Because rails is a complete solution, there are much fewer decisions to make and it is much faster to get started. Some people object to ruby because it is too easy to change the core behavior of the language (e.g. override the methods on Object), and thus make an application unintelligible to any new developer. According to my friend Billy, if you make that “loophole” available, someone is going to take advantage of it, especially on larger projects.

PHP: PHP is popular, scales well, and offers many third party components. There are a number of MVC (model view controller) frameworks (including Cake and Zend), but none seem to be a defacto standard. Another reported advantage of PHP over other languages is availability of a large pool of developers. In this case, the job market advantage is minimal because most PHP developers would not qualify. Like Ruby, PHP has very fast development cycle.

MySql vs Postgres: I chose MySql because of general adoption and because the developers are already experienced with it. Some very large sites use MySql extensively.

Scalability: The web tier will use load balancing routers (e.g. Big IP) configured without sticky allocation. Any HTTP request can be routed to any web server. This means avoiding the default implementation of the java servlet HTTPSession and instead using something like memcachd or equivalent. The database will be split up into separate instances, segregated by user group and application function.

Java Architecture
If you are developing a large java applicaion, there is a sickening number of choices out there. Here are a few of the technologies I looked at:

Integrated Developer Environment (IDE): Evaluated MyEclipse and Intellij Idea. Chose MyEclipse. I’ve been using Idea for four years. Idea provides a slightly better coding environment, but MyEclipse has hot-deploy to tomcat and other tools. The ability to hot-deploy and easily debug webapps makes a huge difference in developer productivity.

Persistence / Object Relational Mapping (ORM): Evaluated EJB3, Hibernate 3 with Annotations. Chose Hibernate with Annotations. The developers already know hibernate and it works great. In face, the annotations layer conforms to the EJB3 JPA specification.

Frameworks: Evaluated JBoss Seam/JSF, Spring Framework 2. Chose Spring. Seam looks like a good tool, but it is wrapped up in the EJB3 specification which is designed for heterogeneous enterprise environments. The learning curve seems steep. Spring 2.0 has added some convention over configuration features that reduce the amount of irritating and unproductive XML configuration files. The development team already knows Spring, including all of the good and bad parts.

Build Tools: Evaluated Ant, Maven + Ant, Ivy + Ant. Chose Ivy + Ant. For big projects, Ant needs some sort of dependency management add-on. Maven has both a loyal following and many detractors. Maven’s integration with ant and eclipse is awkward. For our purposes, Maven is overkill. Ivy does a good job at managing dependency features and works with Maven repositories.

J2EE server: Evaluated Tomcat5, JBoss, Jetty. Chose Tomcat for now. Any one of these (and more) will do, but the development team is more familiar with Tomcat.

View / Page Layout: Evaluated Java Server Faces, Facelets, Velocity, Freemarker, Struts Tiles, JSP page.tag. Chose JSP page.tag. JSF has a steep learning curve and seems to abstract a lot of the session management which could be a problem when it comes to scaling. Velocity or Freemarker provide a nice way of removing some of the JSP irritations but they don’t work with other tag libraries like displaytag. JSP page.tag is mindlessly simple and much better than struts tiles.

References
Here are a few links that I found useful.

Archircture

Programming Language

  • Share/Bookmark

I’m interested in how people use Wikipedia, so I analyzed the Top 100 articles in the English Wikipedia for June and July 2007. Some observations:

  1. You can not extend this analysis by inference to characterize all of Wikipedia because it represents only the most popular 0.2% of the traffic of around 50 million visitors per month.
  2. 48% of articles are purely popular culture. Top categories include Pokemon, Anime, Movies, TV, Music, but there are also
  3. 14% of articles are biographies. Most of these are related to popular culture, including Princess Diana, Pop Singers, Pro Wrestlers
  4. 11% of articles are voyeuristic. These include the articles on Sex, erotic art, etc.
  5. In the month of June, Science, History and Politcs accounted for about 28% of the top 100, but that number dropped to 23% in July. Perhaps this is a reflection of how much Wikipedia is used for school work, since summer vacation starts somewhere in that time frame for many primary school kids.
  6. I filtered out certain articles such as the home page from this analysis. After filtering, the top 100 articles in June accounted for only about .2% of the total US traffic to Wikipedia (1,636,000/816,000,000).
  7. Overall about 70% of the top 100 articles are about popular culture (This certainly does not mean that 70% of all wikipedia articles or 70% of all wikipedia traffic is about popular culture).
  8. For one sample, I stretched the analysis from the top 100 to the top 167. The % Voyeristic went from 4% to 2.4% and other categories also changed slightly. This indicates that an analysis of the top 10,000 articles may yield different results.

One note about the data: the total article counts for July 07 is sparse for some reason. I worked around this by checking the Top 100 on July 7, July 11 and July 31. The percentage breakdown for July was pretty much the same for all three readings.

Here’s a summary data table:

wikipedia-top-100-06-07.png

  • Share/Bookmark

Marc Andressen has a fascinating post titled Age and the entrepreneur, part 1: Some data based on the research of a professor of psychology at University of California Davis named Dean Simonton. Among Marc’s many observations is the startling statement that:

Quality of output does not vary by age… which means, of course, that attempting to improve your batting average of hits versus misses is a waste of time as you progress through a creative career. Instead you should just focus on more at-bats — more output. Think about that one.

If this sounds insane to you, Dr. Simonton points out that the periods of Beethoven’s career that had the most hits also had the most misses — works that you never hear. As I am always fond of asking in such circumstances, if Beethoven couldn’t increase his batting average over time, what makes you think you can?

The odds of a hit versus a miss do not increase over time. The periods of one’s career with the most hits will also have the most misses. So maximizing quantity — taking more swings at the bat — is much higher payoff than trying to improve one’s batting average.

This is type Calvinistic determinism is unfortunate for several reasons. First, creative people have much less control over their number of swings at bat than they do over their actions while at bat. If you work for internet startups, it usually takes at least a few years to find out whether you’ve struck out. Three years is more typical. You might choose to only work on small projects to increase your at-bats, but that is not a good strategy because some of the best things take time to create.

Secondly, the study focuses on “outstanding achievement” (see his paper) – people like Beethoven. These people have less room to improve their average because it is already quite high. In pro baseball the top batters average between 30% and 50%. If your average is 48%, there isn’t much room to improve.

Thirdly, the ratio as defined by Dr. Simonton is overly simplistic. I’ve written on this topic before (engineering goodness); the relationship of success to failure is not a single ratio. It is a bell curve:

goodness-graph1.png

The bell curve provides at least two possibilities for improvement: narrow the curve (decrease the standard deviation) and change the average.

goodness-graph3.png

I believe most creative people have a better chance at improving their average than increasing their swings at bat. The bell curve shows the way.

  • Share/Bookmark

This essay is about the use of platforms within software development organizations.

Platform, Schmatform
Whenever I hear someone talk about building a new software platform for their organization I am instantly skeptical. My reaction has nothing to do with lack of confidence in the speaker. It is just that most platforms are failures. The problem is that the supply of frameworks, platforms and other reusable code projects far exceeds demand. More specifically, there are many, many people who would just love to create the platform that everyone else uses, while the market just wants one platform. The poster child for platform wannabes is Microsoft, which won the first rounds of the desktop operating system game and now owns a huge percentage of that market. Some more recent hopefuls are Facebook, Ning and Google Maps. Goeffry Moore, in The Gorilla Game, does a great job of explaining the risks and rewards of trying to be a platform.

But not all platforms are created with the goal of being the gorilla. Large companies tend to breed platforms like rabbits and open source frameworks grow like weeds. At one point in 2004, it got so bad that someone created a framework-framework, the Keel Framework. Happily, it didn’t last long (see bile blog’s writeup if you don’t mind strong language). There is so much platform proliferation that simple economic factors do not suffice as an explanation.

Platform Proliferation
I think the ultimate cause of platform proliferation is simply that software developers love to build platforms. For many, building a platform represents the pinnacle of technical achievement. As I’ve said elsewhere on this blog, creative technical people are in this business because they like to make things that are valued by others. What can be more self-affirming than creating something that your peers use to build even better things? And other aspiring platform developers often reinforce this feeling by buying into a particular platform ideaology. For instance, there were people who actually used the Keel framework-framework. Go figure.

While in principal, this tendency toward platforms is neither good nor bad, in practice it can lead to disastrous results. This is because architectural decisions regarding platform choices stay with you for a long time and are very difficult to reverse. Here are some patterns of failure that are particularly damaging:

  1. You don’t need it. Platforms are expensive and complex. Often the disadvantages of using or creating a re-usable code framework far outweigh the rewards. Unfortunately once you make a decision, it takes a long time to find out whether the decision was a good one or a bad one.
  2. You made it yourself when you should have used someone else’s. This is so common that there is an acronym for it NIH (Not Invented Here).
  3. You chose the wrong one. Everyone who wants to use a platform has their favorite choice among many. The selection of a platform is based on a number of factors, some of which are only partially related to the business problem at hand. For instance, don’t choose a platform if you can’t find good developers for it.

choose-wisely.jpg
Choose Wisely
In any software development organization, there is a constant tension around where to put your money; in the platform or the application. Software developers love to make platforms so there is always plenty of pressure to either use or make a new platform. Yet the selection of a platform is fraught with peril. As the ancient knight said in Indiana Jones and the Last Crusade, “choose wisely”.

Later I will write about specific examples of platform peril from my career.

  • Share/Bookmark

Call me a data geek, but I can’t help myself. I’ve updated my Wikipedia contributor map based on my recent discoveries.

The problem with my earlier post was that I discounted the contributions of people who don’t make a large number of individual edits but add a lot of content. Aaron Swartz (see link above) suggests that there are different types of contributors. First, there is a small group of contributors who make a lot of edits but don’t add a lot of words. For example, they might revert vandalism, fix grammar, reorganize or categorize. Second, there is a larger group of contributors who don’t make a lot of edits overall, but add a lot of words each time they edit. Aaron believes that this group creates the bulk of Wikipedia content.

Of course both types are critical to the success of Wikipedia and the data below indicates to me that it is more of a continuum than a statistical grouping of contributor types.

Another problem with the original post is the total number of worldwide visitors per month (according to comScore, May 07) is actually 217 million. I was using the number of US visitors, which is 48 million.

So here’s the update:

Here are some interesting factoids culled from Wikipedia contributor statistics.

Compare the population of world countries to the Wikipedia contributors. In the hierarchy of users the vast majority of visitors to Wikipedia, 217 million of them, are readers; for the most part they don’t edit articles. Next are the Regular Contributors who have contributed more than 10 times ever. There are about 340,000 if those. Next are the 105,000 Active Editors who contribute between 5 and 100 times per month. Finally, there are the 10,000 Very Active Editors who contribute more than 100 times per month.

wikipedia-contributor-math-update.png

So if Wikipedia readers are like China, then the contributors are like Macedonia, Montenegro and Grenada. To extend this analogy to absurd extremes, Macedonia, Montenegro and Grenada do all of the work, have the highest GDP and provide humanitarian aid to China!

Some background math:

The most recent total contributor data on the Wikipedia stats page is from Oct 2006. I applied a 41% growth rate to all numbers to arrive at estimates for May 2007, based on growth in overall traffic as reported by comScore. The ratio of users to contributors is:

  1. Regular Contributors; 217M/338k = 642:1
  2. Active Editors; 217M/105k = 2055:1
  3. Very Active Editors 217M/14k = 15,585:1

Here’s the spreasheet I used to calculate this data. The inspiration to make this map came from the Strangemaps blog.

  • Share/Bookmark