Archive for the ‘Infovark’ Category

Getting It Out of Our System

On the main Infovark blog — the “business” blog — I talked about how we threw the first version of Infovark away. Not the core idea, of course, but we dumped our initial database schema and restructured most of the code. Here’s the technical story behind that tough call.

Fighting the last war

We’d originally intended to build Infovark on top of an existing Enterprise Content Management platform. At the time, Gordon and I thought of Infovark mostly as an alternative ECM interface targeted at knowledge workers.

When we decided to strike out on our own, we realized that we couldn’t take the underlying ECM services for granted anymore. We’d have to build things like object storage and version control ourselves. And if we were going to create a mini-platform for our own use, we were going to do it right, by golly.

Drawing on my background with real estate MLS systems, database reporting tools, and electronic records management, I began to work on an amazingly flexible data storage tier. Gordon dove into the data access classes. We began creating the sort of system we’d always wanted to use as highfalutin IT consultants.

The system allowed us to do neat things like define metadata for an object on the fly and roundtrip it to the database. We could access those objects in code, or via XML or JSON web services. We could define multiple views for each object. Every object was searchable using a full text index or via SQL. We were quite proud of ourselves.

It was an utter waste of time.

Deep magic

Technically, what we’d done was construct an entity-attribute-value database model. EAV systems are technically complex beasts designed to solve one tricky problem: client domains that don’t have well-defined metadata.

The EAV model was initially developed for clinical records systems. These needed to accommodate a huge array of possible symptoms, complaints, effects, diagnoses, and interactions. A particular patient is unlikely to have more than a handful of issues at a time, however. It’s a rich set of metadata with a sparse array of values.

Think of those questionnaires you get at your doctor’s office. They normally give you a long form with lots of checkboxes on it. You tick a few boxes here and there to record your or your family’s medical history.

An EAV model is designed to store that type of information efficiently in a standard relational database, at the cost of doing some major code gymnastics.

Except in very limited circumstances, an EAV design is considered a database smell. Some go so far as to list it as a SQL design error. Joe Celko, author of several books on SQL, has an article on how to avoid the EAV of destruction.

EAV remains popular though, perhaps because of its close ties to Steve Yegge’s universal Properties pattern.

In fact, there’s a whole slew of alternative databases designed specifically to help with EAV problems: Column-based databases like Vertica, meant for data warehousing; XML databases like Mark Logic, for structured documents; CouchDB for unstructured content; and key-value stores like MongoDB. But I digress.

All of the hoops we jumped through to help store arbitrary data was overkill. We were making the problem harder than it needed to be.

The only positive thing I can say about the effort was that it was an itch we had to scratch. We had to get the old stuff from our previous jobs out of our system before we could focus on Infovark.

What we really needed

While the features we built were exactly the sorts of things a consultant or systems integrator might want, end users couldn’t care less about them. We’d unconsciously built a product for ourselves, not for our customers.

Our customers didn’t want to define their own data structures. They don’t want to learn about metadata or record types. They just want a product that helps them remember stuff. Figuring out what data to store or columns to index was our job.

So while the Alpha build was incredibly cool from a techie perspective, it wasn’t easy or fun for the typical knowledge worker to use.

We needed to do our homework. What do our customers need to get out of a personal information wiki? What items will they want to reference later?

How we manage that information under the hood should be completely invisible to them. As far as they’re concerned, Infovark is an actual animal that lives inside their computer that helps them find interesting things.

Back to the drawing board

Once we started looking at the problem from the user’s perspective, things got much simpler. We threw out the EAV approach and went with a much simpler data model. We gathered requirements to figure out what were the bare minimum number of data types that a typical knowledge worker would need. Then we began defining templates that gave users the ability to interact with these types in (what we hope) will be natural ways.

I guess it’s another example of the write big to write small principle. We built a general framework at first, capable of handling nearly any sort of object we threw at it, then drastically edited it back to hold the bare minimum needed.

Conclusion

It wasn’t that the EAV approach was wrong. It worked. We could have built on it. But it was a huge framework and it consumed a lot of our engineering effort. That’s time much better spent on things that our customers actually care about.

I wish we’d started with the simple solution. But I’m not sure we would have understood or appreciated it without trying the EAV approach first. We needed to get it out of our system.

And then we needed to get it out of our system.

The Curse of the Singleton

It took us six weeks to break the curse of the singleton. Six weeks! By the end of it, we’d rewritten most of our data access layer.

We began the process of removing singletons innocently enough. I thought I was well prepared for the task. I’d just finished reading The Pragmatic Programmer (my review of The Pragmatic Programmer) and Working Effectively with Legacy Code (my review of Legacy Code). I remember telling Gordon I’d tackle the problem over the weekend…

What’s a singleton?

The Singleton Design Pattern is one of the first patterns introduced in many software design books. But don’t let this fool you like it did me. Its prominent position has nothing to do with its importance. The Singleton is usually listed first because it’s the easiest pattern to explain and implement. It made a convenient place for the author to start, but the Singleton’s real uses are very limited.

Which is appropriate, actually, since the real use of the singleton is to limit usage. A class that implements the Singleton pattern allows only one object to be instantiated at a time. There are a few cases where this is desirable. For example, classes that control access to a single hardware device or that set up global variables. But the danger of the Singleton is that there are many cases where you’ll want to misuse it.

Why are they bad?

Scott Densmore lists the four key characteristics of the Singleton and how each can get you into trouble in his Why Singletons are Evil blog post.

For another cautionary tale of the cycle attraction, infatuation, disappointment, and rejection, read Singleton, I love you, but you’re bringing me down.

In our case, we’d gleefully implemented Singletons for database access, content indexing, security and access control, and in a few other places where we thought we needed just one instance. If Steve Yegge were here, he’d call what we’d done an instance of the Simpleton pattern — a failure to grasp basic principles of object-oriented programming. You can read more about Yegge’s thoughts on the singleton and design patterns for dummies.

Our automated tests were running slowly because we had to set up and tear down the database for every test. Making a change to one component would frequently cause several tests to fail. Everything was tied together at the hip — at the Singleton classes — and it was impossible to disentangle our code to test particular items in isolation. We had tests, but not unit tests. They were integration tests, and the points of integration were the handful of singleton classes we’d built.

Worse, our database performance was lousy. Since we had a global variable for our database object, we could sprinkle database access code throughout the rest of our object model. We discovered that we were opening and closing database connections all the time. And we’d had to implement tricky locking code to guarantee that our SQL statements would get executed in the right order.

What did we do about them?

The Singleton let us be lazy about our programming habits. It allowed us to make assumptions we shouldn’t have. You can call it premature optimization or a retreat into procedural programming techniques from an earlier era. Ultimately, we’d found that it allowed us to cut too many corners.

So we slowly rooted out each Singleton class from our API and reimplemented the functionality in other ways. Fortunately, we had a large battery of integration tests to help guide us. And luckily, we’d decided to tackle the problem during our first Alpha test, when we could still afford to make sweeping changes. But correcting bad design takes much longer than avoiding it in the first place — even if you’ve read all the right books.

Six weeks later, we finally sorted out the mess we’d made for ourselves. There’s a handful of odds and ends left to do, but the design feels better. My gut tells me it’s an improvement, and our tests — now we have both unit and integration tests — show that we’ve almost tripled the speed of the data access layer.

It was worth our time to break the Curse of the Singleton. Beware lest ye, too, fall under its spell!

It Doesn’t Get Better Than This

Our latest revision got high marks from our source control tools.

We are totally 1337.

Tools: ReSharper 4.0

We just finished our trial period for ReSharper from JetBrains. We’re buying licenses right now. It’s become indispensable to us. It’s that good.

ReSharper is like pair programming for introverts. It’s like a real-time FxCop, offering refactorings and best practices advice while you type.

Gordon had used ReSharper in its 2.0 days. I’d heard many positive things about ReSharper, but hadn’t tried it myself. The recently released 4.0 version offers support for C# 3.5, including the var keyword, object and collection initializers, and lambda expressions. Check out the in-depth review by Simon Hart if you want more details. Or just try it yourself.

How to avoid Visual Studio Help

For what seems like the thirteen-thousandth time, I just accidentally pushed the F1 key while I was writing some code. It’s pretty close to the escape key. I didn’t mean to push it. I guess I just have fat fingers.

I really, really hate pressing F1 in Visual Studio. Usually, it takes about a minute to display Microsoft’s help documentation thingy, which is impossible to navigate, frequently wrong and and generally not very helpful. This afternoon, the document explorer decided it had to go and update itself, which took about five minutes before it could take it’s usual minute to load the non-relevant, non-help, that I didn’t even want in the first place!

During this time, Visual Studio was COMPLETELY Unusable. The help dialog blocks the main visual studio  thread – and all attempts to get back to work were greeted with a friendly, informative “This may take several minutes” dialog.

Time Passes…
Time Passes…
Time Passes…

Arggh! Gord Mad!… And it turns out it’s not just me. This annoys other folks, too!

Right. That’s it Visual Studio. You’ve made me go through this song and dance for THE LAST TIME!

For starters, where do we all go for help? To Google, that’s where. So, I added an external tool using the Tools>External Tools Method:

Adding an External Tool

I set up my command to point to Firefox, and passed as the arguments:

http://www.google.com/search?site=&hl=en&q=$(CurText)+c%23&

(The +c%23& part of the command appends “C#” to whatever is highlighted in the IDE. If you’re not using C#, you could leave it out, or substitute it with whatever else you usually search for)

Then, I flipped over to the Keyboard bindings screen (Tools > Options > Keyboard:)

VS 2008 Keybinding

VS 2008 Keyboard Binding Screen

And I re-mapped the F1 key to my new ExternalCommand1.

There! Now, whenever I press F1, Visual Studio opens a new tab on my web browser, and searches Google for whatever I have highlighted in the IDE.

Purposefully punishing developers with a minute or two wait everytime they press a certain key is just plain unforgivable. They get really distracted trying to work around the “functionality”, and then further distracted writing ranty blog posts about it…

Firebird 2.1 Released

Firebird News just announced the release of Firebird 2.1.

We’ve been using Firebird as infovark’s SQL database for several months now. It’s an open source fork of Borland’s InterBase SQL server. So far, we’ve been extremely impressed by the database itself. Its companion database administration tool, FlameRobin, is shaping up nicely too. It’s currently in version 0.86, but it handles most of what we need.

Why did we choose Firebird? I’ve worked with a variety of databases during my career, including Microsoft’s SQL Server, Oracle, and MySQL. While we were strongly tempted use one of those, none of them met our criteria. We needed an embeddable, scalable, ANSI-compliant database that we could deploy easily. It also had to have liberal licensing arrangements and low (or no) royalties. That limited our available choices.

Two features persuaded us to give Firebird a try: It was free to use and the entire database is contained within a single file. I’m glad we did; it’s absolutely the right choice for us.

I wouldn’t recommend it for everyone, however. If you haven’t done much work with relational databases, Firebird is not the place to start. While the Firebird community is working on documentation and tools, they’re far behind what you’ll find at a commercial vendor or some other open source products. The Firebird website is tricky to navigate and lacks a search feature. Trying to get pointers from Google is tough, too, due to a certain American muscle car with the same name. (Tip: Search for Firebird SQL to get the database.) You’ll most likely need to rely on a combination of the Firebird website, InterBase documentation, and developer blogs to get specific information.

If you can look past those issues — and they’re common ones in the open source space — you’ll find FIrebird to be a solid database for development across a variety of platforms.

But keep Lorenzo Alberton’s cheat sheet handy, and bookmark Stefan Heymann’s reference site.

REST for the Weary

Those of you with a technical background may have noticed a close correspondence between the Web 2.0 principles I described in our design series and Representational State Transfer, or REST. This is no coincidence. Gordon’s been a backer of RESTful approaches to web application design for some time now; I’m a more recent convert. More importantly, the REST architectural pattern fit what we were trying to do with our infovark project.

REST is a design pattern used to create Internet applications. It’s been growing in popularity, but hasn’t been fully adopted by any of the major vendors yet. (Microsoft’s efforts to lump REST into the Windows Communication Framework notwithstanding.) This is probably due to the fact that the World Wide Web Consortium, or W3C, put its weight behind an earlier, competing design philosophy called SOAP. (SOAP used to stand for Simple Object Access Protocol, but the “simple” part was dropped long ago.)

SOAP was designed to help loosely connected computer systems communicate with each other. Many previous frameworks and standards had attempted to do the same thing, but with so many different hardware and software vendors building systems, most were doomed to fail. SOAP is likely to stick around for a while due to its close association with with Web Services and Service Oriented Architectures. As a practical matter, however, the class of problems that SOAP solves are actually rather limited. Strike that; the class of problems that only SOAP can solve are rather limited. For most applications, there’s an easier web services alternative: REST.

An Illustration

Pardon me while I geek out for a moment.

When I was a kid, me and my friend Rajeev would dial each other up — yes, literally dial each other — with our 300 baud modems. I know, I know, you young ‘uns are thinking, “What’s baud? What’s a modem?” Suffice it to say that it was a slow way to get two computers to talk with each other. And when I say slow, I mean S… L… O… W. You could literally watch the letters appear one by one in your monochrome terminal window. If you can imagine sending a message via Twitter one letter at a time, you’ve got the idea.

It was so painfully slow that there really wasn’t much point in sending messages back and forth. Other than the nerd-cool factor of making two computers located in different parts of town communicate, there wasn’t much to do. So Rajeev and I hit upon an idea. We’d play a game online. Being nerds, we naturally picked Chess.

Chess was actually a great application for modem-to-modem communication. There was a well-known initial starting state in the traditional arrangement of the pieces. There was an established protocol: white moves, then black moves. And there was even a short messaging format: chess notation.

So we started playing Chess online. Each of us kept a small chess board by the computer. We slowly took turns typing out our moves to each other and updating our game boards: “P-K4″, “P-K4″, “Kt-KB3″, “Kt-QB3″ and so on. Not exactly riveting entertainment, but hey, we were doing something new and different.

Every now and then, we’d run into a problem. I’d get a message from Rajeev with a nonsensical move, or he’d get a message from me that moved a nonexistent piece. Then one or the other of us would see these letters slowly print across the screen: P… I… C… K… U… P… T… H… E… P… H… O… N… E.

We’d then try to figure out what had happened, based on the log of all the messages that went back and forth and the current position of the pieces on each of our game boards. Sometimes we were able to figure out the mistake. Sometimes we agreed to go back to the last time we picked up the phone to reconcile our respective chess boards. Sometimes we started over.

As you can imagine, this sort of troubleshooting got old fast. And it happened all the time. Eventually we gave up on trying to play by computer, and we just bugged our Moms to drive us over so we could play using the same board.

REST in a Nutshell

The point of the story above is not to establish my geek cred, but to offer an analogy.

In the early days of network computing, bandwidth was low, latency was high, and it was vitally important to make your messages as concise as possible. A wide variety of message formats and protocols evolved to respect the limits of early computers and networking technologies, all designed to get as much useful data packed into as little space as possible. It’s exactly like the chess notation Rajeev and I used. We could have sent snapshots of the chess board after each turn, but it would have taken hours to transmit a single move that way. All we really needed to know was which piece needed to move where. Using the shorthand, a single move — one procedure — could be described in just a few letters.

Though network computing technology has come a long way since then, the most common way for computers to talk with each other is still via Remote Procedure Calls, or RPC. Rather than describe the entire gameboard, computers just tell each other how to move the pieces.

If you’re wondering how computers handle mistakes or lost transmissions, well, a staggering amount of effort in computer science has focused on error detection and correction algorithms and secure transaction processing. Believe me, the last thing your credit card company and your bank want to do is pick up the phone to work out whose set of accounts is more accurate.

This is why data replication is such a huge problem. If you only transmit the moves to each other, you have to start from a known initial state. The chessboard at Rajeev’s house and my house had to match at the start of the game. It’s also why synchronization is a big deal. If the moves are sent out of order, all sorts of problems occur.

If, on the other hand, you could send pictures of the game board back and forth, a seasoned player could probably reassemble the images in something close to the right order. Better yet, if both players could look at the same board at the same time, then you’d never get out of synch.

This is the essence of the REST architectural pattern. It’s a little less respectful of network resources, but by transmitting the current state of the game at any point in time, you can simplify the amount of work you need to do to get two players to agree. Most of the transaction issues and handshaking protocols become unnecessary. The World Wide Web — hypertext over HTTP — works in a RESTful way, and it’s the most successful computer application ever built.

Enterprise 2.0 is about applying the lessons from the Web to the enterprise, so it makes sense that we should start with the core design principles.

Welcome Underground

Thanks for stopping by!

The Infovark Underground is a new blog where Dean and I can unleash our inner nerd, and share some of the technology and experiences we run into as we build Infovark (Yes, we call our product Infovark. It’s got the same name as our company, because we’re all about making things easy to remember and share..)

Unlike our Infovark blog, which details what we’re doing, the underground will get into much more technical detail about how we’re doing it — discussing programming, development and tools.

If that sounds like the kind of stuff you might be interested in, feel free to add our feed to your readers!