Information Architecture

Information architecture is how we structure data. It’s the data and metadata we use to describe the things that matter to our systems. We also need to keep in mind that it’s not reality, or even a picture of reality. It’s a set of related models that capture some facets of reality. Our job is to chose which facets to model, what to leave out, and how concrete to be.

When you’re embedded in a paradigm, it’s hard to see its limits. Many of us got started in the era of relational databases and object-oriented programming, so we tend to view the world in terms of related objects and their states. Relational databases are good at answering, “What is the value of attribute A on entity E right now?” But they’re somewhat less good at keeping track of the history of attribute A on entity E. They’re pretty awkward with graphs or hierarchies, and they’re downright terrible at images, sound, or video.

Other database models are good at other questions.

Take the question, “Who wrote Hamlet?” In a relational model, that question has one answer: Shakespeare, William. Your schema might allow coauthors, but it surely wouldn’t allow for the theory that Kit Marlowe wrote Shakespeare’s plays. That’s because the tables in a relational database are meant to represent facts. On the other hand, statements in an RDF triple store are assertions rather than facts. Every statement there comes with an implicit, “Oh yeah, who says?” attached to it.

Another perspective: In most databases, the act of changing the database is a momentary operation that has no long-lived reality of its own. In a few, however, the event itself is primary. Events are preserved as a journal or log. The notion of the current state is really to say, “What’s the cumulative effect of everything that’s ever happened?”

Each of these embeds a way of modeling the world. Each paradigm defines what you can and cannot express. None of them are the whole reality, but each of them can represent some knowledge about reality.

Your job in building systems is to decide what facets of reality matter to your system, how you are going to represent those, and how that representation can survive over time. You also have to decide what concepts will remain local to an application or service, and what concepts can be shared between them. Sharing concepts increases expressive power, but it also creates coupling that can hinder change.

In this section, we’ll look at the most important aspects of information architecture as it affects adaptation. This is a small look at a large subject. For much more on the subject, see Foundations of Databases [AHV94] and Data and Reality [Ken98].

Messages, Events, and Commands

In “What Do You Mean by ’Event-Driven’?”[91] Martin Fowler points out the unfortunate overloading of the word “event.” He and his colleagues identified three main ways events are used, plus a fourth term that is often conflated with events:

  • Event notification: A fire-and-forget, one-way announcement. No response is expected or used.

  • Event-carried state transfer: An event that replicates entities or parts of entities so other systems can do their work

  • Event sourcing: When all changes are recorded as events that describe the change

  • Command-query responsibility segregation (CQRS): Reading and writing with different structures. Not the same as events, but events are often found on the “command” side.

Event sourcing has gained support thanks to Apache Kafka,[92] which is a persistent event bus. It blends the character of a message queue with that of a distributed log. Events stay in the log forever, or at least until you run out of space. With event sourcing, the events themselves become the authoritative record. But since it can be slow to walk through every event in history to figure out the value of attribute A on entity E, we often keep views to make it fast to answer that question. See the following figure for illustration.

images/adaptation/journal-and-views.png

With an event journal, several views can each project things in a different way. None of them is more “true” than others. The event journal is the only truth. The others are caches, optimized to answer a particular kind of question. These views may even store their current state in a database of their own, as shown with the “snapshot” in the previous diagram.

Versioning can be a real challenge with events, especially once you have years’ worth of them. Stay away from closed formats like serialized objects. Look toward open formats like JSON or self-describing messages. Avoid frameworks that require code generation based on a schema. Likewise avoid anything that requires you to write a class per message type or use annotation-based mapping. Treat the messages like data instead of objects and you’re going to have a better time supporting very old formats.

You’ll want to apply some of the versioning principles discussed in Chapter 14, Handling Versions. In a sense, a message sender is communicating with a future (possibly not-yet-written) interface. A message reader is receiving a call from the distant past. So data versioning is definitely a concern.

Using messages definitely brings complexity. People tend to express business requirements in an inherently synchronous way. It requires some creative thinking to transform them to be asynchronous.

Services Control Their Identifiers

Suppose you work for an online retailer and you need to build a “catalog” service. You’ll see in Embrace Plurality, that one catalog will never be enough. A catalog service should really handle many catalogs. Given that, how should we identify which catalog goes with which user?

The first, most obvious approach is to assign an owner to each catalog, as shown in the following figure. When a user wants to access a particular catalog, the owner ID is included in the request.

images/adaptation/catalog-embedded-owner.png

This has two problems:

  1. The catalog service must couple to one particular authority for users. This means that the caller and the provider have to participate in the same authentication and authorization protocol. That protocol certainly stops at the edge of your organization, so it automatically makes it hard to work with partners. But it also increases the barrier to use of the new service.

  2. One owner can only have one catalog. If a consuming application needs more than one catalog, it has to create multiple identities in the authority service (multiple account IDs in Active Directory, for example).

We should remove the idea of ownership from the catalog service altogether. It should be happy to create many, many fine catalogs for anyone who wants one. That means the protocol looks more like the next figure. Any user can create a catalog. The catalog service issues an identifier for that specific catalog. The user provides that catalog ID on subsequent requests. Of course, a catalog URL is a perfectly adequate identifier.

images/adaptation/catalog-external-owner.png

In effect, the catalog service acts like a little standalone SaaS business. It has many customers, and the customers get to decide how they want to use that catalog. Some users will be busy and dynamic. They will change their catalogs all the time. Other users may be limited in time, maybe just building a catalog for a one-time promotion. That’s totally okay. Different users may even have different ownership models.

You probably still need to ensure that callers are allowed to access a particular catalog. This is especially true when you open the service up to your business partners. As shown in the figure, a “policy proxy” can map from a client ID (whether that client is internal or external makes no difference) to a catalog ID. This way, questions of ownership and access control can be factored out of the catalog service itself into a more centrally controlled location.

images/adaptation/catalog-policy-proxy.png

Services should issue their own identifiers. Let the caller keep track of ownership. This makes the service useful in many more contexts.

URL Dualism

We can use quotation marks when we want to talk about a word, rather than using the word itself. For example, we can say the word “verbose” means “using too many words.” It’s a bit like the difference between a pointer and a value. We understand that the pointer stands in as a way to refer to the value.

URLs have the same duality. A URL is a reference to a representation of a value. You can exchange the URL for that representation by resolving it—just like dereferencing the point. Like a pointer, you can also pass the URL around as an identifier. A program may receive a URL, store it as a text string, and pass it along without ever attempting to resolve it. Or your program might store the URL as an identifier for some thing or person, to be returned later when a caller presents the same URL.

If we truly make use of this dualism, we can break a lot of dependencies that otherwise seem impossible.

Here’s another example drawn from the world of online retail. A retailer has a spiffy site to display items. The typical way to get the item information is shown in the figure. An incoming request contains an item ID. The front end looks up that ID in the database, gets the item details, and displays them.

images/adaptation/url-dualism-1.png

Obviously this works. A lot of business gets done with this model! But consider the chain of events when our retailer acquires another brand. Now we have to get all the retailer’s items into our database. That’s usually very hard, so we decide to have the front end look at the item ID and decide which database to hit, as shown in the figure that follows.

images/adaptation/url-dualism-2.png

The problem is that we now have exactly two databases of items. In computer systems, “two” is a ridiculous number. The only numbers that make sense are “zero,” “one,” and “many.” We can use URL dualism to support many databases by using URLs as both the item identifier and a resolvable resource. That model is shown in the following figure.

images/adaptation/url-dualism-3.png

It might seem expensive to resolve every URL to a source system on every call. That’s fine; introduce an HTTP cache to reduce latency.

The beautiful part of this approach is that the front end can now use services that didn’t even exist when it was created. As long as the new service returns a useful representation of that item, it will work.

And who says the item details have to be served by a dynamic, database-backed service? If you’re only ever looking these up by URL, feel free to publish static JSON, HTML, or XML documents to a file server. For that matter, nothing says these item representations even have to come from inside your own company. The item URL could point to an outbound API gateway that proxies a request to a supplier or partner.

You might recognize this as a variation of “Explicit Context.” (See Explicit Context.) We use URLs because they carry along the context we need to fetch the underlying representation. It gives us much more flexibility than plugging item ID numbers into a URL template string for a service call.

You do need to be a bit careful here. Don’t go making requests to any arbitrary URL passed in to you by an external user. See Chapter 11, Security, for a shocking array of ways attackers could use that against you. In practice, you need to encrypt URLs that you send out to users. That way you can verify that whatever you receive back is something you generated.

Embrace Plurality

One of the basic enterprise architecture patterns is the “Single System of Record.” The idea is that any particular concept should originate in exactly one system, and that system will be the enterprise-wide authority on entities within that concept.

The hard part is getting all parts of the enterprise to agree on what those concepts actually are.

Pick an important noun in your domain, and you’ll find a system that should manage every instance of that noun. Customer, order, account, payment, policy, patient, location, and so on. A noun looks simple. It fools us. Across your organization, you’ll collect several definitions of every noun. For example:

  • A customer is a company with which we have a contractual relationship.

  • A customer is someone entitled to call our support line.

  • A customer is a person who owes us money or has paid us money in the past.

  • A customer is someone I met at a trade show once that might buy something someday in the future.

So which is it? The truth is that a customer is all of these things. Bear with me for a minute while I get into some literary theory. Nouns break down. Being a “customer” isn’t the defining trait of a person or company. Nobody wakes up in the morning and says, “I’m happy to be a General Mills customer!” “Customer” describes one facet of that entity. It’s about how your organization relates to that entity. To your sales team, a customer is someone who might someday sign another contract. To your support organization, a customer is someone who is allowed to raise a ticket. To your accounting group, a customer is defined by a commercial relationship. Each of those groups is interested in different attributes of the customer. Each applies a different life cycle to the idea of what a customer is. Your support team doesn’t want its “search by name” results cluttered up with every prospect your sales team ever pursued. Even the question, “Who is allowed to create a customer instance?” will vary.

This challenge was the bane of enterprise-wide shared object libraries, and it’s now the bane of enterprise-wide shared services.

As if those problems weren’t enough, there’s also the “dark matter” issue. A system of record must pick a model for its entities. Anything that doesn’t fit the model can’t be represented there. Either it’ll go into a different (possibly covert) database or it just won’t be represented anywhere.

Instead of creating a single system of record for any given concept, we should think in terms of federated zones of authority. We allow different systems to own their own data, but we emphasize interchange via common formats and representations. Think of this like duck-typing for the enterprise. If you can exchange a URL for a representation that you can use like a customer, then as far as you care, it is a customer service, whether the data came from a database or a static file.

Avoid Concept Leakage

An electronics retailer was late to the digital music party. But it wanted to start selling tracks on its website. The project presented many challenges to its data model. One of the tough nuts was about pricing. The company’s existing systems were set up to price every item individually. But with digital music, the company wanted the ability to price and reprice items in very large groups. Hundreds of thousands of tracks might go from $0.99 to $0.89 overnight. None of its product management or merchandising tools could handle that.

Someone created a concept of a “price point” as an entity for the product management database. That way, every track record could have a field for its specific price point. Then all the merchant would need to do is change the “amount” field on the price point and all related tracks would be repriced.

This was an elegant solution that directly matched the users’ conceptual model of pricing these new digital tracks. The tough question came when we started talking about all the other downstream systems that would need to receive a feed of the price points.

Until this time, items had prices. The basic customer-visible concepts of category, product, and item were very well established. The internal hierarchy of department, class, and subclass were also well understood. Essentially every system that received item data also received these other concepts.

But would they all need to receive the “price point” data as well?

Introducing price point as a global concept across the retailer’s entire constellation of systems was a massive change. The ripple effect would be felt for years. Coordinating all the releases needed to introduce that concept would make Rube Goldberg shake his head in sadness. But it looked like that was required because every other system certainly needed to know what price to display, charge, or account for on the tracks.

But price point was not a concept that other systems needed for their own purposes. They just needed it because the item data was now incomplete thanks to an upstream data model change.

That was a concept leaking out across the enterprise. Price point was a concept the upstream system needed for leverage. It was a way to let the humans deal with complexity in that product master database. To every system downstream it was incidental complexity. The retailer would’ve been just as well served if the upstream system flattened out the price attribute onto the items when it published them.

There’s no such thing as a natural data model, there are only choices we make about how to represent things, relationships, and change over time. We need to be careful about exposing internal concepts to other systems. It creates semantic and operational coupling that hinders future change.

Summary

We don’t capture reality, we only model some aspects of it. There’s no such thing as a “natural” data model, only choices that we make. Every paradigm for modeling data makes some statements easy, others difficult, and others impossible. It’s important to make deliberate choices about when to use relational, document, graph, key-value, or temporal databases.

We always need to think about whether we should record the new state or the change that caused the new state. Traditionally, we built systems to hold the current state because there just wasn’t enough disk space in the world. That’s not our problem today!

Use and abuse of identifiers causes lots of unnecessary coupling between systems. We can invert the relationship by making our service issue identifiers rather than receiving an “owner ID.” And we can take advantage of the dual nature of URLs to both act like an opaque token or an address we can dereference to get an entity.

Finally, we must be careful about exposing concepts to other systems. We may be forcing them to deal with more structure and logic than they need.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset