The first versions of most projects are self-contained applications. They work as-is, without any connection to other applications. It often isn’t until a later release that there is focus on interoperability. They build import/export functionality into their applications or add webservices that allow other applications to interact with it.

While this is an important steps toward application interoperability, there is often an important step missing. Most interoperable applications lack one final feature that allows full seamless integration: data sourcing, or the ability to get the data it needs from elsewhere.

With data sourcing, we are not just importing data into our applications, we are using outside sources as the source for that data without creating redundancy. A simple example is the data sourcing of the user information. Most applications have their own user table. Applications that feature data sourcing of users make it possible to tell the system to get the user data not from its internal database but from a different source, for example the database of another application, an LDAP server or a web service that provides the user data. In the case of user data, if you have five applications that each have a database of users, it would be a lot simpler to integrate those applications if you could use one of them as the master source for the user data and configure the others to refer to it.

The principle is applicable to more than just users. Groups comes to mind (particularly groups within an organization that you may want to use within your applications), and friends is another common topic (aren’t you tired of befriending all your friends on every new social website?). In the case of ecommerce systems it would be great if you could use data sourcing to get the actual product data from different systems. Magento, the popular ecommerce application, has product import/export functionality, but there is no easy way to tell it to connect to a web service to get the data for products instead of looking at its own database. This makes it hard to plug Magento as an ecommerce module into a larger system; most implementations you will find have Magento at its core and other, more flexible, systems plugging into it.

Data sourcing can help applications such as Magento reach a wider audience and especially help it be used in enterprise scenarios where there are many components that together form a big system.

The concept of data sourcing is comparable to Dependency Injection; instead of hardwiring the dependencies within the software, you tell it its dependencies so that at runtime it can connect to the correct components and get its data.

Two Flavours of Data Sourcing

When you want to use data sourcing, there are in essence two ways to do this.

Synchronisation

The first one is synchronisation. This means that the data is still local to the application, but it is (periodically or on the fly) synchronized with external applications. For the application being plugged into a system, this generally means that hardly any modifications are necessary; a script needs to be written that simply synchronises the data between the sources.

While this works in some situations, it is undesirable in most situations. If you have multiple applications you will end up with multiple copies of the data, and run the risk of having data out of sync between the sources. It can also lead to ownership or privacy issues. One application should be the owner of a data set; if that data set is exported to other applications then you lose a certain amount of control over it.

Federation

The better option is federation. Federation basically means that you will get the data from its source when you need it. You could still cache for performance reason but there is no mass synchronisation going on. This is the method we will be looking at in the rest of this article, as it is the one that is the most interesting, but also the one that requires work in the applications that want to make use of federation.

Implementing Data Sourcing

Imagine you have built an application that shows you a person’s wishlist. You may have a query in there somewhere that joins the wishlist table with the user table and a few auxiliary tables with categorisation information. If the application is built like that, and you decide to install the application within an environment with multiple applications, it will be hard to make the application use user accounts from another application; you will have to not only rewrite all the queries, you will also have to find a way to connect the application data with the external user accounts. There are a few things that will help you make your application ready for data sourcing, so let’s take a look at them.

Models

If your application is set up according to the Model/View/Controller (MVC) paradigm, then your business logic is already isolated within your models. This will make it easier to source the data from outside sources. It’s a matter of taking the user model and changing that to get the users from elsewhere. Using models within an application is a recommendation that I would make regardless of whether you plan on doing data sourcing, but it is particularly helpful when abstracting data away from your core application. Here is a simple code snippet that illustrates the idea, without giving detail:

In this example, the application will no longer fetch its data from the database, but use a SOAP client to retrieve the data. Of course it is very easy if the SOAP server has the exact same interface as our model, but that will rarely be the case when you’re talking to external services. If it differs, you may have to do some parameter and result set transformations, but you get the idea.

Protocol Abstraction

In the above example we are connecting to a SOAP service, which allows us to get the users from another application if it has a SOAP service, but what happens if your application is used in a scenario where there is no SOAP server, or the SOAP server uses a different set of methods? If we want to be flexible, we have to abstract the data source from our model. This is where the ‘Data Mapper’ design pattern, presented by Martin Fowler in his book Patterns of Enterprise Application Architecture, comes in handy. This pattern abstracts the data source away from the model. Our above example, adapted to include this idea, could look like this:

You could then provide several data mapper implementations for all the external sources your application supports:

Note the use of ‘SomeSoapService’ in the name and not just ‘SOAP’. SOAP is a standard but every SOAP service can implement their own methods, so we would have to have a separate mappers for different services. To avoid reinventing the wheel across different SOAP services we can derive from a general abstract SOAP class that takes care of the SOAP bits that are standard across all SOAP services. Later we will see how REST and the use of standards make this easier.

REST

REST is a nicer protocol than SOAP because it’s not only easier to implement, it has a lot less overhead and is easier to standardize. Because REST basically uses resource urls and the HTTP verbs to operate on them, they make data sourcing easier. For example, consider the following resource urls:

  • http://example.com/users
  • http://example.com/users/42

The first resource is a collection of users. We can retrieve the users using an HTTP GET request, we can add users using HTTP PUT, etc. The second resource is an individual user, which we can retrieve by performing a GET request, delete by performing a DELETE request etc.

Applying this to your application might mean that in the end, all you need to do is add some configuration into your application, rather than needing to modify code, for example:

Your REST data mapper then performs simple REST calls to retrieve it’s data through these services. Of course, you would still have to know what results to expect and how to map the results to your specific implementation, so you still might end up using multiple REST data mappers. The use of finer grained standards helps here. REST makes things easy; knowing that the response will be JSON will make things even easier; knowing that the JSON response is formatted according to the OpenSocial standard makes it peanuts. Which brings us to the final topic.

Standards

Federating access to applications for centrally managed users is very common. Large organisations deal with this all the time and nowadays, with so many social networking applications, you can see most websites use it. Services allow you to login using your Twitter or Facebook account for example.

Standards have been created to make it easy to work with centrally managed users. Here are a few that will be interesting for you to look at if you are looking to improve your applications in this area:

  • OpenID; a standard for having users log in using an account from one provider to access another provider. In essence it allows users to have one username/password for multiple services. Zend Devzone has a nice tutorial on OpenID to get you started.

  • OAuth; OAuth is a protocol designed to handle the actual authentication process across applications. OAuth allows an application to grant access to particular resources to an identity that was provided by another web site. There is a library on google code to make it easier to integrate oauth in your applications.
  • SAML; a more enterprise grade way to deal with authenticating users, that includes things such as identify verification and digital signatures so services know that the person that’s logging in is actually the person they think they are dealing with. If you want to work with SAML, you will want to look at simpleSAMLphp, which implements the most important parts of the SAML protocol in a relatively easy to understand PHP wrapper.
  • OpenSocial; a standard targeted at social networking, which provides interoperability for users, groups, status updates etc. The OpenSocial PHP client on google code is a good starting point.

For users, groups, privileges and general user data there are plenty of standards to choose from, and all of them have PHP implementations that can help get you started. For other data you may want to develop your own protocols and interfaces, and I hope to have given you enough information to get you started building applications that can get their data from external sources. If more applications are built this way, then the web will become much more pluggable!