In the previous post about speedy sites, we discussed the static assets of a site, and how providing them to users can be optimised through selective use of aggregation and HTML. Let’s move away from considering what to give the user, and instead look at how we provide it to them: the stack.

How do we define “the stack”? For the purposes of this article, the stack represents everything between the user and the PHP generating the content they see. As with all performance tuning work, you will need to have a sandbox to try out these techniques, and see what works for you. Often this will be a mixture of several technologies, which I affectionately refer to as the “Speed Sandwich”, with each layer doing a specific role to improve the overall performance of the user experience, while aiming to limit complexity to a few distinct components. We want to give the end user the page they requested as fast as possible. So how can we achieve this?

Measuring Stack Performance

Before we even look at what makes up the stack, we need to think about benchmarking. Measuring the effect of your stack on performance is key: indeed, there is no reason to have anything more than the simplest of application servers if we don’t measure performance. Most stacks are a combination of many layers, each with specific tasks and benefits, but these benefits must be weighed against the potential complexity and latency that they themselves add. While TCP communication on the same host is fast, it is not instant (ping is a far simpler communication between machines than HTTP traffic is, and even pinging localhost still takes greater than zero time). This latency adds up (more so if the stack is distributed across many machines), so be sure to measure that the benefits of your stack are still what you believe them to be. There are many measures you can use for determining stack performance: time to first byte (TTFB), first-byte latency, and page load time are good measures to use, but the reality is that the most important measure will be the end user’s experience, so look at your HAR (HTTP ARchive) record in your browser and see if the user experience is being enhanced or degraded by your choices.

Core Principles of the Stack

Before we look at various solutions, we need to identify what we’re trying to achieve. Our choices in components will then complement these goals. What are these goals?

  1. to serve content fast
  2. to have a robust, fault-tolerant solution
  3. our users should feel safe and have confidence in their data
  4. each component should only deal with the jobs (and traffic) we want them to

With these goals in mind, we can move on to consider the responsibilities we need the components of the stack to cover:

  • application server, to serve dynamic pages (in our case, with PHP)
  • SSL termination, to protect user data
  • static asset serving, for both HTTP and HTTPS traffic
  • caching, to decrease both the workload of the application server and the time a user has to wait for a request to be served.

Most modern PHP applications aimed at end users will have all of these responsibilities covered by components within the stack. The question is: which components to use, and how to arrange them in the stack?

SSL Termination

Clearly some responsibilities within the stack are immovable: if the site is going to support HTTPS traffic, then you need SSL termination, and for simplicity, it should be the first layer. When we speak of SSL termination, note that we are talking about the successor, Transport Layer Security (TLS), which most browsers support. SSL is an area with lots of considerations and strategies including asymmetric key size, the algorithm chosen, and CPU of the component responsible, and these factors can really hurt performance if overlooked. To learn more about this area, I strongly recommend the blog of Baptiste Assman, which covers a lot of the general principles as well as how to benchmark performance.

Before we decide on a component to cover this responsibility, we need to consider our architecture as a whole. Essentially there are two common designs: a single gateway such as an SSL load balancer, or distributed nodes as in a cloud architecture using DNS load balancing.

SSL termination - load balanced
SSL termination with the single gateway model: a load balancer supporting SSL distributes requests upstream via HTTP.

The gateway model, where SSL requests are terminated by a single component, has the advantage of having direct communication with the components behind to establish their state. This means it can subsequently add or subtract them from the pool of nodes available to serve requests. There can be n number of nodes behind this layer awaiting requests, which are now using HTTP, rather than HTTPS, to communicate. There is some overhead between the client and server during SSL handshakes, which can slow down performance. Most components capable of SSL termination have an SSL session cache, which makes subsequent re-negotiations with the client faster if they support TLS. With a single gateway, it is guaranteed that the client will request secure content through the gateway again, making it far more likely that a SSL session (or “ticket” – there is a very good overview of the relevant parlance in this article by Vincent Bernat) can be reused, and thus not require the full handshake again. Note that when looking at fulfilling responsibilities, a component doesn’t have to be a separate and distinct piece of software or hardware within the stack: If you’ve created a private SSL certificate for a local sandbox instance of Apache, then this is essentially the same: a single piece of your stack is solely responsible for all HTTPS requests (in this case, the mod_ssl Apache module).

SSL termination - DNS round robin
Distributed SSL termination: each node is responsible for SSL termination, and load balancing for users is via DNS round robin.

In the distributed stack, each node handles SSL termination itself, and requests are load balanced by DNS. Each new DNS lookup request is given a different node’s IP address, generally by round robin, so that traffic is distributed across the nodes within the pool. This generally scales well: there’s no single point of failure as in the gateway model, and it becomes much easier to scale across geographic locations. However, it is less obvious how to remove a failing node out of the pool, so unless rolling out your own DNS service, you are reliant on the implementation of the DNS service provider to define how nodes are added or removed from the pool. Additionally, sessions are “semi-sticky”; most modern browsers will cache the DNS lookup and so return to the previous node, though there is no guarantee that they will. To mitigate the increased latency of creating a new SSL session ticket, there are SSL implementations that support distributing SSL tickets to peers: Stud, the Scalable TLS Unwrapping Daemon, supports passing these tickets to other nodes via UDP, alleviating the overhead of doing this for every single request.

Variants exist between these two models, where there are multiple load balancers responsible for SSL termination for a set pool, and the load balancers themselves are balanced by DNS round robin. Load balancing and SSL termination are not necessarily closely coupled, so the manner of load balancing could be DNS or simple round robin hardware, but at the point you need to know something about the request, you need to terminate. Load balancing with infrastructure without termination does have the disadvantage that sessions cannot be sticky so nodes must either share tickets or the user will suffer performance penalties in handshaking every request.

There are many different implementations for terminating SSL, from a dedicated proxy such as Pound or Stud, to hardware implementations with dedicated accelerator cards, to simply handling this with a Web server. Each solution has its own benefits and drawbacks, so there is no one size fits all implementation. We’ll be looking at the benefits of using SSL for all requests (and the SPDY protocol) in a future article.

Static Assets

Unless the Web application is purely an API, there will undoubtedly be static assets that need to be served to the user. Most common lore immediately says “use a Content Delivery Network!”, but before running off to find a CDN provider, consider a few things first:

If your site serves HTTPS traffic, then to gain the full padlock icon within the browser requires ALL assets to be served by HTTPS, otherwise end users may be warned that the page contains insecure items. At the very worst, the site may become unusable if pages rely on JavaScript and the browser blocks access due to being insecure and/or from a separate domain. While there are workarounds, it is a situation that can be avoided if the assets themselves are also available via HTTPS. Many CDN providers support serving assets via HTTPS, but there is wildly differing pricing for this, so keep this in mind when shopping around.

It is always worth considering whether a CDN is needed at all. The rationale of using one is largely to do with providing assets to users in separate geographical locales: massively beneficial if hosting a site in London with a large user base in São Paulo, but less useful if the remotest user using the site is in Watford. There’s another excellent techportal article on multi-region deployments which will help you work out if this would help your application.

A simple and expedient method of providing assets under HTTPS, while also gaining the benefits, is simply to have them served by your standard nodes. Many companies have “vanity domains” that simply redirect traffic to the main domain; these vanity domains could be used to provide asset provisioning (so for example, serving assets for foo.org from foo.org.uk), while requests for the base URL can still redirect browsers to the main site. Again, whether or not to shard content onto separate domains is not a straightforward decision (when we discuss SPDY later, this will become clearer), because of the differences between browsers as to how many parallel pipelines they can support.

Another factor to consider with domain sharding is cookies: these are sent only with requests to the parent domain (so in our example, cookies for foo.org would not be sent to foo.org.uk but would be sent to www.foo.org). With some mobile devices only having upload speeds similar to dial-up, considering how large the request itself is can also be as important as the server response, so it may be better to serve assets from a domain where cookies won’t be sent and aren’t needed.

Caching Dynamic Content

While PHP is sufficiently performant for serving dynamic content, it can become a bottleneck to your users. This is not a weakness in PHP: taking the average Web page, even the fastest super-accelerated C application would be orders of magnitude slower than serving the same content statically. Add in the near-ubiquitous need to communicate with a data persistency, and at high load the application can grind to a halt. There are numerous in-application points that we can cache with a variety of tools such as memcached, Redis or APC, but to use them we’ve already hit the application layer. Enter reverse proxies.

There are a variety of vendors for reverse proxy solutions, but they all essentially work in the same way: dynamic content sits behind (often described as “upstream of”) the proxy, and requests are transparently passed to them. If the response from the upstream server is deemed cacheable by the proxy, then it stores a copy (generally in memory) and subsequent requests receive the cached response directly from the proxy cache.

Not only can a reverse proxy save application latency (in that users get content faster), but load across the application itself will also be reduced. While the majority of queries to a persistence layer are generally read (and so lighter than write operations), these can still overwhelm databases, Redis clusters, and memory. If your application is written in a way that is “friendly” to reverse proxies, then not only are these issues alleviated, but the application can actually be improved.

A well-planned integration with a reverse proxy can allow the application to be brought down briefly without the majority of users being aware that the site is completely being served by the proxy, while more ambitious integrations can use the advanced features of reverse proxies that support Edge Server Includes (ESI), to effectively “widgetize” their site into separate parts, each with individual lifetimes. This allows a page to consist of a mixture of cached and uncached content without having to only provide uncached content.

Tripping Up Reverse Proxies

One of the common pitfalls with reverse proxies is that generally, unless told otherwise, they treat requests with domain cookies appended as uncacheable, regardless of whether the application server needs to use them to produce the response. Generally one of the first things to configure in a reverse proxy are the rules to ignore all but the most essential cookies, to include as a minimum the one used by your application to keep the user’s session ID. However, because sessions are a handy place to store transient data, they can be misused. Session misuse means that slow pages cannot be cached and harms the user experience, but in many cases they are absolutely necessary to ensure user Y sees content designated for them and not private content for user X (a common occurrence when first experimenting with caching is seeing a welcome message with the name of the first developer to log in).

Taking an example of a common session implementation, the problem becomes clear:

There are a few basic ways to fix this issue:

  • Create sessions only when you have to.
  • Create sessions lazily, so that they are only created when data is set in them.
  • Delete sessions if they contain no data.

Here is an updated snippet that doesn’t automatically create a session cookie:

In this snippet, by checking for the existence of the session cookie, and then only fire up session_start() if it exists, we avoid the side effect of creating a session if we only want to check a variable.

The Application Server

Whatever your personal taste in application server, the core performance increases are in bytecode optimization and output minification. In the Doctrine documentation is a great quote by Stas Malyshev, a PHP core contributor:

“If you care about performance and don’t use a bytecode cache then you don’t really care about performance. Please get one and start using it.”

Whether the bytecode cache is APC, XCache, Zend Opcache in newer versions of PHP, or perhaps the application server is HipHop and the code is precompiled, these solutions can improve performance enormously. Poorly planned include_once and require_once operations can mean a lot of statting the disk, so it is a good practice to try and resolve include paths logically. One approach is to simply run composer with the --optimize-autoloader flag in production, which turns your PSR-0 classloader into a class map. Another might be to write your own autoloader that, after resolving a class, caches the resolved path, or even populates a file of includes to be used by the application.

The final part that will help performance is minification; bytes-on-the-wire savings through taking verbose HTML output and removing whitespace, comments and optional elements can be very constructive, especially for mobile users. Whether this is done as an output filter by the application, the application server, or further downstream is a matter of preference, however, if it does behind a caching layer then the overhead is again reduced, as the minified content will be cached and presented to the user without further optimisation required. We’ll be looking at one possible solution, Google PageSpeed, in the next article.

Laying Out The Stack

After identifying all the responsibilities we want our stack to cover, the temptation is immediately to look at particular solutions to cover these responsibilities, but before we look at those, we need to think about the order of our components.

speedy-sandwich

As already mentioned, some things simply have to happen in a certain order. For our stack it is a fairly obvious sandwich, with SSL termination on the downstream side, and the application server on the upstream side. The choice of whether the layer immediately behind SSL is a Web server or a caching layer depends very much on several factors: if the entirety of static content is served by CDN and all of it is referenced by domain-sharded markup, and you see no need for any post-processing, then there is no advantage in putting the Web server in front of the caching layer. However, if you are looking to do pre- or post-processing, support SPDY, or serve some static elements, then having a Web server in front of the caching layer is a must. A caching layer really should only be concerned with dynamic content, so having the reverse proxy behind a Web server makes sense, as it will only receive requests that by their very nature are for content not servable by the Web server. Additionally, maintaining the caching layer is easier, as the rule sets need not be concerned with static assets at all. Readers familiar with the Varnish reverse proxy may be familiar with statements such as this in the VCL file:

Blocks like these are no longer needed if the static assets are served in front of the cache and only dynamic content can reach the caching layer. Here’s a snippet of an nginx configuration where PHP pages or content not in the docroot are passed upstream:

With the above Web server in front of the caching layer, you can be fairly certain that the traffic reaching the reverse proxy is for dynamic content. Additionally, we can do some interesting things with post-processing to preserve caching, and this is a topic we’ll return to as part of this article series.

What’s next?

Hopefully by now you have a clearer picture in your head of the responsibilities that need to be covered by the application stack, and an idea of where in the user request-response flow these need to be addressed. We asked earlier what components we should to address these; the answer to that is very much to do with your own needs and preferences. So in the next group of articles about performance, we’ll look through some possibilities for the application layer, caching layer, and Web layer, starting with Apache and mod_pagespeed.