Connection Sharing with CURL in PHP: How to re-use HTTP connections to knock 70% off REST network time.

Jun 18 2012

The PHP CURL library is the most robust library for working with HTTP (and other protocols) from within PHP. As a maintainer of the HPCloud-PHP library, which makes extensive use of REST services, I've been tinkering around with ways of speeding up REST interactions.

What I've found is a way to cut off nearly 70% of the processing time for a typical usage scenario. For example, our unit tests used to take four minutes to run, and we're now down to just over a minute, while our Drupal module's network time has been cut by over 75%.

This article explains how we accomplished this with a surprisingly simple (and counter-intuitive) modification. The typical way to work with CURL is to use the curl_init() call to create a new CURL handle. After suitable configuration has been done, that CURL handle is typically executed with curl_exec(), and then closed with curl_close(). As might be expected, this builds a new connection for each CURL handle. If you create two handles (calling curl_init() twice, and executing each handle with curl_exec()), the unsurprising result is that two connections to the remote server are created.

But for our library, a pattern emerges quickly: Many requests are sent to the same servers. In some cases, several hundred requests may got to the same server, and even the same URL (though with different headers and bodies). If we use the method above of creating one CURL handle per request, the network overhead for each request can really slow things down. This is compounded by the fact that almost all request are done over SSL, which means each request has not only network overhead, but also SSL negotiation overhead.

This is hardly a new problem, and HTTP has several methods for dealing with this. Unfortunately, CURL, as I've used it above, cannot make use of any of these. Why not? Because each CURL handle does its own connection management. But there were hints in the PHP manual that there may be ways to share connections. And when looking at CURL's raw output, I could see it leaving connections open for re-use. But how could I make use of those?

Reuse a CURL handle

The first method was to call curl_init() once, and then call curl_exec() multiple times before calling curl_close(). This method is described (sparsely) in a Stack Overflow discussion.

I gave this method a try, but immediately ran up against issues. While I suspect that this method works for simple configurations, our library is not simple. It makes deep use of CURL's configuration API, passing input and output streams around, and conditionally setting many options depending on the type of operation being performed. We use GET, HEAD, POST, PUT, and COPY requests, sometimes in rapid succession. Sometimes we provide only scant data to the server, while other times we are working with large objects. Re-using the same CURL handle did not work well in this situation. While it is easy to set an option, it is not possible to unset or reset an option.

After trying several methods of resetting options, I forwent this approach and began digging again.

CURL Multi is not just for parallel processing

The hint that changed everything came from this entry in the CURL FAQ:

"curl and libcurl have excellent support for persistent connections when transferring several files from the same server. Curl will attempt to reuse connections for all URLs specified on the same command line/config file, and libcurl will reuse connections for all transfers that are made using the same libcurl handle. When you use the easy interface, the connection cache is kept within the easy handle. If you instead use the multi interface, the connection cache will be kept within the multi handle and will be shared among all the easy handles that are used within the same multi handle. "

It took me a moment to realize that the easy interface was curl_exec, but once I caught on, I knew what I needed to do.

The CURL multi library is typically used for running several requests in parallel. But as you can see from the FAQ entry above, it has another virtue: It caches connections. As long as the CURL multi handle is re-used, CURL connections will automatically be re-used as long as possible.

This method provides the ability to set different options on each CURL handle, but then to run each CURL handle through the CURL multi handler, which provides the connection caching. While this particular chunk of code never executes requests in parallel, CURL multi still provides a huge performance boost.

A quick test of this revealed instant results. Running a series of requests that took 14 seconds on the original configuration took only five seconds with CURL multi. (How does all of this compare to the built-in PHP HTTP Stream? It took 22 seconds to run the same tests, and it takes over seven minutes to run the same batch of tests that takes CURL multi 1.5 minutes.)

An Example in Code

While the HP Cloud library is object oriented, here is a simple procedural example that shows (basically) what my starting code looked like and what the finished code looked like.

Initially, we were using a simple method of executing CURL like this:

    <?php
    function get($url) {
      // Create a handle.
      $handle = curl_init($url);

      // Set options...

      // Do the request.
      $ret = curl_exec($handle);

      // Do stuff with the results...

      // Destroy the handle.
      curl_close($handle);
    }

    ?>

While our actual code does a lot of options configuring and then does a substantial amount with $handle after the curl_exec() call, this code illustrates the basic idea.

Refactoring to make use of CURL multi, the final code looked more like this:

    <?php
    function get2($url) {
      // Create a handle.
      $handle = curl_init($url);

      // Set options...

      // Do the request.
      $ret = curlExecWithMulti($handle);

      // Do stuff with the results...

      // Destroy the handle.
      curl_close($handle);

    }

    function curlExecWithMulti($handle) {
      // In real life this is a class variable.
      static $multi = NULL;

      // Create a multi if necessary.
      if (empty($multi)) {
        $multi = curl_multi_init();
      }

      // Add the handle to be processed.
      curl_multi_add_handle($multi, $handle);

      // Do all the processing.
      $active = NULL;
      do {
        $ret = curl_multi_exec($multi, $active);
      } while ($ret == CURLM_CALL_MULTI_PERFORM);

      while ($active && $ret == CURLM_OK) {
        if (curl_multi_select($multi) != -1) {
          do {
             $mrc = curl_multi_exec($multi, $active);
          } while ($mrc == CURLM_CALL_MULTI_PERFORM);
        }
      }

      // Remove the handle from the multi processor.
      curl_multi_remove_handle($multi, $handle);

      return TRUE;
    }

    ?>

Now, instead of using curl_exec(), we supply a method called curlExecWithMulti(). This function keeps a single static $multi instance (again, our actual implementation is more nuanced and less... Singleton-ish). This $multi instance is shared for all requests, and doing this allows us to make use of CURL multi's connection caching.

In each call to curlExecWithMulti(), we add $handle to the $multi request handler, execute it using CURL multi's execution style, and then remove the handle once we are done.

There is nothing particularly fancy about this implementation. It is actually even more complicated than it needs to be (I eventually want to make curlExecWithMulti() be able to take an array of handles for parallel processing). But it certainly does the trick.

Using that pattern for the HPCloud PHP library, I re-ran our unit tests. The unit test run typically takes between four and five minutes to handle several hundred REST requests. But with this pattern, the same tests took under a minute and a half -- and made over 300 requests over the same connection.

We will continue to evolve the HPCloud PHP library to improve performance even more. Parallel and asynchronous processing is one performance item on the roadmap. And we have others as well. If you've got some tricks you'd like to share, feel free to drop them in the issue queue at GitHub and let us know.



comments powered by Disqus