Ten steps towards native devops for your Python 3 application

The word devops was coined to unify the roles of tech-ops and development. Since then we've seen the appearance of tools which claim to automate devops tasks.

I have encountered unnecessary complexity in these 'solutions'. They're not always a good match for the Python ecosystem. They add new failure modes which aren't caught in testing because as tooling they fall outside our development cycle.

So I'd like to suggest that Python developers start designing for devops. We need to own the tasks of installation and configuration. We need to test the way they work in the same way we test the rest of our code.

With the features now available in Python 3, there has never been a better time to review how we do this. So here are my ten suggestions for getting there.

1. Make deployment a use case

Any code which is regularly modified and deployed needs to be designed with that in mind.

Upgrading software is more than just delivering packages. If there is a change to the schema or the business logic, data migrations become necessary. And migrations are risky operations; they require testing. So for those reasons:

  • Devops code goes in version control
  • Devops code specifies its dependencies
  • Devops code is packaged
  • Devops code has a release number

Sounds familiar? It's time to apply some rigour to the software we use to deploy and maintain our systems.

I'm of the opinion now that any deployed Python project should be a namespace package. In Python 3.3, namespaces are easier than ever to define. You should make the devops functionality of your project a subpackage of your namespace and treat it as part of your product.

If where you work, devops is performed by sysadmins, then you will need to begin educating them on what standards of quality you expect. There are ways to accomodate contributors to a namespace project, as I described recently.

2. Write an ops manual in Sphinx

Whilst we aspire to automation, it's inevitable in a changing environment that ad-hoc tasks are necessary.

It's a simple matter to jot down these commands in a text file. It's even better to maintain a proper operations manual under version control which can be compiled by Sphinx.

In tricky situations it's reassuring to have clear instructions to follow. Sphinx can help you organise your devops notes into an orderly set of processes, complete with syntax highlighting and hyperlinks to code modules.

So if your company dumps all its tech-ops snippets on a wiki (what happens when that VPN goes down?) you might like to float the idea of an offline manual, defined as reStructuredText in your devops package, and maintained in version control.

3. One file defines the deploy

No matter how you organise your computing assets, ultimately your configuration of them is expressed as a bunch of attributes which apply to nodes or groups of nodes. These attributes parameterise the scripts you run to manage them.

It helps very much if those attributes are all in one place, and can be verified easily by the human eye.

The config file should contain data not code. That means no logic at all, only perhaps variable substitution. You should be able to evaluate and view the parameters of any node from the command line.

I find .ini style files easy to read, and they are well supported by the Python standard library. For me, this is an advantage over alternatives like JSON (less easy on the eye) and YAML (needs an external library).

4. No big bangs

Don't be tempted by tools which promise you hands-free operation. There's no magic about them; they are made of software too, and when they break you will wish you understood how they worked.

Whenever the configuration process is inaccessible to you, it is out of your control. And you are the guy who's supposed to have control.

Beneath the veneer of one-click deployment should be an ordered sequence of steps which you understand very well. You should be able to halt the process of configuration of a node at any point you wish and continue it by hand.

5. Change on command

Most devops tools either favour ad-hoc modifications to systems (where you make isolated changes in support of correct operation) or a convergent model (hands-free mutation until a reference state is reached).

I'd like to suggest a third approach, which I'll call Change on Command. A ConC script is code which has access to the business logic of the application it delivers. It defines and performs a sequence of operations which will result in transition to a new working configuration of that application.

ConC is different from ad-hoc because it is part of the application codebase and it is tested as part of the release.

To show how simple this can be, I'll sketch out a basic ConC module for your project's devops subpackage.

We will use Holger Krekel's library execnet to do the remote invocation. This elegant little package has been around for a while, but fully supports Python 3. Its purpose is to run code in a Python interpreter on a remote machine and send back results. It is all we need to create our own devops framework.

We'll begin by defining some simple classes for control and reporting.

class Host(object):

    local = "local"
    remote = "remote"

class Status(object):

    ok = "OK"
    blocked = "BLOCKED"
    failed = "FAILED"
    stopped = "STOPPED"
    error = "ERROR"
    timedout = "TIMED OUT"

Job = collections.namedtuple("Job", ["host", "op"])

... and make a list of jobs we need to do. In this case, it's delivering and installing some Python packages. Each job's op attribute is a function or a function object. We'll go into some of the more interesting ones later.

jobs = [
    Job(Host.remote, open_bundle),
    Job(Host.remote, create_venv),
    Job(Host.local, open_bundle),
    Job(Host.local, mount_SSHFS),
    Job(Host.local, copy_product),
    Job(Host.remote, UnTar("setuptools-*.tar.gz")),
    Job(Host.remote, SetupInstall("setuptools-*")),
    Job(Host.remote, UnTar("pip-*.tar.gz")),
    Job(Host.remote, SetupInstall("pip-*")),
    Job(Host.remote, PipInstall("SQLAlchemy-0.8.1.tar.gz")),
    Job(Host.local, unmount_SSHFS),
    Job(Host.remote, close_bundle),
    Job(Host.local, close_bundle),
]

Execnet lets us invoke a Python script twice at the same time; once on our workstation and once on the remote node. The two running modules know which is which and can send data to each other in the form of Python primitives.

The script on our workstation runs under the name __main__. We'll set up some logging and grab a reference to this module we're running.

if __name__ == "__main__":
    logging.basicConfig(
    format="%(asctime)s %(levelname)-7s %(host)-10s %(name)-10s %(message)s")

    module = sys.modules[__name__]

Throughout this example, let's assume we have read the configuration file and that the data for a particular node is in the dictionary ncd.

...
    user = ncd["admin"]
    host = ipaddress.IPv4Address(ncd["sshd.ip"])
    port = ncd["sshd.port"]
    keyPath = os.path.expanduser(
        os.path.join("~", ".ssh", "id_rsa-{}".format(ncd["op_key"])))

It's not a good idea to store passwords in plain text, so we'll prompt for those interactively.

...
    sudoPass = getpass.getpass("Enter sudo password for {}:".format(user))

Here's the execnet bit. We create an execution group and launch the same module via ssh on to the remote node's Python 3 interpreter.

...
    execGroup = execnet.Group()
    gw = execGroup.makegateway(
        "ssh=-i {} -p {} {}@{}//python=python3".format(
        keyPath, port, user, host))

    channel = gw.remote_exec(module)

Then we'll send that node's configuration data and the password required for the superuser. After this setting up, we'll call a loop which runs our jobs in order.

...
    channel.send(ncd)
    channel.send(sudoPass)

    rv = work_loop(chan, ncd, sudoPass)
    sys.exit(rv)

The work loop visits each job in sequence. If it's a local task, it gets invoked on our workstation. If not, we send the index of the job over the channel to the remotely operating module. Then we wait for a response.

def work_loop(chan, ncd, sudoPass):
    rv = 0
    for n, job in enumerate(jobs):
        lgr = logging.getLogger(job.op.__class__.__name__)

        host = ncd["host"] = (ncd["name"] if job.host == Host.remote
                        else "localhost")

        if job.host == Host.remote:
            chan.send(n)
            m, status = chan.receive()
        else:
            m, status = n, job.op(ncd, sudoPass)

        lgr.info(status, extra={"host": host})
        if status not in (Status.ok, Status.stopped):
            rv = 1
            break
    return rv

Remember that the same module is running on the node, and that the list of jobs is defined there too. When it starts up remotely, it should receive the configuration data and the sudo password. Execnet runs the remote module under the name __channelexec__.

if __name__ == "__channelexec__":
    ncd = channel.receive()
    sudoPass = channel.receive()

Then we enter a loop, awaiting the instruction to run a job. When we've invoked the defined operation, we return the result back across the channel.

...
    while True:
        n = channel.receive()
        job = jobs[n]
        status = job.op(ncd, sudoPass)
        channel.send((n, status))

That's all we need to define an ordered sequence of operations which can be coordinated between a local and a remote node.

6. Lock operations with the bundle

When we're deploying software, we have to transfer our files to the remote node. But it's usually not just one file only. If our project has dependencies, we'll need to supply them too; they are our vendor packages. I call this collection of packages the bundle.

So the first job of a deployment is to create a directory on the node to hold those files. And actually, this is a useful thing to do even if there are no files to transfer at all. The bundle directory acts as a lock, telling us that the node is in the process of reconfiguration.

Here's the first function referenced in the job list. It simply creates a directory in the home path of the user. Remember, this is a job which runs on both the local and the remote node.

def open_bundle(ncd, sudoPass):
    try:
        locn = os.path.expanduser(os.path.join("~", ncd["bundle"]))
        os.mkdir(locn)
    except OSError:
        return Status.blocked
    else:
        return Status.ok

And here's the function we use to remove the bundle. It's always the last thing we do:

def close_bundle(ncd, sudoPass):
    try:
        locn = os.path.expanduser(os.path.join("~", ncd["bundle"]))
        shutil.rmtree(locn, ignore_errors=False)
    except OSError:
        return Status.failed
    else:
        return Status.ok

7. SSHFS simplifies delivery

With the bundle in place, we can start to move our files across. I always used to do this with scp, but I've recently discovered another solution which is much neater: sshfs.

SSHFS works over SFTP. You don't have to install anything extra on the node, but you'll want to put the sshfs package on your workstation. Then you can mount and unmount the remote bundle with a single command.

The advantage of this is that if you have to pause the deploy for any reason, you can then manually modify the bundle while it is mounted. It's your bridge to the node while the deploy is under way.

Here's the function I use to mount the bundle locally:

def mount_SSHFS(ncd, sudoPass):
    keyPath = os.path.expanduser(
        os.path.join("~", ".ssh", "id_rsa-{}.pub".format(ncd["op_key"])))
    sshCommand = "ssh_command='ssh -i {keyPath}' ".format(keyPath=keyPath)

    node = ipaddress.IPv4Address(ncd["sshd.ip"])
    port = ncd["sshd.port"]
    tgt = os.path.expanduser(os.path.join("~", ncd["bundle"]))
    cmd = ("sshfs -o {ssh_command}"
          "-p {port} {admin}@{node}:{bundle} {tgt}").format(
          admin=ncd["admin"], bundle=ncd["bundle"], node=node, port=port,
          ssh_command=sshCommand, tgt=tgt)

    p = subprocess.Popen(cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            shell=True)
    out, err = p.communicate()
    if out:
        return Status.failed
    else:
        return Status.ok

The unmounting operation is a subprocess call of fusermount -u ~/bundle.

8. Use the venv module

Apart from the obvious syntax changes, the most exciting developments in Python 3 for me are those which reinforce its role as a web language. I sneaked in some use of the new ipaddress module earlier on. Pluggable event loops is a very welcome proposal that I'm looking forward to seeing in Python 3.4.

The standard library now has a module for creating isolated Python environments (virtualenvs). This really means they are now the officially sanctioned way of deploying your app.

Here's a function which creates the virtual environment on a remote node.

def create_venv(ncd, sudoPass):
    then = time.time()
    time.sleep(0.2)
    locn = os.path.expanduser(os.path.join("~", ncd["venv"]))
    bldr = venv.EnvBuilder(
            system_site_packages=False,
            clear=True)
    bldr.create(locn)
    if os.path.getmtime(locn) > then:
        return Status.ok
    else:
        return Status.failed

9. Keep build tools out of production

From time to time I hear I should be using debian packages to deploy my code. In case you think that's true, here's the perfect example of why it's not.

If you install the Ubuntu packages python-pip or python-setuptools they will pull in python-dev and after several minutes you will discover you have the entire gcc toolchain on your production server.

This is not what we want.

Rather, you should include source packages of setuptools and pip in your bundle. Install them into your virtualenv with setup.py install.

So long as your application is pure Python it can be installed from a source package. You should prefer this way over eggs, since their days are numbered in the Python 3.4 timeframe. In 2014 we should begin to see adoption of the new wheel format instead.

10. Write once, test everywhere

I mentioned testing earlier on. Testing is the greater part of Engineering. Here are some of the levels of testing we need to be aware of:

  • Unit tests
  • Integration tests
  • Migration tests
  • Functional tests
  • Load tests
  • Monitoring

The unittest module is useful in many of these scenarios. Although it can cause confusion when people claim that the 'unit tests' take several minutes to run each. I try and acknowledge the purpose I'm using unittest for when I write a module, so if it's really a functional test I'll write:

import unittest as functest

Remember execnet? Well, if it executes a module on a node it can run a test on a node.

import unittest as nodetest

I've started to do this a lot recently and it's very liberating. You can run checks against a newly paved node, or on a regular basis as part of a credentialed scanning regime.

Here's a quick example, which is useful because it's the first to make use of the sudo password. It happens to check that an iptables module has been loaded with the correct options.

class FirewallChecks(nodetest.TestCase):

    .
    .

    def test_xt_recent_module_loaded(self):
        p = subprocess.Popen(
                ["sudo", "-S", "cat",
                "/sys/module/xt_recent/parameters/ip_pkt_list_tot"],
                stdin=subprocess.PIPE, stdout=subprocess.PIPE,
                stderr=subprocess.DEVNULL)
        out, err = p.communicate("{}\n".format(self.sudoPass).encode("utf-8"))
        val = out.decode("utf-8").strip()
        self.assertEqual("35", val)

It's a moderately simple task to write a test runner which will send the results back down the execnet channel to a controlling host.

Notes

The patch required to apply ssh options for port numbers and key files is not yet available in execnet trunk. But you can apply it at run-time with this code:

def patch_execnet():
    """ Patch execnet to accept extended ssh arguments """
    import execnet
    import execnet.gateway_io

    def ssh_args(spec):
        remotepython = spec.python or "python"
        args = ["ssh", "-C"]
        args.extend(spec.ssh.split())
        remotecmd = '{} -c "{}"'.format(
            remotepython, execnet.gateway_io.popen_bootstrapline)
        args.append(remotecmd)
        return args

    execnet.gateway_io.ssh_args = ssh_args
    return execnet

Thanks to Holger Krekel and everyone who develops execnet.

Commercial

I am currently available for consulting work. You can hire me through Thuswise Ltd.