Skip to content

Openstack instances, ips and reliability

rzzzwilson edited this page Jun 28, 2011 · 4 revisions

This page documents problems encountered deploying reliable code onto an OpenStack instance and how they were handled.

The Amazon code that was used as a basis for the OpenStack implementation used the python boto module to do things like start and stop instances, etc. Ideally, boto should be used with OpenStack but in the mad rush to complete, I couldn't get boto to work with OpenStack. It is undoubtedly something simple that I have overlooked, but I ended up using the command-line tools within the python code.

Whether we use boto or not, we probably wouldn't get around the problem of commands just not working at times. They seem to throw up their hands with an UnknownError message, with the helpful direction to "try again". This probably isn't something introduced by using the command-line tools, as they are implemented with boto.

The other problem encountered in the OpenStack environment is the difficulty in mounting an NFS share. The instance needs this to access:

  • 600GiB of MUX data
  • The data files and code supplied by the UI

The problem with an NFS share in the OpenStack environment arises because an instance doesn't automatically have a publicly accessible IP address. This means that any attempt to automatically mount an NFS share through an entry in /etc/fstab cannot succeed. It must be mounted dynamically by instance code.

Special instance code was written to:

  • Obtain a public IP
  • Associate that IP with the instance
  • Mount the NFS share

This is slightly tricky to write due to the UnknownError problem and failures due to multiple instances competing for IPs and preempting each other (ie, an IP one instance obtains can't be associated with that instance because another instance associated it first). The code is here.

This preparation code is not bullet-proof. There is some sort of problem in the public IP system that prevents an NFS mount because the public IP is "not really public" (my interpretation, Michael knows the details). When this happens the instance will never get a successful NFS mount and it is useless.

This special instance code is run as root from /etc/rc.local. The actual simulation code is run from the ubuntu user's crontab via an @reboot directive, ie, run at instance start. This gives rise to another problem - the special instance code and simulation start together, but it can take some appreciable time for the mount of the NFS share to happen. The simulation needs to wait until the mount is successful. This code runs very early in the bootstrap.py code to wait for the mount:

# wait here until /data mounted
while not os.path.isdir(CheckMountPath):
    print('Waiting for /data mount (checking %s, sleep %ds)'
          % (CheckMountPath, CheckMountSleep))
    time.sleep(CheckMountSleep)

where CheckMountPath is the path to any top-level directory in the NFS share, and CheckMountSleep is the time to sleep between checks, typically 30 seconds.

Remaining problems

The problems with NFS mounts and public IPs remains.

This is not just a matter of getting a reliable NFS mount since we have seen one instance of what looks like an NFS failure (dismount) after a successful mount. The instance was running a simulation which means a mount had occurred, but at the end of the simulation crashed because it couldn't write to the share. Doing df -h on the instance hung on the mount.

I'm inclined to ignore that one problem, though. It only happened once and I didn't have a lot of time to analyse it.

An Idea

As the system currently stands an instance may have trouble mounting the NFS share and thereby becomes useless. This is handled by:

It may be possible to automate the above process with a particularly nasty kludge. The mount_shares.py program could try a configurable number of mount requests and then:

  • send a RESTART message to the server
  • terminate the instance

The sending of the message must be tested, but it should work. The server message handling would need to be extended to handle the RESTART message, which would include enough information to find the restart information and start another instance to rerun the simulation.

The server code handling a STOP message from a successfully terminating instance could also delete the appropriate restart file.