Open Infrastructure Technical Overview¶
The OpenDev system administration team strives to run the services behind the OpenDev Collaboratory as an open source project; we term this open infrastructure.
Our infrastructure is code and contributions to it are handled just like the rest of OpenDev. This means that anyone can contribute to the installation and long-running maintenance of systems without shell access, and anyone who is interested can provide feedback and collaborate on code reviews. There are no permissions or special privileges required to contribute to the OpenDev infrastructure project.
Below is a short guide to the major pieces of the project. Some knowledge of Zuul job configuration, Ansible, interaction with the Gerrit code-review system and general Linux administration are assumed; however expertise is not required.
Operating environment¶
The OpenDev production systems run in resources (compute, network, storage) provided by donations from companies who support the project.
Our standard production system is based on the latest Ubuntu LTS release.
Production systems are deployed by Ansible. Most production applications run from containers; some are custom built and others we use unmodified from upstream sources.
Zuul handles the testing and deployment of all changes. Current trends would refer to this as a gitops model – all production changes are ultimately driven by a change proposed to the code-review system. This means we do not have bespoke production systems and any modifications we make are reviewed by peers and logged with change history.
We have a bastion host, or bridge, which is a static host with permissions to deploy to the production systems. Zuul will run Ansible on the production systems via this host to deploy new changes into production.
Getting started - CI¶
The configuration of every system operated by the OpenDev sysadmins is
managed by Ansible and driven by continuous integration and deployment
by Zuul. This is almost exclusively driven by code kept in the
system-config
repository, which can be browsed at:
All system configuration should be encoded in that repository so that anyone may propose a change in the running configuration to Gerrit.
Any change to the OpenDev infrastructure system is first proposed as a
review to this repository at review.opendev.org
. The current open
reviews can be seen at
Zuul will first run CI on all incoming changes. Each service
generally has its own CI job that runs when relevant files
(configuration, Ansible roles, playbooks, etc.) are updated. These
are generally called system-config-run-<service>
; Zuul will post a
comment when the change has been tested, or you can see in-flight
testing at the status page
These jobs are crafted in a way that they replicate production as much as possible. Reading the job definitions in in system-config: zuul.d/system-config-run.yaml will give you a feel for the hosts that are set up with each job. When you view the job results in the Zuul UI, you will see many logs collected from a number of hosts that simulate the production environment. This has all the information you generally need to debug problems, but the best place to start is with the artifacts tab, which has some curated links to useful overviews.
One of the job artifacts is the ARA report. This is a graphical view of the nested Ansible run on the (ephemeral) bastion host against the (ephemeral) production-test nodes. This is generally the first stop for finding deployment issues.
Another artifact is the testinfra results
. Testinfra allows us to define
unit-test-like behaviour to test functionality such as service and API
status, correct deployment of users and files and other interesting
details. Failures here would indicate the the deployment steps
worked, but some part of the operation of that system is not as we
expect. The testinfra
code driving this is kept in
system-config: testinfra and test files are named for the service they
test.
Finally there is a screenshots
artifact, which is a link to a
directory that some tests populate with image files. Tests that are
bringing up interactive services will use a headless browser to take
shots of important pages to verify correct operation.
The logs tab has links the the raw logs; this collects much more
detail such as syslog
, Apache logs, database dumps, etc. Once you
have identified the general problem from the above steps, these logs
provide the in-depth details for further analysis.
Playbooks and roles¶
The starting point for all services is generally the playbooks and
roles kept in system-config: playbooks/. Most playbooks are named
service-<name>.yaml
and will indicate from their naming which
production areas they drive.
During testing, these same playbooks are run against the test nodes. You can note that the testing hosts are given names that match the group configuration in the jobs defined in system-config: zuul.d/system-config-run.yaml.
These playbooks are usually small and they call out to roles where most of the work is done. Roles are kept in system-config: playbooks/roles/. These roles are written to be as generic as possible, but they are not expected to be used outside the OpenDev production deployment system.
These playbooks and roles are the same for CI and deployment.
Hosts and variables¶
The playbooks above run on groups of hosts which are defined in system-config: inventory/service/groups.yaml.
The production hosts are kept in an inventory at system-config: inventory/base/hosts.yaml. In CI, the inventory is generated by Zuul (as it is allocating ephemeral nodes from the testing pool).
Public production and testing variables are kept under system-config: inventory/. The one difference between CI and production is secrets such as API keys, tokens and passwords; in production the nested Ansible will populate these variables for the deployment directly from values stored on the bastion host. In CI, dummy values should be populated into the templates under system-config: playbooks/zuul/templates/.
Production secrets are currently managed manually by OpenDev administrators on the bastion host.
Deployment¶
After review and approval of a change, Zuul will perform final gate testing and merge the change on your behalf.
Just as uploading a new change triggers Zuul to run CI tests in the check pipeline, and approving a change triggers Zuul to run gate tests and merge in the gate pipeline, the merge of a change triggers Zuul to run the deployment jobs in the deploy pipeline.
These jobs are named infra-prod-<service>
and run the same
playbooks and roles as in the CI system, except against the production
services. Zuul will deploy the merged changes to the bastion host,
and then trigger the bastion host to run a nested Ansible deployment
against the production host.
Since the production run logs may leak sensitive information, they are
not published openly. You can add a GPG public key to
system-config: playbooks/zuul/roles/encrypt-logs/defaults/main.yaml and
then ensure the infra-prod-<service>
production has your name in
its encrypt_logs_job_recipients
variable. Once approved and
committed, you will then be able to view the encrypted production log
output provided via the Zuul build page for the production run.
Containers¶
Most services are containerised. When looking at the
system-config-run-*
and infra-prod-*
jobs you may see dependencies
on container build/upload/promote jobs; this indicates we have jobs
that build a bespoke container for this environment.
The base Dockerfile
for these containers is found under
system-config: docker/. Most are straight forward, but some of the more
complicated services have multiple steps and layers. Any changes to
the Dockerfile
will be tested as usual, and when approved the
containers will be rebuilt, published and pulled onto the production
systems automatically.
Certificates¶
We provision SSL certificates from LetsEncrypt; see Let’s Encrypt Certificates.
DNS¶
DNS for opendev.org
(and some other domains) is also handled through
the review system; see the
https://opendev.org/opendev/zone-opendev.org/ project.
Backups¶
Any host in the backup
group will have backups to two
geographically distinct locations setup by the deployment
infrastructure. See the borg-backup
role for details on including
or excluding various data.
Remote access¶
Hosts are only configured by Ansible, but they can be setup for interactive access if required.
- Add your public key to system-config: inventory/base/group_vars/all.yaml
and include a stanza like this in your server
host_vars
:extra_users: - your_user_name
See SSH Access for details on keys.
Documentation¶
Each service should have an RST file with documentation about the server and services in system-config: doc/source/.
Submitting Changes¶
If you are not familiar with submitting changes to Gerrit, you can start with any of the various developer guides such as
https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html
https://docs.opendev.org/opendev/infra-manual/latest/developers.html
The change description is very important and the major source of historical information. It is expected a developer can read the description of a change and have the context to generally understand why it was introduced. Comments in the code-review system are useful to understand the deeper history of each change, but each change should stand-alone once committed. Only the most trivial of changes that are completely self-evident (e.g. typo fixes) would be expected to have less than a few sentences of context in their change log.
Lifecycle¶
We welcome all changes and contributions to the project.
Before starting work to deploy a new service that will require resources, you should do some preparation work. Putting an item on the weekly team meeting agenda agenda is always welcome. Logs of previous meetings can be seen at https://meetings.opendev.org/#OpenDev_Meeting. More complicated changes may justify going through the spec process; see https://opendev.org/opendev/infra-specs. If the existing admins are aware of the details before reviews start appearing it makes the process much smoother.
All preliminary work can be done in an iterative fashion using the CI
jobs at your own pace. The #opendev
IRC channel on OFTC
is a
good place to find help during this process. Alternatively, questions
are welcome on the service-discuss list
This change (or changes) will be reviewed and may take a few rounds
before final approval (in Gerrit terms, a +2
vote). Most changes
will receive a few -1
votes from reviewers during development.
This is really just a flag to note that some further discussion is
required; it is not a rejection.
You can set Workflow
to -1
in Gerrit on changes you are
working on, or some developers like to put [WIP]
at the front of
their change description to indicate to reviewers they probably
shouldn’t spend much time on this yet, as you are still working on it.
Small, stand-alone sequential changes are encouraged, and Zuul makes
testing such “stacks” of changes trivial.
We currently have admins manually deploy production virtual-machines, storage attached to those machines and secrets to the bastion host. This will need to happen before changes are put into production. Discussion with the admins will help decide on which cloud provider, the VM storage/size and other such matters.
Once resources are allocated and the new host is available in the inventory, the production jobs can deploy. After this the service moves into a maintenance phase; changes can be proposed and, after review, deployed.