:title: Open Infrastructure Technical Overview .. _opendev-infra-overview: Open Infrastructure Technical Overview ###################################### The OpenDev system administration team strives to run the services behind the OpenDev Collaboratory as an open source project; we term this *open infrastructure*. Our infrastructure is code and contributions to it are handled just like the rest of OpenDev. This means that anyone can contribute to the installation and long-running maintenance of systems without shell access, and anyone who is interested can provide feedback and collaborate on code reviews. There are no permissions or special privileges required to contribute to the OpenDev infrastructure project. Below is a short guide to the major pieces of the project. Some knowledge of Zuul job configuration, Ansible, interaction with the Gerrit code-review system and general Linux administration are assumed; however expertise is not required. Operating environment --------------------- The OpenDev production systems run in resources (compute, network, storage) provided by donations from companies who support the project. Our standard production system is based on the latest Ubuntu LTS release. Production systems are deployed by Ansible. Most production applications run from containers; some are custom built and others we use unmodified from upstream sources. Zuul handles the testing and deployment of all changes. Current trends would refer to this as a *gitops* model -- all production changes are ultimately driven by a change proposed to the code-review system. This means we do not have bespoke production systems and any modifications we make are reviewed by peers and logged with change history. We have a *bastion host*, or *bridge*, which is a static host with permissions to deploy to the production systems. Zuul will run Ansible on the production systems via this host to deploy new changes into production. Getting started - CI -------------------- The configuration of every system operated by the OpenDev sysadmins is managed by Ansible and driven by continuous integration and deployment by Zuul. This is almost exclusively driven by code kept in the ``system-config`` repository, which can be browsed at: https://opendev.org/opendev/system-config All system configuration should be encoded in that repository so that anyone may propose a change in the running configuration to Gerrit. Any change to the OpenDev infrastructure system is first proposed as a review to this repository at ``review.opendev.org``. The current open reviews can be seen at https://review.opendev.org/q/project:opendev/system-config Zuul will first run CI on all incoming changes. Each service generally has its own CI job that runs when relevant files (configuration, Ansible roles, playbooks, etc.) are updated. These are generally called ``system-config-run-``; Zuul will post a comment when the change has been tested, or you can see in-flight testing at the status page https://zuul.opendev.org/t/openstack/status These jobs are crafted in a way that they replicate production as much as possible. Reading the job definitions in in :git_file:`zuul.d/system-config-run.yaml` will give you a feel for the hosts that are set up with each job. When you view the job results in the Zuul UI, you will see many logs collected from a number of hosts that simulate the production environment. This has all the information you generally need to debug problems, but the best place to start is with the *artifacts* tab, which has some curated links to useful overviews. One of the job artifacts is the `ARA report `__. This is a graphical view of the *nested* Ansible run on the (ephemeral) bastion host against the (ephemeral) production-test nodes. This is generally the first stop for finding deployment issues. Another artifact is the ``testinfra results``. `Testinfra `__ allows us to define unit-test-like behaviour to test functionality such as service and API status, correct deployment of users and files and other interesting details. Failures here would indicate the the deployment steps worked, but some part of the operation of that system is not as we expect. The ``testinfra`` code driving this is kept in :git_file:`testinfra` and test files are named for the service they test. Finally there is a ``screenshots`` artifact, which is a link to a directory that some tests populate with image files. Tests that are bringing up interactive services will use a headless browser to take shots of important pages to verify correct operation. The logs tab has links the the raw logs; this collects much more detail such as ``syslog``, Apache logs, database dumps, etc. Once you have identified the general problem from the above steps, these logs provide the in-depth details for further analysis. Playbooks and roles ------------------- The starting point for all services is generally the playbooks and roles kept in :git_file:`playbooks/`. Most playbooks are named ``service-.yaml`` and will indicate from their naming which production areas they drive. During testing, these same playbooks are run against the test nodes. You can note that the testing hosts are given names that match the group configuration in the jobs defined in :git_file:`zuul.d/system-config-run.yaml`. These playbooks are usually small and they call out to roles where most of the work is done. Roles are kept in :git_file:`playbooks/roles/`. These roles are written to be as generic as possible, but they are not expected to be used outside the OpenDev production deployment system. These playbooks and roles are the same for CI and deployment. Hosts and variables ------------------- The playbooks above run on groups of hosts which are defined in :git_file:`inventory/service/groups.yaml`. The production hosts are kept in an inventory at :git_file:`inventory/base/hosts.yaml`. In CI, the inventory is generated by Zuul (as it is allocating ephemeral nodes from the testing pool). Public production and testing variables are kept under :git_file:`inventory/`. The one difference between CI and production is *secrets* such as API keys, tokens and passwords; in production the *nested* Ansible will populate these variables for the deployment directly from values stored on the bastion host. In CI, dummy values should be populated into the templates under :git_file:`playbooks/zuul/templates/`. Production secrets are currently managed manually by OpenDev administrators on the bastion host. Deployment ---------- After review and approval of a change, Zuul will perform final gate testing and merge the change on your behalf. Just as uploading a new change triggers Zuul to run CI tests in the *check* pipeline, and approving a change triggers Zuul to run gate tests and merge in the *gate* pipeline, the merge of a change triggers Zuul to run the deployment jobs in the *deploy* pipeline. These jobs are named ``infra-prod-`` and run the same playbooks and roles as in the CI system, except against the production services. Zuul will deploy the merged changes to the bastion host, and then trigger the bastion host to run a *nested* Ansible deployment against the production host. Since the production run logs may leak sensitive information, they are not published openly. You can add a GPG public key to :git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml` and then ensure the ``infra-prod-`` production has your name in its ``encrypt_logs_job_recipients`` variable. Once approved and committed, you will then be able to view the encrypted production log output provided via the Zuul build page for the production run. Containers ---------- Most services are containerised. When looking at the ``system-config-run-*`` and ``infra-prod-*`` jobs you may see dependencies on container build/upload/promote jobs; this indicates we have jobs that build a bespoke container for this environment. The base ``Dockerfile`` for these containers is found under :git_file:`docker/`. Most are straight forward, but some of the more complicated services have multiple steps and layers. Any changes to the ``Dockerfile`` will be tested as usual, and when approved the containers will be rebuilt, published and pulled onto the production systems automatically. Certificates ------------ We provision SSL certificates from LetsEncrypt; see :ref:`letsencrypt`. DNS --- DNS for ``opendev.org`` (and some other domains) is also handled through the review system; see the ``__ project. Backups ------- Any host in the ``backup`` group will have backups to two geographically distinct locations setup by the deployment infrastructure. See the ``borg-backup`` role for details on including or excluding various data. Remote access ------------- Hosts are only configured by Ansible, but they can be setup for interactive access if required. Add your public key to :git_file:`inventory/base/group_vars/all.yaml` and include a stanza like this in your server ``host_vars``:: extra_users: - your_user_name See :ref:`ssh-access` for details on keys. Documentation ------------- Each service should have an RST file with documentation about the server and services in :git_file:`doc/source/`. Submitting Changes ------------------ If you are not familiar with submitting changes to Gerrit, you can start with any of the various developer guides such as :: https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html https://docs.opendev.org/opendev/infra-manual/latest/developers.html The change description is very important and the major source of historical information. It is expected a developer can read the description of a change and have the context to generally understand why it was introduced. Comments in the code-review system are useful to understand the deeper history of each change, but each change should stand-alone once committed. Only the most trivial of changes that are completely self-evident (e.g. typo fixes) would be expected to have less than a few sentences of context in their change log. Lifecycle --------- We welcome all changes and contributions to the project. Before starting work to deploy a new service that will require resources, you should do some preparation work. Putting an item on the `weekly team meeting agenda `__ agenda is always welcome. Logs of previous meetings can be seen at ``__. More complicated changes may justify going through the spec process; see ``_. If the existing admins are aware of the details before reviews start appearing it makes the process much smoother. All preliminary work can be done in an iterative fashion using the CI jobs at your own pace. The ``#opendev`` IRC channel on ``OFTC`` is a good place to find help during this process. Alternatively, questions are welcome on the `service-discuss list `__ This change (or changes) will be reviewed and may take a few rounds before final approval (in Gerrit terms, a ``+2`` vote). Most changes will receive a few ``-1`` votes from reviewers during development. This is really just a flag to note that some further discussion is required; it is not a rejection. You can set ``Workflow`` to ``-1`` in Gerrit on changes you are working on, or some developers like to put ``[WIP]`` at the front of their change description to indicate to reviewers they probably shouldn't spend much time on this yet, as you are still working on it. Small, stand-alone sequential changes are encouraged, and Zuul makes testing such "stacks" of changes trivial. We currently have admins manually deploy production virtual-machines, storage attached to those machines and secrets to the bastion host. This will need to happen before changes are put into production. Discussion with the admins will help decide on which cloud provider, the VM storage/size and other such matters. Once resources are allocated and the new host is available in the inventory, the production jobs can deploy. After this the service moves into a maintenance phase; changes can be proposed and, after review, deployed.