Troubleshooting
This guide describes how to troubleshoot Terraform Enterprise.
Health check
Terraform Enterprise provides a /_health_check
endpoint on the instance. If
Terraform Enterprise is up, the health check will return a 200 OK
.
The /_health_check
endpoint operates in 2 modes:
- Full check
- Minimal check
With a full check, the service will attempt to verify the status of internal
components and PostgreSQL, in contrast to a minimal check which returns 200
OK
automatically after a successful check.
The endpoint's default behavior is to perform a full check during startup of the instance, and minimal checks after Terraform Enterprise is active and running.
To force a full check, include the additional query parameter ?full=1
. This
parameter causes every call to make requests to internal components and
PostgreSQL, increasing system load and latency. Use it sparingly.
Support bundles
A support bundle is a collection of logs and other information about your installation that you can then send to HashiCorp Support for further troubleshooting.
You can generate a support bundle using the tfectl support bundle
command.
See the tfectl
documentation
for more information.
Support bundle contents
Support bundles contain the following information:
- Logs for all Terraform Enterprise services.
- License information for the installation.
- The Terraform Enterprise environment configuration with redacted secrets.
- Additional diagnostic information about the container, such as the contents
of
/etc/hosts
, disk and memory usage, and network configuration.
Logs are available within the bundle under the host
directory. All other
information is available within the results.json
file located at the root of
the bundle.
Terraform Enterprise services
When you encounter application issues, understanding the role of individual services in Terraform Enterprise will help you troubleshoot.
archivist
- Object storage API which simplifies the service architecture and minimizes inner-network cross talk by colocating the logical storage and front-end API handler pieces.atlas
- The Terraform Enterprise API.atlas-ui
- The Terraform Enterprise user interface.backup-restore
- A tool that provides both an API to backup and restore a Terraform Enterprise backup snapshot and a command line tool to inspect a backup snapshot. A backup snapshot is an encrypted binary file containing the Archivist data, Vault transit keys, and PostgreSQL schema dumps for a given Terraform Enterprise instance.licensing
- A library and service that provides enterprise license functionality for Terraform.metrics
- A Terraform Enterprise component to aggregate metrics and expose them over HTTP and HTTPS.nginx
- The NGINX reverse proxy which facilitates access to the Terraform Enterprise services.outbound-http-proxy
- Security control used to filter user-controlled network traffic (e.g. sentinel imports) and prevent them from accessing internal services directly.postgres
- The PostgreSQL database holds relational data such as workspace applies and where their state is stored in object storage. An internal PostgreSQL service is started when the operational mode isdisk
on Docker runtime. PostgreSQL server host config must be provided for application on cloud-managed Kubernetes, or forexternal
andactive-active
operational mode on Docker runtime.redis
- An in-memory database, use for caching andsidekiq
queue. An internal Redis service is started when the operational mode isdisk
orexternal
. Redis server host config must be provided foractive-active
mode for application on cloud-managed Kubernetes, or forexternal
andactive-active
operational mode on Docker runtime.registry_api
- Terraform Private Module Registry API.sidekiq
- Background job scheduler system.slug-ingress
- Listens for VCS webhooks. Packages VCS repo data as a slug and sends it toarchivist
.task-worker
- A service that manages asynchronous units of work in Terraform Enterprise.terraform-registry-api
- The API to the Terraform Registry.terraform-registry-worker
- Processes VCS slugs and prepares modules to be published on the Terraform private Module Registry.terraform-state-parser
- Reads Terraform state files and parses important information out of them. Terraform state is consumed from a remote state URL, and compiled data is sent in the payload of a callback to Atlas.tfe-health-check
- This tool is to help our customers and us know exactly why Terraform Enterprise has gotten into an unhealthy state, checking the health and connections to Postgres, Redis, Vault, storage, etc.vault
- HashiCorp Vault utilizes transit encryption for items such as sensitive workspace variables.
Service status
To check the status of the services, execute the following command within Terraform Enterprise container.
$ supervisorctl status
You will receive output similar to the following.
$ supervisorctl status logs RUNNING pid 39, uptime 1:38:49postgres RUNNING pid 103, uptime 1:38:48redis RUNNING pid 77, uptime 1:38:49tfe:archivist RUNNING pid 199, uptime 1:38:46tfe:atlas RUNNING pid 200, uptime 1:38:46tfe:atlas-ui RUNNING pid 201, uptime 1:38:46tfe:backup-restore RUNNING pid 203, uptime 1:38:46tfe:licensing RUNNING pid 205, uptime 1:38:46tfe:metrics RUNNING pid 211, uptime 1:38:46tfe:nginx RUNNING pid 215, uptime 1:38:46tfe:outbound-http-proxy RUNNING pid 220, uptime 1:38:46tfe:sidekiq RUNNING pid 238, uptime 1:38:46tfe:slug-ingress RUNNING pid 248, uptime 1:38:46tfe:task-worker RUNNING pid 257, uptime 1:38:46tfe:terraform-registry-api RUNNING pid 265, uptime 1:38:46tfe:terraform-registry-worker RUNNING pid 280, uptime 1:38:46tfe:terraform-state-parser RUNNING pid 291, uptime 1:38:46tfe:tfe-health-check RUNNING pid 298, uptime 1:38:46tfe:vault RUNNING pid 309, uptime 1:38:46tfe-next RUNNING pid 40, uptime 1:38:49
Inspecting logs
To inspect the logs for a particular service, execute the following command
within the Terraform Enterprise container where SERVICE_NAME
is the name of a
Terraform Enterprise service.
$ cat /var/log/terraform-enterprise/SERVICE_NAME.log
For example, we can see why tfe:licensing
exited.
$ cat /var/log/terraform-enterprise/licensing.log{"@level":"info","@message":"initializing database","@module":"tfe-licensing","@timestamp":"2023-05-10T20:46:26.379084Z"}{"@level":"error","@message":"error opening database connection","@module":"tfe-licensing","@timestamp":"2023-05-10T20:46:26.399064Z","error":"failed to connect to `host=/var/run/postgresql user=terraform-enterprise database=`: server error (FATAL: role \"terraform-enterprise\" does not exist (SQLSTATE 28000))"}
Kubernetes
Debug mode
Terraform Enterprise dispatches plans and applies via jobs when running inside Kubernetes. These jobs are removed immediately after their execution, which can make it hard to understand if a job failed due to cluster-specific errors.
To make troubleshooting easier in these scenarios, it is possible to keep the kubernetes jobs alive for a limited period of time,
after which they will get garbage collected by the cluster. To enable this, provide the following environment variables to
the deployment, either via the env.variables
entry in the values.yaml
override, or via the ConfigMap
attached to the
deployment holding all of the environment variables.
TFE_RUN_PIPELINE_KUBERNETES_DEBUG_ENABLED
. Boolean flag to enable debug mode, set totrue
.TFE_RUN_PIPELINE_KUBERNETES_DEBUG_JOBS_TTL
. (Optional) time in seconds after which the jobs will get deleted; default is86400
.
Troubleshooting common application errors
This section covers common application errors that you may run into.
Kubernetes Fails to Pull Image
Symptom
Kubernetes pods are failing to pull the container image with a BackOff
error.
Signals
kubectl describe pod
is stuck in the Waiting
state with the ErrImagePull
reason.
$ kubectl describe pod terraform-enterprise-7f649f6598-2k79b...Containers: terraform-enterprise: State: Waiting Reason: ErrImagePull...
Solution
Update the image pull policy for the deployment to always
.
Empty S3 static credentials
Symptom
Application fails to start.
Signals
Logs show the following S3 prefix detection error.
2023-05-10T23:38:18.100Z [ERROR] terraform-enterprise: startup: error="failed detecting s3 prefix: could not list objects: operation error S3: ListObjectsV2, failed to sign request: failed to retrieve credentials: failed to refresh cached credentials, static credentials are empty"
Solution
Set TFE_OBJECT_STORAGE_S3_USE_INSTANCE_PROFILE
to true
when using IAM auth
for S3.
Unknown certificate with VCS integration
Symptom
You cannot configure a VCS connection within Terraform Enterprise.
Signals
Setting up VCS fails with unknown certificate issuer
error.
Solution
Include the CA certificate for your VCS server in the CA Bundle. Ensure the
TFE_TLS_CA_BUNDLE_FILE
is set to a path pointing to your CA bundle.
Unknown certificate with failing Terraform runs
Symptom
Terraform plans and applies fail.
Signals
Logs for task worker and archivist show an x509 error.
Solution
Include the CA certificates for all hosts that Terraform must communicate with,
including your Terraform Enterprise server itself, in the CA Bundle. Ensure the
TFE_TLS_CA_BUNDLE_FILE
is set to a path pointing to your CA bundle.
Unable to fetch Terraform binary
Symptom
Terraform plans and applies fail with failed downloading terraform
.
Signals
Terraform run logs contain.
Operation failed: failed fetching Terraform: failed downloading terraform: failed downloading "https://releases.hashicorp.com/terraform/1.3.2/terraform_1.3.2_linux_amd64.zip": GET https://releases.hashicorp.com/terraform/1.3.2/terraform_1.3.2_linux_amd64.zip giving up after 5 attempt(s): failed making temp file: open /tmp/terraform/8c23e18ed1846a552fc22ed5ee80eec9.download-67d5219a-aa5c-cd41-3262-2b9d57c1bfe2: read-only file system
Solution
Ensure the TFE_DISK_CACHE_PATH
location is properly backed by a writable
volume.