troubleshooting.html.md.erb

---
title: Troubleshooting the IPsec Add-on for PCF
owner: Security Engineering
---

<strong><%= modified_date %></strong>

This topic provides instructions to verify that strongSwan-based IPsec works with your Pivotal Cloud Foundry (PCF) deployment and general recommendations for troubleshooting IPsec.

## <a id='verify-ipsec'></a>Verify that IPsec Works with PCF
To verify that IPsec works between two hosts, you can check that traffic is encrypted in the deployment with `tcpdump`, perform the ping test, and check the logs with the steps below.

1. Check traffic encryption and perform the ping test. Select two hosts in your deployment with IPsec enabled and note their IP addresses. These are referenced below as `IP-ADDRESS-1` and `IP-ADDRESS-2`.
	1. SSH into `IP-ADDRESS-1`.
	<pre class="terminal">$ ssh IP-ADDRESS-1</pre>
	1. On the first host, run the following, and allow it to continue running.
	<pre class="terminal">$ tcpdump host IP-ADDRESS-2</pre>
	1. From a separate command line, run the following:
	<pre class="terminal">$ ssh IP-ADDRESS-2</pre>
	1. On the second host, run the following:
	<pre class="terminal">$ ping IP-ADDRESS-1</pre>
	1. Verify that the packet type is ESP. If the traffic shows `ESP` as the packet type, traffic is successfully encrypted. The output from `tcpdump` will look similar to the following:
	<pre class="terminal">03:01:15.242731 IP IP-ADDRESS-2 > IP-ADDRESS-1: ESP(spi=0xcfdbb261,seq=0x3), length 100</pre>
1. Open the `/var/log/daemon.log` file to obtain a detailed report, including information pertaining to the type of certificates you use, and to verify an established connection exists.
1. Navigate to your Installation Dashboard, and click **Recent Install Logs** to view information regarding your most recent deployment. Search for "ipsec" and the status of the IPsec job.
1. Run `ipsec statusall` to return a detailed status report regarding your connections. The typical path for this binary: `/var/vcap/packages/strongswan-x.x.x/sbin`. `x.x.x` represents the version of strongSwan packaged into the IPsec.

If you experience symptoms that IPsec does not establish a secure connection, return to the [Installing the IPsec Add-on for PCF](./installing.html) topic and review your installation.

If you encounter issues with installing IPsec, refer to the [Troubleshooting IPsec](#troubleshoot-ipsec) section of this topic.

## <a id='troubleshoot-ipsec'></a>Troubleshoot IPsec

### <a id="ipsec-installation-issues"></a> IPsec Installation Issues

#### Symptom
Unresponsive applications or incomplete responses, particularly for large payloads

#### Explanation: Packet Loss
IPsec packet encryption increases the size of packet payloads on host VMs. If the size of the larger packets exceeds the maximum transmission unit (MTU) size of the host VM, packet loss may occur when the VM forwards those packets.

If your VMs were created with an Amazon PV stemcell, the default MTU value is 1500 for both host VMs and the application containers. If your VMs were created with Amazon HVM stemcells, the default MTU value is 9001. Garden containers default to 1500 MTU.

#### Solution
Implement a 100 MTU difference between host VM and the contained application container, using one of the following approaches:

* Decrease the MTU of the application containers to a value lower than the MTU of the VM for that container. In the Elastic Runtime tile configuration, click **Networking** and modify **Applications Network Maximum Transmission Unit (MTU) (in bytes)** before you deploy. Decrease it from the default value of 1454 to 1354.

* Increase the MTU of the application container VMs to a value greater than 1500. Pivotal recommends a headroom of 100. Run <code>ifconfig NETWORK-INTERFACE mtu MTU-VALUE</code> to make this change. Replace NETWORK-INTERFACE with the network interface used to communicate with other VMs For example:
<code>$ ifconfig NETWORK-INTERFACE mtu 1600</code>

<hr>
#### Symptom
Unresponsive applications or incomplete responses, particularly for large payloads

#### Explanation: Network Degradation
IPsec data encryption increases the size of packet payloads. If the number of requests and the size of your files are large, the network may degrade.

#### Solution
Scale your deployment by allocating more processing power to your VM CPU or GPUs, which, additionally, decreases the packet encryption time. One way to increase network performance is to compress the data prior to encryption. This approach increases performance by reducing the amount of data transferred.

### <a id="ipsec-upgrade-issues"></a>IPsec Upgrade Issues

#### Symptom

Ops Manager returns an error when you click **Apply Changes** to [Upgrading the IPsec Add-on for PCF](./upgrading.html).

#### Explanation: Consul cluster failure

When BOSH applies the IPsec Add-on the VMs in the Consul cluster, 
the VMs cannot communicate with each other, which causes the cluster to fail. 
This can result in many different errors. 
For instance, you might see an error in Ops Manager that says a Diego cell is failing. 
In this case, it is likely that the Diego cell is failing because Consul is not running. 

#### Solution

1. Scale down the number of Consul instances in Elastic Runtime to `1`. For more information, see [Scaling Elastic Runtime](http://docs.pivotal.io/pivotalcf/opsguide/scaling-ert-components.html).

1. Click **Apply Changes**. 

1. After the install succeeds, scale up the number of Consul instances to the previous value.

### <a id="ipsec-runtime-issues"></a>IPsec Runtime Issues

#### Symptom
Errors relating to IPsec, including symptoms of network partition. You may receive an error indicating that IPsec has stopped working.

For example, this error shows a symptom of IPsec failure, a failed `clock_global-partition`:

<pre class="terminal">
Failed updating job clock_global-partition-abf4378108ba40fd9a43 > clock_global-partition-abf4378108ba40fd9a43/0
(ddb1fbfa-71b1-4114-a82c-fd75867d54fc)
(canary): 	Action Failed
get_task: 	Task 044424f7-c5f2-4382-5d81-57bacefbc238
result: 	Stopping Monitored Services: Stopping service
ipsec: 		Sending stop request to Monit: Request failed,
response: 	Response{ StatusCode: 503, Status: '503 Service Unavailable' } (00:05:22)..
</pre>

#### Explanation: Asynchronous `monit` Job Priorities

When a monit stop command is issued to the NFS mounter job, it hangs, preventing a shutdown of the PCF cluster.

This is not a problem with the IPsec add-on release itself. Rather, it is a known issue with the NFS mounter job and the monit stop script that can manifest itself after IPsec is deployed with PCF v1.7.

This issue occurs when monit job priorities are asynchronous. Because the order of job shutdown is arbitrary, it is possible that the IPsec job will be stopped first. After this happens, the network connectivity for that VM goes away, and the NFS mounter job loses visibility to the associated storage. This causes the NFS mounter job to hang, and it blocks the monit stop from completing. See the [Monit job Github details](https://github.com/cloudfoundry/cf-release/blob/v231/jobs/nfs_mounter/monit) for further information.

<p class="note"><strong>Note</strong>: This issue affects deployments using CF v231 or earlier, but in CF v232 the release uses an nginx blobstore instead of the NFS blobstore. The error does not exist for PCF deployments using CF releases greater than CF v231. The error also does not apply to PCF deployments that use WebDAV as their Cloud Controller blobstore.</p>

#### Solution

1. BOSH `ssh` into the stuck instance by running one of the following commands:
  * **For Ops Manager v1.10 or earlier:** <br>
  `bosh ssh VM-INDEX`
  * **For Ops Manager v1.11 or later:** 
  `bosh2 -e BOSH-ENVIRONMENT -d DEPLOYMENT-NAME ssh VM-INDEX`

1. Authenticate as root and use the `sv stop agent` command to kill the BOSH Agent:

    <pre class="terminal">
    $ sudo su
    # sv stop agent
    </pre>

1. Run the following command to detect the missing monit job VM.
  * **For Ops Manager v1.10 or earlier:** <br>
  `bosh cloudcheck`
  * **For Ops Manager v1.11 or later:** 
  `bosh2 -e ENVIRONMENT-NAME -d DEPLOYMENT-NAME cloud-check`
   
    For example,

    <pre class="terminal">
    # bosh cloudcheck
      VM with cloud ID `vm-3e37133c-bc33-450e-98b1-f86d5b63502a' missing:
      <br>
      <span>-</span> Ignore problem
      <span>-</span> Recreate VM using last known apply spec
      <span>-</span> Delete VM reference (DANGEROUS!)
    </pre>

1. Choose `Recreate VM using last known apply spec`.

1. Continue with your deploy procedure.

<hr>

#### Symptom
* App fails to start with the following message:
<pre class="terminal">FAILED
Server error,
status code: 500,
error code: 10001,
message: An unknown error occurred.</pre>
The Cloud Controller log shows it is unable to communicate with Diego due to `getaddrinfo` failing.

* Deployment fails with a similar error message:
```
diego_database-partition-620982d595434269a96a/0 (a643c6c0-bc43-411b-b011-58f49fb61a6f)' is not running after update. Review logs for failed jobs: etcd
```


#### Explanation: Split Brain `consul`
This error indicates a “split brain” issue with Consul.

#### Solution
Confirm this diagnosis by checking the `peers.json` file from /var/vcap/store/consul\_agent/raft. If it is null, then there may be a split brain. To fix this problem, follow these steps:

1. Run `monit stop` on all Consul servers:
1. Run `rm -rf /var/vcap/store/consul_agent/` on all Consul servers.
1. Run `monit start consul_agent` on all Consul servers one at a time.
1. Restart the consul\_agent process on the Cloud Controller VM. You may need to restart consul\_agent on other VMs, as well.

<hr>
#### Symptom
You see that communication is not encrypted between two VMs.

#### Explanation: Error in Network Configuration
The IPsec BOSH job is not running on either VM. This problem could happen if both IPsec jobs crash, both IPsec jobs fail to start, or the subnet configuration is incorrect. There is a momentary gap between the time when an instance is created and when BOSH sets up IPsec. During this time, data can be sent unencrypted. This length of time depends on the instance type, IAAS, and other factors. For example, on a t2.micro on AWS, the time from networking start to IPsec connection was measured at 95.45 seconds.

#### Solution
Set up a networking restriction on host VMs to only allow IPsec protocol and block the normal TCP/UDP traffic. For example, in AWS, configure a network security group with the minimal networking setting as shown below and block all other TCP and UDP ports.

<table border='1' class='nice'>
	<caption>Additional AWS Configuration</caption>
<tr>
<th>Type</th>
<th>Protocol</th>
<th>Port Range</th>
<th>Source</th>
</tr>
<tr>
<td>Custom Protocol</td>
<td>AH (51)</td>
<td>All</td>
<td>10.0.0.0/16</td>
</tr>
<tr>
<td>Custom Protocol</td>
<td>ESP (50)</td>
<td>All</td>
<td>10.0.0.0/16</td>
</tr>
<tr>
<td>Custom UDP Rule</td>
<td>UDP</td>
<td>500</td>
<td>10.0.0.0/16</td>
</tr>
</table>

<p class="note"><strong>Note</strong>: When configuring a network security group, IPsec adds an additional layer to the original communication protocol. If a certain connection is targeting a port number, for example port 8080 with TCP, it actually uses IP protocol 50/51 instead. Due to this detail, traffic targeted at a blocked port may be able to go through.</p>

<hr>

#### Symptom
You see unencrypted app messages in the logs.

#### Explanation: `etcd` Split Brain

#### Solution
1. Check for split brain etcd by connecting with BOSH `ssh` into each etcd node: <pre class="terminal">$ curl localhost:4001/v2/members</p>
1. Check if the members are consistent on all of etcd. If a node has only itself as a member, it has formed its own cluster and developed "split brain." To fix this issue, SSH into the split brain VM and run the following commands:
	1. <pre class="terminal">$ sudo su -</p>
	1. <pre class="terminal"># monit stop etcd</p>
	1. <pre class="terminal"># rm -r /var/vcap/store/etcd</p>
	1. <pre class="terminal"># monit start etcd</p>

1. Check the logs to confirm the node rejoined the existing cluster.

<hr>

#### Symptom

IPsec deployment fails with <code>Error filling in template 'pre-start.erb'</code>

<pre class="terminal">
Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'consul_server-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #<String:0x0055e9e36d2c98>)
   - Unable to render jobs for instance group 'nats-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #<String:0x0055e9e36d2c98>)
</pre>

#### Explanation: Typographical or syntax error in deployment descritor YAML syntax

#### Solution
Check the deployment descriptor YAML syntax for the CA certificates entry:

<pre class="terminal">
releases:
- {name: ipsec, version: 1.0.0}

addons:
- name: ipsec-addon
  jobs:
  - name: ipsec
    release: ipsec
  properties:
    ipsec:
      ipsec_subnets:
      - 10.0.1.1/20
      no_ipsec_subnets:
      - 10.0.1.10/32  # bosh director
      instance_certificate: |
        -----BEGIN CERTIFICATE-----
        MIIEMDCCAhigAwIBAgIRAIvrBY2TttU/LeRhO+V1t0YwDQYJKoZIhvcNAQELBQAw
        ...
        -----END CERTIFICATE-----
      instance_private_key: |
        -----BEGIN EXAMPLE RSA PRIVATE KEY-----
        MIIEogIBAAKCAQEAtAkBjrzr5x9g0aWgyDEmLd7m9u/ZzpK7UScfANLaN7JiNz3c
        ...
        -----END EXAMPLE RSA PRIVATE KEY-----
      ca_certificates:
        - |
        -----BEGIN CERTIFICATE-----
        MIIEUDCCArigAwIBAgIJAJVLBeJ9Wm3TMA0GCSqGSIb3DQEBCwUAMB0xGzAZBgNV
        BAMMElBDRiBJUHNlYyBBZGRPbiBDQTAeFw0xNjA4MTUxNzQwNDVaFw0xOTA4MTUx
        ...
        -----END CERTIFICATE-----
</pre>

In the example above, the values that appear after the <code>ca_certificates</code>: key are contained within a list and are not just a single certificate.
This entry must be followed by a line starting with <code>-</code>, and ending with <code>|</code>.  The lines following this contain the PEM encoded certificate(s).

The error message shown above indicating a problem with the <code>each\_with\_index</code> method provides a hint that the <code>- |</code> YAML syntax sequence is missing.  Use this syntax even in situations where there is only one CA certificate, for example a list of one entry.

<hr>

#### Symptom
Complete system outage with no warning.

#### Explanation: IPsec Certificates Might Have Expired
Expired IPsec certificates can cause a sudden system outage.
For example, the self-signed certificates generated by the script provided in the installation instructions have a lifetime of 365 days. 
IPsec certificates expire if you do not rotate them within their lifetime.
 
#### Solution
Renew expired IPsec certificates.
To avoid future downtime due to expired IPsec certificates,
set a calendar reminder to rotate the certificates before they expire. 

For how to renew certificates, see [Renewing Expired IPsec Certificates](./renewing.html).
For how to rotate them, see [Rotating IPsec Certificates](./credentials.html).