Scalable, secure, always-on VPN solution for managed services clients
As we (customer) expanded our customer base and extended our monitoring more customers, managing remote access for our engineers / monitoring towards customer infrastructure became cumbersome.
We already employed an OpenVPN solution, permitting remote access for staff to internal resources, using per-device certificates. Engineers started building bespoke solutions per-customer, involving either more OpenVPN instances, PPTP VPNs, Citrix gateways, or pinholed firewall rules. Concerns were raised regarding the manageability, scalability, and security of these solutions.
Requirements
I was asked to design a solution providing the necessary remote access which was both scalable and secure. The solution was required to:
- Provide connectivity for staff into customer infrastructure without additional user VPNs or credentials required (I.e. existing internal authentication must be used).
- Solve the potential issue of overlap of IP ranges across multiple customers. I.e., it should be possible for authorized staff to access CustomerA’s Exchange server on 192.168.0.200, and CustomerB’s database server, also on 192.168.0.200.
- Isolate customers from each other, and isolate our own infrastructure from customers.
- Allow for revocation of connectivity to a customer.
- Provide per-user, per-destination access controls both managed both internally and by the customer.
- Provide a high level of availability, given the solution would provide critical connectivity for monitoring and remote access
Design
Given we already employed a well-secured OpenVPN instance for staff, this design was extended by building a customer-facing OpenVPN terminator (in an isolated network segment), and connecting our existing internal OpenVPN server to this terminator, as a client. We routed an unused /14 block of RFC1918 private IPs towards this terminator.
Managed appliances / virtual machines were deployed on each customer network, configured to connect to the OpenVPN terminator as clients. On connection, each client would advertise a number of /24 routes from the /14 block into the OpenVPN terminator, providing connectivity from staff towards subnets “behind” each client instance.
Each client OpenVPN instance was configured with iptables NETMAP rules, performing 1:1 NAT translation from the /24 range to the real customer IP range within the customer network. The 1:1 NAT made it possible for an engineer to access 10.8.76.200 (translated to 192.168.0.200 at CustomerA), and simultaneously connect to 10.8.65.200 (translated to 192.168.0.200 at CustomerB).
Using configuration directives within OpenVPN, default client-to-client connectivity was restricted. Authorized staff connectivity towards approved customer ranges was then permitted, using iptables rules on the OpenVPN terminator to permit and log.
A further set of firewalling rules were optionally applied on each customer OpenVPN client instance, providing a mechanism for customer security oversight.
A DNS naming convention was developed to translate “RVPN” IP addresses to customer IPs, in the form of customerA10.x.rvpn.ourdomain.com, where x referred to the 3rd octet in the customer’s IP range. For example, evilcorp0.200.rvpn.ourdomain.com might point to 10.8.76.200, the NAT’d address for EvilCorp’s 192.168.0.200 IP address, while goodcorp3.70.rvpn.ourdomain.com might point to 10.8.69.70, the NAT’d address for GoodCorp’s 192.168.3.70 address.
The VPN terminator itself was comprised of a pair of hosts in high-availability configuration.
VPN access to clients was provisioned using a dedicated certificate authority, such that only a client presenting a certificate signed by the server could connect, and could only be received a pre-configured VPN address and configuration (it would not be possible for a customer to maliciously misconfigure their OpenVPN instance to gain access to another customer). Certificates were marked as revoked upon cessation of work with a customer, without impacting other customers or exposing credentials for potential future re-use.
Challenges
Although elegantly simple when fully configured, the RVPN solution was challenging to debug. Connectivity between an engineer and a customer host was required to transit three independent OpenVPN instances, and at least as many firewalls. This complexity was minimized where possible by designing for simple management and careful change control.
The DNS records which pointed NAT’d addresses to real customer addresses were auto-generated. Monitoring was setup to test every step of the connection so that a failure of one of the connectivity elements could be quickly identified.
MTU issues were challenging early on due to the number of VPNs involved. These MTU issues are largely solved now that UFB is ubiquitous and customers no longer use ADSL connections.
Process
While the initial requirements of the solution were met on day #1 (security, auditability, etc) some of the more subtle design features evolved over several iterations, as staff gained more experience with the system. The initial design simply provided access to monitoring bastions deployed at each customer, with fully NAT-routed access to any customer IPs being a later addition.
The solution wasn’t highly available initially, but since the terminator was required to reboot roughly monthly for upstream kernel updates, HA was added as the level of criticality of monitored customer infrastructure grew.
Initially, all customer instances were Linux-based VMs, but over time as we rolled out IaaS and L3VPNs to customers, the instances were migrated to HA pairs of pfsense firewalls running on our own IaaS infrastructure.
Result
The VPN solution was originally designed to prevent engineers from building bespoke connections to remotely support customers. Over several years, the design was extended to include:
Permitting RADIUS/LDAP auth from any devices we managed on customer sites towards a dedicated HA pair of auth servers. Staff could use internal credentials to remotely support managed network elements across all customers in all locations.
Permitting VMWare vCenter management of customer-located ESX hosts. As part of our IaaS offering, we provided a central vCenter management and reporting platform, servicing ESX hosts across multiple customers. (Making VMWare ESX heartbeats work towards vCenter from NAT’d IPs was non-trivial!)
Providing remote backup services to managed VMs / platforms hosted at customer sites.
Supporting the monitoring of customer infrastructure via central or distributed Icinga platforms, with all alerts and logs being aggregated at a “master” Icinga instance.