Like many of you out there, we were suddenly in a position where we needed to ramp up out remote connectivity to cope with the demand driven by Covid-19, after some research, we decided the easiest path was to build some more RAS servers and load-balance them with a pair of Kemp Loadmasters.
We followed the guide provided by Kemp and used their templates to configure the virtual services as laid out here: https://support.kemptechnologies.com/hc/en-us/articles/360026123592
Our set up was fairly straight forwards, a Nat rule on the firewall for UDP 500/4500 to the VIP on the load-balancer. Ras servers had their default gateway pointed at the same VIP. (See https://support.kemptechnologies.com/hc/en-us/articles/360002996552-Routing-Feature-Description and https://support.kemptechnologies.com/hc/en-us/articles/203126369-Transparency for a good explanation covering routing and default gateways) Ras servers and Kemps were on the same subnet.
All appeared well, but we could never get more than ~60 connections on a server, and many users were failing to connect, with the dreaded 809 error: https://directaccess.richardhicks.com/2019/02/14/troubleshooting-always-on-vpn-error-code-809/ (By the way, Richards site has a TON of really useful information on the subject, if you haven’t checked his site out yet, I would urge you to do so!)
We had ruled out blocked ports and Ikev2 packet fragmentation (We were running server 2019 with the registry setting to enable support for ikev3 packet fragmentation):
New-ItemProperty -Path “HKLM:\SYSTEM\CurrentControlSet\Services\RemoteAccess\Parameters\Ikev2\” -Name EnableServerFragmentation -PropertyType DWORD -Value 1 -Force
But we were still having issues. At this point, we called in Microsoft Premiere support, and they pointed out that the logs indicated “Max number of established MM SAs to peer exceeded”
this was because the Kemp was passing through the load-balanced VIP IP to the ras servers, and they, in turn, hit a max connection per IP limit and refused further connections. As per Kemps recommendation: “It is best practice to enable the Subnet Originating Requests option globally.” meant that the RAS servers only ever saw the IP of the VIP, not the clients.
this could be fixed by upping the limit in the registry on the RAS servers:(no upper limit)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\IKEEXT\Parameters
New DWORD : IkeNumEstablishedForInitialQuery = 0x0000c350 [DEC 50000]
however, they pointed out that “you need to not use NAT for IKEv2 on the LB”
So a cleaner way of doing this is to adjust the settings on the virtual services on the kemp to enable transparency: (caveat, this works for us because we have the Loadbalancers and the RAS servers on the same subnet) Do this on both the virtual services, and uncheck Subnet Originating Requests (We unticked this from the Network Options pane in the WUI)
This helped with the connection limit, and clients with their public IPs were now being displayed in the RAS console.
The next issue we had is that clients would never re-connect if one of the ras servers failed until that failed server had been brought online, this was fixed by enabling “System Configuration/Miscellaneous :Options/Network Options :: Enable reset on close”
We also verified our L7 configuration was correct, with Always Check Persist set to “Yes – accept changes”, “Drop Connections on RS Failure” and “Drop at drain time” are checked.
Clients will now reconnect to another host should one go down, as expected.
We are now running in a stable environment with these tweaks to the original Kemp settings – a huge thanks to Richard Hicks for his help, hopefully, this will help some of you!
Great article Jimmy. Whilst I implemented on our setup similar settings as per above, with the exception being the Persistence timeout being 5 minutes, still getting the dreaded 809 error on IKEv2. Working with Microsoft Support at the moment but just checking to see if you have any other tips from your experience. I have the IkeNumEstablishedForInitialQuery implemented even though we are doing Layer 7 Transparency but still is happening. Would you be able to share some thoughts? Many Thanks
Do you get the 809 error everytime, or just some clients? We also have a call open with Microsoft for this as we have some clients that throw this error until we restart the ikeext service. They have escalated this internally to the engineering team as a bug. We are now considering using sstp instead as we are also aware that in windows 10 2004 if you have a device tunnel and a user tunnel the user tunnel fails to establish using ike.
Once we restart a device we most likely get the 809 and eventually I noticed that after 2 hours it connects. Ended up with the user tunnel being SSTP first but with the device tunnel you can only do IKEv2 and that causes a big problem. Restarting the service or the servers is not really an option as such but rather a necessity. would you be able to email me a reference to your ticket so I can pass to the engineer that is dealing with ours? Thanks for your prompt reply
If you do a wire shark you will likely see that the client tries to initiate the ike session on port udp500 but the server never replies. I’ll email you our ticket ref direct rather than paste it here
Thanks Jimmy. Trust me I have done traces after traces. I know there is an issue but I need to get past the big wall 🙂 Thanks again
Let us know how you get on – Ill post an update here when we get one
Thanks appreciated