Max connection resource problem
Posted: Tue May 10, 2022 8:12 pm
We have a large number of IoT devices (running ubuntu) spread out in random places. They are behind misc network that we can't control so to be able to manage them we have them to do a vpn connection to two vpn servers (two for redundancy). To do some maintenance authorized users can then connect to the vpn server and on to the IoT device.
This works well most of the time but as we grow we seem to hit some limit.
It's currently >5500 connections to each server and sometimes it seems like it just collapse with a ton of "TLS Error: TLS key negotiation failed to occur within 60 seconds". The connection count does a dive down to some hundred connections before it recovers as the clients then retry and the connection count then goes up again.
The vpn servers are in the cloud on dedicated cpu, quad core cpus with 8G mem. Network traffic when idle seems to be <2MB/sec.
Whenever it has issues I of course check cpu, mem and io and it seems like cpu might be a problem because the (single threaded - I know) openvpn process get close to using a full cpu (18-23%). Ok, so it's cpu then - well why is the cpu util then more like 5-10% when it is working?
What is it that can cause it to suddenly start dropping connections and then - without we doing anything - recover all the connections and go back to all good?
As we grow, what is a better solution to have access to each of this devices? I'm thinking more load balanced VPN servers with a "back bone" vpn network and some smarts to keep track of what vpn server each device is connected to.
What other smarter solution exist?
/ps
This works well most of the time but as we grow we seem to hit some limit.
It's currently >5500 connections to each server and sometimes it seems like it just collapse with a ton of "TLS Error: TLS key negotiation failed to occur within 60 seconds". The connection count does a dive down to some hundred connections before it recovers as the clients then retry and the connection count then goes up again.
The vpn servers are in the cloud on dedicated cpu, quad core cpus with 8G mem. Network traffic when idle seems to be <2MB/sec.
Whenever it has issues I of course check cpu, mem and io and it seems like cpu might be a problem because the (single threaded - I know) openvpn process get close to using a full cpu (18-23%). Ok, so it's cpu then - well why is the cpu util then more like 5-10% when it is working?
What is it that can cause it to suddenly start dropping connections and then - without we doing anything - recover all the connections and go back to all good?
As we grow, what is a better solution to have access to each of this devices? I'm thinking more load balanced VPN servers with a "back bone" vpn network and some smarts to keep track of what vpn server each device is connected to.
What other smarter solution exist?
/ps