Hi,
We have a haproxy setup consisting of a pair of nodes with keepalived, which then utilize the proxy protocol to pass requests (roundrobin) to a second pair of haproxy nodes. The first pair mainly terminates SSL and serves as a highly available entrypoint, while second pair does all the logic around routing to the correct application.
Yesterday at 21:30:35 CET the active node of the first pair suddenly accumulated thousands of orphaned sockets. Historically there have been around 300 orphans at any time on this machine. A few seconds later, the cpu shot up to 100%, up from 30%. At this point, most requests started timing out. At the moments just before, we were handling 400-450 req/s. The second pair saw no increase of load during these problems.
As far as we can tell so far (investigations are ongoing), there was no change anywhere in the environment for several hours preceding this sudden activity. We are lucky enough that a large part of the traffic coming in is diagnostic data, which we can live without for a while, and when we shut down that application, the situation returned to normal. Before doing that, we tried both failing over to the other node, as well as both restarting haproxy and a full reboot of the node. Neither of these helped, the situation returned in less than a minute.
We had a lot of "kernel: [ 1342.944691] TCP: too many orphaned sockets" in our logs, as well as a few of these:
kernel: [2352025.865855] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
kernel: [2352068.014861] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
According to the Arbor our hosting provider has, there was no SYN attack, however. The servers are fronted by Akamai, which by default does two retries on connection failures or timeouts, so this may have amplified our problems. This feature will be turned off.
So, some questions:
1. Does it seem reasonable that the orphaned socket could cause this behaviour, or are they just a symptom?
2. What causes the orphaned sockets? Could haproxy start misbehaving when it is starved for resources?
3. We were speculating that it could somehow be related to keepalives not being terminated properly, any merit to that thought?
The nodes are virtual machines running Centos 7 and haproxy 1.6.7, single core with 2 GB memory. As I mentioned, they have been handling this load without any hiccups for several months, but we are still considering increasing the specs. Would a few more cores have helped, or would it just have taken a few more seconds to chew up?
Below is the haproxy configuration.
Thankful for any insights! Best regards,
Carl Pettersson
global
log 127.0.0.1 local0
maxconn 4000
log-tag haproxy-gateway
server-state-file /var/lib/haproxy/gateway.state
stats socket /var/run/haproxy-gateway.sock
stats timeout 2m
user haproxy
group haproxy
tune.ssl.default-dh-param 2048
ssl-default-bind-options no-sslv3
ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK
tune.ssl.cachesize 100000
tune.ssl.lifetime 600
tune.ssl.maxrecord 1460
defaults
log global
mode http
option httplog
option dontlognull
load-server-state-from-file global
option http-server-close
option forwardfor
# Redispatch if backend is down
option redispatch
retries 3
# timeouts
timeout http-request 5s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
# Set X-Request-Id on all incoming requests. Magic format taken from docs
unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid
unique-id-header X-Request-ID
# Set log format to the same as default, adding the request id on the end
log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r\ %ID
listen http
bind 10.0.1.21:80
default_backend backend_pair
capture request header Host len 64
capture request header X-Forwarded-For len 64
capture request header User-Agent len 200
capture request header True-Client-IP len 32
listen https
bind 10.0.1.21:443 ssl crt /etc/haproxy/ssl/gateway/
http-request set-header X-Forwarded-Proto https
default_backend backend_pair
capture request header Host len 64
capture request header X-Forwarded-For len 64
capture request header User-Agent len 200
capture request header True-Client-IP len 32
backend backend_pair
option tcp-check
server 10_0_1_24_8082 10.0.1.24:8082 check send-proxy
server 10_0_1_27_8082 10.0.1.27:8082 check send-proxy
We have a haproxy setup consisting of a pair of nodes with keepalived, which then utilize the proxy protocol to pass requests (roundrobin) to a second pair of haproxy nodes. The first pair mainly terminates SSL and serves as a highly available entrypoint, while second pair does all the logic around routing to the correct application.
Yesterday at 21:30:35 CET the active node of the first pair suddenly accumulated thousands of orphaned sockets. Historically there have been around 300 orphans at any time on this machine. A few seconds later, the cpu shot up to 100%, up from 30%. At this point, most requests started timing out. At the moments just before, we were handling 400-450 req/s. The second pair saw no increase of load during these problems.
As far as we can tell so far (investigations are ongoing), there was no change anywhere in the environment for several hours preceding this sudden activity. We are lucky enough that a large part of the traffic coming in is diagnostic data, which we can live without for a while, and when we shut down that application, the situation returned to normal. Before doing that, we tried both failing over to the other node, as well as both restarting haproxy and a full reboot of the node. Neither of these helped, the situation returned in less than a minute.
We had a lot of "kernel: [ 1342.944691] TCP: too many orphaned sockets" in our logs, as well as a few of these:
kernel: [2352025.865855] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
kernel: [2352068.014861] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
According to the Arbor our hosting provider has, there was no SYN attack, however. The servers are fronted by Akamai, which by default does two retries on connection failures or timeouts, so this may have amplified our problems. This feature will be turned off.
So, some questions:
1. Does it seem reasonable that the orphaned socket could cause this behaviour, or are they just a symptom?
2. What causes the orphaned sockets? Could haproxy start misbehaving when it is starved for resources?
3. We were speculating that it could somehow be related to keepalives not being terminated properly, any merit to that thought?
The nodes are virtual machines running Centos 7 and haproxy 1.6.7, single core with 2 GB memory. As I mentioned, they have been handling this load without any hiccups for several months, but we are still considering increasing the specs. Would a few more cores have helped, or would it just have taken a few more seconds to chew up?
Below is the haproxy configuration.
Thankful for any insights! Best regards,
Carl Pettersson
global
log 127.0.0.1 local0
maxconn 4000
log-tag haproxy-gateway
server-state-file /var/lib/haproxy/gateway.state
stats socket /var/run/haproxy-gateway.sock
stats timeout 2m
user haproxy
group haproxy
tune.ssl.default-dh-param 2048
ssl-default-bind-options no-sslv3
ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK
tune.ssl.cachesize 100000
tune.ssl.lifetime 600
tune.ssl.maxrecord 1460
defaults
log global
mode http
option httplog
option dontlognull
load-server-state-from-file global
option http-server-close
option forwardfor
# Redispatch if backend is down
option redispatch
retries 3
# timeouts
timeout http-request 5s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
# Set X-Request-Id on all incoming requests. Magic format taken from docs
unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid
unique-id-header X-Request-ID
# Set log format to the same as default, adding the request id on the end
log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r\ %ID
listen http
bind 10.0.1.21:80
default_backend backend_pair
capture request header Host len 64
capture request header X-Forwarded-For len 64
capture request header User-Agent len 200
capture request header True-Client-IP len 32
listen https
bind 10.0.1.21:443 ssl crt /etc/haproxy/ssl/gateway/
http-request set-header X-Forwarded-Proto https
default_backend backend_pair
capture request header Host len 64
capture request header X-Forwarded-For len 64
capture request header User-Agent len 200
capture request header True-Client-IP len 32
backend backend_pair
option tcp-check
server 10_0_1_24_8082 10.0.1.24:8082 check send-proxy
server 10_0_1_27_8082 10.0.1.27:8082 check send-proxy