Esta publicación profundiza en cómo identificamos y arreglamos una anomalía esporádica en el rendimiento del almacenamiento que observamos en uno de nuestros puntos de referencia.
En Qumulo, creamos una plataforma de datos de archivos de alto rendimiento y Continuar lanzando actualizaciones cada dos semanas. El envío de software de nivel empresarial requiere con frecuencia un amplio conjunto de pruebas para garantizar que hemos creado un producto de alta calidad. Nuestro conjunto de pruebas de rendimiento se ejecuta continuamente en todos nuestros ofertas de plataforma e incluye pruebas de rendimiento de archivos que se ejecutan en comparación con los estándares de la industria.
Ingrese la anomalía del rendimiento del almacenamiento
Durante un período de unos pocos meses, observamos variabilidad en nuestros puntos de referencia de lectura y escritura de múltiples flujos. Estos puntos de referencia utilizan IOzone para generar lecturas simultáneas y escrituras simultáneas en el clúster, y miden el rendimiento agregado en todos los clientes conectados. En particular, observamos una distribución bimodal en la que la mayoría de las ejecuciones alcanzaron un objetivo de rendimiento constantemente estable, mientras que un segundo conjunto de resultados más pequeños corrió esporádicamente alrededor de 200-300MB / s más lento, que es aproximadamente 10% peor. Aquí hay un gráfico que muestra los resultados de rendimiento.
Caracterizando el problema
Al investigar cualquier anomalía en el rendimiento del almacenamiento, el primer paso es eliminar tantas variables como sea posible. Los resultados esporádicos se identificaron por primera vez en cientos de versiones de software durante un período de meses. Para simplificar las cosas, iniciamos una serie de ejecuciones del punto de referencia, todas en el mismo hardware y en una única versión de software. Esta serie de corridas mostró la misma distribución bimodal, lo que significaba que la variabilidad no podía explicarse por diferencias de hardware o regresiones específicas de la versión de software.
Después de reproducir el rendimiento bimodal en una sola versión, comparamos los datos de rendimiento detallados recopilados de una ejecución rápida y una ejecución lenta. Lo primero que saltó a la vista fue que las latencias RPC entre nodos eran mucho más altas para las ejecuciones malas que para las buenas. Esto podría haber sido por varias razones, pero sugirió una causa raíz relacionada con la red.
Explorando el rendimiento del socket TCP
Teniendo eso en cuenta, queríamos datos más detallados sobre el rendimiento de nuestro socket TCP de las ejecuciones de prueba, por lo que permitimos que nuestro analizador de pruebas de rendimiento continuamente recopilar datos de ss. Cada vez que se ejecuta ss, genera estadísticas detalladas para cada socket en el sistema:
> ss -tio6 State Recv-Q Send-Q Local Address: Port Peer Address: Port ESTAB 0 0 fe80 :: f652: 14ff: fe3b: 8f30% bond0: 56252 fe80 :: f652: 14ff: fe3b: 8f60: 42687 sack cubic wscale: 7,7 rto: 204 rtt: 0.046 / 0.01 ato: 40 mss: 8940 cwnd: 10 ssthresh: 87 bytes_acked: 21136738172861 bytes_received: 13315563865457 segs_out: 3021503845 segs_in: 2507786423 enviar 15547.8Mbps lastscnd: 348 lastscrate 1140Mbps retransmisiones: 348/30844.2 rcv_rtt: 0 rcv_space: 1540003 ESTAB 4 8546640 fe0 :: f0: 80ff: fe652b: 14f3% bond8: 30 fe0 :: f44517: 80ff: fe652b: 14: 2 saco cúbico wscale: 4030 rto : 45514 rtt: 7,7 / 204 ato: 2.975 mss: 5.791 cwnd: 40 ssthresh: 8940 bytes_acked: 10 bytes_received: 10 segs_out: 2249367594375 segs_in: 911006516679 enviar 667921849Mbps durasnd: 671354128 lastrcv: 240.4Mbps retroceso 348 rcv_rtt: 1464 rcv_space: 348…
Cada socket en el sistema corresponde a una entrada en la salida.
Como puede ver en la salida de muestra, ss vuelca sus datos de una manera que no es muy fácil de analizar. Tomamos los datos y trazamos los distintos componentes para ofrecer una vista visual del rendimiento del socket TCP en todo el clúster para una prueba de rendimiento determinada. Con este gráfico, podríamos comparar fácilmente las pruebas rápidas y las pruebas lentas y buscar anomalías.
El más interesante de estos gráficos fue el tamaño de la ventana de congestión (en segmentos) durante la prueba. La ventana de congestión (indicada por cwnd: in the above output) is crucially important to TCP performance, as it controls the amount of data outstanding in-flight over the connection at any given time. The higher the value, the more data TCP can send on a connection in parallel. When we looked at the congestion windows from a node during a low-performance run, we saw two connections with reasonably high congestion windows and one with a very small window.

Looking back at the inter-node RPC latencies, the high latencies directly correlated with the socket with the tiny congestion window. This brought up the question - why would one socket maintain a very small congestion window compared to the other sockets in the system?
Having identified that one RPC connection was experiencing significantly worse TCP performance than the others, we went back and looked at the raw output of ss. We noticed that this ‘slow’ connection had different TCP options than the rest of the sockets. In particular, it had the default tcp options. Note that the two connections have vastly different congestion windows and that the line showing a smaller congestion window is missing sack and wscale:7,7.
ESTAB 0 0 ::ffff:10.120.246.159:8000 ::ffff:10.120.246.27:52312
sack cubic wscale:7,7 rto:204 rtt:0.183/0.179 ato:40 mss:1460 cwnd:293 ssthresh:291 bytes_acked:140908972 bytes_received:27065 segs_out:100921 segs_in:6489 send 18700.8Mbps lastsnd:37280 lastrcv:37576 lastack:37280 pacing_rate 22410.3Mbps rcv_space:29200
ESTAB 0 0 fe80::e61d:2dff:febb:c960%bond0:33610 fe80::f652:14ff:fe54:d600:48673
cubic rto:204 rtt:0.541/1.002 ato:40 mss:1440 cwnd:10 ssthresh:21 bytes_acked:6918189 bytes_received:7769628 segs_out:10435 segs_in:10909 send 212.9Mbps lastsnd:1228 lastrcv:1232 lastack:1228 pacing_rate 255.5Mbps rcv_rtt:4288 rcv_space:1131488
This was interesting, but looking at just one socket datapoint didn’t give us much confidence that having default TCP options was highly correlated with our tiny congestion window issue. To get a better sense of what was going on, we gathered the ss data from our series of benchmark runs and observed that 100% of the sockets without the SACK (selective acknowledgement) options maintained a max congestion window size 90-99.5% smaller than every socket with non-default TCP options. There was clearly a correlation here between sockets were missing the SACK option and the performance of those TCP sockets, which makes sense as SACK and other options are intended to increase performance.

How TCP options are set
TCP options on a connection are set by passing options values along with messages containing SYN flags. This is part of the TCP connection handshake (SYN, SYN+ACK, ACK) required to create a connection. Below is an example of an interaction where MSS (maximum segment size), SACK, and WS (window scaling) options are set.

So where have our TCP options gone?
Although we had associated the missing SACK and window scaling options with smaller congestion windows and low-throughput connections, we still had no idea why these options were turned off for some of our connections. After all, every connection was created using the same code!
We decided to focus on the SACK option because it’s a simple flag, hoping that would be easier to debug. In Linux, SACK is controlled globally by a sysctl and can’t be controlled on a per-connection basis. And we had SACK enabled on our machines:
>sysctl net.ipv4.tcp_sack
net.ipv4.tcp_sack = 1
We were at a loss as to how our program could have missed setting these options on some connections. We started by capturing the TCP handshake during connection setup. We found that the initial SYN message had the expected options set, but the SYN+ACK removed SACK and window scaling.

We cracked open the Linux kernel’s TCP stack and started searching for how the SYN+ACK options are crafted. We found tcp_make_synack, which calls tcp_synack_options:
static unsigned int tcp_synack_options(const struct sock *sk,
struct request_sock *req,
unsigned int mss, struct sk_buff *skb,
struct tcp_out_options *opts,
const struct tcp_md5sig_key *md5,
struct tcp_fastopen_cookie *foc)
{
...
if (likely(ireq->sack_ok)) {
opts->options |= OPTION_SACK_ADVERTISE;
if (unlikely(!ireq->tstamp_ok))
remaining -= TCPOLEN_SACKPERM_ALIGNED;
}
...
return MAX_TCP_OPTION_SPACE - remaining;
}
We saw that the SACK option is simply set based on whether the incoming request has the SACK option set, which was not very helpful. We knew that SACK was getting stripped from this connection between the SYN and SYN+ACK, and we still had to find where it was happening.
We took a look at the incoming request parsing in tcp_parse_options:
void tcp_parse_options(const struct net *net,
const struct sk_buff *skb,
struct tcp_options_received *opt_rx, int estab,
struct tcp_fastopen_cookie *foc)
{
...
case TCPOPT_SACK_PERM:
if (opsize == TCPOLEN_SACK_PERM && th->syn &&
!estab && net->ipv4.sysctl_tcp_sack) {
opt_rx->sack_ok = TCP_SACK_SEEN;
tcp_sack_reset(opt_rx);
}
break;
...
}
We saw that, in order to positively parse a SACK option on an incoming request, the request must have the SYN flag (it did), the connection must not be established (it wasn’t), and the net.ipv4.tcp_sack sysctl must be enabled (it was). No luck here.
As part of our browsing we happened to notice that when handling connection requests in tcp_conn_request, it sometimes clears the options:
int tcp_conn_request(struct request_sock_ops *rsk_ops,
const struct tcp_request_sock_ops *af_ops,
struct sock *sk, struct sk_buff *skb)
{
...
tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
...
return 0;
}
We quickly found out that the want_cookie variable indicates that Linux wants to use the TCP SYN cookies feature, but we didn’t have any idea what that meant.
What are TCP SYN cookies?
TCP SYN cookies can be characterized as follows.
SYN flooding
TCP servers typically have a limited amount of space in the SYN queue for connections that aren’t yet established. When this queue is full, the server cannot accept more connections and must drop incoming SYN requests.
This behavior leads to a denial-of-service attack called SYN flooding. The attacker sends many SYN requests to a server, but when the server responds with SYN+ACK, the attacker ignores the response and never sends an ACK to complete connection setup. This causes the server to try resending SYN+ACK messages with escalating backoff timers. If the attacker never responds and continues to send SYN requests, it can keep the servers SYN queue full at all times, preventing legitimate clients from establishing connections with the server.
Resisting the SYN flood
TCP SYN cookies solve this problem by allowing the server to respond with SYN+ACK and set up a connection even when the SYN queue is full. What SYN cookies do is actually encode the options that would normally be stored in the SYN queue (plus a cryptographic hash of the approximate time and source/destination IPs & ports) entry into the initial sequence number value in the SYN+ACK. The server can then throw away the SYN queue entry and not waste any memory on this connection. When the (legitimate) client eventually responds with an ACK message, it will contain the same initial sequence number. The server can then decode the hash of the time and, if it’s valid, decode the options and complete connection setup without using any SYN queue space.
Drawbacks of SYN cookies
Using SYN cookies to establish a connection has one drawback: there isn’t enough space in the initial sequence number to encode all the options. The Linux TCP stack only encodes the maximum segment size (a required option) and sends a SYN+ACK that rejects all other options, including the SACK and window scaling options. This isn’t usually a problem because it’s only used when the server has a full SYN queue, which isn’t likely unless it’s under a SYN flood attack.
Below is an example interaction that shows how a connection would be created with SYN cookies when a server’s SYN queue is full.

The Storage Performance Anomaly: Qumulo’s TCP problem
After studying TCP SYN cookies, we recognized that they were likely responsible for our connections periodically missing TCP options. Surely, we thought, our test machines weren’t under a SYN flood attack, so their SYN queues should not have been full.
We went back to reading the Linux kernel and discovered that the maximum SYN queue size was set in inet_csk_listen_start:
int inet_csk_listen_start(struct sock *sk, int backlog)
{
...
sk->sk_max_ack_backlog = backlog;
sk->sk_ack_backlog = 0;
...
}
From there, we traced through callers to find that the backlog value was set directly in the listen syscall. We pulled up Qumulo’s socket code and quickly saw that when listening for connections, we always used a backlog of size 5.
if (listen(fd, 5) == -1)
return error_new(system_error, errno, "listen");
During cluster initialization we were creating a connected mesh network between all of the machines, so of course we had more than 5 connections created at once for any cluster of sufficient size. We were SYN flooding our own cluster from the inside!
We quickly made a change to increase the backlog size that Qumulo used and all of the bad performance results disappeared: Case closed!
Editors Note: This post was published in December 2020.
Learn more
Qumulo’s engineering team is hiring and we have several job openings – check them out and learn about life at Qumulo.
Contact us
Take a test drive. Demo Qumulo in our new, interactive Hands-On Labs, or request a demo or free trial.
Subscribe to the Qumulo blog for customer stories, technical insights, industry trends and product news.