Transferencia de datos descargados de almacenamiento de Windows (ODX) y el caso de la captación previa faltante

¿Qué es la transferencia de datos descargada de almacenamiento de Windows ODX?

Cómo un cliente experimentó un rendimiento más lento de lo esperado al comparar LECTURA/ESCRITURA estándar con ODX, y cómo el equipo de ingeniería aprovechó las métricas de Qumulo Core, el ciclo de lanzamiento rápido y la experiencia de actualización perfecta para descubrir y resolver el misterio.

¿Qué es ODX (Transferencia de datos descargados de almacenamiento de Windows)?

Windows Storage Offloaded Data Transfer (ODX) es parte del protocolo del bloque de mensajes del servidor (SMB) en el que puede indicarle al sistema de archivos Qumulo que copie una parte de un archivo de un archivo a otro sin enviar los datos a través de la máquina cliente. ODX mejora el rendimiento de la copia muy parecido a la copia del lado del servidor (SSC). Elimina la latencia de la red entre el cliente y el servidor, y permite que Qumulo Core optimice los tamaños de E/S y el almacenamiento en búfer. al comparar ODX frente a SSC, los resultados de nuestras pruebas internas muestran que ODX es 7 veces más rápido y SSC 4 veces más rápido que la típica secuencia de LECTURA/ESCRITURA que se produce al copiar un archivo a otra ubicación en el mismo servidor.

Leer misterio de latencia

En 2021, enviamos soporte ODX para Qumulo Core versión 3.0.0 o superior. Poco después de que enviáramos, un cliente informó que el rendimiento de ODX era más lento de lo esperado.

Sabiendo que ODX lee de un archivo y escribe en otro, inicialmente sospechamos que el problema era el rendimiento de escritura. Por lo general, las operaciones de lectura toman un máximo de unos pocos milisegundos (gracias Almacenamiento en caché y captura previa de SSD!). Sin embargo, cuando miramos el latencia de lectura, vimos que era una parte más grande de la latencia ODX de lo que podríamos esperar: del orden de 25 ms (aprox.), con picos más altos de 40 o 50 ms. ¿Qué está pasando con eso? Aquí hay una visualización de la vida útil de la fase de lectura de una operación ODX:

visualización de la vida útil de la fase de lectura de una operación ODX

Figura 1: Vida útil de una solicitud ODX – Leer

Examinamos nuestros datos de rendimiento de series temporales de baja granularidad que recopilamos a través de nuestra canalización de monitoreo en la nube. En promedio, nuestra tasa de aciertos de caché de RAM y las métricas de latencia se veían bien. Pero entonces, ¿por qué las lecturas específicas de ODX eran comúnmente lentas?

la primera pista

A medida que nos acercábamos a descubrir la causa raíz, necesitábamos métricas de mayor granularidad que fueran específicas para una sola carga de trabajo ODX en este clúster, sin el ruido de otras operaciones. Por suerte, tenemos una herramienta, llamada trigger, which we built internally, which collects this kind of distributed performance data when needed. We worked with our customer to profile some specific ODX sequences, and used trigger to collect data during those sequences. We learned a couple of things. (Refer to Figure 1, above.)

In each read operation, most blocks were in RAM cache, but each read was missing cache for a few blocks. When this happens, the file system transaction has to read the data from the storage devices themselves. In this case, the metrics told us the data was on spinning disks.

Note: In my mind I was thinking “of course, it had to be the spinning disks…it’s ALWAYS the spinning disks.” I should have known that seeks are the only thing that could have pushed the latency up to 50ms. Spinning disks and I don’t have the best relationship. They have an infuriating but useful habit of painfully exposing unintended behavior in performance sensitive code paths.

Spinning disks can stream data efficiently when requests are queued in a contiguous order; starting at the beginning of a file and reading until the end. However, in this case, we were only missing a few blocks from cache per ODX read operation – which shows up to spinning disk as a lot of discontiguous reads. That’s bad news for spinning disk latency.

Normally, file system read operations don’t experience this kind of spinning disk latency, because we have an effective prefetcher. The prefetcher guesses what’s going to be read next, and loads it into cache ahead of time. The prefetcher has an IO pattern optimized to read from spinning disks.

So, why wasn’t the prefetcher working?

Our prefetcher has metrics that describe how successful it is on average. These metrics told us that everything being prefetched was being read, but some data the file system read wasn’t getting prefetched…that’s really weird! Maybe the ODX operations are reading out of order or jumping around randomly? The prefetcher doesn’t do well with random IO.

We checked the operation logs to see the IO pattern. Operation logs are a text log file tool that tells us information about each operation, like offset, size, and cache hit or miss. However, we hadn’t anticipated we’d need the ODX read sizes and offsets in our operation logs, and as a result that code hadn’t been hooked up! So we had to write some more code, and wait for an upgrade…

How Prefetch Works (Example Illustrations)
How Prefetch Works

Figure 2: How Prefetch Should Work

What prefetch state looks like during a contiguous read after second read

Figure 3: How Prefetch Should Work (after the second read)

How prefetch works after third read

Figure 4: How Prefetch Should Work (after third read)

Prefetch state during contiguous read proceeds until end of file

Figure 5: How Prefetch Should Work

Finding the Culprit

Within a couple of weeks, we had shipped new code, and using our super awesome upgrades, the customer was able to upgrade their Qumulo storage system to the latest version soon after. They re-ran the test workload for us. Using the improved operation logging, we found another puzzle! By reading operation logs (I felt like I was in the matrix reading the green numbers go by) we found that our ODX protocol code was sending the file system contiguous read requests for 8650752 bytes each. When we did the math, that comes out to 8 mebibytes + 256KiB. But! The operation log showed that each ODX request was actually only getting 8388608 bytes from cache (exactly 8 mebibytes).

I remember at first my eyes glazed over…was I misreading the numbers? How is each request missing cache for only 256KiB of the 8MiB it was asking for? How is the prefetcher only fetching a subset of each request? That’s impossible!

Note: The first stage of every bug is denial - believing that the data is wrong. Luckily, this wasn’t my first rodeo. I quickly stifled my denial, and started asking myself “how could this be possible”?

I started explaining how the prefetcher works to my coworker who was working with me on the investigation. The prefetcher should be getting ahead of the read workload, such that the IO size of the read workload doesn’t matter. The prefetcher always fetches data in large chunks and in such a way that the queueing on the spinning disks is well-ordered to avoid seeks. And then it hit me…large chunks…how big are the prefetch chunks? OH…they are 8 MiB…wait that’s spooky. And how does it get ahead of the read workload? OH, we kick off one prefetch per read request…wait…putting those two things together…if we only kick off a single prefetch per read request, and the prefetch chunk size is at most 8 MiB, and the reader is reading chunks of 8MiB + 256KiB…then yeah, obviously, the prefetcher will never get ahead…it will always prefetch the next 8MiB, leaving a tail of 256KiB uncached. WOW. (Fist pump). Ok. We found the smoking gun.

Example Illustrations of the Prefetch Problem
Contiguous read but with read IO size greater than prefetch IO size

Illustration 1: Prefetch Problem (first read of a file)

Contiguous read but with read IO size greater than prefetch IO size

Illustration 2: Prefetch Problem (after the second read)

Contiguous read but with read IO size greater than prefetch IO size

Illustration 3: Prefetch Problem (after the third read)

Prefetch problem after the third read of a file

Illustration 4: Prefetch Read IO is greater than prefetch IO size

Contiguous read with read IO size greater than prefetch IO size after the fourth read

Illustration 5: Prefetch Problem (after the fourth read)

Mystery Explained

Figuring out the bug is just the beginning. There were immediately three questions that came to mind:

  1. Why doesn’t this same problem happen in our standard file system read tests?
  2. Why is ODX using an IO size of 8 MiB + 256 KiB, larger than the prefetcher’s max IO size, and
  3. Why didn’t our ODX specific performance tests catch this?
1. Why doesn’t this problem happen in our standard file system read performance tests?

Our file system protocol scheduler processes all protocol requests. Typical protocol request size is 1MiB. When we receive protocol read requests we process them and potentially combine requests to optimize read performance. We naturally have a maximum combined IO size for these file system requests. If you let the IO size grow too large then the request can be bottlenecked on some large single threaded thing it was doing (for example CPU, memory allocation). So our file system scheduler makes the maximum combined read request 8MiB, the same as our prefetcher’s max IO size. This was by design. So in these workloads, the prefetcher is prefetching everything, although it isn’t ever getting more than 8MiB ahead if the combined IO size is also 8 MiB…(scratches headthat also wasn’t intended behavior…).

2. Why is ODX using an IO size of 8 MiB + 256 KiB?

The ODX operation itself goes through the scheduler, but doesn’t put its internal read request through the scheduler. ODX was then setting the maximum IO read size it would request, independent of the scheduler.

We found that the scheduler’s max IO size and ODX max IO size were using different functions:

fs_target_transaction_size()
fs_max_transaction_size()

The scheduler was using fs_target_transaction_size, which is a minimum of fs_max_transaction_size and 8MiB! Whereas the ODX code was intuitively using fs_max_transaction_size - which is computed, and turned out to be 8MiB + 256KiB.

So there was our root cause. We fixed the ODX protocol code to use fs_target_transaction_size, and voilà, no more uncached tails to the read requests.

3. Why didn’t our performance tests of ODX catch this?

Our ODX performance tests did, in fact, exhibit the behavior of uncached tails. But, the performance wasn’t that bad; because, in this performance test, the data was small enough to fit in the SSD cache!

Looking back at Figure 1 (above), you can see that the latency of reads from SSD is much lower than spinning disk. So it wasn’t as obvious from the top line performance test results that the prefetcher was doing the wrong thing. The test wasn’t explicitly checking the prefetcher’s metrics.

Once we changed the performance test to do a second step of evicting the data from SSD, and then reading it back from spinning disk, the topline performance was significantly worse than it should have been -- revealing the bug. We fixed the bug, observed this automated performance test improve, and shipped the improvement in 2 weeks. Once again leveraging Qumulo’s easy upgrades, our customers were able to upgrade soon after the fix and get the improved ODX performance.

Learn more
The New Enterprise Data Storage Playbook

The New Enterprise Data Storage Playbook

Unstructured data is everywhere, and it’s growing at uncontrollable rates. CIOs and IT leaders are turning to scalable storage solutions to manage this data and remain competitive. Block storage, object storage, and file storage each have unique capabilities and limitations, meaning enterprise-level storage systems are not “one size fits all” solutions.

Learn why unstructured data matters — and which storage solution is right for you.

Download the Playbook

Share this post