Byte Queue Limits: the unauthorized biography | by Tom Herbert

2025-01-14 01:25:00
medium.com

This post does not reflect the views of current, past, or future employers. The opinions in this article are my own.

Of the technologies I’ve worked on, Byte Queue Limits, or just BQL, seems the one for which I get the most comments! It’s rather ancient now dating back to 2011, however to my surprise it still seems quite relevant. Anyway, I thought it’d be fun today to look at it as part retrospective and part exposé.

To understand Byte Queue Limits we first need some foundation. For me, the journey started in the 1990s when I was working for Sun Microsystems fresh out of school. My first project was implementing PPP (Point-to-Point Protocol). I would run test after test of my code on a US Robotics modem with a whopping 28.8Kbps data rate! It might have been a crude setup, but I learned two key concepts:

#1 Don’t starve the transmitter

If there is send data enqueued then the transmitter should never be idle (I guess the technical term is to be work-conserving). When performing a large file transfer I’d watch the SD (Send Data) LED on the modem — if the LED was lit solid then I’d get good throughput, if it was flashing on and off that’s the transmitter being starved and throughput would drop.

#2 Don’t over-queue

When I set a large transmit queue limit I noticed that TCP retransmits would appear. Investigating, I discovered that TCP was retransmitting packets that hadn’t even left yet the system since they were still enqueued. It’s pretty obvious that that’s a really bad reason to retransmit!

Putting these together gives the core principle of Byte Queue Limits:

Enqueue enough data in a device queue to prevent starvation, but not any more than that!

**US Robotics modem circa 1995**. When doing a file transfer the goal was to keep SD lit solid.

Today, we face a major problem on the Internet in pervasive over-queuing. This problem has been coined bufferbloat. With respect to device queues, bufferbloat can cause excessive delays for sending high priority packets.

Byte Queue Limits (BQL) addresses bufferbloat head on. It is a feature in the Linux kernel that solves the problem by automatically adjusting the device queue limit. This is accomplished by adding a layer which enables and disables queuing to the device queue based on calculating the minimum buffer size required to avoid starvation under the current conditions. The smaller the amount of queued data, the lower the maximum latency experienced by queued packets.

An interesting aspect of BQL can be inferred from the first word in the name — byte. Unlike typical device queue limits that specify a number of packets, BQL operates on bytes. This is because the number of bytes has a more direct relationship with the time required to transmit to the physical medium than the number of packets since the latter are variably sized.

The implementation of Byte Queue Limits is based on Dynamic Queue Limits (DQL) library that is meant to be a general-purpose queue length controller; on top of that the Byte Queue Limits mechanism is used within the networking layer.

To support BQL a driver calls one of the following functions when queueing packets on a device queue:

void netdev_sent_queue(struct net_device *dev, unsigned int pkts,
unsigned int bytes);
void netdev_tx_sent_queue(struct netdev_queue *dev_queue, unsigned int pkts,
unsigned int bytes);

Either of these functions will note that the given number of bytes have been queued to the given device. If the underlying DQL code determines that the queue exceeds the limit after adding these bytes, it will tell the upper layers to pass no more data to the device for now.

In completion processing, a driver should call one of:

void netdev_completed_queue(struct net_device *dev, unsigned pkts,
unsigned bytes);
void netdev_tx_completed_queue(struct netdev_queue *dev_queue, unsigned pkts,
unsigned bytes);

These functions in turn call dql_completed which runs an algorithm to dynamically adjust the byte limit of the device queue (see below).

The crux of the BQL algorithm is implemented in dql_completed. The input is the number of bytes completed in a completion interrupt and a per queue data structure containing BQL parameters and state. If you’d like to look at the code it’s here, but be forewarned, the code is rather elegant and concise (the accompanying comments are really critical!).

The dql_completed function does two things: 1) It checks if the transmitter was starved and so the limit needs to be increased, and 2) if the transmitter wasn’t starved then it checks if there is over queuing and the limit can be decreased.

#1 Was the transmitter starved?

The queue is considered starved if one of the two following conditions is true:

The queue was over-limit in the last interval (the last time completion processing ran), and there is no more data in the queue (i.e. it’s empty)
The queue was over-limit in the previous interval and when enqueuing it was possible that all queued data had been consumed. This covers the case when queue may have become starved between completion processing running and next time enqueue was scheduled

When the queue is starved, increase the limit by the amount of bytes both sent and completed in the last interval, plus any previous over-limit. The diagram below shows an example of the limit being increased.

**Increasing the byte queue limit**. Data is first enqueued to a device queue that exceeds the limit (shown on the left). The amount of new data is indicated by “Enqueued last”. A completion interrupt is processed and all bytes are completed so that the queue is empty and the transmitter may have been starved (shown in the center). The byte queue limit is increased by the amount of the previous over-limit plus the number of bytes completed minus “Enqueued last”. New data can be enqueued with the new limit applied (shown on the right).

#2 Transmitter not starved, can we decrease the limit?

If the transmit queue was not starved then it may have been subject to over- queueing. dql_completed checks if the limit can be decreased. A decrease is only considered if the queue has been busy in the whole interval.

We define slack as the amount of excess data queued above the amount needed to prevent starvation. If calculated slack is greater than zero then the queue limit can be decreased. To avoid hysteresis we consider the minimum amount of slack found over several iterations of the completion processing.

Slack is the maximum of:

The queue limit plus previous over-limit minus twice the number of objects completed. Note that two times the number of completed bytes is a basis for an upper bound of the limit.
Portion of objects in the last queuing operation that was not part of non-zero previous over-limit. That is “round down” by the non-overlimit portion of the last queueing operation. This is needed in the case of enqueuing big packets like for TSO.

The diagram below shows an example of the limit being decreased.

**Decreasing the Byte Queue Limit**. Some amount of data is enqueued in a device queue (shown on the left). Completion processing happens and some number of bytes are completed shown by “Bytes completed” (second from right). The transmitter wasn’t starved so the slack is computed (second from left). The slack is greater than zero so the limit is decreased by subtracting the slack from the old limit as shown on the right.

As an example, suppose we have a single queue NIC with 10Gbps data rate. Let’s suppose that the packet limit is set to 256 (default on several NICs). Now a large file transfer, hundreds of Gigabytes, is sent. The packets for the file transfer will quickly fill the device queue and the queue will be kept full for the duration of the transfer. Now suppose a single high priority, latency sensitive packet is sent. The packet is enqueued on device queue behind 256 packets, so it won’t be transmitted until the 256 packets in front of it are sent (a sort of head-of-line blocking). Assuming a 1,500 byte packet size, that means 384,000 bytes need to be sent before the high priority packet is transmitted — for a 10Gbps NIC this takes 307 microseconds.

**Impact of over-queuing in a device queue**. This diagram shows the contents of a device queue across eight instances of completion processing. The blue regions indicate bulk data being transmitted that is not latency sensitive, and the orange region indicates a high priority, latency sensitive packet. The high priority packet is delayed by the low priority data enqueued in front of it. For instance, assuming that completion processing interrupts once a millisecond, transmitting the high priority packet is delayed by at least seven milliseconds

Now let’s enable BQL on the same 10Gbps NIC. BQL depends on periodic completion interrupts so let’s suppose we get an interrupt every ten microseconds (this period can be set by device configuration). On average about 12,500 bytes would be completed every ten microseconds. Plugging the numbers into the BQL algorithm (see below), a byte queue limit of around 30,000 bytes would be calculated. Now the high priority packet would be enqueued behind only 30,000 bytes which translates to a delay of 24 microseconds compared to 307 microseconds without BQL.

**Effect of Byte Queue Limits on latency**. The diagram shows the same scenario as in the diagram above except that Byte Queue Limits are used. In this case the high priority, latency sensitive packet is enqueued behind significantly less data. The result is that the delay of the high priority packet is reduced to about less than three milliseconds compared to seven with BQL enabled. The dashed line is the byte queue limit.

Both TSO and multi-queue have drastic effects on the number of bytes that can be enqueued and packet delay. The table below gives some numbers to show the effects of BQL in the presence of TSO and multi-queue. Note that BQL scales pretty well across the various scenarios.

**Table showing effects of BQL with TSO and multi-queue**. The baseline NIC in this example is a 10Gbps NIC with a 256 packet limit and is configured to generate a completion interrupt every ten microseconds. We consider a single queue, 16 queues, and 32 queues. For each queue configuration we show the maximum number of bytes enqueued and the packet delay for TSO enabled or disabled and BQL enabled or disabled.

We’ve seen BQL successfully deployed with good effect over the years. The benefits are:

Reduced Latency: By limiting the amount of data that can be queued in the NIC, BQL minimizes the time packets spend waiting in the device to be transmitted, leading to lower latency, especially in scenarios with fluctuating network traffic. We have seen upwards of 90% latency improvement with BQL.
Moves the problem (and that’s a good thing!): A side effect of BQL is that proportionally more packets are moved from the device queue, which is a simple FIFO, to the queueing discipline (QDisc) layer which is capable of implementing much more complicated queueing strategies.
It just works: This is probably my favorite feature! Unlike static queue limits and other networking techniques, BQL doesn’t require special configuration. It automatically and dynamically adjusts the queue limit based on real-time network conditions, making it more responsive to changing traffic patterns.

There are a couple of annoyances with BQL. First, it does require driver changes which is a bit of a pain. By my count it is supported by more than fifty drivers so it doesn’t seem to be a major impediment. Secondly, there is some processing overhead. I’ve measured it to be about 3% CPU utilization, IMO that’s not huge and in most cases the benefits outweigh the cost.

Hardware will obsolete BQL (eventually!). BQL is a pure software solution so it’s limited as to how precisely it can set the optimal limit. Some newer hardware devices have advanced device queue features like priority queues can offer near zero delay for high priority packets even in the face of TSO and multi-queue. My prediction is that hardware will obsolete BQL in the high end data center (if it hasn’t happened already), however BQL will still be useful to billions of low end devices for the foreseeable future!

Source Link

Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!

Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.