Skip to content
Taras Chornyi edited this page Nov 28, 2022 · 3 revisions

This section describes the Linux configuration of Quality of Service (QoS) on Prestera hardware.

Priority Assignment

Switch Priority (SP) of a packet can be derived from packet headers or assigned by default. The headers that are used when deriving the SP of a packet depend on the Trust Level of the port through which the packet ingressed. The Prestera driver supports two trust modes:

  • L2 trust mode / PCP trust (default, applicable for VLAN-aware bridge members)
  • L3 trust mode / DSCP trust

L2 Trust Mode

By default, all ports are in L2 trust mode. In this mode (if a port is a member of a VLAN-aware bridge), the switch will prioritize packets based on their IEEE 802.1p priority. This priority is specified in the packet's Priority Code Point (PCP) field, which is part of the packet's VLAN header. The mapping between PCP and SP is 1:1 and cannot be changed. CFI/DFI bit is not supported, thus this bit does not affect the SP.

If the packet does not have 802.1p tagging, or if the port is not a member of a VLAN-aware bridge, then the packet is assigned the port-default priority.

L3 Trust Mode

Each port can be configured to set the SP of packets based on the DSCP field of IP headers. The priority is assigned to individual DSCP values through the DCB-APP entries. After the first APP rule is added to a given port, this port's trust level is toggled to DSCP. It stays in this mode until all DSCP APP rules are removed again.

Use dcb app attribute dscp-prio to manage the DSCP-APP entries:

dcb app add dev <port> dscp-prio <DSCP>:<SP> # Insert rule.
dcb app del dev <port> dscp-prio <DSCP>:<SP> # Delete rule.
dcb app show dev <port> dscp-prio            # Show rules.

For more information see dcb-app(8).

The Linux DCB interface allows a very flexible configuration of DSCP-to-SP mapping, to the point of permitting configuration of two priorities for the same DSCP value. These conflicts are resolved in favor of the highest configured priority.

The following example maps the same DSCP to two different SPs:

dcb app add dev swp1 dscp-prio 24:3      # Configure 24->3.
dcb app add dev swp1 dscp-prio 24:2      # Keep 24->3.

To show the result:

dcb app show dev swp1 dscp-prio
dscp-prio CS3:2 CS3:3

Note: due to a limitation the dcb app del command will clear all previously configured mappings on a port. Due to the same limitation the dcb app replace command is not supported.

L3 trust mode disables PCP prioritization even for non-IP packets that have no DSCP value, and even if they have 802.1p tagging. Such packets are assigned the port-default priority instead.

Default Priority

The default value for port-default priority is 0. As with the DSCP-APP rules, Linux allows configuration of several default priorities. The system uses the highest default value.

Use dcb app attribute default-prio to configure the default priority:

dcb app add dev <port> default-prio <SP>      # Insert default prio <SP>.
dcb app del dev <port> default-prio <SP>      # Delete rule for default <SP>.

Note: changing the default priority also changes the dscp-prio mapping (only relevant for L3 trust mode).

Mapping Profiles

Marvell devices have L2 mapping profiles, which are used for priority assignment. One of them is reserved for L2 mode and is statically configured, so the user can configure up to 11 unique dscp-prio mapping profiles (L3 mode). Whenever the user adds a new mapping using the dcb utility a new profile is allocated. If the mapping matches an existing one, the profile is shared between all ports with the same mapping:

# This will only use 1 profile, which will be shared
# between ports swp1, swp2, and swp3
dcb app add dev swp1 dscp-prio 15:1 30:2 60:5
dcb app add dev swp2 dscp-prio 15:1 30:2 60:5
dcb app add dev swp3 dscp-prio 15:1 30:2 60:5

The user can specify multiple dscp-prio maps with a single dcb command:

dcb app add dev swp1 dscp-prio 10:1 20:1 30:5

But dcb handles each dscp-prio pair as if it was a separate command:

dcb app add dev swp1 dscp-prio 10:1
dcb app add dev swp1 dscp-prio 20:1
dcb app add dev swp1 dscp-prio 30:5

This leads to a side effect where if you already configured 11 unique profiles, you cannot add a new mapping even if it matches an existing mapping. To overcome this issue, it is advised to use only 10 profiles, so the last one is used to configure the intermediate dscp-prio maps.

When you run the dcb app del command, all the port mappings are removed from the hardware and the port switches to L2 mode, regardless of what was configured. But the mapping will still be visible in dcb tool:

# configure some mapping
dcb app add dev swp1 dscp-prio 10:1 20:1 30:5

# verify that it is applied to hw
dcb app show dev swp1

> dscp-prio 10:1 20:1 30:5


# remove a single mapping
dcb app del dev swp1 dscp-prio 10:1

# the prio mapping is now fully removed from hw; port is in L2 mode
# but dcb will show that the mapping is still present
dcb app show dev swp1

> dscp-prio 20:1 30:5

Changing the default-prio also changes the dscp-prio mapping. If a new default-prio is configured on a port then dscp-prio will also be updated, which might require a new mapping profile. If there are no available profiles then an error is returned:

dcb app add dev swp1 dscp-prio 10:1 20:1 30:5

# this will also update dscp-prio map on hw
# might return an error if no profiles are available
dcb app add dev swp1 default-prio 4

When a default-prio config is removed, then all dscp-prio mappings that were present on the port, are also deleted:

# this will delete any dscp-prio map on hw and change port to L2 mode
dcb app del dev swp1 default-prio 4

Egress remarking

Egress remarking is done only when L3 trust mode enabled. User priority remarking is not supported. Only DSCP can be configured.

Packets that ingress the switch through a port that is set to L3 trust mode, will have their DSCP value updated as they egress the switch. The same dcb app rules that are used for packet prioritization are used to configure the rewrite. If several priorities end up resolving to the same DSCP value, the highest DSCP is favored.

Assumptions and Limitations

  • Remarking of user priority (UP) is not supported.
  • For packets that ingress to L2 trust port and egress from L3 trust port the DSCP is unchanged.
  • For packets that ingress to L3 trust port and egress from L2 trust port the DSCP is set to 0.
  • If there are no user DSCP mappings on the egress port that matches the packet SP, the DSCP of that packet is set to 0.
  • If there are no rules configured at the egress port, and the egress port is configured to L3 trust mode, all DSCP values are rewritten to zero.
  • dcb app del command removes any dscp-prio mappings from hw and changes port to L2 mode.

Traffic Shaping

In Linux, QoS traffic shaping is configured by using the TC Linux interface called Queuing Disciplines. There are two types of disciplines: classful and classless. These types are described in the appropriate Linux man pages, or the following TC how-to pages:

Prestera hardware offloads the following queuing disciplines (qdiscs) to implement QoS:

  • ETS qdisc for scheduling configuration.
  • TBF qdisc for shaper configuration.
  • RED qdisc for congestion avoidance.

Scheduler (ETS qdisc)

In Linux, a scheduler with strict and WRR bands is implemented by using the ETS queuing discipline. For information on qdisc, see man page at https://man7.org/linux/man-pages/man8/tc-ets.8.html.

Priomap

ETS supports several traffic classification algorithms, but only priomap is offloaded. priomap is composed of a list of numbers, one for each priority. The number indicates the band number that the packets with that priority should go to: 0 for the first band, 1 for the second, and so on:

                      p7 ----------------.
                      ..                 |
                      p2 ------.         |
                      p1 ----. |   ...   |
                      p0 --. | |         |
                           | | |         |
                           v v v         v
tc ... ets bands 8 priomap 7 6 5 4 3 2 1 0
                           | | |   ...   |
                           | | |         '-> band 0
                           | | |              ...
                           | | '-----------> band 5
                           | '-------------> band 6
                           '---------------> band 7

NOTES:

  • ETS supports up to 16 priorities in a priomap. For purposes of offloading, the only relevant priorities are 0-7. Priorities 8-15 are ignored and can be omitted when configuring ETS.
  • The mapping of a priority to a traffic class is static and cannot be changed. If you try to change the priority mapping, the qdisc is not offloaded and the offload flag is not set.

The tc -s show command lists the current ETS configuration including the full priomap and statistics.

Band Number Mapping

The Prestera driver uses bands to denote logical traffic classes. Each band is mapped in the ASIC to a pair of TCs. The TC is derived from the band number as follows: band 0 maps to TC 7, band 1 to TC 6, etc., ... band 7 maps to TC 0.

For purposes of attaching a child to a band, the qdisc class ID of the band is its band number + 1.

The following table summarizes the band mapping described above:

Band no. Class ID Priority
0 X:1 Highest
1 X:2
2 X:3
3 X:4
4 X:5
5 X:6
6 X:7
7 X:8 Lowest

Example

Add an ETS qdisc with handle 10, the quanta sum up to 10000 with 8 bands:

  • 4 strict bands
  • 4 bands that split the traffic 40% : 30% : 20% : 10%.

Traffic is mapped to bands in a reversed 1:1 manner to make priority-0 traffic the least prioritized and priority-7 traffic the most prioritized. That means that priority 0 goes to TC 0, 1 goes to TC 1, and so on. (Except BUM traffic, which goes to TC 8, TC 9, and so on instead.)

tc qdisc replace dev sw1p1 root handle 10: \
     ets bands 8 strict 4 quanta 4000 3000 2000 1000 \
     priomap 7 6 5 4 3 2 1 0

As indicated in the table above, band 0 has the class ID X:1, band 1 X:2, and so on. When creating a new qdisc, to attach a child qdisc to a band, use that ID as parent reference. For example, to attach RED to the first band and TBF to the second one:

tc qdisc replace dev sw1p1 parent 10:1 handle 101: \
     red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M
tc qdisc replace dev swp1 parent 10:2 handle 102: \
     tbf rate 400Mbit burst 128K limit 1M

Default Bahavior

The Scheduler for all traffic classes on all ports is configured to be SDWRR by default (switch initialization) as shown in the following table:

Default TC sched configuration

TC band (Linux) TC SDWRR weight
1 7 5
2 6 5
3 5 5
4 4 5
5 3 1
6 2 1
7 1 1
8 0 1

If you configure different scheduler parameters (SDWRR, SP etc.) on a port, and then remove that configuration from the port (remove root ETS qdisc), the default scheduler profile (described in the table above) is applied.

Linux Default Scheduler Behavior

When a network interface is created in Linux, the OS applies the default scheduling profile (qdisc) on that interface. The same thing happens when the last TC qdisc is removed from the port. In other words, Linux network interface always has at least one (default) queuing disciplines (qdisc) set. Not all qdiscs can be used as a default qdisc, for example:

  • b/pfifo_fast (3x queue qdisc with fixed TOS mapping), default
  • b/pfifo (Packet limited First In, First Out queue)

The default qdisc can be obtained using one of the following commands:

cat /proc/sys/net/core/default_qdisc

> net.core.default_qdisc = pfifo_fast


sysctl net.core.default_qdisc

> net.core.default_qdisc = pfifo_fast

The qdisc can be changed using the corresponding sysctl or echo command.

Limitations

  • The device supports 16 different scheduler profiles, with one profile reserved for a default configuration. So, a total of 15 different scheduler profiles are available to be configured by users. If the same profile is used for all ports, only one hardware scheduler profile is utilized, that is, the same scheduler profile is shared between ports.
  • It is not possible to change the priority to band mapping. This mapping is always static. See the priomap section for more information.
  • The user has to specify 8 queues when creating a qdisc.
  • Default qdiscs applied by Linux on network interfaces are not offloaded by Switchdev.

Shaper (TBF qdisc)

In Linux, the Shaper is implemented using a Token Bucket Filter queuing discipline. If the implementation is supported it can be offloaded to hardware. For more information see tc-tbf(8).

On hardware, shaping is implemented using a token bucket. TBF is configured on the ETS qdisc bands. The following TBF parameters are offloaded:

  • rate - The speed with which the queued traffic will be sent.
  • burst - The number of bytes of traffic that is dequeued before the shaper rate takes effect.

NOTE: other parameters are not offloaded.

Hardware statistics are also supported. To see the statistics, use the show command that shows the current configuration, including TBF's statistics.

tc -s qdisc show dev swp1

The following example attaches a TBF qdisc under band 0 of an ETS parent whose handle is 101. It configures a 400Mbps shaper with a burst size of 128KiB:

tc qdisc add dev sw1p1 parent 10:1 handle 101: tbf rate 400Mbit burst 128K limit 1M

Limitations

  • Max burst value that can be set is 16M (4M for AlleyCat5X) and should be a multiple of 4k.
  • The hardware may adjust the rate value set by the user. The actual value is not shown in the show command.

Congestion Avoidance (RED qdisc)

Congestion avoidance is implemented by the RED queuing discipline. RED is configured on the ETS qdisc bands only. For more information see tc-red(8).

On the hardware, the algorithm is implemented using a WRED mechanism.

The following parameters of RED qdisc are offloaded:

  • min - The minimum queue size.
  • max - The maximum limit.
  • probability - The probability to drop a packet when the average queue size reaches the maximum limit. 1.0 means 100%.

NOTE: other parameters are not offloaded, however they may be required to be provided by user.

To see RED statistics on each band (ETS bands and its child RED qdiscs), use the qdisc show command on the port. This command displays the current qdisc configuration including queue statistics.

tc -s qdisc show dev sw1p1

NOTE: Currently, only sent and drop counters are supported.

The following example attaches a RED qdisc under band 0 of an ETS parent whose handle is 10. Between the queue depths of 500KiB and 1.5MiB, the probability of dropping packets gradually rises from 0 to 10%.

tc qdisc add dev sw1p1 parent 10:1 handle 101: red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M

Minimum and Maximum limits

The table below describes max/min value of the queue size for each supported platform:

Platform Max value (bytes) Min value (bytes) Probability of dropping packets
AlleyCat3X 2560 buffers of 256 = 655360 Max value – 64 buffers of 256 = 16128 1.0
Aldrin2 36960 buffers of 256 = 9461760 Max value – 64 buffers of 256 = 16128 1.0
AlleyCat5X 22609 buffers of 256 = 5787904 0 up to Max value 0 – 1.0

NOTE: Max and Min value are multiples of 256.

Recommended values

The following table lists the recommended values. These were the values used for internal testing.

Platform Max value (bytes) Min value (bytes) Probability
AlleyCat3X 131072 114944 1.0
Aldrin2 131072 114944 1.0
AlleyCat5X 131072 114944 0.73

Limitations

  • Only 16 different RED profiles (one per port) are available on the system.
  • Profiles are not shared among ports.
  • At least one profile is reserved for default configuration (RED qdisc is removed).
  • On some platforms, more profiles are reserved for internal purposes.
  • CA configuration for multicast traffic is not supported.
Clone this wiki locally