summaryrefslogtreecommitdiff
path: root/doc/guides/prog_guide/qos_framework.rst
blob: a7527b21ca4767de50a3b1bb9e89b73e8b0d3351 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
..  SPDX-License-Identifier: BSD-3-Clause
    Copyright(c) 2010-2014 Intel Corporation.

Quality of Service (QoS) Framework
==================================

This chapter describes the DPDK Quality of Service (QoS) framework.

Packet Pipeline with QoS Support
--------------------------------

An example of a complex packet processing pipeline with QoS support is shown in the following figure.

.. _figure_pkt_proc_pipeline_qos:

.. figure:: img/pkt_proc_pipeline_qos.*

   Complex Packet Processing Pipeline with QoS Support


This pipeline can be built using reusable DPDK software libraries.
The main blocks implementing QoS in this pipeline are: the policer, the dropper and the scheduler.
A functional description of each block is provided in the following table.

.. _table_qos_1:

.. table:: Packet Processing Pipeline Implementing QoS

   +---+------------------------+--------------------------------------------------------------------------------+
   | # | Block                  | Functional Description                                                         |
   |   |                        |                                                                                |
   +===+========================+================================================================================+
   | 1 | Packet I/O RX & TX     | Packet reception/ transmission from/to multiple NIC ports. Poll mode drivers   |
   |   |                        | (PMDs) for Intel 1 GbE/10 GbE NICs.                                            |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 2 | Packet parser          | Identify the protocol stack of the input packet. Check the integrity of the    |
   |   |                        | packet headers.                                                                |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 3 | Flow classification    | Map the input packet to one of the known traffic flows. Exact match table      |
   |   |                        | lookup using configurable hash function (jhash, CRC and so on) and bucket      |
   |   |                        | logic to handle collisions.                                                    |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 4 | Policer                | Packet metering using srTCM (RFC 2697) or trTCM (RFC2698) algorithms.          |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 5 | Load Balancer          | Distribute the input packets to the application workers. Provide uniform load  |
   |   |                        | to each worker. Preserve the affinity of traffic flows to workers and the      |
   |   |                        | packet order within each flow.                                                 |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 6 | Worker threads         | Placeholders for the customer specific application workload (for example, IP   |
   |   |                        | stack and so on).                                                              |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 7 | Dropper                | Congestion management using the Random Early Detection (RED) algorithm         |
   |   |                        | (specified by the Sally Floyd - Van Jacobson paper) or Weighted RED (WRED).    |
   |   |                        | Drop packets based on the current scheduler queue load level and packet        |
   |   |                        | priority. When congestion is experienced, lower priority packets are dropped   |
   |   |                        | first.                                                                         |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+
   | 8 | Hierarchical Scheduler | 5-level hierarchical scheduler (levels are: output port, subport, pipe,        |
   |   |                        | traffic class and queue) with thousands (typically 64K) leaf nodes (queues).   |
   |   |                        | Implements traffic shaping (for subport and pipe levels), strict priority      |
   |   |                        | (for traffic class level) and Weighted Round Robin (WRR) (for queues within    |
   |   |                        | each pipe traffic class).                                                      |
   |   |                        |                                                                                |
   +---+------------------------+--------------------------------------------------------------------------------+

The infrastructure blocks used throughout the packet processing pipeline are listed in the following table.

.. _table_qos_2:

.. table:: Infrastructure Blocks Used by the Packet Processing Pipeline

   +---+-----------------------+-----------------------------------------------------------------------+
   | # | Block                 | Functional Description                                                |
   |   |                       |                                                                       |
   +===+=======================+=======================================================================+
   | 1 | Buffer manager        | Support for global buffer pools and private per-thread buffer caches. |
   |   |                       |                                                                       |
   +---+-----------------------+-----------------------------------------------------------------------+
   | 2 | Queue manager         | Support for message passing between pipeline blocks.                  |
   |   |                       |                                                                       |
   +---+-----------------------+-----------------------------------------------------------------------+
   | 3 | Power saving          | Support for power saving during low activity periods.                 |
   |   |                       |                                                                       |
   +---+-----------------------+-----------------------------------------------------------------------+

The mapping of pipeline blocks to CPU cores is configurable based on the performance level required by each specific application
and the set of features enabled for each block.
Some blocks might consume more than one CPU core (with each CPU core running a different instance of the same block on different input packets),
while several other blocks could be mapped to the same CPU core.

Hierarchical Scheduler
----------------------

The hierarchical scheduler block, when present, usually sits on the TX side just before the transmission stage.
Its purpose is to prioritize the transmission of packets from different users and different traffic classes
according to the policy specified by the Service Level Agreements (SLAs) of each network node.

Overview
~~~~~~~~

The hierarchical scheduler block is similar to the traffic manager block used by network processors
that typically implement per flow (or per group of flows) packet queuing and scheduling.
It typically acts like a buffer that is able to temporarily store a large number of packets just before their transmission (enqueue operation);
as the NIC TX is requesting more packets for transmission,
these packets are later on removed and handed over to the NIC TX with the packet selection logic observing the predefined SLAs (dequeue operation).

.. _figure_hier_sched_blk:

.. figure:: img/hier_sched_blk.*

   Hierarchical Scheduler Block Internal Diagram


The hierarchical scheduler is optimized for a large number of packet queues.
When only a small number of queues are needed, message passing queues should be used instead of this block.
See `Worst Case Scenarios for Performance`_ for a more detailed discussion.

Scheduling Hierarchy
~~~~~~~~~~~~~~~~~~~~

The scheduling hierarchy is shown in :numref:`figure_sched_hier_per_port`.
The first level of the hierarchy is the Ethernet TX port 1/10/40 GbE,
with subsequent hierarchy levels defined as subport, pipe, traffic class and queue.

Typically, each subport represents a predefined group of users, while each pipe represents an individual user/subscriber.
Each traffic class is the representation of a different traffic type with specific loss rate,
delay and jitter requirements, such as voice, video or data transfers.
Each queue hosts packets from one or multiple connections of the same type belonging to the same user.

.. _figure_sched_hier_per_port:

.. figure:: img/sched_hier_per_port.*

   Scheduling Hierarchy per Port


The functionality of each hierarchical level is detailed in the following table.

.. _table_qos_3:

.. table:: Port Scheduling Hierarchy

   +---+--------------------+----------------------------+---------------------------------------------------------------+
   | # | Level              | Siblings per Parent        | Functional Description                                        |
   |   |                    |                            |                                                               |
   +===+====================+============================+===============================================================+
   | 1 | Port               | -                          | #.  Output Ethernet port 1/10/40 GbE.                         |
   |   |                    |                            |                                                               |
   |   |                    |                            | #.  Multiple ports are scheduled in round robin order with    |
   |   |                    |                            |     all ports having equal priority.                          |
   |   |                    |                            |                                                               |
   +---+--------------------+----------------------------+---------------------------------------------------------------+
   | 2 | Subport            | Configurable (default: 8)  | #.  Traffic shaping using token bucket algorithm (one token   |
   |   |                    |                            |     bucket per subport).                                      |
   |   |                    |                            |                                                               |
   |   |                    |                            | #.  Upper limit enforced per Traffic Class (TC) at the        |
   |   |                    |                            |     subport level.                                            |
   |   |                    |                            |                                                               |
   |   |                    |                            | #.  Lower priority TCs able to reuse subport bandwidth        |
   |   |                    |                            |     currently unused by higher priority TCs.                  |
   |   |                    |                            |                                                               |
   +---+--------------------+----------------------------+---------------------------------------------------------------+
   | 3 | Pipe               | Configurable (default: 4K) | #.  Traffic shaping using the token bucket algorithm (one     |
   |   |                    |                            |     token bucket per pipe.                                    |
   |   |                    |                            |                                                               |
   +---+--------------------+----------------------------+---------------------------------------------------------------+
   | 4 | Traffic Class (TC) | 4                          | #.  TCs of the same pipe handled in strict priority order.    |
   |   |                    |                            |                                                               |
   |   |                    |                            | #.  Upper limit enforced per TC at the pipe level.            |
   |   |                    |                            |                                                               |
   |   |                    |                            | #.  Lower priority TCs able to reuse pipe bandwidth currently |
   |   |                    |                            |     unused by higher priority TCs.                            |
   |   |                    |                            |                                                               |
   |   |                    |                            | #.  When subport TC is oversubscribed (configuration time     |
   |   |                    |                            |     event), pipe TC upper limit is capped to a dynamically    |
   |   |                    |                            |     adjusted value that is shared by all the subport pipes.   |
   |   |                    |                            |                                                               |
   +---+--------------------+----------------------------+---------------------------------------------------------------+
   | 5 | Queue              | 4                          | #.  Queues of the same TC are serviced using Weighted Round   |
   |   |                    |                            |     Robin (WRR) according to predefined weights.              |
   |   |                    |                            |                                                               |
   +---+--------------------+----------------------------+---------------------------------------------------------------+

Application Programming Interface (API)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Port Scheduler Configuration API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The rte_sched.h file contains configuration functions for port, subport and pipe.

Port Scheduler Enqueue API
^^^^^^^^^^^^^^^^^^^^^^^^^^

The port scheduler enqueue API is very similar to the API of the DPDK PMD TX function.

.. code-block:: c

    int rte_sched_port_enqueue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts);

Port Scheduler Dequeue API
^^^^^^^^^^^^^^^^^^^^^^^^^^

The port scheduler dequeue API is very similar to the API of the DPDK PMD RX function.

.. code-block:: c

    int rte_sched_port_dequeue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts);

Usage Example
^^^^^^^^^^^^^

.. code-block:: c

    /* File "application.c" */

    #define N_PKTS_RX   64
    #define N_PKTS_TX   48
    #define NIC_RX_PORT 0
    #define NIC_RX_QUEUE 0
    #define NIC_TX_PORT 1
    #define NIC_TX_QUEUE 0

    struct rte_sched_port *port = NULL;
    struct rte_mbuf *pkts_rx[N_PKTS_RX], *pkts_tx[N_PKTS_TX];
    uint32_t n_pkts_rx, n_pkts_tx;

    /* Initialization */

    <initialization code>

    /* Runtime */
    while (1) {
        /* Read packets from NIC RX queue */

        n_pkts_rx = rte_eth_rx_burst(NIC_RX_PORT, NIC_RX_QUEUE, pkts_rx, N_PKTS_RX);

        /* Hierarchical scheduler enqueue */

        rte_sched_port_enqueue(port, pkts_rx, n_pkts_rx);

        /* Hierarchical scheduler dequeue */

        n_pkts_tx = rte_sched_port_dequeue(port, pkts_tx, N_PKTS_TX);

        /* Write packets to NIC TX queue */

        rte_eth_tx_burst(NIC_TX_PORT, NIC_TX_QUEUE, pkts_tx, n_pkts_tx);
    }

Implementation
~~~~~~~~~~~~~~

Internal Data Structures per Port
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A schematic of the internal data structures in shown in with details in.

.. _figure_data_struct_per_port:

.. figure:: img/data_struct_per_port.*

    Internal Data Structures per Port


.. _table_qos_4:

.. table:: Scheduler Internal Data Structures per Port

   +---+----------------------+-------------------------+---------------------+------------------------------+---------------------------------------------------+
   | # | Data structure       | Size (bytes)            | # per port          | Access type                  | Description                                       |
   |   |                      |                         |                     |                              |                                                   |
   |   |                      |                         |                     +-------------+----------------+---------------------------------------------------+
   |   |                      |                         |                     | Enq         | Deq            |                                                   |
   |   |                      |                         |                     |             |                |                                                   |
   +===+======================+=========================+=====================+=============+================+===================================================+
   | 1 | Subport table entry  | 64                      | # subports per port | -           | Rd, Wr         | Persistent subport data (credits, etc).           |
   |   |                      |                         |                     |             |                |                                                   |
   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
   | 2 | Pipe table entry     | 64                      | # pipes per port    | -           | Rd, Wr         | Persistent data for pipe, its TCs and its queues  |
   |   |                      |                         |                     |             |                | (credits, etc) that is updated during run-time.   |
   |   |                      |                         |                     |             |                |                                                   |
   |   |                      |                         |                     |             |                | The pipe configuration parameters do not change   |
   |   |                      |                         |                     |             |                | during run-time. The same pipe configuration      |
   |   |                      |                         |                     |             |                | parameters are shared by multiple pipes,          |
   |   |                      |                         |                     |             |                | therefore they are not part of pipe table entry.  |
   |   |                      |                         |                     |             |                |                                                   |
   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
   | 3 | Queue table entry    | 4                       | #queues per port    | Rd, Wr      | Rd, Wr         | Persistent queue data (read and write pointers).  |
   |   |                      |                         |                     |             |                | The queue size is the same per TC for all queues, |
   |   |                      |                         |                     |             |                | allowing the queue base address to be computed    |
   |   |                      |                         |                     |             |                | using a fast formula, so these two parameters are |
   |   |                      |                         |                     |             |                | not part of queue table entry.                    |
   |   |                      |                         |                     |             |                |                                                   |
   |   |                      |                         |                     |             |                | The queue table entries for any given pipe are    |
   |   |                      |                         |                     |             |                | stored in the same cache line.                    |
   |   |                      |                         |                     |             |                |                                                   |
   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
   | 4 | Queue storage area   | Config (default: 64 x8) | # queues per port   | Wr          | Rd             | Array of elements per queue; each element is 8    |
   |   |                      |                         |                     |             |                | byte in size (mbuf pointer).                      |
   |   |                      |                         |                     |             |                |                                                   |
   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
   | 5 | Active queues bitmap | 1 bit per queue         | 1                   | Wr (Set)    | Rd, Wr (Clear) | The bitmap maintains one status bit per queue:    |
   |   |                      |                         |                     |             |                | queue not active (queue is empty) or queue active |
   |   |                      |                         |                     |             |                | (queue is not empty).                             |
   |   |                      |                         |                     |             |                |                                                   |
   |   |                      |                         |                     |             |                | Queue bit is set by the scheduler enqueue and     |
   |   |                      |                         |                     |             |                | cleared by the scheduler dequeue when queue       |
   |   |                      |                         |                     |             |                | becomes empty.                                    |
   |   |                      |                         |                     |             |                |                                                   |
   |   |                      |                         |                     |             |                | Bitmap scan operation returns the next non-empty  |
   |   |                      |                         |                     |             |                | pipe and its status (16-bit mask of active queue  |
   |   |                      |                         |                     |             |                | in the pipe).                                     |
   |   |                      |                         |                     |             |                |                                                   |
   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
   | 6 | Grinder              | ~128                    | Config (default: 8) | -           | Rd, Wr         | Short list of active pipes currently under        |
   |   |                      |                         |                     |             |                | processing. The grinder contains temporary data   |
   |   |                      |                         |                     |             |                | during pipe processing.                           |
   |   |                      |                         |                     |             |                |                                                   |
   |   |                      |                         |                     |             |                | Once the current pipe exhausts packets or         |
   |   |                      |                         |                     |             |                | credits, it is replaced with another active pipe  |
   |   |                      |                         |                     |             |                | from the bitmap.                                  |
   |   |                      |                         |                     |             |                |                                                   |
   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+

Multicore Scaling Strategy
^^^^^^^^^^^^^^^^^^^^^^^^^^

The multicore scaling strategy is:

#.  Running different physical ports on different threads. The enqueue and dequeue of the same port are run by the same thread.

#.  Splitting the same physical port to different threads by running different sets of subports of the same physical port (virtual ports) on different threads.
    Similarly, a subport can be split into multiple subports that are each run by a different thread.
    The enqueue and dequeue of the same port are run by the same thread.
    This is only required if, for performance reasons, it is not possible to handle a full port with a single core.

Enqueue and Dequeue for the Same Output Port
""""""""""""""""""""""""""""""""""""""""""""

Running enqueue and dequeue operations for the same output port from different cores is likely to cause significant impact on scheduler's performance
and it is therefore not recommended.

The port enqueue and dequeue operations share access to the following data structures:

#.  Packet descriptors

#.  Queue table

#.  Queue storage area

#.  Bitmap of active queues

The expected drop in performance is due to:

#.  Need to make the queue and bitmap operations thread safe,
    which requires either using locking primitives for access serialization (for example, spinlocks/ semaphores) or
    using atomic primitives for lockless access (for example, Test and Set, Compare And Swap, an so on).
    The impact is much higher in the former case.

#.  Ping-pong of cache lines storing the shared data structures between the cache hierarchies of the two cores
    (done transparently by the MESI protocol cache coherency CPU hardware).

Therefore, the scheduler enqueue and dequeue operations have to be run from the same thread,
which allows the queues and the bitmap operations to be non-thread safe and
keeps the scheduler data structures internal to the same core.

Performance Scaling
"""""""""""""""""""

Scaling up the number of NIC ports simply requires a proportional increase in the number of CPU cores to be used for traffic scheduling.

Enqueue Pipeline
^^^^^^^^^^^^^^^^

The sequence of steps per packet:

#.  *Access* the mbuf to read the data fields required to identify the destination queue for the packet.
    These fields are: port, subport, traffic class and queue within traffic class, and are typically set by the classification stage.

#.  *Access* the queue structure to identify the write location in the queue array.
    If the queue is full, then the packet is discarded.

#.  *Access* the queue array location to store the packet (i.e. write the mbuf pointer).

It should be noted the strong data dependency between these steps, as steps 2 and 3 cannot start before the result from steps 1 and 2 becomes available,
which prevents the processor out of order execution engine to provide any significant performance optimizations.

Given the high rate of input packets and the large amount of queues,
it is expected that the data structures accessed to enqueue the current packet are not present
in the L1 or L2 data cache of the current core, thus the above 3 memory accesses would result (on average) in L1 and L2 data cache misses.
A number of 3 L1/L2 cache misses per packet is not acceptable for performance reasons.

The workaround is to prefetch the required data structures in advance. The prefetch operation has an execution latency during which
the processor should not attempt to access the data structure currently under prefetch, so the processor should execute other work.
The only other work available is to execute different stages of the enqueue sequence of operations on other input packets,
thus resulting in a pipelined implementation for the enqueue operation.

:numref:`figure_prefetch_pipeline` illustrates a pipelined implementation for the enqueue operation with 4 pipeline stages and each stage executing 2 different input packets.
No input packet can be part of more than one pipeline stage at a given time.

.. _figure_prefetch_pipeline:

.. figure:: img/prefetch_pipeline.*

    Prefetch Pipeline for the Hierarchical Scheduler Enqueue Operation


The congestion management scheme implemented by the enqueue pipeline described above is very basic:
packets are enqueued until a specific queue becomes full,
then all the packets destined to the same queue are dropped until packets are consumed (by the dequeue operation).
This can be improved by enabling RED/WRED as part of the enqueue pipeline which looks at the queue occupancy and
packet priority in order to yield the enqueue/drop decision for a specific packet
(as opposed to enqueuing all packets / dropping all packets indiscriminately).

Dequeue State Machine
^^^^^^^^^^^^^^^^^^^^^

The sequence of steps to schedule the next packet from the current pipe is:

#.  Identify the next active pipe using the bitmap scan operation, *prefetch* pipe.

#.  *Read* pipe data structure. Update the credits for the current pipe and its subport.
    Identify the first active traffic class within the current pipe, select the next queue using WRR,
    *prefetch* queue pointers for all the 16 queues of the current pipe.

#.  *Read* next element from the current WRR queue and *prefetch* its packet descriptor.

#.  *Read* the packet length from the packet descriptor (mbuf structure).
    Based on the packet length and the available credits (of current pipe, pipe traffic class, subport and subport traffic class),
    take the go/no go scheduling decision for the current packet.

To avoid the cache misses, the above data structures (pipe, queue, queue array, mbufs) are prefetched in advance of being accessed.
The strategy of hiding the latency of the prefetch operations is to switch from the current pipe (in grinder A) to another pipe
(in grinder B) immediately after a prefetch is issued for the current pipe.
This gives enough time to the prefetch operation to complete before the execution switches back to this pipe (in grinder A).

The dequeue pipe state machine exploits the data presence into the processor cache,
therefore it tries to send as many packets from the same pipe TC and pipe as possible (up to the available packets and credits) before
moving to the next active TC from the same pipe (if any) or to another active pipe.

.. _figure_pipe_prefetch_sm:

.. figure:: img/pipe_prefetch_sm.*

   Pipe Prefetch State Machine for the Hierarchical Scheduler Dequeue
   Operation


Timing and Synchronization
^^^^^^^^^^^^^^^^^^^^^^^^^^

The output port is modeled as a conveyor belt of byte slots that need to be filled by the scheduler with data for transmission.
For 10 GbE, there are 1.25 billion byte slots that need to be filled by the port scheduler every second.
If the scheduler is not fast enough to fill the slots, provided that enough packets and credits exist,
then some slots will be left unused and bandwidth will be wasted.

In principle, the hierarchical scheduler dequeue operation should be triggered by NIC TX.
Usually, once the occupancy of the NIC TX input queue drops below a predefined threshold,
the port scheduler is woken up (interrupt based or polling based,
by continuously monitoring the queue occupancy) to push more packets into the queue.

Internal Time Reference
"""""""""""""""""""""""

The scheduler needs to keep track of time advancement for the credit logic,
which requires credit updates based on time (for example, subport and pipe traffic shaping, traffic class upper limit enforcement, and so on).

Every time the scheduler decides to send a packet out to the NIC TX for transmission, the scheduler will increment its internal time reference accordingly.
Therefore, it is convenient to keep the internal time reference in units of bytes,
where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium.
This way, as a packet is scheduled for transmission, the time is incremented with (n + h),
where n is the packet length in bytes and h is the number of framing overhead bytes per packet.

Internal Time Reference Re-synchronization
""""""""""""""""""""""""""""""""""""""""""

The scheduler needs to align its internal time reference to the pace of the port conveyor belt.
The reason is to make sure that the scheduler does not feed the NIC TX with more bytes than the line rate of the physical medium in order to prevent packet drop
(by the scheduler, due to the NIC TX input queue being full, or later on, internally by the NIC TX).

The scheduler reads the current time on every dequeue invocation.
The CPU time stamp can be obtained by reading either the Time Stamp Counter (TSC) register or the High Precision Event Timer (HPET) register.
The current CPU time stamp is converted from number of CPU clocks to number of bytes:
*time_bytes = time_cycles / cycles_per_byte, where cycles_per_byte*
is the amount of CPU cycles that is equivalent to the transmission time for one byte on the wire
(e.g. for a CPU frequency of 2 GHz and a 10GbE port,*cycles_per_byte = 1.6*).

The scheduler maintains an internal time reference of the NIC time.
Whenever a packet is scheduled, the NIC time is incremented with the packet length (including framing overhead).
On every dequeue invocation, the scheduler checks its internal reference of the NIC time against the current time:

#. If NIC time is in the future (NIC time >= current time), no adjustment of NIC time is needed.
   This means that scheduler is able to schedule NIC packets before the NIC actually needs those packets, so the NIC TX is well supplied with packets;

#. If NIC time is in the past (NIC time < current time), then NIC time should be adjusted by setting it to the current time.
   This means that the scheduler is not able to keep up with the speed of the NIC byte conveyor belt,
   so NIC bandwidth is wasted due to poor packet supply to the NIC TX.

Scheduler Accuracy and Granularity
""""""""""""""""""""""""""""""""""

The scheduler round trip delay (SRTD) is the time (number of CPU cycles) between two consecutive examinations of the same pipe by the scheduler.

To keep up with the output port (that is, avoid bandwidth loss),
the scheduler should be able to schedule n packets faster than the same n packets are transmitted by NIC TX.

The scheduler needs to keep up with the rate of each individual pipe,
as configured for the pipe token bucket, assuming that no port oversubscription is taking place.
This means that the size of the pipe token bucket should be set high enough to prevent it from overflowing due to big SRTD,
as this would result in credit loss (and therefore bandwidth loss) for the pipe.

Credit Logic
^^^^^^^^^^^^

Scheduling Decision
"""""""""""""""""""

The scheduling decision to send next packet from (subport S, pipe P, traffic class TC, queue Q) is favorable (packet is sent)
when all the conditions below are met:

*   Pipe P of subport S is currently selected by one of the port grinders;

*   Traffic class TC is the highest priority active traffic class of pipe P;

*   Queue Q is the next queue selected by WRR within traffic class TC of pipe P;

*   Subport S has enough credits to send the packet;

*   Subport S has enough credits for traffic class TC to send the packet;

*   Pipe P has enough credits to send the packet;

*   Pipe P has enough credits for traffic class TC to send the packet.

If all the above conditions are met,
then the packet is selected for transmission and the necessary credits are subtracted from subport S,
subport S traffic class TC, pipe P, pipe P traffic class TC.

Framing Overhead
""""""""""""""""

As the greatest common divisor for all packet lengths is one byte, the unit of credit is selected as one byte.
The number of credits required for the transmission of a packet of n bytes is equal to (n+h),
where h is equal to the number of framing overhead bytes per packet.

.. _table_qos_5:

.. table:: Ethernet Frame Overhead Fields

   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
   | # | Packet field                   | Length (bytes) | Comments                                                                  |
   |   |                                |                |                                                                           |
   +===+================================+================+===========================================================================+
   | 1 | Preamble                       | 7              |                                                                           |
   |   |                                |                |                                                                           |
   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
   | 2 | Start of Frame Delimiter (SFD) | 1              |                                                                           |
   |   |                                |                |                                                                           |
   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
   | 3 | Frame Check Sequence (FCS)     | 4              | Considered overhead only if not included in the mbuf packet length field. |
   |   |                                |                |                                                                           |
   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
   | 4 | Inter Frame Gap (IFG)          | 12             |                                                                           |
   |   |                                |                |                                                                           |
   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
   | 5 | Total                          | 24             |                                                                           |
   |   |                                |                |                                                                           |
   +---+--------------------------------+----------------+---------------------------------------------------------------------------+

Traffic Shaping
"""""""""""""""

The traffic shaping for subport and pipe is implemented using a token bucket per subport/per pipe.
Each token bucket is implemented using one saturated counter that keeps track of the number of available credits.

The token bucket generic parameters and operations are presented in :numref:`table_qos_6` and :numref:`table_qos_7`.

.. _table_qos_6:

.. table:: Token Bucket Generic Parameters

   +---+------------------------+--------------------+---------------------------------------------------------+
   | # | Token Bucket Parameter | Unit               | Description                                             |
   |   |                        |                    |                                                         |
   +===+========================+====================+=========================================================+
   | 1 | bucket_rate            | Credits per second | Rate of adding credits to the bucket.                   |
   |   |                        |                    |                                                         |
   +---+------------------------+--------------------+---------------------------------------------------------+
   | 2 | bucket_size            | Credits            | Max number of credits that can be stored in the bucket. |
   |   |                        |                    |                                                         |
   +---+------------------------+--------------------+---------------------------------------------------------+

.. _table_qos_7:

.. table:: Token Bucket Generic Operations

   +---+------------------------+------------------------------------------------------------------------------+
   | # | Token Bucket Operation | Description                                                                  |
   |   |                        |                                                                              |
   +===+========================+==============================================================================+
   | 1 | Initialization         | Bucket set to a predefined value, e.g. zero or half of the bucket size.      |
   |   |                        |                                                                              |
   +---+------------------------+------------------------------------------------------------------------------+
   | 2 | Credit update          | Credits are added to the bucket on top of existing ones, either periodically |
   |   |                        | or on demand, based on the bucket_rate. Credits cannot exceed the upper      |
   |   |                        | limit defined by the bucket_size, so any credits to be added to the bucket   |
   |   |                        | while the bucket is full are dropped.                                        |
   |   |                        |                                                                              |
   +---+------------------------+------------------------------------------------------------------------------+
   | 3 | Credit consumption     | As result of packet scheduling, the necessary number of credits is removed   |
   |   |                        | from the bucket. The packet can only be sent if enough credits are in the    |
   |   |                        | bucket to send the full packet (packet bytes and framing overhead for the    |
   |   |                        | packet).                                                                     |
   |   |                        |                                                                              |
   +---+------------------------+------------------------------------------------------------------------------+

To implement the token bucket generic operations described above,
the current design uses the persistent data structure presented in :numref:`table_qos_8`,
while the implementation of the token bucket operations is described in :numref:`table_qos_9`.

.. _table_qos_8:

.. table:: Token Bucket Persistent Data Structure

   +---+------------------------+-------+----------------------------------------------------------------------+
   | # | Token bucket field     | Unit  | Description                                                          |
   |   |                        |       |                                                                      |
   +===+========================+=======+======================================================================+
   | 1 | tb_time                | Bytes | Time of the last credit update. Measured in bytes instead of seconds |
   |   |                        |       | or CPU cycles for ease of credit consumption operation               |
   |   |                        |       | (as the current time is also maintained in bytes).                   |
   |   |                        |       |                                                                      |
   |   |                        |       | See  Section 26.2.4.5.1 "Internal Time Reference" for an             |
   |   |                        |       | explanation of why the time is maintained in byte units.             |
   |   |                        |       |                                                                      |
   +---+------------------------+-------+----------------------------------------------------------------------+
   | 2 | tb_period              | Bytes | Time period that should elapse since the last credit update in order |
   |   |                        |       | for the bucket to be awarded tb_credits_per_period worth or credits. |
   |   |                        |       |                                                                      |
   +---+------------------------+-------+----------------------------------------------------------------------+
   | 3 | tb_credits_per_period  | Bytes | Credit allowance per tb_period.                                      |
   |   |                        |       |                                                                      |
   +---+------------------------+-------+----------------------------------------------------------------------+
   | 4 | tb_size                | Bytes | Bucket size, i.e. upper limit for the tb_credits.                    |
   |   |                        |       |                                                                      |
   +---+------------------------+-------+----------------------------------------------------------------------+
   | 5 | tb_credits             | Bytes | Number of credits currently in the bucket.                           |
   |   |                        |       |                                                                      |
   +---+------------------------+-------+----------------------------------------------------------------------+

The bucket rate (in bytes per second) can be computed with the following formula:

*bucket_rate = (tb_credits_per_period / tb_period) * r*

where, r = port line rate (in bytes per second).

.. _table_qos_9:

.. table:: Token Bucket Operations

   +---+-------------------------+-----------------------------------------------------------------------------+
   | # | Token bucket operation  | Description                                                                 |
   |   |                         |                                                                             |
   +===+=========================+=============================================================================+
   | 1 | Initialization          | *tb_credits = 0; or tb_credits = tb_size / 2;*                              |
   |   |                         |                                                                             |
   +---+-------------------------+-----------------------------------------------------------------------------+
   | 2 | Credit update           | Credit update options:                                                      |
   |   |                         |                                                                             |
   |   |                         | *   Every time a packet is sent for a port, update the credits of all the   |
   |   |                         |     the subports and pipes of that port. Not feasible.                      |
   |   |                         |                                                                             |
   |   |                         | *   Every time a packet is sent, update the credits for the pipe and        |
   |   |                         |     subport. Very accurate, but not needed (a lot of calculations).         |
   |   |                         |                                                                             |
   |   |                         | *   Every time a pipe is selected (that is, picked by one                   |
   |   |                         |     of the grinders), update the credits for the pipe and its subport.      |
   |   |                         |                                                                             |
   |   |                         | The current implementation is using option 3.  According to Section         |
   |   |                         | `Dequeue State Machine`_, the pipe and subport credits are                  |
   |   |                         | updated every time a pipe is selected by the dequeue process before the     |
   |   |                         | pipe and subport credits are actually used.                                 |
   |   |                         |                                                                             |
   |   |                         | The implementation uses a tradeoff between accuracy and speed by updating   |
   |   |                         | the bucket credits only when at least a full *tb_period*  has elapsed since |
   |   |                         | the last update.                                                            |
   |   |                         |                                                                             |
   |   |                         | *   Full accuracy can be achieved by selecting the value for *tb_period*    |
   |   |                         |     for which  *tb_credits_per_period = 1*.                                 |
   |   |                         |                                                                             |
   |   |                         | *   When full accuracy is not required, better performance is achieved by   |
   |   |                         |     setting *tb_credits* to a larger value.                                 |
   |   |                         |                                                                             |
   |   |                         | Update operations:                                                          |
   |   |                         |                                                                             |
   |   |                         | *   n_periods = (time - tb_time) / tb_period;                               |
   |   |                         |                                                                             |
   |   |                         | *   tb_credits += n_periods * tb_credits_per_period;                        |
   |   |                         |                                                                             |
   |   |                         | *   tb_credits = min(tb_credits, tb_size);                                  |
   |   |                         |                                                                             |
   |   |                         | *   tb_time += n_periods * tb_period;                                       |
   |   |                         |                                                                             |
   +---+-------------------------+-----------------------------------------------------------------------------+
   | 3 | Credit consumption      | As result of packet scheduling, the necessary number of credits is removed  |
   |   |  (on packet scheduling) | from the bucket. The packet can only be sent if enough credits are in the   |
   |   |                         | bucket to send the full packet (packet bytes and framing overhead for the   |
   |   |                         | packet).                                                                    |
   |   |                         |                                                                             |
   |   |                         | Scheduling operations:                                                      |
   |   |                         |                                                                             |
   |   |                         | pkt_credits = pkt_len + frame_overhead;                                     |
   |   |                         | if (tb_credits >= pkt_credits){tb_credits -= pkt_credits;}                  |
   |   |                         |                                                                             |
   +---+-------------------------+-----------------------------------------------------------------------------+

Traffic Classes
"""""""""""""""

Implementation of Strict Priority Scheduling
''''''''''''''''''''''''''''''''''''''''''''

Strict priority scheduling of traffic classes within the same pipe is implemented by the pipe dequeue state machine,
which selects the queues in ascending order.
Therefore, queues 0..3 (associated with TC 0, highest priority TC) are handled before
queues 4..7 (TC 1, lower priority than TC 0),
which are handled before queues 8..11 (TC 2),
which are handled before queues 12..15 (TC 3, lowest priority TC).

Upper Limit Enforcement
'''''''''''''''''''''''

The traffic classes at the pipe and subport levels are not traffic shaped,
so there is no token bucket maintained in this context.
The upper limit for the traffic classes at the subport and
pipe levels is enforced by periodically refilling the subport / pipe traffic class credit counter,
out of which credits are consumed every time a packet is scheduled for that subport / pipe,
as described in :numref:`table_qos_10` and :numref:`table_qos_11`.

.. _table_qos_10:

.. table:: Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure

   +---+-----------------------+-------+-----------------------------------------------------------------------+
   | # | Subport or pipe field | Unit  | Description                                                           |
   |   |                       |       |                                                                       |
   +===+=======================+=======+=======================================================================+
   | 1 | tc_time               | Bytes | Time of the next update (upper limit refill) for the 4 TCs of the     |
   |   |                       |       | current subport / pipe.                                               |
   |   |                       |       |                                                                       |
   |   |                       |       | See  Section `Internal Time Reference`_ for the                       |
   |   |                       |       | explanation of why the time is maintained in byte units.              |
   |   |                       |       |                                                                       |
   +---+-----------------------+-------+-----------------------------------------------------------------------+
   | 2 | tc_period             | Bytes | Time between two consecutive updates for the 4 TCs of the current     |
   |   |                       |       | subport / pipe. This is expected to be many times bigger than the     |
   |   |                       |       | typical value of the token bucket tb_period.                          |
   |   |                       |       |                                                                       |
   +---+-----------------------+-------+-----------------------------------------------------------------------+
   | 3 | tc_credits_per_period | Bytes | Upper limit for the number of credits allowed to be consumed by the   |
   |   |                       |       | current TC during each enforcement period tc_period.                  |
   |   |                       |       |                                                                       |
   +---+-----------------------+-------+-----------------------------------------------------------------------+
   | 4 | tc_credits            | Bytes | Current upper limit for the number of credits that can be consumed by |
   |   |                       |       | the current traffic class for the remainder of the current            |
   |   |                       |       | enforcement period.                                                   |
   |   |                       |       |                                                                       |
   +---+-----------------------+-------+-----------------------------------------------------------------------+

.. _table_qos_11:

.. table:: Subport/Pipe Traffic Class Upper Limit Enforcement Operations

   +---+--------------------------+----------------------------------------------------------------------------+
   | # | Traffic Class Operation  | Description                                                                |
   |   |                          |                                                                            |
   +===+==========================+============================================================================+
   | 1 | Initialization           | tc_credits = tc_credits_per_period;                                        |
   |   |                          |                                                                            |
   |   |                          | tc_time = tc_period;                                                       |
   |   |                          |                                                                            |
   +---+--------------------------+----------------------------------------------------------------------------+
   | 2 | Credit update            | Update operations:                                                         |
   |   |                          |                                                                            |
   |   |                          | if (time >= tc_time) {                                                     |
   |   |                          |                                                                            |
   |   |                          | tc_credits = tc_credits_per_period;                                        |
   |   |                          |                                                                            |
   |   |                          | tc_time = time + tc_period;                                                |
   |   |                          |                                                                            |
   |   |                          | }                                                                          |
   |   |                          |                                                                            |
   +---+--------------------------+----------------------------------------------------------------------------+
   | 3 | Credit consumption       | As result of packet scheduling, the TC limit is decreased with the         |
   |   | (on packet scheduling)   | necessary number of credits. The packet can only be sent if enough credits |
   |   |                          | are currently available in the TC limit to send the full packet            |
   |   |                          | (packet bytes and framing overhead for the packet).                        |
   |   |                          |                                                                            |
   |   |                          | Scheduling operations:                                                     |
   |   |                          |                                                                            |
   |   |                          | pkt_credits = pk_len + frame_overhead;                                     |
   |   |                          |                                                                            |
   |   |                          | if (tc_credits >= pkt_credits) {tc_credits -= pkt_credits;}                |
   |   |                          |                                                                            |
   +---+--------------------------+----------------------------------------------------------------------------+

Weighted Round Robin (WRR)
""""""""""""""""""""""""""

The evolution of the WRR design solution from simple to complex is shown in :numref:`table_qos_12`.

.. _table_qos_12:

.. table:: Weighted Round Robin (WRR)

   +---+------------+-----------------+-------------+----------------------------------------------------------+
   | # | All Queues | Equal Weights   | All Packets | Strategy                                                 |
   |   | Active?    | for All Queues? | Equal?      |                                                          |
   +===+============+=================+=============+==========================================================+
   | 1 | Yes        | Yes             | Yes         | **Byte level round robin**                               |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | *Next queue*  queue #i, i =  *(i + 1) % n*               |
   |   |            |                 |             |                                                          |
   +---+------------+-----------------+-------------+----------------------------------------------------------+
   | 2 | Yes        | Yes             | No          | **Packet level round robin**                             |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | Consuming one byte from queue #i requires consuming      |
   |   |            |                 |             | exactly one token for queue #i.                          |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | T(i) = Accumulated number of tokens previously consumed  |
   |   |            |                 |             | from queue #i. Every time a packet is consumed from      |
   |   |            |                 |             | queue #i, T(i) is updated as: T(i) += *pkt_len*.         |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | *Next queue* : queue with the smallest T.                |
   |   |            |                 |             |                                                          |
   |   |            |                 |             |                                                          |
   +---+------------+-----------------+-------------+----------------------------------------------------------+
   | 3 | Yes        | No              | No          | **Packet level weighted round robin**                    |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | This case can be reduced to the previous case by         |
   |   |            |                 |             | introducing a cost per byte that is different for each   |
   |   |            |                 |             | queue. Queues with lower weights have a higher cost per  |
   |   |            |                 |             | byte. This way, it is still meaningful to compare the    |
   |   |            |                 |             | consumption amongst different queues in order to select  |
   |   |            |                 |             | the next queue.                                          |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | w(i) = Weight of queue #i                                |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | t(i) = Tokens per byte for queue #i, defined as the      |
   |   |            |                 |             | inverse weight of queue #i.                              |
   |   |            |                 |             | For example, if w[0..3] = [1:2:4:8],                     |
   |   |            |                 |             | then t[0..3] = [8:4:2:1]; if w[0..3] = [1:4:15:20],      |
   |   |            |                 |             | then t[0..3] = [60:15:4:3].                              |
   |   |            |                 |             | Consuming one byte from queue #i requires consuming t(i) |
   |   |            |                 |             | tokens for queue #i.                                     |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | T(i) = Accumulated number of tokens previously consumed  |
   |   |            |                 |             | from queue #i. Every time a packet is consumed from      |
   |   |            |                 |             | queue #i, T(i) is updated as:  *T(i) += pkt_len * t(i)*. |
   |   |            |                 |             | *Next queue* : queue with the smallest T.                |
   |   |            |                 |             |                                                          |
   +---+------------+-----------------+-------------+----------------------------------------------------------+
   | 4 | No         | No              | No          | **Packet level weighted round robin with variable queue  |
   |   |            |                 |             | status**                                                 |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | Reduce this case to the previous case by setting the     |
   |   |            |                 |             | consumption of inactive queues to a high number, so that |
   |   |            |                 |             | the inactive queues will never be selected by the        |
   |   |            |                 |             | smallest T logic.                                        |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | To prevent T from overflowing as result of successive    |
   |   |            |                 |             | accumulations, T(i) is truncated after each packet       |
   |   |            |                 |             | consumption for all queues.                              |
   |   |            |                 |             | For example, T[0..3] = [1000, 1100, 1200, 1300]          |
   |   |            |                 |             | is truncated to T[0..3] = [0, 100, 200, 300]             |
   |   |            |                 |             | by subtracting the min T from T(i), i = 0..n.            |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | This requires having at least one active queue in the    |
   |   |            |                 |             | set of input queues, which is guaranteed by the dequeue  |
   |   |            |                 |             | state machine never selecting an inactive traffic class. |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | *mask(i) = Saturation mask for queue #i, defined as:*    |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | mask(i) = (queue #i is active)? 0 : 0xFFFFFFFF;          |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | w(i) = Weight of queue #i                                |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | t(i) = Tokens per byte for queue #i, defined as the      |
   |   |            |                 |             | inverse weight of queue #i.                              |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | T(i) = Accumulated numbers of tokens previously consumed |
   |   |            |                 |             | from queue #i.                                           |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | *Next queue*  : queue with smallest T.                   |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | Before packet consumption from queue #i:                 |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | *T(i) |= mask(i)*                                        |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | After packet consumption from queue #i:                  |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | T(j) -= T(i), j != i                                     |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | T(i) = pkt_len * t(i)                                    |
   |   |            |                 |             |                                                          |
   |   |            |                 |             | Note: T(j) uses the T(i) value before T(i) is updated.   |
   |   |            |                 |             |                                                          |
   +---+------------+-----------------+-------------+----------------------------------------------------------+

Subport Traffic Class Oversubscription
""""""""""""""""""""""""""""""""""""""

Problem Statement
'''''''''''''''''

Oversubscription for subport traffic class X is a configuration-time event that occurs when
more bandwidth is allocated for traffic class X at the level of subport member pipes than
allocated for the same traffic class at the parent subport level.

The existence of the oversubscription for a specific subport and
traffic class is solely the result of pipe and
subport-level configuration as opposed to being created due
to dynamic evolution of the traffic load at run-time (as congestion is).

When the overall demand for traffic class X for the current subport is low,
the existence of the oversubscription condition does not represent a problem,
as demand for traffic class X is completely satisfied for all member pipes.
However, this can no longer be achieved when the aggregated demand for traffic class X
for all subport member pipes exceeds the limit configured at the subport level.

Solution Space
''''''''''''''

summarizes some of the possible approaches for handling this problem,
with the third approach selected for implementation.

.. _table_qos_13:

.. table:: Subport Traffic Class Oversubscription

   +-----+---------------------------+-------------------------------------------------------------------------+
   | No. | Approach                  | Description                                                             |
   |     |                           |                                                                         |
   +=====+===========================+=========================================================================+
   | 1   | Don't care                | First come, first served.                                               |
   |     |                           |                                                                         |
   |     |                           | This approach is not fair amongst subport member pipes, as pipes that   |
   |     |                           | are served first will use up as much bandwidth for TC X as they need,   |
   |     |                           | while pipes that are served later will receive poor service due to      |
   |     |                           | bandwidth for TC X at the subport level being scarce.                   |
   |     |                           |                                                                         |
   +-----+---------------------------+-------------------------------------------------------------------------+
   | 2   | Scale down all pipes      | All pipes within the subport have their bandwidth limit for TC X scaled |
   |     |                           | down by the same factor.                                                |
   |     |                           |                                                                         |
   |     |                           | This approach is not fair among subport member pipes, as the low end    |
   |     |                           | pipes (that is, pipes configured with low bandwidth) can potentially    |
   |     |                           | experience severe service degradation that might render their service   |
   |     |                           | unusable (if available bandwidth for these pipes drops below the        |
   |     |                           | minimum requirements for a workable service), while the service         |
   |     |                           | degradation for high end pipes might not be noticeable at all.          |
   |     |                           |                                                                         |
   +-----+---------------------------+-------------------------------------------------------------------------+
   | 3   | Cap the high demand pipes | Each subport member pipe receives an equal share of the bandwidth       |
   |     |                           | available at run-time for TC X at the subport level. Any bandwidth left |
   |     |                           | unused by the low-demand pipes is redistributed in equal portions to    |
   |     |                           | the high-demand pipes. This way, the high-demand pipes are truncated    |
   |     |                           | while the low-demand pipes are not impacted.                            |
   |     |                           |                                                                         |
   +-----+---------------------------+-------------------------------------------------------------------------+

Typically, the subport TC oversubscription feature is enabled only for the lowest priority traffic class (TC 3),
which is typically used for best effort traffic,
with the management plane preventing this condition from occurring for the other (higher priority) traffic classes.

To ease implementation, it is also assumed that the upper limit for subport TC 3 is set to 100% of the subport rate,
and that the upper limit for pipe TC 3 is set to 100% of pipe rate for all subport member pipes.

Implementation Overview
'''''''''''''''''''''''

The algorithm computes a watermark, which is periodically updated based on the current demand experienced by the subport member pipes,
whose purpose is to limit the amount of traffic that each pipe is allowed to send for TC 3.
The watermark is computed at the subport level at the beginning of each traffic class upper limit enforcement period and
the same value is used by all the subport member pipes throughout the current enforcement period.
illustrates how the watermark computed as subport level at the beginning of each period is propagated to all subport member pipes.

At the beginning of the current enforcement period (which coincides with the end of the previous enforcement period),
the value of the watermark is adjusted based on the amount of bandwidth allocated to TC 3 at the beginning of the previous period that
was not left unused by the subport member pipes at the end of the previous period.

If there was subport TC 3 bandwidth left unused,
the value of the watermark for the current period is increased to encourage the subport member pipes to consume more bandwidth.
Otherwise, the value of the watermark is decreased to enforce equality of bandwidth consumption among subport member pipes for TC 3.

The increase or decrease in the watermark value is done in small increments,
so several enforcement periods might be required to reach the equilibrium state.
This state can change at any moment due to variations in the demand experienced by the subport member pipes for TC 3, for example,
as a result of demand increase (when the watermark needs to be lowered) or demand decrease (when the watermark needs to be increased).

When demand is low, the watermark is set high to prevent it from impeding the subport member pipes from consuming more bandwidth.
The highest value for the watermark is picked as the highest rate configured for a subport member pipe.
:numref:`table_qos_14` and :numref:`table_qos_15` illustrates the watermark operation.

.. _table_qos_14:

.. table:: Watermark Propagation from Subport Level to Member Pipes at the Beginning of Each Traffic Class Upper Limit Enforcement Period

   +-----+---------------------------------+----------------------------------------------------+
   | No. | Subport Traffic Class Operation | Description                                        |
   |     |                                 |                                                    |
   +=====+=================================+====================================================+
   | 1   | Initialization                  | **Subport level**: subport_period_id= 0            |
   |     |                                 |                                                    |
   |     |                                 | **Pipe level**: pipe_period_id = 0                 |
   |     |                                 |                                                    |
   +-----+---------------------------------+----------------------------------------------------+
   | 2   | Credit update                   | **Subport Level**:                                 |
   |     |                                 |                                                    |
   |     |                                 | if (time>=subport_tc_time)                         |
   |     |                                 |                                                    |
   |     |                                 | {                                                  |
   |     |                                 |     subport_wm = water_mark_update();              |
   |     |                                 |                                                    |
   |     |                                 |     subport_tc_time = time + subport_tc_period;    |
   |     |                                 |                                                    |
   |     |                                 |     subport_period_id++;                           |
   |     |                                 |                                                    |
   |     |                                 | }                                                  |
   |     |                                 |                                                    |
   |     |                                 | **Pipelevel:**                                     |
   |     |                                 |                                                    |
   |     |                                 | if(pipe_period_id != subport_period_id)            |
   |     |                                 |                                                    |
   |     |                                 | {                                                  |
   |     |                                 |                                                    |
   |     |                                 |     pipe_ov_credits = subport_wm \* pipe_weight;   |
   |     |                                 |                                                    |
   |     |                                 |     pipe_period_id = subport_period_id;            |
   |     |                                 |                                                    |
   |     |                                 | }                                                  |
   |     |                                 |                                                    |
   +-----+---------------------------------+----------------------------------------------------+
   | 3   | Credit consumption              | **Pipe level:**                                    |
   |     | (on packet scheduling)          |                                                    |
   |     |                                 | pkt_credits = pk_len + frame_overhead;             |
   |     |                                 |                                                    |
   |     |                                 | if(pipe_ov_credits >= pkt_credits{                 |
   |     |                                 |                                                    |
   |     |                                 |    pipe_ov_credits -= pkt_credits;                 |
   |     |                                 |                                                    |
   |     |                                 | }                                                  |
   |     |                                 |                                                    |
   +-----+---------------------------------+----------------------------------------------------+

.. _table_qos_15:

.. table:: Watermark Calculation

   +-----+------------------+----------------------------------------------------------------------------------+
   | No. | Subport Traffic  | Description                                                                      |
   |     | Class Operation  |                                                                                  |
   +=====+==================+==================================================================================+
   | 1   | Initialization   | **Subport level:**                                                               |
   |     |                  |                                                                                  |
   |     |                  | wm = WM_MAX                                                                      |
   |     |                  |                                                                                  |
   +-----+------------------+----------------------------------------------------------------------------------+
   | 2   | Credit update    | **Subport level (water_mark_update):**                                           |
   |     |                  |                                                                                  |
   |     |                  | tc0_cons = subport_tc0_credits_per_period - subport_tc0_credits;                 |
   |     |                  |                                                                                  |
   |     |                  | tc1_cons = subport_tc1_credits_per_period - subport_tc1_credits;                 |
   |     |                  |                                                                                  |
   |     |                  | tc2_cons = subport_tc2_credits_per_period - subport_tc2_credits;                 |
   |     |                  |                                                                                  |
   |     |                  | tc3_cons = subport_tc3_credits_per_period - subport_tc3_credits;                 |
   |     |                  |                                                                                  |
   |     |                  | tc3_cons_max = subport_tc3_credits_per_period - (tc0_cons + tc1_cons +           |
   |     |                  | tc2_cons);                                                                       |
   |     |                  |                                                                                  |
   |     |                  | if(tc3_consumption > (tc3_consumption_max - MTU)){                               |
   |     |                  |                                                                                  |
   |     |                  |     wm -= wm >> 7;                                                               |
   |     |                  |                                                                                  |
   |     |                  |     if(wm < WM_MIN) wm =  WM_MIN;                                                |
   |     |                  |                                                                                  |
   |     |                  | } else {                                                                         |
   |     |                  |                                                                                  |
   |     |                  |    wm += (wm >> 7) + 1;                                                          |
   |     |                  |                                                                                  |
   |     |                  |    if(wm > WM_MAX) wm = WM_MAX;                                                  |
   |     |                  |                                                                                  |
   |     |                  | }                                                                                |
   |     |                  |                                                                                  |
   +-----+------------------+----------------------------------------------------------------------------------+

Worst Case Scenarios for Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Lots of Active Queues with Not Enough Credits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The more queues the scheduler has to examine for packets and credits in order to select one packet,
the lower the performance of the scheduler is.

The scheduler maintains the bitmap of active queues, which skips the non-active queues,
but in order to detect whether a specific pipe has enough credits,
the pipe has to be drilled down using the pipe dequeue state machine,
which consumes cycles regardless of the scheduling result
(no packets are produced or at least one packet is produced).

This scenario stresses the importance of the policer for the scheduler performance:
if the pipe does not have enough credits,
its packets should be dropped as soon as possible (before they reach the hierarchical scheduler),
thus rendering the pipe queues as not active,
which allows the dequeue side to skip that pipe with no cycles being spent on investigating the pipe credits
that would result in a "not enough credits" status.

Single Queue with 100% Line Rate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The port scheduler performance is optimized for a large number of queues.
If the number of queues is small,
then the performance of the port scheduler for the same level of active traffic is expected to be worse than
the performance of a small set of message passing queues.

.. _Dropper:

Dropper
-------

The purpose of the DPDK dropper is to drop packets arriving at a packet scheduler to avoid congestion.
The dropper supports the Random Early Detection (RED),
Weighted Random Early Detection (WRED) and tail drop algorithms.
:numref:`figure_blk_diag_dropper` illustrates how the dropper integrates with the scheduler.
The DPDK currently does not support congestion management
so the dropper provides the only method for congestion avoidance.

.. _figure_blk_diag_dropper:

.. figure:: img/blk_diag_dropper.*

   High-level Block Diagram of the DPDK Dropper


The dropper uses the Random Early Detection (RED) congestion avoidance algorithm as documented in the reference publication.
The purpose of the RED algorithm is to monitor a packet queue,
determine the current congestion level in the queue and decide whether an arriving packet should be enqueued or dropped.
The RED algorithm uses an Exponential Weighted Moving Average (EWMA) filter to compute average queue size which
gives an indication of the current congestion level in the queue.

For each enqueue operation, the RED algorithm compares the average queue size to minimum and maximum thresholds.
Depending on whether the average queue size is below, above or in between these thresholds,
the RED algorithm calculates the probability that an arriving packet should be dropped and
makes a random decision based on this probability.

The dropper also supports Weighted Random Early Detection (WRED) by allowing the scheduler to select
different RED configurations for the same packet queue at run-time.
In the case of severe congestion, the dropper resorts to tail drop.
This occurs when a packet queue has reached maximum capacity and cannot store any more packets.
In this situation, all arriving packets are dropped.

The flow through the dropper is illustrated in :numref:`figure_flow_tru_droppper`.
The RED/WRED algorithm is exercised first and tail drop second.

.. _figure_flow_tru_droppper:

.. figure:: img/flow_tru_droppper.*

   Flow Through the Dropper


The use cases supported by the dropper are:

*   *    Initialize configuration data

*   *    Initialize run-time data

*   *    Enqueue (make a decision to enqueue or drop an arriving packet)

*   *    Mark empty (record the time at which a packet queue becomes empty)

The configuration use case is explained in :ref:`Section2.23.3.1 <Configuration>`,
the enqueue operation is explained in  :ref:`Section 2.23.3.2 <Enqueue_Operation>`
and the mark empty operation is explained in :ref:`Section 2.23.3.3 <Queue_Empty_Operation>`.

.. _Configuration:

Configuration
~~~~~~~~~~~~~

A RED configuration contains the parameters given in :numref:`table_qos_16`.

.. _table_qos_16:

.. table:: RED Configuration Parameters

   +--------------------------+---------+---------+------------------+
   | Parameter                | Minimum | Maximum | Typical          |
   |                          |         |         |                  |
   +==========================+=========+=========+==================+
   | Minimum Threshold        | 0       | 1022    | 1/4 x queue size |
   |                          |         |         |                  |
   +--------------------------+---------+---------+------------------+
   | Maximum Threshold        | 1       | 1023    | 1/2 x queue size |
   |                          |         |         |                  |
   +--------------------------+---------+---------+------------------+
   | Inverse Mark Probability | 1       | 255     | 10               |
   |                          |         |         |                  |
   +--------------------------+---------+---------+------------------+
   | EWMA Filter Weight       | 1       | 12      | 9                |
   |                          |         |         |                  |
   +--------------------------+---------+---------+------------------+

The meaning of these parameters is explained in more detail in the following sections.
The format of these parameters as specified to the dropper module API
corresponds to the format used by Cisco* in their RED implementation.
The minimum and maximum threshold parameters are specified to the dropper module in terms of number of packets.
The mark probability parameter is specified as an inverse value, for example,
an inverse mark probability parameter value of 10 corresponds
to a mark probability of 1/10 (that is, 1 in 10 packets will be dropped).
The EWMA filter weight parameter is specified as an inverse log value,
for example, a filter weight parameter value of 9 corresponds to a filter weight of 1/29.

.. _Enqueue_Operation:

Enqueue Operation
~~~~~~~~~~~~~~~~~

In the example shown in :numref:`figure_ex_data_flow_tru_dropper`, q (actual queue size) is the input value,
avg (average queue size) and count (number of packets since the last drop) are run-time values,
decision is the output value and the remaining values are configuration parameters.

.. _figure_ex_data_flow_tru_dropper:

.. figure:: img/ex_data_flow_tru_dropper.*

   Example Data Flow Through Dropper


EWMA Filter Microblock
^^^^^^^^^^^^^^^^^^^^^^

The purpose of the EWMA Filter microblock is to filter queue size values to smooth out transient changes
that result from "bursty" traffic.
The output value is the average queue size which gives a more stable view of the current congestion level in the queue.

The EWMA filter has one configuration parameter, filter weight, which determines how quickly
or slowly the average queue size output responds to changes in the actual queue size input.
Higher values of filter weight mean that the average queue size responds more quickly to changes in actual queue size.

Average Queue Size Calculation when the Queue is not Empty
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

The definition of the EWMA filter is given in the following equation.

.. image:: img/ewma_filter_eq_1.*

Where:

*   *avg*  = average queue size

*   *wq*   = filter weight

*   *q*    = actual queue size

.. note::

    The filter weight, wq = 1/2^n, where n is the filter weight parameter value passed to the dropper module
	on configuration (see :ref:`Section2.23.3.1 <Configuration>` ).

Average Queue Size Calculation when the Queue is Empty
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The EWMA filter does not read time stamps and instead assumes that enqueue operations will happen quite regularly.
Special handling is required when the queue becomes empty as the queue could be empty for a short time or a long time.
When the queue becomes empty, average queue size should decay gradually to zero instead of dropping suddenly to zero
or remaining stagnant at the last computed value.
When a packet is enqueued on an empty queue, the average queue size is computed using the following formula:

.. image:: img/ewma_filter_eq_2.*

Where:

*   *m*   = the number of enqueue operations that could have occurred on this queue while the queue was empty

In the dropper module, *m* is defined as:

.. image:: img/m_definition.*

Where:

*   *time*  = current time

*   *qtime* = time the queue became empty

*   *s* = typical time between successive enqueue operations on this queue

The time reference is in units of bytes,
where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium
(see Section `Internal Time Reference`_).
The parameter s is defined in the dropper module as a constant with the value: s=2^22.
This corresponds to the time required by every leaf node in a hierarchy with 64K leaf nodes
to transmit one 64-byte packet onto the wire and represents the worst case scenario.
For much smaller scheduler hierarchies,
it may be necessary to reduce the parameter s, which is defined in the red header source file (rte_red.h) as:

.. code-block:: c

    #define RTE_RED_S

Since the time reference is in bytes, the port speed is implied in the expression: *time-qtime*.
The dropper does not have to be configured with the actual port speed.
It adjusts automatically to low speed and high speed links.

Implementation
""""""""""""""

A numerical method is used to compute the factor (1-wq)^m that appears in Equation 2.

This method is based on the following identity:

.. image:: img/eq2_factor.*


This allows us to express the following:

.. image:: img/eq2_expression.*


In the dropper module, a look-up table is used to compute log2(1-wq) for each value of wq supported by the dropper module.
The factor (1-wq)^m can then be obtained by multiplying the table value by *m* and applying shift operations.
To avoid overflow in the multiplication, the value, *m*, and the look-up table values are limited to 16 bits.
The total size of the look-up table is 56 bytes.
Once the factor (1-wq)^m is obtained using this method, the average queue size can be calculated from Equation 2.

Alternative Approaches
""""""""""""""""""""""

Other methods for calculating the factor (1-wq)^m in the expression for computing average queue size
when the queue is empty (Equation 2) were considered.
These approaches include:

*   Floating-point evaluation

*   Fixed-point evaluation using a small look-up table (512B) and up to 16 multiplications
    (this is the approach used in the FreeBSD* ALTQ RED implementation)

*   Fixed-point evaluation using a small look-up table (512B) and 16 SSE multiplications
    (SSE optimized version of the approach used in the FreeBSD* ALTQ RED implementation)

*   Large look-up table (76 KB)

The method that was finally selected (described above in Section 26.3.2.2.1) out performs all of these approaches
in terms of run-time performance and memory requirements and
also achieves accuracy comparable to floating-point evaluation.
:numref:`table_qos_17` lists the performance of each of these alternative approaches relative to the method that is used in the dropper.
As can be seen, the floating-point implementation achieved the worst performance.

.. _table_qos_17:

.. table:: Relative Performance of Alternative Approaches

   +------------------------------------------------------------------------------------+----------------------+
   | Method                                                                             | Relative Performance |
   |                                                                                    |                      |
   +====================================================================================+======================+
   | Current dropper method (see :ref:`Section 23.3.2.1.3 <Dropper>`)                   | 100%                 |
   |                                                                                    |                      |
   +------------------------------------------------------------------------------------+----------------------+
   | Fixed-point method with small (512B) look-up table                                 | 148%                 |
   |                                                                                    |                      |
   +------------------------------------------------------------------------------------+----------------------+
   | SSE method with small (512B) look-up table                                         | 114%                 |
   |                                                                                    |                      |
   +------------------------------------------------------------------------------------+----------------------+
   | Large (76KB) look-up table                                                         | 118%                 |
   |                                                                                    |                      |
   +------------------------------------------------------------------------------------+----------------------+
   | Floating-point                                                                     | 595%                 |
   |                                                                                    |                      |
   +------------------------------------------------------------------------------------+----------------------+
   | **Note**: In this case, since performance is expressed as time spent executing the operation in a         |
   | specific condition, any relative performance value above 100% runs slower than the reference method.      |
   |                                                                                                           |
   +-----------------------------------------------------------------------------------------------------------+

Drop Decision Block
^^^^^^^^^^^^^^^^^^^

The Drop Decision block:

*   Compares the average queue size with the minimum and maximum thresholds

*   Calculates a packet drop probability

*   Makes a random decision to enqueue or drop an arriving packet

The calculation of the drop probability occurs in two stages.
An initial drop probability is calculated based on the average queue size,
the minimum and maximum thresholds and the mark probability.
An actual drop probability is then computed from the initial drop probability.
The actual drop probability takes the count run-time value into consideration
so that the actual drop probability increases as more packets arrive to the packet queue
since the last packet was dropped.

Initial Packet Drop Probability
"""""""""""""""""""""""""""""""

The initial drop probability is calculated using the following equation.

.. image:: img/drop_probability_eq3.*

Where:

*   *maxp*  = mark probability

*   *avg*  = average queue size

*   *minth*  = minimum threshold

*   *maxth*  = maximum threshold

The calculation of the packet drop probability using Equation 3 is illustrated in :numref:`figure_pkt_drop_probability`.
If the average queue size is below the minimum threshold, an arriving packet is enqueued.
If the average queue size is at or above the maximum threshold, an arriving packet is dropped.
If the average queue size is between the minimum and maximum thresholds,
a drop probability is calculated to determine if the packet should be enqueued or dropped.

.. _figure_pkt_drop_probability:

.. figure:: img/pkt_drop_probability.*

   Packet Drop Probability for a Given RED Configuration


Actual Drop Probability
"""""""""""""""""""""""

If the average queue size is between the minimum and maximum thresholds,
then the actual drop probability is calculated from the following equation.

.. image:: img/drop_probability_eq4.*

Where:

*   *Pb*  = initial drop probability (from Equation 3)

*   *count* = number of packets that have arrived since the last drop

The constant 2, in Equation 4 is the only deviation from the drop probability formulae
given in the reference document where a value of 1 is used instead.
It should be noted that the value pa computed from can be negative or greater than 1.
If this is the case, then a value of 1 should be used instead.

The initial and actual drop probabilities are shown in :numref:`figure_drop_probability_graph`.
The actual drop probability is shown for the case where
the formula given in the reference document1 is used (blue curve)
and also for the case where the formula implemented in the dropper module,
is used (red curve).
The formula in the reference document results in a significantly higher drop rate
compared to the mark probability configuration parameter specified by the user.
The choice to deviate from the reference document is simply a design decision and
one that has been taken by other RED implementations, for example, FreeBSD* ALTQ RED.

.. _figure_drop_probability_graph:

.. figure:: img/drop_probability_graph.*

   Initial Drop Probability (pb), Actual Drop probability (pa) Computed Using
   a Factor 1 (Blue Curve) and a Factor 2 (Red Curve)


.. _Queue_Empty_Operation:

Queue Empty Operation
~~~~~~~~~~~~~~~~~~~~~

The time at which a packet queue becomes empty must be recorded and saved with the RED run-time data
so that the EWMA filter block can calculate the average queue size on the next enqueue operation.
It is the responsibility of the calling application to inform the dropper module
through the API that a queue has become empty.

Source Files Location
~~~~~~~~~~~~~~~~~~~~~

The source files for the DPDK dropper are located at:

*   DPDK/lib/librte_sched/rte_red.h

*   DPDK/lib/librte_sched/rte_red.c

Integration with the DPDK QoS Scheduler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

RED functionality in the DPDK QoS scheduler is disabled by default.
To enable it, use the DPDK configuration parameter:

::

    CONFIG_RTE_SCHED_RED=y

This parameter must be set to y.
The parameter is found in the build configuration files in the DPDK/config directory,
for example, DPDK/config/common_linux.
RED configuration parameters are specified in the rte_red_params structure within the rte_sched_port_params structure
that is passed to the scheduler on initialization.
RED parameters are specified separately for four traffic classes and three packet colors (green, yellow and red)
allowing the scheduler to implement Weighted Random Early Detection (WRED).

Integration with the DPDK QoS Scheduler Sample Application
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The DPDK QoS Scheduler Application reads a configuration file on start-up.
The configuration file includes a section containing RED parameters.
The format of these parameters is described in :ref:`Section2.23.3.1 <Configuration>`.
A sample RED configuration is shown below. In this example, the queue size is 64 packets.

.. note::

    For correct operation, the same EWMA filter weight parameter (wred weight) should be used
    for each packet color (green, yellow, red) in the same traffic class (tc).

::

    ; RED params per traffic class and color (Green / Yellow / Red)

   [red]
   tc 0 wred min = 28 22 16
   tc 0 wred max = 32 32 32
   tc 0 wred inv prob = 10 10 10
   tc 0 wred weight = 9 9 9

   tc 1 wred min = 28 22 16
   tc 1 wred max = 32 32 32
   tc 1 wred inv prob = 10 10 10
   tc 1 wred weight = 9 9 9

   tc 2 wred min = 28 22 16
   tc 2 wred max = 32 32 32
   tc 2 wred inv prob = 10 10 10
   tc 2 wred weight = 9 9 9

   tc 3 wred min = 28 22 16
   tc 3 wred max = 32 32 32
   tc 3 wred inv prob = 10 10 10
   tc 3 wred weight = 9 9 9

With this configuration file, the RED configuration that applies to green,
yellow and red packets in traffic class 0 is shown in :numref:`table_qos_18`.

.. _table_qos_18:

.. table:: RED Configuration Corresponding to RED Configuration File

   +--------------------+--------------------+-------+--------+-----+
   | RED Parameter      | Configuration Name | Green | Yellow | Red |
   |                    |                    |       |        |     |
   +====================+====================+=======+========+=====+
   | Minimum Threshold  | tc 0 wred min      | 28    | 22     | 16  |
   |                    |                    |       |        |     |
   +--------------------+--------------------+-------+--------+-----+
   | Maximum Threshold  | tc 0 wred max      | 32    | 32     | 32  |
   |                    |                    |       |        |     |
   +--------------------+--------------------+-------+--------+-----+
   | Mark Probability   | tc 0 wred inv prob | 10    | 10     | 10  |
   |                    |                    |       |        |     |
   +--------------------+--------------------+-------+--------+-----+
   | EWMA Filter Weight | tc 0 wred weight   | 9     | 9      | 9   |
   |                    |                    |       |        |     |
   +--------------------+--------------------+-------+--------+-----+

Application Programming Interface (API)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Enqueue API
^^^^^^^^^^^

The syntax of the enqueue API is as follows:

.. code-block:: c

   int rte_red_enqueue(const struct rte_red_config *red_cfg, struct rte_red *red, const unsigned q, const uint64_t time)


The arguments passed to the enqueue API are configuration data, run-time data,
the current size of the packet queue (in packets) and a value representing the current time.
The time reference is in units of bytes,
where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium
(see Section 26.2.4.5.1 "Internal Time Reference" ).
The dropper reuses the scheduler time stamps for performance reasons.

Empty API
^^^^^^^^^

The syntax of the empty API is as follows:

.. code-block:: c

    void rte_red_mark_queue_empty(struct rte_red *red, const uint64_t time)

The arguments passed to the empty API are run-time data and the current time in bytes.

Traffic Metering
----------------

The traffic metering component implements the Single Rate Three Color Marker (srTCM) and
Two Rate Three Color Marker (trTCM) algorithms, as defined by IETF RFC 2697 and 2698 respectively.
These algorithms meter the stream of incoming packets based on the allowance defined in advance for each traffic flow.
As result, each incoming packet is tagged as green,
yellow or red based on the monitored consumption of the flow the packet belongs to.

Functional Overview
~~~~~~~~~~~~~~~~~~~

The srTCM algorithm defines two token buckets for each traffic flow,
with the two buckets sharing the same token update rate:

*   Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter
    (measured in IP packet bytes per second).
    The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);

*   Excess (E) bucket: fed with tokens at the same rate as the C bucket.
    The size of the E bucket is defined by the Excess Burst Size (EBS) parameter (measured in bytes).

The trTCM algorithm defines two token buckets for each traffic flow,
with the two buckets being updated with tokens at independent rates:

*   Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter
    (measured in bytes of IP packet per second).
    The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);

*   Peak (P) bucket: fed with tokens at the rate defined by the Peak Information Rate (PIR) parameter
    (measured in IP packet bytes per second).
    The size of the P bucket is defined by the Peak Burst Size (PBS) parameter (measured in bytes).

Please refer to RFC 2697 (for srTCM) and RFC 2698 (for trTCM) for details on how tokens are consumed
from the buckets and how the packet color is determined.

Color Blind and Color Aware Modes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For both algorithms, the color blind mode is functionally equivalent to the color aware mode with input color set as green.
For color aware mode, a packet with red input color can only get the red output color,
while a packet with yellow input color can only get the yellow or red output colors.

The reason why the color blind mode is still implemented distinctly than the color aware mode is
that color blind mode can be implemented with fewer operations than the color aware mode.

Implementation Overview
~~~~~~~~~~~~~~~~~~~~~~~

For each input packet, the steps for the srTCM / trTCM algorithms are:

*   Update the C and E / P token buckets. This is done by reading the current time (from the CPU timestamp counter),
    identifying the amount of time since the last bucket update and computing the associated number of tokens
    (according to the pre-configured bucket rate).
    The number of tokens in the bucket is limited by the pre-configured bucket size;

*   Identify the output color for the current packet based on the size of the IP packet
    and the amount of tokens currently available in the C and E / P buckets; for color aware mode only,
    the input color of the packet is also considered.
    When the output color is not red, a number of tokens equal to the length of the IP packet are
    subtracted from the C or E /P or both buckets, depending on the algorithm and the output color of the packet.