Skip to content

Latest commit

 

History

History
225 lines (137 loc) · 12.6 KB

faq.md

File metadata and controls

225 lines (137 loc) · 12.6 KB

DPVS Frequently Asked Questions (FAQ)

Contents


Please try to follow README.md and doc/tutorial.md first. And if you still have problem, possible reasons are:

  1. NIC does not support DPDK or flow control (rte_flow), please check this answer.
  2. DPDK is not compatible with Kernel Version, it cause build error, please refer to DPDK.org or consider upgrade the Kernel.
  3. CPU core (lcore) and NIC queue's configure is miss-match. Please read conf/*.sample, note worker-CPU/NIC-queue are 1:1 mapping and you need one more cpu for master.
  4. DPDK NIC's link is not up ? please check NIC cable first.
  5. curl VIP in FullNAT mode fails (or sometime fails)? Please check if NIC support rte_flow.
  6. curl still fails. Please check route and arp by dpip route show, dpip neigh show.
  7. The patchs in patch/ are not applied.

And you may find other similar issues and solutions from Github's issues list.

Actaully, it's the question about if the NIC support DPDK as well as "flow control(rte_flow)".

First, please make sure the NIC support DPDK, you can check the link. Second, DPVS's FullNAT/SNAT mode need flow control(rte_flow) feature, unless you configure only one worker. For rte_flow support, this link can be checked.

Please find the DPDK driver name according to your NIC by the first link. And check rte_flow support for each drivers from the matrix in the second link.

  1. https://core.dpdk.org/supported/
  2. http://doc.dpdk.org/guides/nics/overview.html#id1

The PMD of your NIC should support the following rte_flow items,

  • ipv4
  • ipv6
  • tcp
  • udp

and the following rte_flow actions at least.

  • queue
  • drop

If you are using only one worker, you can turn off dpvs flow control by setting sa_pool/flow_enable to off in dpvs.conf.

Like LVS, DPVS should be deployed in ECMP Cluster Mode. If one director went down, others still keep working, new connections are not affected. When using Cluster Mode (for both one-arm and two arms), keeplived can still be used for VIP (LIP) configuration and RS health check purpose. Note the keeplived's VIP backup (VRRP) feature is not used.

When upgrade directors (DPVS), we can "stop" ospfd, upgrade it, and start ospfd again. Yes, some existing connection may broken, however, applications may have fail-over (re-try) mechanism.

To address the issue director crash or getting upgraded, some implementation introduce session sharing/sync between directors. Honestly, this is not easy, because,

  1. LIPs are not same for each DPVS director.
  2. connection-table is per-lcore, when sync-ed to another DPVS, how to handle per-lcore table in another machine?
  3. "Session-Sharing" works well for "long-connection" but not "short-connection", since the connection's creation/destruction are too frequently to be sync-ed to other directors.

As I know, some L4 LB implemented "session sharing" or "session synchronization", they are configuring same LIPs for each LB director. And each LIP is configured for one CPU core. Both cases are quite different from DPVS's implementation and deployment.

On the other hand, for the high availability of Real Servers, DPVS leverage keepalived for health check on RS, both TCP/UDP services can be checked, you can also write your own checking scripts. For more info about health-check, please refer to LVS's document.

Yes, and DPVS's TOA derives from the open-sourced alibaba/LVS. The RS's TOA kernel module implementation is at kmod/toa. Compared to the original TOA, DPVS's TOA add supports for IPv6 and Nat64. TOA format on the side of DPVS is defined in proto_tcp.h, while option code is 254, option length is 8.

// include/ipvs/proto_tcp.h
enum {
   TCP_OPT_EOL         = 0,
   TCP_OPT_ADDR        = 254,
};
... ...
struct tcpopt_addr {
    uint8_t opcode;
    uint8_t opsize;
    __be16 port;
    uint8_t addr[16];
} __attribute__((__packed__));

In case of nat64, RS's codes need a few changes to get real client's IP/port. DPVS provide an example TCP server and a nginx patch for nat64 toa. Please refer to example_nat64.

Yes, it does support UDP. In order to get the real client IP/port in FullNAT mode, you need to install UOA module on RS, and make a few code changes. UOA supports two modes: opp for private protocol mode, which supports IPv4/IPv6/Nat64, and ipo for ip option mode, which support IPv4 only. Please refer to uoa.md and udp_serv.c.

Which UOA mode should we choose? Honestly speaking, choose the one that works, because unlike TCP option, either IP option or private IP protocol is often restricted by network devices or policies. Thus consider your network environments and choose the best suitable one.

No, since connection table is per-lcore (per-CPU), and RSS/rte_flow are used for FNAT. Assuming RSS mode is TCP and rte_flow uses L4 info <lip, lport>. Considered that IP fragment doesn't have L4 info, it needs reassembling first and re-schedule the packet to correct lcore which the 5-tuple flow (connection) belongs to.

May be someday in the future, we will support "packet re-schedule" on lcores or use L3 (IP) info only for RSS or flow control, then we may support fragment. But even we support fragment, it may hurt the performance (reassemble, re-schedule effort) or security.

Actually, IPv4 fragment is not recommended, while IPv6 even not support fragment by fixed header, and do not allow re-fragment on middle-boxes. The applications, especially for the datagram-oriented apps, like UDP-apps, should perform PMTU discover algorithm to avoid fragment. TCP is sending sliced segments, notifying MSS to peer side and PMTU discover is built-in, TCP-app should not need worry about fragment.

Please refer to the tutorial.md, there's an exmaple to run DPVS on Ubuntu. Basically, you may need to reduce memory usage. And for VM's NIC, rte_flow is not supported, so if you want to config FullNAT/SNAT mode, you have to configure only one worker (cpu), and another CPU core for master.

You can use ipvsadm and dpip tools. For example,

$ ipvsadm -ln --stats
$ dpip -s link show
$ dpip -s link show cpu

For example, to get the throughput for each VIP/RS, you can use ipvsadm -ln --stats, and ipvsadm -Z to clear the statistics.

Note --rate option of ipvsadm is not supported.

It may need to write scripts to parse the outputs or integrated with your local admin-system.

If any question, please make sure you have read the docs first, and LVS's documents are also helpful. It's better to have some experience about networking configuration (e.g., routing, neigbour, ...) since DPVS is kernel-bypass, basic routing, IP address configurations are needed to be setup from scratch.

We have Chinese QQ-Group or WeChat-Group, you can ask questions, raise issues, talk about design, help others or discuss everything else about DPVS. Here's the entry for WeChar-Group (微信群) and entry for QQ-Group (QQ群). Emailing to iig_cloud_qlb #at# qiyi #dot# com is another way to contact us.

At last, remember, you can find answer of all kind of questions from the codes, DPVS is open-sourced :).

We use wrk as HTTP client and f-stack/nginx as Real Server. When we testing the performance of DPVS, at least 6 physical machines are used as Client (wrk), and 4 physical machines for f-stack/nginx.

For the machine running wrk, IRQ affinity should be set to make sure all CPUs are used. You can find the scripts like set_irq_affinity_ixgbe from Internet.

To get the test result for QPS, please check the output of wrk. To calculate Packet Per-Second (pps), Bytes Per-Second (bps), you can use ipvsadm -ln --stats or dpip -s link show.

We have tested 10G NIC only, and the result shows DPVS can reach the line-speed of 10G NIC with small packets. We have not test 25G/40G/100G NIC yet. Although it's in plan.

The test result can be found in README.md. We have no chance to use professional instruments for test.

Yes. To use bonding device, please check conf/dpvs.bond.conf.sample. To set up VLAN/Tunnel device, you can refer to tutorial.md or dpip vlan help.

There're several ways:

  • /etc/dpvs.conf

You can modify dpvs.conf, and refer to conf/xxx.sample. Some parameters are configurable during run-time (on-fly), while others are configurable in initialization stage only. Refer to conf/dpvs.conf.items for all available parameters and corresponding type, default value and supported value range. Please use kill -SIGHUP <DPVS-PID> to reload the former on-fly kind of parameters in dpvs.conf.

  • ipvsadm, keepalived, quagga/ospfd/bgpd

You should read LVS's documents for ipvsadm/keepalived first, and note ipvsadm and keepalived of DPVS are slightly different, please check tutorial.md for details. For the configuration about quagga/zebra/ospfd/bgpd, please refer to quagga's documents.

  • dpip tool

dpip tool is developed to configure DPVS "on-fly". It's like ip command of Linux iproute2 suites. It can be used to configure IP address, neighbour (arp), routes, DPDK devices (link), virtual-devices (vlan/tunnel), and traffic control (qsch/cls). And it's also helpful to get statistics. Please check tutorial.md and dpip help for details.

DPVS's logging is using DPDK's RTE_LOG. By default, syslogd is used by DPDK RTE_LOG, so DPVS's log should be find in /var/log/message, with time-stamp printed. You may change syslogd's configure as you like, for example change the log file path, or put different programs' log to different files, or limit the log file size, etc.

DPVS log file path and level can also be changed by /etc/dpvs.conf, if the path changed, it means no time-stamp, no syslogd's feature.

Add more LIPs. Increase sapool's pool_hash_size config may also be helpful if RS count of a virtual server is greater than the count of LIPs. You can refer to #72 for more details.

It's normal, not issue. Since DPDK application is using busy-polling mode. Every CPU core configured to DPVS are 100% usage, including Master and Worker CPU cores.

Yes, DPDK is kernel-bypass solution, all forwarding traffic in data plane do not get into the Linux Kernel, it means iptables(Netfilter) won't work for that kind of traffic.