TCP流量控制主要由基於接收方確認的滑動窗口機制和基於網路情況的發送方的擁塞控制演算法來實現!
TCP Traffic Control
TCP流量控制主要涉及滑動窗口,擁塞控制演算法,RTT(Round-Trip Time,往返時間),RTO(Retransmission Timeout,重傳超時)計算等等。
- 滑動窗口通常認為是接收方反饋給發送方的流量和重傳控制(不考慮網路);
- 擁塞控制演算法通常認為是發送方的流量控制演算法(考慮網路時延丟包等);
- RTT計算會影響到RTO計算及影響某些擁塞演算法運行結果;
- RTO計算會涉及到超時重傳定時器設置。
滑動窗口
滑動窗口(sliding window)是數據傳輸時接收方通告窗口給發送方,然後發送方根據窗口決定一次發送多少數據的機制,本質是ARQ(Automatic Repeat Request, 停等協議),滑動窗口一方面利用接收方確認報文反饋的窗口大小控制每次發包個數,另一方面根據接收方確認報文確認號判斷數據幀重傳,滑動窗口僅通過接收方反饋的信息進行流量控制,不會考慮網路帶寬延遲等因素。
ARQ主要有連續ARQGo-Back-N和選擇ARQSelective Repeat兩種,TCP中使用Go-back-N,Selective Repeat作為可選實現。Go-back-N和Selective Repeat區別在於,Go-back-N一次發包只有一個定時器,中間出現丟包,則丟的包之後的所有包都需要重傳,Selective Repeat一次發包需要和包個數一樣的定時器數量,中間出現丟包則僅重傳丟失的那個包,簡單來說Go-back-N比較耗網路帶寬,Selective Repeat比較耗計算資源。
兩種ARQ滑動窗口發送過程可參考如下網站的動畫:
https://www2.tkn.tu-berlin.de/teaching/rn/animations/gbn_sr/
TCP滑動機制原理可簡述為:TCP發送方包含一個發送窗口,按序列號發送數據包,TCP接收方根據收到的包回復確認包,TCP發送包根據收到的確認包中通告窗口反饋來動態調節發送窗口大小、滑動窗口右移並發送後續的數據包。
如下tcpipguide網站TCP滑動窗口描述,TCP發送方發送緩存區分為4塊區域:已發送已確認、已發送未確認、未發送接收方已準備好和未發送接收方未準備好,TCP接收方接收緩存區分為3塊區域:已接收已確認、未接收被允許發送、未收到發生方可能未發送。


Linux內核4.x中滑動窗口代碼示例,大體可以看出來和tcpipguid網站中描述一致:
/* Update our send window.
*
* Window update algorithm, described in RFC793/RFC1122 (used in linux-2.2
* and in FreeBSD. NetBSD's one is even worse.) is wrong.
*/
static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32 ack,
u32 ack_seq)
{
struct tcp_sock *tp = tcp_sk(sk);
int flag = 0;
u32 nwin = ntohs(tcp_hdr(skb)->window);
if (likely(!tcp_hdr(skb)->syn))
nwin <<= tp->rx_opt.snd_wscale;
if (tcp_may_update_window(tp, ack, ack_seq, nwin)) {
flag |= FLAG_WIN_UPDATE;
tcp_update_wl(tp, ack_seq);
if (tp->snd_wnd != nwin) {
tp->snd_wnd = nwin;
/* Note, it is the only place, where
* fast path is recovered for sending TCP.
*/
tp->pred_flags = 0;
tcp_fast_path_check(sk);
if (!tcp_write_queue_empty(sk))
tcp_slow_start_after_idle_check(sk);
if (nwin > tp->max_window) {
tp->max_window = nwin;
tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
}
}
}
tcp_snd_una_update(tp, ack);
return flag;
}
RTT&RTO計算
TCP流量控制中RTT及RTO計算是一個很重要的問題,ARQ中針對丟包都需要超時重傳,超時的設置就會涉及到RTO,RTO設置過短(<2RTT)則可能導致不必要的重傳(丟的包可能在傳輸馬上能到),超時重傳時間設置過長則可能影響TCP通信效率(一直在等丟的包重傳過來),RTO計算需要基於RTT,RTT計算值直接影響RTO,不少擁塞演算法也和RTT直接相關,TCP流端到端的RTT和RTO都不是一個固定的值,都是一段時間網路中端到端的估算值,隨著時間和網路的變化而變化。
網路就像高速公路網,時而擁堵時而暢通無阻,對應在網路里就是RTT抖動,如何計算RTT和RTO直接影響重傳進而影響通信效率,其計算也經過長時間的理論和實踐過程。
最初Phil Karn和Craig Partridge提出SRTT(Smoothed RTT,估算重傳時間)和RTO計算如下:
SRTT = (α * SRTT) + ( (1-α) *RTT)
RTO = min[UBOUND, max[LBOUND,(β * SRTT)]]
其中α取0.8 ~ 0.9,β取1.3 ~ 2.0, UBOUND和LBOUND分別為上限時間和下限時間
後來Jacobson 和 Karels做了一定的修正,最終形成RFC6298
https://datatracker.ietf.org/doc/html/rfc6298,SRTT,DevRTT(RTT偏差) ,RTO計算公式,具體如下:
SRTT = SRTT + α * (RTT - SRTT)
DevRTT = (1-β) * DevRTT + β *(|RTT - SRTT|)
RTO = μ * SRTT + δ * DevRTT
其中α取0.125,β取0.25,μ取1,δ取4
Linux內核4.x中SRTT計算代碼如下:
/* Called to compute a smoothed rtt estimate. The data fed to this
* routine either comes from timestamps, or from segments that were
* known _not_ to have been retransmitted [see Karn/Partridge
* Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
* piece by Van Jacobson.
* NOTE: the next three routines used to be one big routine.
* To save cycles in the RFC 1323 implementation it was better to break
* it up into three procedures. -- erics
*/
static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
{
struct tcp_sock *tp = tcp_sk(sk);
long m = mrtt_us; /* RTT */
u32 srtt = tp->srtt_us;
/* The following amusing code comes from Jacobson's
* article in SIGCOMM '88. Note that rtt and mdev
* are scaled versions of rtt and mean deviation.
* This is designed to be as fast as possible
* m stands for "measurement".
*
* On a 1990 paper the rto value is changed to:
* RTO = rtt + 4 * mdev
*
* Funny. This algorithm seems to be very broken.
* These formulae increase RTO, when it should be decreased, increase
* too slowly, when it should be increased quickly, decrease too quickly
* etc. I guess in BSD RTO takes ONE value, so that it is absolutely
* does not matter how to _calculate_ it. Seems, it was trap
* that VJ failed to avoid. 8)
*/
if (srtt != 0) {
m -= (srtt >> 3); /* m is now error in rtt est */
srtt += m; /* rtt = 7/8 rtt + 1/8 new */
if (m < 0) {
m = -m; /* m is now abs(error) */
m -= (tp->mdev_us >> 2); /* similar update on mdev */
/* This is similar to one of Eifel findings.
* Eifel blocks mdev updates when rtt decreases.
* This solution is a bit different: we use finer gain
* for mdev in this case (alpha*beta).
* Like Eifel it also prevents growth of rto,
* but also it limits too fast rto decreases,
* happening in pure Eifel.
*/
if (m > 0)
m >>= 3;
} else {
m -= (tp->mdev_us >> 2); /* similar update on mdev */
}
tp->mdev_us += m; /* mdev = 3/4 mdev + 1/4 new */
if (tp->mdev_us > tp->mdev_max_us) {
tp->mdev_max_us = tp->mdev_us;
if (tp->mdev_max_us > tp->rttvar_us)
tp->rttvar_us = tp->mdev_max_us;
}
if (after(tp->snd_una, tp->rtt_seq)) {
if (tp->mdev_max_us < tp->rttvar_us)
tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
tp->rtt_seq = tp->snd_nxt;
tp->mdev_max_us = tcp_rto_min_us(sk);
}
} else {
/* no previous measure. */
srtt = m << 3; /* take the measured time to be rtt */
tp->mdev_us = m << 1; /* make sure rto = 3*rtt */
tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
tp->mdev_max_us = tp->rttvar_us;
tp->rtt_seq = tp->snd_nxt;
}
tp->srtt_us = max(1U, srtt);
}
Linux內核4.x中RTO計算代碼:
/* Calculate rto without backoff. This is the second half of Van Jacobson's
* routine referred to above.
*/
void tcp_set_rto(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* Old crap is replaced with new one. 8)
*
* More seriously:
* 1. If rtt variance happened to be less 50msec, it is hallucination.
* It cannot be less due to utterly erratic ACK generation made
* at least by solaris and freebsd. "Erratic ACKs" has _nothing_
* to do with delayed acks, because at cwnd>2 true delack timeout
* is invisible. Actually, Linux-2.4 also generates erratic
* ACKs in some circumstances.
*/
inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
/* 2. Fixups made earlier cannot be right.
* If we do not estimate RTO correctly without them,
* all the algo is pure shit and should be replaced
* with correct one. It is exactly, which we pretend to do.
*/
/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
* guarantees that rto is higher.
*/
tcp_bound_rto(sk);
}
static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
{
return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
}
擁塞控制演算法
由於滑動窗口僅考慮接收方的接收能力不考慮網路因素,網路帶寬是共享的,網路延時是抖動的,網路的變化帶來的問題勢必會造成整個系統的崩潰,這就像兩個火車站運輸貨物,僅考慮火車站的容納量,不考慮火車軌道的承載能力,火車車次發的再多可能也是堵在軌道上,還可能導致軌道系統癱瘓。
擁塞控制演算法是發送方根據當前網路情況調整發包速率的流量控制機制,TCP中擁塞控制演算法發展至今,從經典的reno到google近年來提出的bbr,期間不斷有新的擁塞控制演算法提出來,擁塞控制也形成了如下RFC。
RFC5681
https://datatracker.ietf.org/doc/html/rfc5681
以linux內核為例,最新的內核版本內部實現了十多種擁塞控制演算法實現,包括一些比較新的演算法實現(比如bbr),linux較早版本內核使用reno,linux 2.6.8以後默認採用bic演算法,linux 2.6.19以後又默認使用了cubic演算法,目前linux 5.x依然默認使用cubic演算法, windows同樣也是支持多種擁塞控制演算法的,目前windows11默認也是使用cubic演算法。
擁塞演算法原理(reno為例)
TCP擁塞演算法整體框架基本一致,以經典的TCP reno擁塞控制演算法(教科書通常講的版本)進行簡要介紹,TCP擁塞控制RFC5681規定都包含慢啟動、擁塞避免、快重傳、快恢復階段,這又涉及到幾個基本參數(cwnd、ssthresh),cwnd指擁塞窗口的大小,決定了一次發送多少數據包,ssthresh指慢啟動閾值。
- 慢啟動
慢啟動階段分為兩種,一種是流剛開始cwnd<ssthresh時cwnd初始為1MSS(隨著硬體更新,Linux最新內核初始為10MSS)並隨著RTT呈指數增長,一種是重傳超時cwnd恢復初始值重新進入慢啟動,慢啟動並不表示發送速率增長慢(指數級增長),個人理解是指從比較低的起始速率來試探網路能力,針對慢啟動有提出了ABC(Appropriate Byte Count)演算法對慢啟動突發而造成超越網路承載能力進行修復,如下兩篇關於ABC的RFC。
RFC3465
https://datatracker.ietf.org/doc/html/rfc3465
RFC3742
https://datatracker.ietf.org/doc/html/rfc3742
首次慢啟動最初初始ssthresh都是設置為一個很大的值,這樣直到丟包才會進入擁塞避免,後又出現了hystart優化(混合慢啟動),比如預測出理想的ssthresh,從而讓首次慢啟動不至於丟包才進入擁塞避免(丟包代價太大)。
- 擁塞避免
當cwnd>=ssthresh時會進入擁塞避免階段,cwnd會隨著RTT呈線性增長,這個起始是比較保守地試探最大網路能力,不至於網路崩潰。
- 快重傳
擁塞避免階段當收到3個連續ACK,表明可能出現了丟包,表明網路出現輕微擁堵,這個時候會進入快重傳階段,ssthresh會設置為0.5 * cwnd,cwnd會設置為ssthresh+3MSS,進行數據包重傳進入快恢復階段。
- 快恢復
快恢復階段如果重傳數據包後如果依然收不到新數據包ACK而且RTO超時了,表明網路並沒有恢復,就會重新進入慢啟動階段,ssthresh會設置為0.5 * cwnd,cwnd會設置為初始值,如果收到了新數據的ACK包,表明網路已恢復,cwnd會設置為ssthresh,進入擁塞避免階段。
reno擁塞控制演算法狀態圖如下


reno演算法iperf打流wireshark抓包io圖如下:

擁塞演算法對比
擁塞控制演算法主要是基於網路丟包和延遲(RTT)來實現,所以有的演算法丟包敏感,有的演算法延遲敏感,有的結合丟包和延遲,不同的演算法主要的區別可能在於擁塞避免階段如何去擬合理想發送速率曲線又不至於丟包,如下
https://en.wikipedia.org/wiki/TCP_congestion_control關於不同擁塞演算法對比。
擁塞演算法 | 網路反饋 | 需要修改點 | 應用場景 | 公平性準則 |
(New) Reno | Loss | — | — | Delay |
Vegas | Delay | Sender | Less loss | Proportional |
High Speed | Loss | Sender | High bandwidth | |
BIC | Loss | Sender | High bandwidth | |
CUBIC | Loss | Sender | High bandwidth | |
C2TCP[10][11] | Loss/Delay | Sender | Ultra-low latency and high bandwidth | |
NATCP[12] | Multi-bit signal | Sender | Near Optimal Performance | |
Elastic-TCP | Loss/Delay | Sender | High bandwidth/short & long-distance | |
Agile-TCP | Loss | Sender | High bandwidth/short-distance | |
H-TCP | Loss | Sender | High bandwidth | |
FAST | Delay | Sender | High bandwidth | Proportional |
Compound TCP | Loss/Delay | Sender | High bandwidth | Proportional |
Westwood | Loss/Delay | Sender | L | |
Jersey | Loss/Delay | Sender | L | |
BBR[13] | Delay | Sender | BLVC, Bufferbloat | |
CLAMP | Multi-bit signal | Receiver, Router | V | Max-min |
TFRC | Loss | Sender, Receiver | No Retransmission | Minimum delay |
XCP | Multi-bit signal | Sender, Receiver, Router | BLFC | Max-min |
VCP | 2-bit signal | Sender, Receiver, Router | BLF | Proportional |
MaxNet | Multi-bit signal | Sender, Receiver, Router | BLFSC | Max-min |
JetMax | Multi-bit signal | Sender, Receiver, Router | High bandwidth | Max-min |
RED | Loss | Router | Reduced delay | |
ECN | Single-bit signal | Sender, Receiver, Router | Reduced loss |
這裡還要提一下TCP流的公平性問題,有的擁塞演算法可能會存在帶寬搶佔,基於延時相關的演算法就很容易出現,因為是基於RTT來搶佔帶寬,RTT越小佔用則更多,比如有兩條TCP流,TCP流中延遲更小的會搶佔更多的帶寬,甚至完全佔有所有帶寬,其他流都無法正常工作。比如reno和bic都存在搶佔問題,cubic是通過3次曲線來探測理想帶寬,和RTT無關聯,公平性做得較好。

如上圖使用reno演算法,使用iperf打流到兩個不同服務端,stream2延時低直接搶佔了大部分帶寬

如上圖使用cubic演算法,同樣使用iperf打流到兩個不同服務端,steam2延時低搶佔帶寬則沒有那麼嚴重
Linux內核中擁塞演算法
Linux 2.6.x內核版本具體可參考
https://linuxgazette.net/135/pfeiffer.html進行擁塞演算法選擇,最新的linux 5.x內核多支持了幾種,內核代碼Kconfig中也有簡要描述,順便提一下mptcp內核也包含了多路徑場景的幾種擁塞演算法。
Linux 4.x內核中支持的tcp擁塞演算法,默認使用cubic:
config DEFAULT_TCP_CONG
string
default "bic" if DEFAULT_BIC
default "cubic" if DEFAULT_CUBIC
default "htcp" if DEFAULT_HTCP
default "hybla" if DEFAULT_HYBLA
default "vegas" if DEFAULT_VEGAS
default "westwood" if DEFAULT_WESTWOOD
default "veno" if DEFAULT_VENO
default "reno" if DEFAULT_RENO
default "dctcp" if DEFAULT_DCTCP
default "cdg" if DEFAULT_CDG
default "bbr" if DEFAULT_BBR
default "cubic"
Linux mptcp v0.95內核支持的mptcp擁塞演算法:
config TCP_CONG_LIA
tristate "MPTCP Linked Increase"
depends on MPTCP
default n
---help---
MultiPath TCP Linked Increase Congestion Control
To enable it, just put 'lia' in tcp_congestion_control
config TCP_CONG_OLIA
tristate "MPTCP Opportunistic Linked Increase"
depends on MPTCP
default n
---help---
MultiPath TCP Opportunistic Linked Increase Congestion Control
To enable it, just put 'olia' in tcp_congestion_control
config TCP_CONG_WVEGAS
tristate "MPTCP WVEGAS CONGESTION CONTROL"
depends on MPTCP
default n
---help---
wVegas congestion control for MPTCP
To enable it, just put 'wvegas' in tcp_congestion_control
config TCP_CONG_BALIA
tristate "MPTCP BALIA CONGESTION CONTROL"
depends on MPTCP
default n
---help---
Multipath TCP Balanced Linked Adaptation Congestion Control
To enable it, just put 'balia' in tcp_congestion_control
config TCP_CONG_MCTCPDESYNC
tristate "DESYNCHRONIZED MCTCP CONGESTION CONTROL (EXPERIMENTAL)"
depends on MPTCP
default n
---help---
Desynchronized MultiChannel TCP Congestion Control. This is experimental
code that only supports single path and must have set mptcp_ndiffports
larger than one.
To enable it, just put 'mctcpdesync' in tcp_congestion_control
For further details see:
http://ieeexplore.ieee.org/abstract/document/6911722/
https://doi.org/10.1016/j.comcom.2015.07.010
Linux4.x內核中擁塞演算法擴展時參考struct,擁塞演算法主要重寫裡面函數來實現:
struct tcp_congestion_ops {
struct list_head list;
u32 key;
u32 flags;
/* initialize private data (optional) */
void (*init)(struct sock *sk);
/* cleanup private data (optional) */
void (*release)(struct sock *sk);
/* return slow start threshold (required) */
u32 (*ssthresh)(struct sock *sk);
/* do new cwnd calculation (required) */
void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
/* call before changing ca_state (optional) */
void (*set_state)(struct sock *sk, u8 new_state);
/* call when cwnd event occurs (optional) */
void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
/* call when ack arrives (optional) */
void (*in_ack_event)(struct sock *sk, u32 flags);
/* new value of cwnd after loss (required) */
u32 (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* override sysctl_tcp_min_tso_segs */
u32 (*min_tso_segs)(struct sock *sk);
/* returns the multiplier used in tcp_sndbuf_expand (optional) */
u32 (*sndbuf_expand)(struct sock *sk);
/* call when packets are delivered to update cwnd and pacing rate,
* after all the ca_state processing. (optional)
*/
void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
union tcp_cc_info *info);
char name[TCP_CA_NAME_MAX];
struct module *owner;
};
Linux4.x內核cubic演算法重新如下函數來實現:
static struct tcp_congestion_ops cubictcp __read_mostly = {
.init = bictcp_init,
.ssthresh = bictcp_recalc_ssthresh,
.cong_avoid = bictcp_cong_avoid,
.set_state = bictcp_state,
.undo_cwnd = tcp_reno_undo_cwnd,
.cwnd_event = bictcp_cwnd_event,
.pkts_acked = bictcp_acked,
.owner = THIS_MODULE,
.name = "cubic",
};
Linux 系統流量控制重要系統參數
# 每個套接字所允許的最大緩衝區的大小
net.core.optmem_max = 20480
# 接收套接字緩衝區大小的預設值(以位元組為單位)。
net.core.rmem_default = 229376
# 接收套接字緩衝區大小的最大值(以位元組為單位)。
net.core.rmem_max = 16777216
# socket監聽的backlog(監聽隊列)上限
net.core.somaxconn = 128
# tcp默認擁塞演算法
net.ipv4.tcp_congestion_control = cubic
# 是否開啟tcp窗口擴大
net.ipv4.tcp_window_scaling = 1
# tcp內存大小區(以頁為單位, 1頁=4096位元組), 3個數值分別為 low, pressure, high
# low: tcp使用低於該值頁數時不考慮釋放內存, pressure: tcp使用內存大於該值則試圖穩定其使用內存low以下
# high: 允許tcp用排隊緩存數據報文的頁面量, 超過該值拒絕分配socket, 出現TCP: too many of orphaned sockets
net.ipv4.tcp_mem = 381705 508942 763410
# tcp讀緩存區(以位元組為單位), 3個數值分別為 默認值 最小值 最大值
net.ipv4.tcp_rmem = 10240 87380 16777216
# tcp寫緩存區(以位元組為單位), 3個數值分別為 默認值 最小值 最大值
net.ipv4.tcp_wmem = 10240 87380 16777216
Reference
https://tools.ietf.org/html/std7
https://datatracker.ietf.org/doc/html/rfc1122
https://datatracker.ietf.org/doc/html/rfc1323
https://datatracker.ietf.org/doc/html/rfc1379
https://datatracker.ietf.org/doc/html/rfc1948
https://datatracker.ietf.org/doc/html/rfc2018
https://datatracker.ietf.org/doc/html/rfc5681
https://datatracker.ietf.org/doc/html/rfc6247
https://datatracker.ietf.org/doc/html/rfc6298
https://datatracker.ietf.org/doc/html/rfc6824
https://datatracker.ietf.org/doc/html/rfc7323
https://datatracker.ietf.org/doc/html/rfc7414
https://en.wikipedia.org/wiki/Transmission_Control_Protocol
https://linuxgazette.net/135/pfeiffer.html
http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech.htm
https://www2.tkn.tu-berlin.de/teaching/rn/animations/gbn_sr/
https://zhuanlan.zhihu.com/p/144273871
http://lxr.linux.no/linux+v3.2.8/Documentation/networking/ip-sysctl.txt#L464
原創文章,作者:投稿專員,如若轉載,請註明出處:https://www.506064.com/zh-tw/n/203949.html