Until the last week I had quite vague understanding of how LCP keepalives work - it was just one of those underlying protocols you have general knowledge about but never really go in details with. However, last week I encountered a problem that required slightly deeper understanding of how it works, and since some of the important details are not emphasized in the sources I read, I decided to write a short note on this topic.
So, how does it work? Let's take a look on a capture taken between a PPPoE server and a Linux client:
Note there's no "master" who sends requests and "slave" who sends replies - both devices are sending requests as well as replies independently.
Each device sends requests with configured interval (currently it's 10 on both sides), each request with unique (incrementing with every request) identifier, expecting to receive echo reply with the same identifier.
What's not so obvious about this is the fact that the devices checks if the link is still alive relying only to replies to its own requests and not incoming requests from other side.
What exactly it means is that in some specific situations (some bug, for example, or any other situations which will somehow lead to this) if you local device stops sending requests (and hence stop receiving replies), you won't know if the link is down. Your device will be sending replies to incoming requests from remote device, but when link will be down and remote device will stop sending request you won't pay any attention to it, cause the only thing you care about is replies to your own requests. If this happens on your PPPoE server you'll obviously face stuck PPPoE sessions.
So how do we troubleshoot issues like this on Junos? show interfaces pp0.x extensive won't really help, cause it doesn't list extended information about LCPs:
root> show interfaces pp0.3221225475 extensive
Logical interface pp0.3221225475 (Index 536870921) (SNMP ifIndex 200000009)
(Generation 5)
Flags: Up Point-To-Point Encapsulation: PPPoE
PPPoE:
State: SessionUp, Session ID: 1,
Session AC name: TST-1, Remote MAC address: 00:50:00:00:0e:00,
Underlying interface: demux0.3221225472 (Index 536870917)
Ignore End-Of-List tag: Disable
Traffic statistics:
Input bytes : 21114
Output bytes : 6068
Input packets: 598
Output packets: 595
Local statistics:
Input bytes : 36
Output bytes : 0
Input packets: 1
Output packets: 0
Transit statistics:
Input bytes : 21078 0 bps
Output bytes : 6068 0 bps
Input packets: 597 0 pps
Output packets: 595 0 pps
Keepalive settings: Interval 10 seconds, Up-count 3, Down-count 3
LCP state: Opened
NCP state: inet: Opened, inet6: Not-configured, iso: Not-configured, mpls:
Not-configured
CHAP state: Success
PAP state: Closed
Protocol inet, MTU: 1492
Max nh cache: 0, New hold nh limit: 0, Curr nh cnt: 0, Curr new hold cnt: 0,
NH drop cnt: 0
Generation: 0, Route table: 0
Flags: Unnumbered
Donor interface: lo0.0 (Index 321)
Input Filters: ff-in_UID1007-pp0.3221225475-in (60)
Output Filters: ff-out_UID1008-pp0.3221225475-out (60)
Addresses, Flags: Is-Primary
Destination: Unspecified, Local: 100.64.0.1, Broadcast: Unspecified,
Generation: 0
(If you're confused about where's LCP packets are counted here, it's Transit section - packets handled by PFE are listed here while packets sent to RE are counted in Local section)
Here's where shell commands come:
root> start shell
root@:~ # vty fpc0
VMX- platform (3067Mhz Intel(R) Atom(TM) CPU processor, 1792MB memory, 8192KB flash)
VMX-0( vty)# show jnh inline-ka session 0 ppp global-stats
PPP Inline keepalive Global Stats:
Total Rx PPP Echo Request pkts : 367
Total Rx PPP Echo Reply pkts : 338
Total Tx PPP Echo Request pkts : 340
Total Tx PPP Echo Reply pkts : 367
Total PPP Magic Number Mismatch pkts : 0
So in my case the indication of things going wrong was the fact tx echo requests and rx echo replies counters didn't change.
So, how does it work? Let's take a look on a capture taken between a PPPoE server and a Linux client:
Note there's no "master" who sends requests and "slave" who sends replies - both devices are sending requests as well as replies independently.
Each device sends requests with configured interval (currently it's 10 on both sides), each request with unique (incrementing with every request) identifier, expecting to receive echo reply with the same identifier.
What's not so obvious about this is the fact that the devices checks if the link is still alive relying only to replies to its own requests and not incoming requests from other side.
What exactly it means is that in some specific situations (some bug, for example, or any other situations which will somehow lead to this) if you local device stops sending requests (and hence stop receiving replies), you won't know if the link is down. Your device will be sending replies to incoming requests from remote device, but when link will be down and remote device will stop sending request you won't pay any attention to it, cause the only thing you care about is replies to your own requests. If this happens on your PPPoE server you'll obviously face stuck PPPoE sessions.
So how do we troubleshoot issues like this on Junos? show interfaces pp0.x extensive won't really help, cause it doesn't list extended information about LCPs:
root> show interfaces pp0.3221225475 extensive
Logical interface pp0.3221225475 (Index 536870921) (SNMP ifIndex 200000009)
(Generation 5)
Flags: Up Point-To-Point Encapsulation: PPPoE
PPPoE:
State: SessionUp, Session ID: 1,
Session AC name: TST-1, Remote MAC address: 00:50:00:00:0e:00,
Underlying interface: demux0.3221225472 (Index 536870917)
Ignore End-Of-List tag: Disable
Traffic statistics:
Input bytes : 21114
Output bytes : 6068
Input packets: 598
Output packets: 595
Local statistics:
Input bytes : 36
Output bytes : 0
Input packets: 1
Output packets: 0
Transit statistics:
Input bytes : 21078 0 bps
Output bytes : 6068 0 bps
Input packets: 597 0 pps
Output packets: 595 0 pps
Keepalive settings: Interval 10 seconds, Up-count 3, Down-count 3
LCP state: Opened
NCP state: inet: Opened, inet6: Not-configured, iso: Not-configured, mpls:
Not-configured
CHAP state: Success
PAP state: Closed
Protocol inet, MTU: 1492
Max nh cache: 0, New hold nh limit: 0, Curr nh cnt: 0, Curr new hold cnt: 0,
NH drop cnt: 0
Generation: 0, Route table: 0
Flags: Unnumbered
Donor interface: lo0.0 (Index 321)
Input Filters: ff-in_UID1007-pp0.3221225475-in (60)
Output Filters: ff-out_UID1008-pp0.3221225475-out (60)
Addresses, Flags: Is-Primary
Destination: Unspecified, Local: 100.64.0.1, Broadcast: Unspecified,
Generation: 0
(If you're confused about where's LCP packets are counted here, it's Transit section - packets handled by PFE are listed here while packets sent to RE are counted in Local section)
Here's where shell commands come:
root> start shell
root@:~ # vty fpc0
VMX- platform (3067Mhz Intel(R) Atom(TM) CPU processor, 1792MB memory, 8192KB flash)
VMX-0( vty)# show jnh inline-ka session 0 ppp global-stats
PPP Inline keepalive Global Stats:
Total Rx PPP Echo Request pkts : 367
Total Rx PPP Echo Reply pkts : 338
Total Tx PPP Echo Request pkts : 340
Total Tx PPP Echo Reply pkts : 367
Total PPP Magic Number Mismatch pkts : 0
So in my case the indication of things going wrong was the fact tx echo requests and rx echo replies counters didn't change.
No comments:
Post a Comment