networking n stuff

Thursday, January 11, 2018

PPP LCP Keepalives & Troubleshooting it on Junos

Until the last week I had quite vague understanding of how LCP keepalives work - it was just one of those underlying protocols you have general knowledge about but never really go in details with. However, last week I encountered a problem that required slightly deeper understanding of how it works, and since some of the important details are not emphasized in the sources I read, I decided to write a short note on this topic.

So, how does it work? Let's take a look on a capture taken between a PPPoE server and a Linux client:


Note there's no "master" who sends requests and "slave" who sends replies - both devices are sending requests as well as replies independently.

Each device sends requests with configured interval (currently it's 10 on both sides), each request with unique (incrementing with every request) identifier, expecting to receive echo reply with the same identifier.
What's not so obvious about this is the fact that the devices checks if the link is still alive relying only to replies to its own requests and not incoming requests from other side.
What exactly it means is that in some specific situations (some bug, for example, or any other situations which will somehow lead to this) if you local device stops sending requests (and hence stop receiving replies), you won't know if the link is down. Your device will be sending replies to incoming requests from remote device, but when link will be down and remote device will stop sending request you won't pay any attention to it, cause the only thing you care about is replies to your own requests. If this happens on your PPPoE server you'll obviously face stuck PPPoE sessions.

So how do we troubleshoot issues like this on Junos? show interfaces pp0.x extensive won't really help, cause it doesn't list extended information about LCPs:

root> show interfaces pp0.3221225475 extensive     
  Logical interface pp0.3221225475 (Index 536870921) (SNMP ifIndex 200000009)
   (Generation 5)
    Flags: Up Point-To-Point Encapsulation: PPPoE
    PPPoE:
      State: SessionUp, Session ID: 1,
      Session AC name: TST-1, Remote MAC address: 00:50:00:00:0e:00,
      Underlying interface: demux0.3221225472 (Index 536870917)
      Ignore End-Of-List tag: Disable 
    Traffic statistics:
     Input  bytes  :                21114
     Output bytes  :                 6068
     Input  packets:                  598
     Output packets:                  595
    Local statistics:
     Input  bytes  :                   36
     Output bytes  :                    0
     Input  packets:                    1
     Output packets:                    0
    Transit statistics:
     Input  bytes  :                21078                    0 bps
     Output bytes  :                 6068                    0 bps
     Input  packets:                  597                    0 pps
     Output packets:                  595                    0 pps
  Keepalive settings: Interval 10 seconds, Up-count 3, Down-count 3
  LCP state: Opened
  NCP state: inet: Opened, inet6: Not-configured, iso: Not-configured, mpls:
  Not-configured
  CHAP state: Success
  PAP state: Closed
    Protocol inet, MTU: 1492
    Max nh cache: 0, New hold nh limit: 0, Curr nh cnt: 0, Curr new hold cnt: 0,
    NH drop cnt: 0
    Generation: 0, Route table: 0
      Flags: Unnumbered
      Donor interface: lo0.0 (Index 321)
      Input Filters: ff-in_UID1007-pp0.3221225475-in (60)
      Output Filters: ff-out_UID1008-pp0.3221225475-out (60)
      Addresses, Flags: Is-Primary
        Destination: Unspecified, Local: 100.64.0.1, Broadcast: Unspecified,
        Generation: 0

(If you're confused about where's LCP packets are counted here, it's Transit section - packets handled by PFE are listed here while packets sent to RE are counted in Local section)

Here's where shell commands come:

root> start shell          
root@:~ # vty fpc0

VMX- platform (3067Mhz Intel(R) Atom(TM) CPU processor, 1792MB memory, 8192KB flash)

VMX-0( vty)# show jnh inline-ka session 0 ppp global-stats
PPP Inline keepalive Global Stats:
Total Rx PPP Echo Request pkts        : 367
Total Rx PPP Echo Reply pkts          : 338
Total Tx PPP Echo Request pkts        : 340
Total Tx PPP Echo Reply pkts          : 367
Total PPP Magic Number Mismatch pkts  : 0

So in my case the indication of things going wrong was the fact tx echo requests and rx echo replies counters didn't change.

No comments:

Post a Comment