When traces lie…

Interesting issue pops into my email box – when calling India, the call goes through to local 911 emergency services instead.  Not surprisingly, this email is marked with high priority.

So diving in, I have the user make test calls and we prove that calls to England, France and other international destinations work splendidly.  Not the same story with calls to a certain number in India- let’s just say local emergency dispatchers aren’t looking to be friends with voice engineers making test calls, even ones with charming southern accents.

In this case, all the calls are dialed the same way: 9011[country code][number], but the number to India happens to be 90119111XXXXXXXX.  As you may have noticed- 911, emergency services in the US, is part of the dialed number.

So what would make the Call Manager or the router- not sure where to lay the blame at this point since a PBX isn’t involved- ditch the 9011 and send 911 out to the PSTN? Good question.

Time for the Dialed Number Analyzer to save the day! Punch in the digits, click “Do Analysis” and get back 9@ as the matching route pattern. Cue icky feeling in stomach.  For those who aren’t familiar with why 9@ just sucks in your dial plan, please click on over to @networkingnerd’s blog post: http://networkingnerd.net/2011/05/26/9-must-die/ for a nice write up on the tawdry subject. If that doesn’t convince you, know that if you use it, I will hunt you down and…uh, let’s get back to the story…

In an attempt to thwart 9@, I create my own international dialing pattern the way god intended international route patterns to be, making sure my CSS/partition trumped that of the pathetic 9@ pattern.  Testing commences and the user’s test call goes through successfully! Huzzah! No more making crank calls to grumpy 911 operators.

Just for good measure, I have the user do one more test as I quietly pat myself on the back. This time, though instead of hitting redial, which unbeknownst to me, he had been doing with the previous calls, the user this time dials the number digit by digit.  I then hear the melodious “your call can not be completed as dialed…” message. Huh?

Having put my self-congratulatory speech on hold, it’s time for more debug and log collections.  At this point things go from slightly askew to downright wonky.  DNA tool says I’m still matching 9@.  *Gasp* – the DNA tool is lying to me! Viewing the router debugs I can see that my pattern has changed what the router was sending out to the PSTN from 911 to 011911- which, while not actually routable, is solid proof my new route pattern in Call Manager is being hit.

Then TAC tells me the trace files show that Call Manager quits collecting digits after the 9 and 0 are dialed for calls placed to the 9011911XXXXXXXX destination, but that the Call Manager collects all the digits dialed for any other international destination. Wait, what?  How does it know after my 9 and 0 whether I am going to dial India or Timbuktu? According to the trace files, though, Call Manager can predict if I’m going to call India before I even dial it. I know the system is good, but I didn’t think it had progressed to mind reading yet.

And what about using redial?  The system apparently collects all the digits there too. Somehow Call Manager *knows* when I’m going to dial India using the keypad, but if you hit redial it’s magical predictive powers are somehow temporarily suspended and the call sneaks on by.

To quote one of my favorite shows of all time: “this is all making a kind of sense that’s… not.”*

Feeling betrayed by my trusty tools and trace files, I am left to conclude that the system is as utterly confused as we are about what is actually going on under the hood.  So it’s back to basics- call routing appears to be the issue, time to review the system’s infernal route patterns yet again.

At this point, I’ll note that in addition to 9@, there is also present a 9011@ pattern. Previously we all blew this pattern off because all the evidence indicated this wasn’t ever being matched by anything. Now that the evidence is suspect at best, a closer look is warranted. We proceed to put the 9011@ pattern in a partition nothing has access to. We test and alas we have true success.

So what to make of this?

Number one and most importantly: never, ever use @ in your route patterns if you can help it.  It’s just wrong, wrong, dirty and wrong.  Also, it appears to completely goof up the Dialed Number Analyzer, so keep that in mind when troubleshooting such patterns.

Number two: tools are useful, but not always accurate. Corollary, trust – but verify. Take output from as many sources as you can to build a full picture of the puzzle, especially if one or more of the tools at hand are spitting out results that defy logic.

Number three: some clues throw you off track. In this case, the redial working pointed to a digit timeout issue, but other international calls were fine, so we put this on the back burner. Turned out to be a good decision.

So one mystery still remains: why the heck did the redial work? Anyone with thoughts/theories please feel free to comment, I’d love to hear your ideas on the subject…I wouldn’t rule out black magic and powers of unspeakable darkness…

*in case you were wondering, quote is from Buffy the Vampire Slayer, episode Becoming- Part 2, a series chocked full o’ excellent one-liners…

A tale of two phones

Manager comes in and asks me if I have any tasks the shiny new intern can observe, my response: I’m about to troubleshoot a Call Forward All issue- if the intern doesn’t have anything more interesting, he’s welcome to watch.  And to my complete dismay, he really doesn’t have anything more interesting and I have an audience.

First step: gather the facts.  Call the user’s extension, the phone rings- super.  Set the call forward all- super.  Call the user’s extension- get busy signal.  Not so super. Okay, Houston, we’ve verified there is a problem.

Second step: check the obvious.  In this case, checking the CSS of the Call Forward All of the Directory Number is the place to start.  If no CSS or the wrong CSS is set, the call that’s supposed to be forwarded to Timbuktu isn’t going to get there no matter how much you will it to do so. So I verify the CSS is correct, and I give a lengthy explanation of CSSs to the intern who, once again to my dismay, has not fallen asleep yet.

Third step: isolate the issue- This is also known as the simplify-the-issue stage or eliminate-all-the-factors-you-can stage.  In this situation, I whittle down the problem to it’s most basic parts- taking the directory number, which was a shared line, isolating it to a known good test phone. Also, I confirm the number I’m forwarding to is an internal, reachable, working number. I repeat the tests. And for the love of Pete, it still doesn’t work.

Soooo, thinking I’d like the intern to think I’m smarter than a potato, I blow through the rest of my bag of tricks, including using the Dialed Number Analyzer to confirm my call is taking the expected path.  I also proceed to rule out any PBX conflicts.  As a side note, if you have customers that run a Cisco voice system integrated with a legacy PBX, it is perfectly acceptable to blame the PBX for all issues.  They are like telco carriers, it’s the right thing to do.

While I still have what is left of your attention, let’s move along.  At this point, I hang my head in shame, admit defeat, and do what all good mentors do – show the intern how to collect logs and open a TAC case.

And TAC’s findings on the source of my malcontent?  Bug ID CSCse19548.  Which, for those who would rather I google that for them, the essence is there’s a counter that is supposed to prevent looping calls and it has been triggered.  This counter tells the Directory Number  “you are a loop!! no call forward for you!!” (call forward Nazi…)

The fix: reset the counter.  You can do this two ways: the take-a-hammer-to-it approach and stop/start the CCM Service, temporarily taking down call processing on your CUCM server, OR the slightly milder approach, you can reset the max forwards counter using the phone itself:

-Go off hook on a phone that is registered to CCM that has the high counter, make sure the CallFwdAll is set
-enter “**##*30 (enables codes)
-Go off hook again
-enter “**##*35 (clears the max hop counters)

By the way, if you are wondering what this issue looks like in the log files, you are looking for something like this:

06/17/2011 11:25:21.637 CCM|Forwarding – processCFA – ERROR –
Forward loop detected.  — Clear the call with USER_BUSY. callKey=

Final thoughts:

The last thing I will mention I learned from this situation is that users rarely tell you all of the story.  If you’ve ever seen House and heard his rant on “all patients lie” it’s similar in concept. In this case, I later found that the directory number in question was once part of a group of two directory numbers in which two very nice, very old ladies, kept call forwarding their phones to each other.  You guessed it, creating a call forwarding loop!  Just shortly before this CallFwdAll issue had been discovered, the hooligans had been assigned a single directory number to share which put a stop to their shenanigans. Reminding me to always, always, ask questions.  And lots of them.

Customizing CAS…

Ever been troubleshooting an issue to find the problem was you left out a single line of code – code you never knew needed to be there?  I’m betting anyone who’s been in IT for any length of time has been there, done that- and nearly pulled his/her hair out in the process. (Why do you think there are so many balding guys in IT?)

Case in point – team lead and I are bringing up E1 circuit in Mexico, resting comfortably in the knowledge that we’ve done this before. The fact that we have no idea how many digits the carrier is going to be sending doesn’t even phase us, we’ve got mad translating skills – bring it on.

That is until it’s clear that whatever digits the carrier is sending, the router is less than thrilled with. No modification of the incoming translation pattern appeases the angry stream of incoming digits- whatever they may be.

Fast forward about two hours and quite a number of debugs later and say hi to the Australian TAC engineer, who is now on the line with us, two IT guys at the site, and a Mexican telco engineer.  Only problem – no one but the telco engineer speaks Spanish – and we’re pretty sure everything that’s wrong is his fault.  He is the telco guy after all.

In the bizarre world of coincidences, the Australian TAC guy (with the really great accent, btw) pipes up with  “hey, the guy in the cube next to me happens to speak Spanish” – and with that our international summit gains traction.  Shortly after, we are staring in awe at the magic line of code that makes everything in this particularly odd universe super happy.

What was missing? This line: groupa-callerid-end. Yep, all this madness and mayhem over that one single line.  It goes here:

controller E1 0/0/0
framing NO-CRC4
ds0-group 1 timeslots 1-15,17-30 type r2-digital r2-compelled ani
cas-custom 1
country telmex
category 2
answer-signal group-b 1
groupa-callerid-end  

Now will you always need this command when bringing up E1s?  Nope.  Will this fix all your issues with telcos in Mexico?  Not likely.  But it’s definitely something to make note of.  Especially when you find yourself doing a bit of guesswork due to a certain lack of information and relatively huge language gap with the carrier.

As an added bonus and completely unrelated to the issue above – here are some dial-peers for common patterns in Mexico that might prove useful if you are planning on bringing up a site there.  Think of it as your treat for making it to the end of this post.

dial-peer voice 2 pots
description Local Dialing
destination-pattern 9[1-9]…….
port 0/1/0:1
forward-digits 8
!
dial-peer voice 91 pots
description Long distance
destination-pattern 901……….
port 0/1/0:1
forward-digits 12
!
dial-peer voice 9011 pots
description International Dialing
destination-pattern 900T
port 0/1/0:1
prefix 00
!
dial-peer voice 44 pots
description Local Cell Phone
destination-pattern 9044……….$
port 0/1/0:1
forward-digits 13
!
dial-peer voice 45 pots
description Long Distance Cell Phone
destination-pattern 9045……….$
port 0/1/0:1
forward-digits 13
!
dial-peer voice 60 pots
description Emergency Services
destination-pattern 060
port 0/1/0:1
forward-digits all
!
dial-peer voice 9060 pots
description Emergency Services
destination-pattern 906.$
port 0/1/0:1
forward-digits 3

dial-peer voice 9070 pots
description Information & Electric Repairs
destination-pattern 907[01]
port 0/1/0:1
forward-digits 3
!
dial-peer voice 9050 pots
description Telephone Repair
destination-pattern 9050
port 0/1/0:1
forward-digits 3
!
dial-peer voice 9040 pots
description Information
destination-pattern 9040$
port 0/1/0:1
forward-digits 3

Translating nothing into nothing…

Wanna confuse a just-starting-out voice engineer quickly? Just show them voice translation rules. Seemingly simple on the surface, black magic voodoo underneath.  At least it can seem that way to someone new to voice…

The most recent dark magic I learned to perform came about on an issue I was 90% sure was a carrier issue – I like to hold out a 10% chance that the carrier actually did get it right, it’s only fair.

So a user reports that international calls to Great Britain are failing- no other international calls are failing, just those.  Now, I don’t know about your users, but mine *often* have trouble even figuring out the digits to dial to make a long distance call, so my confidence in them being able to accurately enter an international access code is low. Okay, non-existent.

So we fire up the good ole “debug isdn q931” and to my surprise the user is actually right. Surprise being the appropriate emotion since, let’s face it, that doesn’t happen everyday.  I take a capture of the call failure to Great Britain and a capture of the successful international call and conclude that the carrier must be goofing something up somewhere.

Now, I’m really not a blame-it-on-the-other-guy type of gal, but come on- the dial strings are hitting the same route pattern, sent to the same gateway, to the same dial-peer, and out the same voice port.  And only Great Britain numbers fail – thinking it’s not likely my system- seeing that there’s equal treatment to all things international on this end. I reasonably conclude the carrier switch must have some super special, surely unintentional, non-routing going on.

Arming the user with debugs, I send him on his way to confront the carrier with the proof of their Anglophobic ways. That’s when I learn I have overlooked something significant in the debugs- something the lovely carrier technician pointed out – likely with a smirk on his I-know-I’m-right face.

The q931 debugs showed the “type” for the Great Britain calls being marked with type as “International” whereas the calls for other international destinations were being marked with type of  “Unknown.”  Why is this significant?  Well, the “International” designation when received by a carrier switch causes that switch to prepend a 011 to the dialed string.  In this case, it’s extremely detrimental since 011 was already part of the digits placed on the line.

There are many ways to fix this issue, the one I liked best as you may have guessed, involves a translation pattern and was suggested by one of my brilliant coworkers.

It goes like this:

voice translation-rule 1
  rule 1 // // type any unknown plan any unknown

This rule will take anything that hits it, change any “type” to Unknown and any “plan” to Unknown.

It then needs to be added to a translation profile that will catch the called number:

voice translation-profile SET_UNKNOWN
  translate called 1

This then gets applied to the outgoing international dial peer:

dial-peer voice 10000 pots
translation-profile outgoing SET_UNKNOWN
destination-pattern 9011T
prefix 011
port 0/0/0:23

And there you have it.  Calls to the Queen Mother can now commence and users can rejoice!

In case you are still reading this and are interested in the debugs, here are some pertinent excerpts:

From the unsuccessful call (X’s added to protect calling/called parties): Note, Plan:ISDN, Type: International

Bearer Capability i = 0x8090A2
Standard = CCITT
Transfer Capability = Speech
Transfer Mode = Circuit
Transfer Rate = 64 kbit/s
Channel ID i = 0xA98396
Exclusive, Channel 22
Calling Party Number i = 0x2181, ‘XXXXXX3547’
Plan:ISDN, Type:National
Called Party Number i = 0x91, ‘01144XX80212223’
Plan:ISDN, Type:International

From the successful call (X’s added to protect calling/called parties) – Note, Plan: Unknown, Type:Unknown:

Bearer Capability i = 0x8090A2
Standard = CCITT
Transfer Capability = Speech
Transfer Mode = Circuit
Transfer Rate = 64 kbit/s
Channel ID i = 0xA98395
Exclusive, Channel 21
Calling Party Number i = 0x2181, ‘XXXXXX3547’
Plan:ISDN, Type:National
Called Party Number i = 0x80, ‘01133XX2087574’
Plan:Unknown, Type:Unknown