[sorta OT] LAN reliability ?
DComTalk.com Forum Index DComTalk.com
Discussion of VoIP, VPN, Video Conferencen, DSL and other data commucations.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web dcomtalk.com
[sorta OT] LAN reliability ?
Goto page Previous  1, 2
 
Post new topic   Reply to topic    DComTalk.com Forum Index -> Ethernet
Author Message
Al Dykes
Guest





Posted: Mon Apr 18, 2005 11:09 pm    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <RxS8e.310$l45.6@newssvr12.news.prodigy.com>,
Robert Redelmeier <redelm@ev1.net.invalid> wrote:
Quote:
J. Clarke <jclarke.usenet@snet.net.invalid> wrote:
Whoa. The right approach is defense in depth.

When you need to defend. What to defend is a decision.

so guess what most of the traffic on the "fully open"
network was.

Through traffic, of course. The Internet was designed that way.
the boundary routers could have been easily configured to drop
non-source/dest packets and it would have stopped. A better
solution would have been to negotiate with the various ISPs for
peering and/or cost to carry traffic. But that may have been
beyond the administrations skills.

99% uptime is piss poor. That's one minute of outage every hour
and a half or so. Any server on which that happens is broken.

That isn't the usual granularity. It's more like 2 hours
every month or two, counting only core time. That includes
diagnosis time and is in addition to non-core hours server
routine maintenance and reboots to recover leaked memory.

why your system is so unreliable. And if you're focussed on
"Windows" and think that eliminating Windows would solve
the problem then you're not really looking at the problem.

Well, yes. It's a long series of dependant chains. Any link
can break. I have no idea if Unix could be configured as
dependantly as MS-ActiveDirectory becomes when large.

If you're running XP and your desktop machines in a place
of business are "out of action" "fairly frequently" you
need to find out why and fix it.

First, we do not use MS-WinXP. MS-Win2kPro is bad enough.
Second, failures are not global. People can usually work
at their desktops. But they lose access to some resources
like shared drives or email. Bizarrely, others are unaffected.

-- Robert




For Windows on desktops on business it isn't so much the MTBF as the
MTTR that makes a happy shop. With all the user data and profile on
the server we just drop in a fresh pre-imaged box when a user has a
probelm, hardware or software. The sick box goes on te bench for a
hardware repair or reimage and reuse.

IMO MS servers running mainstrea windows applications can be very
good if your expectations o0f scale and complexity are reasonable.



--
a d y k e s @ p a n i x . c o m

Don't blame me. I voted for Gore.
Back to top
J. Clarke
Guest





Posted: Tue Apr 19, 2005 12:03 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

Al Dykes wrote:

Quote:
In article <d40luv$7a9$1@canopus.cc.umanitoba.ca>,
Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:
Thanks for the reply, Al.

In article <d40ae9$khj$1@panix5.panix.com>, Al Dykes <adykes@panix.com
wrote:

Rule 3 was addressd by a duplicte of the main data center in another
state and if the main DC burned down the whole branch system and ATM
machines would still operate unassisted for about a day. We had 4
hours (per banking regs) to get the backup data ceneter up and
running. We did that on a regular basis.

If you have a moment, I'd appreciate an expansion on that "regular
basis". I'm not quite sure whether you are saying that:

1) it was not uncommon to need to fall back to the backup data center
in response to some trouble issue; or

2) you regularily tested the fallback procedures ("fire drills"); or

3) because of issues like scheduled maintenance, backups, and the like,
that it was not uncommon to activate the duplicate center as a routine
business continuity mechanism; or

4) on the relatively few occasions when it was necessary to fallback,
that you were repeatably able to do so comfortably within the
four-hour window ?

Or to put things another way, are you saying that even with all
the reliability planning that the backup data centre had to be kicked up
in response to a problem, or are you saying that failovers
were no big thing on the occasions they were needed?
--
"This was a Golden Age, a time of high adventure, rich living and
hard dying... but nobody thought so." -- Alfred Bester, TSMD


Policy said that we did a genuine drill every 6 months unless events
in the Real World caused us to use the backup. In the late 80's in
Manhattan there were enough little disasters that we switched data
centers on a regular basis and rarely had to do full fire drills.
After every event there was a post-mortem analysis to see what didn't
work and what we could have done better. In an operation this complex
some (hopefully) little thing doesn't go as expected. We switched to
backup site whenever it made sense. It was a straight forward
operation. We had huge ringbinders with contingency plans for
different scenarious.

I'm out of this now but I understand that this newfangled thing called
the Internet and the experience of 9/11 shows that the hot/standby
pair strategy is weak and both sites need to be working in production
capacity in parallel to be able to say to your Chairman that you're as
ready as you can be for the next disaster.

My working scenario when I had to explain disaster scenario planning
was that the Vogons would lift our main operations building (or our
backup site) off the planet, with data and staff, instantly with no
notice and we needed to continue to meet business obligations when
that happened. Once you've planned for this every other scenario is
covered and if you try to enumerate all the possible little disasters
and plan for them individuually you're going to miss something and get
bit by reality someday.

But were you covered for the Dark Angel scenario?

Quote:
Business Contigency Planning is a recognized job description.

--
--John
to email, dial "usenet" and validate
(was jclarke at eye bee em dot net)
Back to top
J. Clarke
Guest





Posted: Tue Apr 19, 2005 12:05 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

Robert Redelmeier wrote:

Quote:
J. Clarke <jclarke.usenet@snet.net.invalid> wrote:
Whoa. The right approach is defense in depth.

When you need to defend. What to defend is a decision.

so guess what most of the traffic on the "fully open"
network was.

Through traffic, of course. The Internet was designed that way.
the boundary routers could have been easily configured to drop
non-source/dest packets and it would have stopped.

But then it would not have been "completely open".

Quote:
A better
solution would have been to negotiate with the various ISPs for
peering and/or cost to carry traffic. But that may have been
beyond the administrations skills.

99% uptime is piss poor. That's one minute of outage every hour
and a half or so. Any server on which that happens is broken.

That isn't the usual granularity. It's more like 2 hours
every month or two, counting only core time. That includes
diagnosis time and is in addition to non-core hours server
routine maintenance and reboots to recover leaked memory.

why your system is so unreliable. And if you're focussed on
"Windows" and think that eliminating Windows would solve
the problem then you're not really looking at the problem.

Well, yes. It's a long series of dependant chains. Any link
can break. I have no idea if Unix could be configured as
dependantly as MS-ActiveDirectory becomes when large.

If you're running XP and your desktop machines in a place
of business are "out of action" "fairly frequently" you
need to find out why and fix it.

First, we do not use MS-WinXP. MS-Win2kPro is bad enough.
Second, failures are not global. People can usually work
at their desktops. But they lose access to some resources
like shared drives or email. Bizarrely, others are unaffected.

That sounds like it might be a licensing issue.
Quote:

-- Robert

--
--John
to email, dial "usenet" and validate
(was jclarke at eye bee em dot net)
Back to top
J. Clarke
Guest





Posted: Tue Apr 19, 2005 12:18 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

Walter Roberson wrote:

Quote:
In article <d40mgf04kp@news3.newsguy.com>,
J. Clarke <jclarke.usenet@snet.net.invalid> wrote:

:If you're running XP and your desktop machines in a place of business are
:"out of action" "fairly frequently" you need to find out why and fix it.

I've been isolated for some years [this city is blooming nicely
in biotechnology, but the nearest "high tech city" is ~900 miles away].

Perhaps I don't get around as much as I should... but as best I recall,
I don't think I've ever met anyone who was actually skilled in
configuring and debugging and repairing MS Windows. I've met a number
of good unix/linux hackers, who could repair just about any software
problem -- but with MS Windows, having a good clue about the Registry
has been about the upper limit, after which the standard problem
resolution stream seems to be "Reinstall the application. Reinstall
Windows. Re-Ghost from a known-good system."

There's a reason for that--it's faster than troubleshooting in most cases.
The major criticism of that approach is that the user loses data and
settings. On a corporate LAN neither of those should be the case.

I can generally get a Windows problem fixed or determine that it's a bug
that requires source access to fix and in that case come up with a
workaround, but I find that that's really _practical_ only for my home
system, where chasing the bug is recreation, and not for any situation in
which my time has dollar value. Cheaper to just reinstall or restore the
image.

By the way, you'll find Novell Zenworks a very useful tool for Windows
troubleshooting.

Quote:
I'm certainly not trying to provoke a Unix vs Windows war here:
I'm asking more: Has my sample been biased? Is there a good
representation in IT of people who can -fix- MS Windows problems
beyond "Search the Knowledgebase and check out the registry, and if you
don't find the answer, then re-install?" And I certainly don't mean
to cast stones at MS Windows specialists with this question: I'm
asking seriously whether MS Windows gurus are uncommon or if I've
just not noticed them.

--
--John
to email, dial "usenet" and validate
(was jclarke at eye bee em dot net)
Back to top
J. Clarke
Guest





Posted: Tue Apr 19, 2005 12:19 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

Al Dykes wrote:

Quote:
In article <d40ng7$8v8$1@canopus.cc.umanitoba.ca>,
Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:
In article <d40ae9$khj$1@panix5.panix.com>, Al Dykes <adykes@panix.com
wrote:
In article <d3vrn2$6or$1@canopus.cc.umanitoba.ca>,

Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:

It seems to me that more than once I've been in a major bank and
been told "The network's down", and no-one, staff or customer, seemed
surprised.

It (a) wasn't a business-wide outage, (b) they had manual proceedures
for essential tasks and (c) people could go to the next branch.

I think you are pointing out here that the observed failures fit
within the parameters of a well-planned business risk model.

My mention of banks was only partially contextual. I would have
predicted that for banks (and other major businesses) that most
customers would expect and tolerate near-zero failure. But that's
not what I actually observe in practice: instead I observe that
people sort of sign a bit, but don't start raving about
"Why can't you people keep your computers up?!?" If the lineups
move noticably more slowly than the customers are accustomed to,
some of them get frustrated at the extra time -- but I don't hear
them getting frustrated at the "incompetence" of the bank's systems.

Thus, what I seem to observe is that most people appear to be
"socialized" to think systems/network problems are a fact of life, an
inconvenience but something to be expected, like the way a traffic
accident can slow down a highway. I have heard the occasional
complaint ("I tried to pay my bills but I couldn't because the
bank computers were down") -- but I hear more people complain
(and more bitterly) about the busses being late or about traffic jams --
or about the power having failed and they have to go around and
reset all their VCR clocks

And if people have become socialized to systems/network problems then
that suggests that network/server problems are "normal" in many businesses
-- as opposed to the mental model that networks/systems are rarely a
problem most places and any operation which falls short of that has
probably been designed or managed incorrectly.
--
"No one has the right to destroy another person's belief by
demanding empirical evidence." -- Ann Landers


Some good points here and banks may be differnt from, say, booking an
taking an airline flight, in that (a) people have such a grim expectation
of customer service that is very low and (b) online banking and ATMs
have meant that there are fewer "gotta get to the bank by 3PM" events.

If you book a flight, show up and find they don't have you in the
computer or have been overbooked you're going to be _much_ madder than
the bank scenario. Stuck in traffic is similar.

Windows 95 taught people to be tolerant of computer problems at work.

OS/360 taught _me_ that.

--
--John
to email, dial "usenet" and validate
(was jclarke at eye bee em dot net)
Back to top
T. Sean Weintz
Guest





Posted: Tue Apr 19, 2005 12:20 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

Walter Roberson wrote:
Quote:
Recently, I had it put to me that LANs (and firewalls) should be 100%
reliable (barring major equipment failure) -- that networks & security
should be about as reliable as the electrical mains (i.e., something
that can taken for granted nearly all the time, and repairs should take
only a few minutes.)

A few thoughts -

Rarely does an ENTIRE network go down, right? It happens, but that is
pretty darn rare. PARTS of it usually go down from time to time. When
that "part" happens to be your gig Ethernet or ATM core for a particular
building such as your corporate headquarters, it can be rather dramatic,
tho. <grin>

The same could be said for the electrical mains. In fact power here is
problematic enough have diesel powered backup generators at many of our
buildings. And we are in the middle a city.

In all fairness, the electrical systems do not have the complexity and
constant need to be changes that most data networks have. Even voice
networks are quite simple by comparison.

For localized networks outages, there are of course typically off site
back up hosts, etc, that are utilized in such events. But the honchos
upstairs in the building that is down don't see that everything is fine
for customers. Web servers are up for the outside world to see, customer
transactions are being processed, but the CEO of the company can't get
to his favorite blog or check his in house email. To him, "the network
is down"

Quote:

I was informed that "millions of businesses every day" have that
kind of LAN reliability.

I'd counter with "75% of all statistics are made up on the spot" <grin>
Seriously, where did this person get that statistic? My experience
certainly does not bear that out.

Quote:

Is that level of reliability the norm in real SMBs, with 500-ish
hosts, multiple subnets, and a mandatory deny-by-default firewall
policy?

The smaller the org, the smaller the network, the more reliable it is
IME. The less complex, the more reliable IME. I'm sure you'd consider
that common sense.

That said, what you describe above fits my network pretty well. I manage
it pretty aggressively/proactively - read the syslog server logs every
day looking for issues, etc. I DO have outages from time to time. Most
often these are down WAN links, however that is the fault of the local
telco. Things like that happen when you run DS1 circuits over copper
pairs that were put in place over 100 years ago (really - no kidding -
many of the pairs in this city ARE that old!) So I'd have have to say
the core of my network is actually more reliable than the local PTSN and
the power mains here.

However, given what the users know and experience, "reliability" leaves
room for interpretation. For the average end user, having an email
message dropped due to it coming from a blacklisted server might be an
"unreliable network" in their mind. Execs telecommuting from home, using
a cable modem on a congested node that drops packets from over
subscribing, thus causing the citrix metaframe sessions to drop, has in
my experience been blamed on our network. Try explaining to the user
that "yes, I understand you have no problems going to any websites from
your home internet connection. However, the problem IS on your end, not
back here at the office"


Quote:

Which is the truer picture in a growing organization with fluid network
access requirements: that the network & security person has barely
anything to do because they set up the equipment "right" the
first time?

LOL. Even if set up "right the first time", it won't remain so... see below.

Quote:
Or that keeping up with the network & security
changes and failures and planning is more than a full-time job
that can involve many a late night (or marathon repair session)?

Full time job. The problem is the changes. New sites and office open up,
old ones close. Topologies change. Access to new apps over the internet
(designed by folks who consider ease of integration into your
environment lastly or not at all), etc, etc. Reliability is much easier
when the goal is not a moving target.

When I worked for a large bank, that I won't mention by name, rather
than CHASE down our folks during big network changes, we rented hotel
rooms in MANHATTAN for weeks at time, and would send our folks over to
the hotel for a few hours of sleep once in a while. I once witnessed my
boss staying at headquarters for over 72hr without once leaving the
building. Sometimes it crosses over from being a mere "full time job" to
being a "way of life". I started to know I had a problem when I started
dreaming at night about PIX over IP tunnels.

Quote:

How much truth is there, in real organizations, to those old
cartoons of a skeleton with cobwebs in front of a computer terminal,
with the caption "The network's down again." ?

Depends. I never quite got that one - is the skeleton supposed to be the
user or the admin?

Quote:

It seems to me that more than once I've been in a major bank and been
told "The network's down",

Which could mean anything. Most likely it means the leased line from
that branch back to the main office is down. Hardly the same in my mind
as the network being down.

Quote:
and no-one, staff or customer, seemed
surprised. I also seem to recall hearing a number of casual
conversations along the lines of "Oh yeah, the network
went down again at work today"... and I don't recall
hearing anyone reply "Our network never goes down"... not
for anything short of a Service Provider.

There has long been a "blame the computer" component to our culture -
it's a common scapegoat. The network has been added into that. Folks
WANT to be able to have something to blame, real or not. 500 years ago
it was "the devil". Today it's "the network"

Quote:


Lastly: has anyone observed a network "freak out", with a series
of normally reliable devices getting confused and staying confused
all through hours of standard problem isolation procedures, with no
discernable reason for the multiple failures -- and for the devices
to eventually settle down, and start working properly with
configurations that didn't work before?

Sure. I think anyone who has helped manage a network of any size has
seen that at least once. Never fails that things settle down right when
the hour you can start intrusive testing pops up, too.

Keeping configs as simple as possible tends to minimize this IME. Never
seen it happen on, say, a network that had no vlans, all routing was
done via static routes, and no multicast stuff was used, etc.
Back to top
Robert Redelmeier
Guest





Posted: Tue Apr 19, 2005 12:20 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

Al Dykes <adykes@panix.com> wrote:
Quote:
For Windows on desktops on business it isn't so much the MTBF
as the MTTR that makes a happy shop. With all the user data
and profile on the server we just drop in a fresh pre-imaged
box when a user has a probelm, hardware or software. The sick
box goes on te bench for a hardware repair or reimage and reuse.

And with USB flashdrive sticks, maybe that's also the way
to go for laptops. No user data. Appliance computing.

Quote:
IMO MS servers running mainstrea windows applications can be
very good if your expectations o0f scale and complexity are
reasonable.

I think that's where we blow it. 100+k machines around
the world. Most data nearby, but all visible.

-- Robert
Back to top
Walter Roberson
Guest





Posted: Tue Apr 19, 2005 12:20 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <d40t40$cm8$1@panix5.panix.com>, Al Dykes <adykes@panix.com> wrote:
:For Windows on desktops on business it isn't so much the MTBF as the
:MTTR that makes a happy shop. With all the user data and profile on
:the server we just drop in a fresh pre-imaged box when a user has a
:probelm, hardware or software.

:IMO MS servers running mainstrea windows applications can be very
:good if your expectations o0f scale and complexity are reasonable.

At the risk of becoming increasingly off-topic for "lans", but
still on the topic of "user-perceived reliability", I wonder if
someone might have some information on reliability of this sort
of configuration as a fileserver to replace Novell 5:

fileserver ~~ NFS3 ~~ Linux node ~~~ samba/SMB ~~~ Windows PC


When I was young and impressionable, I saw first hand that NFS
(probably v1 then) was incompatible with reliability, because *in
practice* a single crashed node that had mounted a filesystem
read-write could end up requiring that -all- the nodes be rebooted to
clear the problem. And having seen that happen all too often, When I
Had My Way in designing our services, NFS was banished. NFS didn't have
the best of reputations when lots of different systems were accessing
it [e.g, as would be the case for a fileserver.] But NFS has, I know,
grown up somewhat since then: has it made it to production grade for
reliability, locking, and speed?

Similarly, when I was examining earlier samba editions, locking was a
big issue for it as well... as was ability to handle a lot of
simultaneous activity. Perhaps that has been cleared up as well?

We don't do much database work, so we don't often work at the level of
record-level locks, but we would plausibly have multiple appenders to
single files, and we would certainly have multiple people attempting to
open the same document with write permissions. With Novell Netware 4/5,
the multiple writers issues was handled acceptably for our purposes: a
message saying the file was locked was better than two people
overwriting each other's changes.

What's the word these days: "Don't worry, those are problems of
the past", or "Not there -yet- for NFS + samba?", or "Run Netware Services
for Linux to mediate the accesses rather than trusting samba" ?
--
Warning: potentially contains traces of nuts.
Back to top
Walter Roberson
Guest





Posted: Wed Apr 20, 2005 4:20 pm    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <d40de2$g7g$1@X31.networkingunlimited.com>,
Vincent C Jones <vcjones@X31.networkingunlimited.com> wrote:

:Warning: I have been called a "Network Management Bigot" for
:requesting all sorts of monitoring. However, my experience has been
:that if you look closely enough at how the network is ACTUALLY
:running, you will often spot problems before they are manifested
:as service outages. Examples range from marginal links which are
:reporting only brief intermittent hiccups on their way to total
:failure, to routing tables which indicate that the routes in use
:are not the routes you designed with the high probability that
:when something fails, the network will roll over and die rather
:than select an alternate route.

That sounds like much more sophisticated monitoring tools than
MRTG or even Fluke's OptiView (nee' Network Inspector). What is
available that can give useful information at that level, over
a range of manufacturers (e.g., Nortel BayStack, Nortel Accelar,
Cisco 3750, likely some Cisco 2960's)?

We presently have only about 25 local switch units (mostly in stacks
of 2-4); we might be getting more for redundancy.


More importantly than monitoring software: where can one learn what
events can be safely ignored; and what needs replacing "sooner or
later"; and what isn't failing now but it's time to tell management
firmly that a replacement is a priority? Would there be some good
books/documents about network risk assessment?

We are not in a situation where "The network MUST stay up" -- but the
longer access is down, the less work our people can do, so it would be
better if I could learn how observed problems translate into
risk probabilities, so we can do cost/benefit analysis that take into
account the then-current financial priorities.
--
I was very young in those days, but I was also rather dim.
-- Christopher Priest
Back to top
Vincent C Jones
Guest





Posted: Thu Apr 21, 2005 1:23 pm    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <d45ul8$pih$1@canopus.cc.umanitoba.ca>,
Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:
Quote:
In article <d40de2$g7g$1@X31.networkingunlimited.com>,
Vincent C Jones <vcjones@X31.networkingunlimited.com> wrote:

:Warning: I have been called a "Network Management Bigot" for
:requesting all sorts of monitoring. However, my experience has been
:that if you look closely enough at how the network is ACTUALLY
:running, you will often spot problems before they are manifested
:as service outages. Examples range from marginal links which are
:reporting only brief intermittent hiccups on their way to total
:failure, to routing tables which indicate that the routes in use
:are not the routes you designed with the high probability that
:when something fails, the network will roll over and die rather
:than select an alternate route.

That sounds like much more sophisticated monitoring tools than
MRTG or even Fluke's OptiView (nee' Network Inspector). What is

Not necessarily!!! For example, MRTG is extremely useful, as long
as you use it to monitor more than just usage. It does an excellent
job of monitoring other parameters that are exposed by SNMP, such
as link errors, CPU utilization, route changes, and the like.

It also means logging into each router (particularly after a change) and
looking at the routing tables, routing topology database, etc., and
verifying that everything matches what your design says it should be,
then triggering various faults and verifying that the recovery is the
recovery you designed. Simple example, routing with EIGRP, is the
alternate route a feasible successor? Bottom line, don't assume your
design and implementation are correct, prove they are. Then test them to
failure so you can be sure they behave correctly under stress.

Quote:
available that can give useful information at that level, over
a range of manufacturers (e.g., Nortel BayStack, Nortel Accelar,
Cisco 3750, likely some Cisco 2960's)?

Careful and routine perusal of the logs can be enlightening. The logs
are not for post mortem analysis only. A multivendor environment makes
automating the analysis more challenging, but most vendors of "business
grade" hardware provide logging of a wide range of interesting events.

Quote:
We presently have only about 25 local switch units (mostly in stacks
of 2-4); we might be getting more for redundancy.

Substitute spanning tree for my comments on routing :-) Is the spanning
tree what you think it is? Is the topology simple enough that the new
spanning tree can be quickly determined when an interswitch trunk fails?
Do the "fast convergence" features configured actually work?

Quote:
More importantly than monitoring software: where can one learn what
events can be safely ignored; and what needs replacing "sooner or
later"; and what isn't failing now but it's time to tell management
firmly that a replacement is a priority? Would there be some good
books/documents about network risk assessment?

Doubt that there is a book available, there is no market for one (I
tried, but the mass market is for "pass the cert" books). Part of the
problem is that the answer is the classic "it depends." If you really
understand how your network works, you will know what is important and
what is not. Some examples:

Review yesterday's logs: Can you explain every event reported? Every
report of a routing peer associating or disassociating? Every backup
call placed? Every spanning tree change?

Still in yesterday's logs: Was the reaction to every network
perturbation correct? If a link failed, did it have the proper
side effects such as routing peer change reports, dial backup,
HSRP active moves, etc.

Still in yesterday's logs: Do the events reported in the log match up
with the events reported by your status monitoring system? Are there
failures reported by your monitoring system (even "WhatsUp Gold" can do
that job) that are not reflected in the log or vice versa?

Look for correlations between the events reported and the traffic, error
rates, and the like recorded by MRTG (or whatever). Note that MRTG only
gives you 5 minute resolution for the past 24 hours, you may want to
look at preserving that data longer if you don't check at the same time
every day, including weekends :-)

While you're looking at the error rates, are there correspondences
between links which are supposed to be independent? This could imply a
common point of failure which needs to be eliminated. On WAN links, it
could mean that links which you are paying to have diversely routed have
been groomed back to one or more common trunks.

Etc. etc. etc.

Quote:
We are not in a situation where "The network MUST stay up" -- but the
longer access is down, the less work our people can do, so it would be
better if I could learn how observed problems translate into
risk probabilities, so we can do cost/benefit analysis that take into
account the then-current financial priorities.
--
I was very young in those days, but I was also rather dim.
-- Christopher Priest

The biggest challenge is justifying the effort required to keep track of
what is happening. Note that there is significant expert analysis
required with little chance of an immediate payback. Be wary of getting
caught in the network management success trap. If you do too good a job,
users (and their management) will never see all the problems you cured
while still beneath their radar, and you'll be asked "Why spend all that
money on network management when the network never fails?"

Good luck and have fun!
--
Vincent C Jones, Consultant Expert advice and a helping hand
Networking Unlimited, Inc. for those who want to manage and
Tenafly, NJ Phone: 201 568-7810 control their networking destiny
http://www.networkingunlimited.com
Back to top
Walter Roberson
Guest





Posted: Thu Apr 21, 2005 4:20 pm    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <d489gh$9o5$1@X31.networkingunlimited.com>,
Vincent C Jones <vcjones@X31.networkingunlimited.com> wrote:

:Be wary of getting
:caught in the network management success trap. If you do too good a job,
:users (and their management) will never see all the problems you cured
:while still beneath their radar, and you'll be asked "Why spend all that
:money on network management when the network never fails?"

As Pooh would say, "Oh, Bother!".

I've been in that trap for years. People see the failures and not the
successes or the efforts; and they wonder what you -do-, since you
don't seem to be -producing- anything... and then they cut budgets.
Then when a disaster happens because you couldn't get the time or
money allocated to go redundant or build/buy the proper monitoring
and testing tools, it is due to "your Bad Planning". :(
--
'ignorandus (Latin): "deserving not to be known"'
-- Journal of Self-Referentialism
Back to top
Walter Roberson
Guest





Posted: Thu Apr 21, 2005 11:14 pm    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <d489gh$9o5$1@X31.networkingunlimited.com>,
Vincent C Jones <vcjones@X31.networkingunlimited.com> wrote:
:In article <d45ul8$pih$1@canopus.cc.umanitoba.ca>,
:Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:

:>More importantly than monitoring software: where can one learn what
:>events can be safely ignored;

:If you really
:understand how your network works, you will know what is important and
:what is not.

I think it would be difficult for me to justify tracking down every
CRC error -- test all the cables each time, swap to a different
copy of the same kind of NIC, swap to different NICs, SmartBits several
adjacent runs in case it is cross-talk related, and so on.

To give an analogy: when our desktop workstations panic once every
few years with a memory error, we didn't worry, because we knew
that those events could be caused by cosmic rays (literally!)
or by natural radiation (the flower-garden borders are granite,
and granite is slightly radioactive.) We developed a feel for
how often was "normal", and how often suggested a more serious
problem (loose SIMM, need to clean the contacts.) There wasn't any
point in calling for maintenance on a stray memory error --
the incidence rate, impact, and reproducability were too low.

With a network, I have not yet developed a good feel for "acceptable"
error rates. Retransmissions will happen, so single errors or bursts
are taken in stride in low levels; it takes a fair number to
interfere noticably with the regular data (especially if the link
is not particularily under load.) A -stable- error rate is
not necessarily a problem.

An error rate is, though, also a hint that the link might deteriorate,
perhaps rapidly -- and I don't know how to go from a particular error
rate to a likelyhood of real failure. 'This link only has "two months
to live"' is going to be treated somewhat differently than 'There is
about a 3% chance of this link failing in the next two years, leading
to about a half-hour of slow DNS response when it does.' I cannot
just go on "gut instinct" to convince management to divert funds
to buy a new switch, especially not if any failure would most -likely-
happen in the next fiscal year instead of this one...

--
Entropy is the logarithm of probability -- Boltzmann
Back to top
Walter Roberson
Guest





Posted: Wed May 04, 2005 12:20 am    Post subject: Re: [sorta OT] LAN reliability ? Reply with quote

In article <d3vrn2$6or$1@canopus.cc.umanitoba.ca>,
Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:
:Recently, I had it put to me that LANs (and firewalls) should be 100%
:reliable (barring major equipment failure)

:Is that level of reliability the norm in real SMBs, with 500-ish
:hosts, multiple subnets, and a mandatory deny-by-default firewall
:policy?

I received some good feedback to my earlier questions -- thanks,
everyone.

Today I came across an article that directly addresses real-world
IT reliability measures:

http://www.bcr.com/forum/viewtopic.php?t=271

The quick summary is that Savvis did a survey of 100 private firms.
[Abstracting], the average -monthly- downtime was:
- 4.5 hours for network hardware
- 4.7 hours for network transport
- 2.3 hours for security

More figures and number-of-incidents analysis are available from
the above article.


[The internal needs survey I've been doing suggests that the internal
-expectations- here are for close to four-nines reliability... on
a one-nines budget.]
--
Studies show that the average reader ignores 106% of all statistics
they see in .signatures.
Back to top
 
Post new topic   Reply to topic    DComTalk.com Forum Index -> Ethernet All times are GMT
Goto page Previous  1, 2
Page 2 of 2

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Solutions: Telephone Systems Electronics Satellite TV Tech & Gadgets
Powered by phpBB