|
|
||||
![]() |
![]() |
|||
![]() |
![]() |
||
| ||||||||||||||
| ||||||||||||||
Resources Home About InternetWeek.com Contact Us E-Mail Newsletter Tech Library TechCareers Privacy Statement Resource Centers Virtual Private Networks (VPNs) TechWeb Sites InformationWeek InternetWeek Network Computing Financial Technology Network Bank Systems & Technology Insurance & Technology Wall Street & Technology Technology & Learning Optimize Magazine The Open Enterprise Ad Info |
||||||||||||||
Predicting Failure
Two tools help administrators catch subtle network problems caused by hardware faults or changing network usage patternsBy ALAN ZEICHICKSeptember 4, 2000 To the end user, the network is binary. It's up or it's down. The Web server is accessible or it's not. The printer is working or it's broken. All too often, this binary view affects IT and network administrators together: If the network ain't broke, don't mess with it. If it's broke, fix it and wait for something else to break. If things break often, set up redundant or failover systems to eliminate, or at least reduce, downtime. By exploiting the ability of managed devices such as hubs, switches, routers, servers and other systems to send SNMP traps when they experience partial failure, as well as respond to queries about their current utilization, administrators can provide preventive maintenance. Server manufacturers have provided aspects of this service for years, most notably in SMART hard drives, which can alert administrators to an imminent disk failure. Those capabilities aren't unique to hard drives. Most managed devices have the capability to alert administrators to what might appear to be minor problems or changes in operating status: a switch fan is running slowly, internal temperature in a router is rising, a server's network interface card is sending erratic bad packets. None of those errors would cause major faults, especially because modern network protocols, LAN design and equipment is designed to be fault-tolerant and resilient. However, when many of these minor faults are analyzed over time, they might signal that a component is ready to fail. Similarly, when devices are polled to their current utilization and other performance parameters, and that data is compared against a historical database, unexpected changes in their usage patterns might signal real problems ahead. We examined two packages that help network managers and administrators analyze SNMP data that, with luck, will provide not only a detailed view of the real behavior of the network, but also help predict component failures. Entuity Inc.'s Eye of the Storm, priced at about $25 per managed network element (such as a switch or router port), focuses on the physical well-being of the network. It's the best tool we've seen for helping administrators identify components on the brink of failure as it passively monitors SNMP events in real time. Concord's product, the eHealth Suite, is performance-oriented. It uses active SNMP polling to build a baseline database of the utilization of each managed infrastructure element on the network, and then watches for exceptions. When utilization changes, compared to complex rules matched with historical data, administrators are alerted in case this change represents an error condition or an unexpected shift in network utilization that must be addressed before it impacts end-to-end performance. Eye Of The Storm The heart of EotS is its server application, which discovers network equipment, monitors and probes for SNMP messages, and maintains a database of events. A separate EotS client provides a Component Viewer, which displays a drill-down menu of all managed ports on the network; a Bulletin Board, where SNMP events can be watched as they happen, organized in such a way as to make patterns obvious; and a Report Center, which displays a network inventory, lists historical faults and provides metrics on port traffic and utilization. Targeted at very large networks with more than 10,000 managed ports, EotS is very focused on its goal of helping LAN administrators find problems before they become serious. We installed two instances of EotS version 2.1.2 on two Windows NT servers. The first was configured to use a sample database, provided by the company, that contained a large quantity of data--SNMP events and equipment inventory--from one of Entuity's customers. We used this version to examine the product's report-generating capabilities. We configured the second instance to perform real-time discovery and event monitoring on our own, much smaller LAN. The server application, on Windows NT, runs as a set of services that communicates with a MySQL database, which also runs as a service on the same machine. Once the product is installed, it is accessed and administered via a Web interface; one of EotS's services is an HTTP server. Some functions, such as logging into the server, are handled from within the browser. The real EotS client, however, is a Java 2 application, which can either be launched from within the browser or downloaded from the EotS server using the browser and run as a stand-alone desktop application. The Java application is based on Sun's Java 2 Standard Edition, and the EotS server software also includes the J2SE virtual machine code and installer for 32-bit Windows and Solaris. Users of other operating systems, such as Macintosh, cannot run the EotS client--we tried. Ultimately, we ran the client on both a Windows 2000 Professional desktop as well as on the Windows NT Server desktop itself. Both servers, by the way, were dual-processor systems: the demo database was run on a Compaq ProLiant DL360 server with dual 800MHz Pentium III processors and 512MB RAM, and the live server was a Dell PowerEdge 2450 server with dual 733MHz Pentium III processors and 512MB RAM. Our biggest complaint is that the Web-based interface is extremely slow, even when run on the server console itself. The Java application is much more responsive, however. The documentation, though plentiful, is poorly organized and doesn't lend itself to an administrator attempting to learn how to use the product. As our focus for this project was predictive failure analysis, we focused our attention on EotS's Bulletin Board, which displays SNMP events. On the live system, we let the EotS server run for several days as it discovered the resources on our network; the discovery process is slow. According to the vendor, it's because it's typically installed on such large networks that a discovery can create a heavy load on the network. On the live system, the Bulletin Board displayed events as they happened. Unfortunately, we had trouble simulating alerts that the software could see; but we did manage to do so, such as creating a thermal alert on a switch by stopping its cooling and plugging its air vents. That's when we switched over to use the provided event database, which included a small simulator that generated SNMP events the Bulletin Board could trap. The program responds quickly and actively when SNMP events happen, but out of the box, the only place you can see the events unfold is on the Bulletin Board console itself--the application contains no built-in functions for e-mail, page or telephone alerts, or even to display pop-up windows when things go wrong. Because the types of faults that EotS is designed to watch for are subtle and may take several hours or even days to go from a minor problem to a full system failure, and because the company expects that its customers would have full-time network administrators monitoring the system, the lack of active alerting may not be a problem. However, the company says these features will be added to a future version. EotS lends itself to its use as a predictive-failure tool for two reasons. The first is that current and historical errors on devices can be collected and analyzed on the Bulletin Board in real time, and the Report Center can be used for more in-depth analysis. Because the MySQL database is open ended, even events widely separated in time can be caught and analyzed. Another reason is that the Bulletin Board alerting system is tightly integrated into the separate Component Viewer and its discovery database. When a device begins throwing off SNMP events, the Component Viewer can examine that device and its ports to determine which other ports are dependent upon it. For example, a Router 14's port 3 might take down the Workgroup Switch 12 as well as its users. Because EotS also maps VLANs, it provides administrators with the opportunity to reshuffle users electronically by reconfiguring the VLAN to take users off affected ports. A handy feature is that Component Viewer can establish a telnet connection with a device right from the Java interface. Alternatively, administrators can physically move those users' connections to isolate the problem device or port. Similarly, if users complain about intermittent errors, the Component Viewer helps trace their particular connectivity and isolate the fault. Eye of the Storm is an excellent product, focused tightly on analyzing subtle problems that affect a network's LAN infrastructure equipment. If it included its own alerting capabilities and also managed the end nodes of the network--or at least the NIC--it would be near perfect. With pricing starting at $25 per managed port for a minimum installation of 1,000 ports and declining rapidly thereafter, it's also a bargain for a large distributed network, with its complex gear that's apt to warn administrators that a failure is imminent--if only they'd listen. EHealth/Live Health EHealth is a four-part client/server-based suite. The foundation is Live Health, which polls devices in real time to gather performance metrics. It provides a Web-based interface, generates on-screen and Adobe Acrobat-based reports, and can also talk to external management applications such as HP OpenView. The installation procedure for Live Health not only sets up Concord's application, but also adds the CERN Web server, Open Ingress database engine and SCO's XVision PC X server, which is used to manage the software locally from the server console. The other three packages, which must be purchased separately, are Network Health, which monitors the hubs, switches, routers and other network infrastructure; System Health, which monitors servers and their individual services and daemons; and Application Health, which performs end-to-end performance and availability monitoring of Web servers and applications like Microsoft Exchange. For this review, we focused on the Network Health module. The software suite can be installed on a HP-UX, Solaris or Windows NT server; we chose to install it on the same Compaq ProLiant DL360 hardware used for the previous product test. After a complex scheme of entering license information (Concord uses the database server's network MAC address as a unique key for a generated license code), the software was ready to begin a discovery process on the network. After we provided the software with an IP range, it quickly identified 67 SNMP-manageable devices on the LAN, ranging from hubs and switch ports, to a Network Appliance NetCache C1100 we're currently evaluating, to a number of servers that had SNMP agents active. Normally, the software performs the discovery process once per day, at midnight. This, like most health parameters, is extremely user-configurable‹there are options for just about everything, which can be overwhelming until you decide to just trust the defaults. All configuration takes place from the server console. Once the discovery is done, the application begins probing each of the managed devices at regular intervals (set by default to be five minutes). According to Concord, Live Health is preprogrammed with the MIB definitions for more than 500 SNMP-manageable devices, so it can find out as much information as is relevant. The MIBs that Live Health uses are for performance- and availability-oriented metrics. Unlike EotS, it doesn't ask devices for their fan status or monitor their operating temperature. Nor does it determine the network's topology and, thus, device and port dependencies. All captured data is stored in the Open Ingress database, normally for six weeks. After this initial survey has been completed, the company recommends letting the application run for a few days, so it can get used to normal traffic patterns, and build a baseline. Live Health then comes into its own, and is primarily driven from its Web-based interface, which is very quick and responsive. One important use of Live Health would be real-time exception monitoring, using an interface tab named Live Health. Based on a complex set of rules, the interface will show when the results of its regular polling show something out of the ordinary. For example, a port that's normally busy suddenly shows zero traffic, a CPU exceeds its normal utilization, or the e-mail application is running slower than expected. What sets Live Health apart from other products is that the rules can be quite complex: Not only does the router's WAN port utilization exceed normal parameters, but it exceeds historical usage patterns for day of the week and time of day, for example. Those rules may be set explicitly, but most administrators would be content to let the system determine what constitutes an exception, based on its extensive database of "best practices" rules as well as the historical database. Once an exception occurs, it's displayed on the Live Health screen; from there, administrators can tell the system to begin monitoring that device in "fast mode," polling as often as every 30 seconds, so as to track the unfolding situation. Like Eye of the Storm, Live Health doesn't offer any external alerting functions, like e-mail or paging, but relies on links to third-party programs to provide that feature. By tracking these exceptions and using the database to research them, administrators can get a very realistic handle as to what's happening on the network, as well as see changes in trends and behavior. The other major part of the Live Health application is its extensive reporting capability. We were astonished at the depth that the application's reporting could go, allowing administrators to look at devices sliced by time, behavior or organization. There are preprogrammed reports, such as for devices that have recently had exceptions or are using certain applications. Reports may be viewed on the Web; Live Health also will generate them in easy-to-read Adobe Acrobat files, ready for printing for upper management, sending to a vendor or bringing up for discussion in a meeting. We've only scratched the surface of what Live Health can do; not only does it monitor the infrastructure, but with the additional modules listed above, it also watches servers and their applications end-to-end, thanks to a small agent that can be installed on Linux, Unix and Windows servers and on Windows NT workstations. (We didn't evaluate that portion of the product.) With a cost of about $150 per managed element, it's much pricier than EotS, but its focus on performance and extensive reporting make it a truly different solution. Alan Zeichick is principal analyst with Camden Associates, which conducts independent technology research, and is a contributing editor to InternetWeek. He can be reached at zeichick@camdenassociates.com
|
Let our Solution Center help you find the network products you need. Then, receive customized proposals from qualified suppliers -- fast! MORE Looking for technical information, white papers and analyst reports on CRM, wireless, enterprise networking, and more? Don't miss Tech Library's collection of 14,000+ white papers. Featured White Paper: Supply Chain Management: Why B2B eMarkets Are Here to Stay -- Accenture |
||
| Home | Breaking News | Supply Chain | Web Development | |
| Security | IT Services | All Stories | Sitemap | |
| Media Kit | Copyright © 2010 | CMP Media LLC | Privacy Statement | Feedback |