Manuel d'utilisation / d'entretien du produit X4140 du fabricant Sun Microsystems
Aller à la page of 80
Sun Microsystems, Inc. www .sun.com Submit comments about this document at: http://www.sun.com/hwdocs/feedback Sun Fire™ X4140, X4240, and X4440 Ser v ers Diagnostics Guide P ar t No .
Please Recycle Copyright © 2008 Sun Microsystems, Inc., 4150 Network Cir cle, Santa Clara, California 95054, U.S.A. All rights reserved. Unpublished - rights reserved under the Copyright Laws of the United States. THIS PRODUCT CONT AINS CONFIDENTIAL INFORMA TION AND TRADE SECRETS OF SUN MICROSYSTEMS, INC.
iii Contents Preface vii 1. Initial Inspection of the Server 1 Service Troubleshooting Flowchart 1 Gathering Service Information 2 System Inspection 3 T r oubleshooting Power Problems 3 Externally Inspecting the Server 3 Internally Inspecting the Server 4 2.
iv Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Uncorrectable DIMM Err ors 12 Correctable DIMM Err ors 14 BIOS DIMM Error Messages 15 DIMM Fault LEDs 15 Isolating and Correcting DIMM ECC Errors 18 A.
Contents v Handling of Uncorrectable Err ors 53 Handling of Correctable Err ors 56 Handling of Parity Errors (PERR) 59 Handling of System Errors (SERR) 61 Handling Mismatching Processors 63 Hardwar e .
vi Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008.
vii Pr eface The Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide contains information and procedur es for using available tools to diagnose problems with the servers. Befor e Y ou Read This Document It is important that you review the safety guidelines in the Sun Fir e X4140, X4240, and X4440 Safety and Compliance Guide.
viii Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Related Documentation The document set for the Sun Fire X4140, X4240, and X4440 Servers is described in the Where T o Find Sun Fir e X4140, X4240, and X4440 Servers Documentation sheet that is packed with your system.
Preface ix T ypographic ConventionsThir d-Party W eb Sites Sun ™ is not responsible for the availability of third-party web sites mentioned in this document. Sun does not endorse and is not responsible or liable for any content, advertising, products, or other materials that are available on or thr ough such sites or resour ces.
x Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Sun W elcomes Y our Comments Sun is interested in impr oving its documentation and welcomes your comments and suggestions. Y ou can submit your comments by going to: http://www.
1 CHAPTER 1 Initial Inspection of the Server This chapter includes the following topics: ■ “Service T roubleshooting Flowchart” on page 1 ■ “Gathering Service Information” on page 2 ■ .
2 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Gathering Service Information The first step in determining the cause of a pr oblem with the server is to gather information from the service-call paperwork or the onsite personnel.
Chapter 1 Initial Inspection of the Server 3 System Inspection Controls that have been impr operly set and cables that are loose or improperly connected are common causes of problems with har dware components.
4 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Internally Inspecting the Server T o perform a visual inspection of the internal system: 1. Choose a method for shutting down the server from main power mode to standby power mode.
Chapter 1 Initial Inspection of the Server 5 FIGURE 1-2 X4440 Server Front Panel 2. Remove the server cover . For instructions on removing the server cover , refer to your server ’s service manual. 3. Inspect the internal status indicator LEDs. These can indicate component malfunction.
6 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 10. If the problem with the server is not evident, you can obtain additional information by viewing the power-on self test (POST) messages and BIOS event logs during system startup.
7 CHAPTER 2 Using SunVTS Diagnostic Softwar e This chapter contains information about the SunVTS™ diagnostic software tool. Running SunVTS Diagnostic T ests The servers are shipped with a Bootable Diagnostics CD that contains the Sun V alidation T est Suite (SunVTS) software.
8 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 ■ QLogic Host Bus Adapter T est (qlctest) ■ RAM T est (ramtest) ■ Serial Port T est (serialtest) ■ System T est (sy.
Chapter 2 Using SunVTS Diagnostic Software 9 Using the Bootable Diagnostics CD T o use the diagnostics CD to perform diagnostics: 1. W ith the server powered on, insert the CD into the DVD-ROM drive. 2. Reboot the server , and press F2 during the start of the reboot so that you can change the BIOS setting for boot-device priority .
10 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 ■ Solaris system message log is a log of all the general Solaris events logged by syslogd . The path name of this log file is /var/adm/messages . a. Click the Log button. The Log file window is displayed.
11 CHAPTER 3 T roubleshooting DIMM Pr oblems This chapter describes how to detect and correct problems with the server ’s Dual Inline Memory Modules (DIMM)s.
12 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 DIMM Replacement Policy Replace a DIMM when one of the following events takes place: ■ The DIMM fails memory testing under BIOS due to Uncorrectable Memory Err ors (UCEs).
Chapter 3 T roubleshooting DIMM Problems 13 3. BIOS reports this event in the service pr ocessor ’s system event log (SEL) as shown in the sample IPMItool output below: # ipmitool -H 10.
14 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 The lines in the display start with event numbers (in hex), followed by a description of the event.
Chapter 3 T roubleshooting DIMM Problems 15 to view ECC errors ■ Linux: The HERD utility can be used to manage DIMM errors in Linux. See the x64 Servers Utilities Reference Manual for details. ■ If HERD is installed, it copies messages from /dev/mcelog to /var/log/messages .
16 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Note – The DIMM Fault and Motherboard Fault LEDs operate on stored power for up to a minute when the system is powered down, even after the AC power is disconnected, and the motherboard (or mezzanine board) is out of the system.
Chapter 3 T roubleshooting DIMM Problems 17 FIGURE 3-1 DIMMs and LEDs on Motherboard.
18 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 FIGURE 3-2 DIMMs and LEDs on Mezzanine Board Isolating and Corr ecting DIMM ECC Err ors If your log files r eport an ECC error or a problem with a DIMM, complete the steps below until you can isolate the fault.
Chapter 3 T roubleshooting DIMM Problems 19 3. Press the PRESS TO SEE F AUL T button, and inspect the DIMM fault LEDs. See FIGURE 3-1 and FIGURE 3-2 . A flashing LED identifies a component with a fault. ■ For CEs, the LEDs correctly identify the DIMM where the err ors were detected.
20 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 1 1. Power on the server and run the diagnostics test again. 12. Review the log f ile.
21 APPENDIX A Event Logs and POST Codes This appendix contains information about the BIOS event log, the BMC system event log, the power-on self-test (POST), and console r edirection.
22 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Main Advanced PCIPnP Boot Security Chipset Exit ****************************************************************************** * Advanced Settings * Configure CPU.
Appendix A Event Logs and POST Codes 23 b. From the Advanced Settings screen, select Event Log Conf iguration. The Advanced Menu Event Logging Details screen is displayed. c. From the Event Logging Details screen, select V iew Event Log. All unread events ar e displayed.
24 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 c. From the IPMI 2.0 Conf iguration screen, select V iew BMC System Event Log.
Appendix A Event Logs and POST Codes 25 Power -On Self-T est (POST) The system BIOS provides a rudimentary power -on self-test. The basic devices requir ed for the server to operate are checked, memor.
26 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Redir ecting Console Output Use the following instructions to access the service pr ocessor and redir ect the console output so that the BIOS POST codes can be read.
Appendix A Event Logs and POST Codes 27 10. Set the color depth for the redirection console at either 6 or 8 bits. 1 1. Click the Start Redirection button. 12. When you are prompted for a user name and password, type the following: ■ User Name: root ■ Password: changeme The current POST scr een is displayed.
28 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Changing POST Options These instructions are optional, but you can use them to change the operations that the server performs during POST testing. T o change POST options: 1.
Appendix A Event Logs and POST Codes 29 3. Select Boot Settings Conf iguration. The Boot Settings Configuration scr een is displayed. 4. On the Boot Settings Conf iguration screen, there are several options that you can enable or disable: ■ Quick Boot – This option is disabled by default.
30 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 ■ Boot Num-Lock – This option is On by default (keyboard Num-Lock is turned on during boot). If you set this to off, the keyboar d Num-Lock is not turned on during boot.
Appendix A Event Logs and POST Codes 31 POST Codes T ABLE A-1 contains descriptions of each of the POST codes, listed in the same order in which they are generated. These POST codes appear as a four -digit string that is a combination of two-digit output from primary I/O port 80 and two-digit output from secondary I/O port 81.
32 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 de00 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors pr esent. NOTE: APs are left in the CLI HL T state. 8613 Initialize PM regs and PM PCI regs at Early-POST .
Appendix A Event Logs and POST Codes 33 POST Code Checkpoints The POST code checkpoints are the lar gest set of checkpoints during the BIOS pre- boot process. T ABLE A-2 describes the type of checkpoints that might occur during the POST portion of the BIOS.
34 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 0E T esting and initialization of different Input Devices. Also, update the Kernel V ariables. T raps the INT09h vector , so that the POST INT09h handler gets control for IRQ1.
Appendix A Event Logs and POST Codes 35 60 Initializes NUM-LOCK status and programs the KBD typematic rate. 75 Initialize Int-13 and prepar e for IPL detection. 78 Initializes IPL devices controlled by BIOS and option ROMs. 7A Initializes remaining option ROMs.
36 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 B1 Save system context for ACPI. 00 Prepar es CPU for booting to OS by copying all of the context of the BSP to all application processors pr esent. NOTE: APs are left in the CLI HL T state.
37 APPENDIX B Status Indicator LEDs This appendix contains information about the locations and behavior of the LEDs on the server . It describes the external LEDs that can be viewed on the outside of the server and the internal LEDs that can be viewed only with the main cover removed.
38 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Fr ont Panel LEDs FIGURE B-1 Front Panel LEDs (X4140 shown) Back Panel LEDs FIGURE B-2 Back Panel LEDs (X4140 shown) Figur.
Appendix B Status Indicator LEDs 39 Har d Drive LEDs FIGURE B-3 Hard Drive LEDs Internal Status Indicator LEDs The server has internal status indicators on the motherboard, and on the mezzanine board. For motherboar d locations, see FIGURE B-4 . For mezzanine board locations, see FIGURE B-5 .
40 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Note – The mezzanine board, when present, obscur es part of the motherboard, including the LEDs. The Motherboard Fault LED indicates that one or more of the LEDs on the motherboard is active.
Appendix B Status Indicator LEDs 41 FIGURE B-5 DIMMs and LEDs on Mezzanine Board.
42 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008.
43 APPENDIX C Using the ILOM Service Pr ocessor GUI to V iew System Information This appendix contains information about using the Integrated Lights Out Manager (ILOM) Service processor (SP) GUI to view monitoring and maintenance information for your server .
44 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Making a Serial Connection to the SP T o make a serial connection to the SP: 1. Connect a serial cable from the RJ-45 Serial Management port on server to a terminal device. 2.
Appendix C Using the ILOM Service Processor GUI to View System Information 45 V iewing ILOM SP Event Logs Events are notif ications that occur in response to some actions. The IPMI system event log (SEL) provides status information about the server ’s hardwar e and software to the ILOM softwar e, which displays the events in the ILOM web GUI.
46 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 FIGURE C-1 System Event Logs Page 3. Select the category of event that you want to view in the log from the drop- down list box. Y ou can select from the following types of events: ■ Sensor-specif ic events.
Appendix C Using the ILOM Service Processor GUI to View System Information 47 After you have selected a category of event, the Event Log table is updated with the specified events. The f ields in the Event Log are described in T ABLE C-1 . 4. T o clear the event log, click the Clear Event Log button.
48 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 ■ ILOM web GUI operation; for example, from the Maintenance tab, selecting Reset SP ■ An SP firmwar e upgrade After an SP reboot, the SP clock is changed by the following events: ■ When the host is booted.
Appendix C Using the ILOM Service Processor GUI to View System Information 49 2. From the System Information tab, select Components. The Replaceable Component Information page is displayed. See FIGURE C-2 . FIGURE C-2 Replaceable Component Information Page 3.
50 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 V iewing Sensors This section describes how to view the server temperature, voltage, and fan sensor readings. For a complete list of sensors, see Appendix D . T o view sensor readings: 1.
Appendix C Using the ILOM Service Processor GUI to View System Information 51 FIGURE C-3 Sensor Readings Page 3. Click the Refresh button to update the sensor readings to their current status. 4. Click a sensor to display its thresholds. A display of properties and values appears.
52 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 FIGURE C-4 Sensor Details Page 5. If the problem with the server is not evident after viewing sensor readings information, continue with “Running SunVTS Diagnostic T ests” on page 7 .
53 APPENDIX D Err or Handling This appendix contains information about how the servers process and log errors. See the following sections: ■ “Handling of Uncorrectable Errors” on page 53 ■ “.
54 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Note – If the error is on low 1MB, the BIOS fr eezes after rebooting. Therefor e, no DMI log is recor ded. ■ An example of the error r eported by the SEL through IPMI 2.
Appendix D Error Handling 55 FIGURE D-1 DMI Log Screen, Uncorr ectable Error.
56 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Handling of Corr ectable Errors This section lists facts and considerations about how the server handles correctable errors. ■ During BIOS POST : ■ The BIOS polls the MCK registers.
Appendix D Error Handling 57 FIGURE D-2 DMI Log Screen, Corr ectable Error ■ If during any stage of memory testing the BIOS finds itself incapable of reading/writing to the DIMM, it takes the following actions: ■ The BIOS disables the DIMM as indicated by the Memory Decreased message in the example in EXAMPLE D-1 .
58 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 EXAMPLE D-1 DMI Log Screen, Corr ectable Error , Memory Decreased.
Appendix D Error Handling 59 Handling of Parity Err ors (PERR) This section lists facts and considerations about how the server handles parity errors (PERR). ■ The handling of parity errors works thr ough NMIs. ■ During BIOS POST , the NMI is logged in the DMI and the SP SEL.
60 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 FIGURE D-3 DMI Log Screen, PCI Parity Err or ■ The BIOS displays the following messages and freezes (during POST or DOS).
Appendix D Error Handling 61 Note – The Linux system reboots, but does not inform the BIOS of this incident. Handling of System Err ors (SERR) This section lists facts and considerations about how the server handles system errors (SERR). ■ System error handling works through the HyperT ransport Synch Flood Error mechanism on 81 1 1 and 8131.
62 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 ■ FIGURE D-5 shows an example DMI log screen fr om the BIOS Setup Page with a system error .
Appendix D Error Handling 63 Handling Mismatching Pr ocessors This section lists facts and considerations about how the server handles mismatching processors. ■ The BIOS performs a complete POST . ■ The BIOS displays a report of any mismatching CPUs, as shown in the following example: ■ No SEL or DMI event is recor ded.
64 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Har dware Err or Handling Summary T ABLE D-1 summarizes the most common hardwar e errors that you might encounter with these servers.
Appendix D Error Handling 65 Single-bit DRAM ECC error W ith ECC enabled in the BIOS Setup, the CPU detects and corrects a single-bit error on the DIMM interface. The CPU corrects the err or in hardware. No interrupt or machine check is generated by the hardwar e.
66 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 PCI SERR, PERR System or parity error on a PCI bus. Sync floods on HyperT ransport links, the machine resets itself, and err or information gets retained thr ough reset. The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue .
Appendix D Error Handling 67 Multiple fan failure Fan failure is detected by reading tach signals. The Front Fan Fault, Service Action Requir ed, and individual fan module LEDs are lit. SP SEL Fatal Single power supply failure When any of the AC/DC PS_VIN_GOOD or PS_PWR_OK signals are deasserted.
68 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008.
69 Index B BIOS changing POST options, 28 event logs, 21 POST code checkpoints, 33 POST codes, 31 POST overview, 25 redir ecting console output for POST, 2 6 Bootable Diagnostics CD, 8 C comments and .
70 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • A ugust 2008 external, 3 internal, 4 Integrated Lights-Out Manager Service Processor , See ILOM SP GUI internal inspection, 4 isolatin.
Un point important après l'achat de l'appareil (ou même avant l'achat) est de lire le manuel d'utilisation. Nous devons le faire pour quelques raisons simples:
Si vous n'avez pas encore acheté Sun Microsystems X4140 c'est un bon moment pour vous familiariser avec les données de base sur le produit. Consulter d'abord les pages initiales du manuel d'utilisation, que vous trouverez ci-dessus. Vous devriez y trouver les données techniques les plus importants du Sun Microsystems X4140 - de cette manière, vous pouvez vérifier si l'équipement répond à vos besoins. Explorant les pages suivantes du manuel d'utilisation Sun Microsystems X4140, vous apprendrez toutes les caractéristiques du produit et des informations sur son fonctionnement. Les informations sur le Sun Microsystems X4140 va certainement vous aider à prendre une décision concernant l'achat.
Dans une situation où vous avez déjà le Sun Microsystems X4140, mais vous avez pas encore lu le manuel d'utilisation, vous devez le faire pour les raisons décrites ci-dessus,. Vous saurez alors si vous avez correctement utilisé les fonctions disponibles, et si vous avez commis des erreurs qui peuvent réduire la durée de vie du Sun Microsystems X4140.
Cependant, l'un des rôles les plus importants pour l'utilisateur joués par les manuels d'utilisateur est d'aider à résoudre les problèmes concernant le Sun Microsystems X4140. Presque toujours, vous y trouverez Troubleshooting, soit les pannes et les défaillances les plus fréquentes de l'apparei Sun Microsystems X4140 ainsi que les instructions sur la façon de les résoudre. Même si vous ne parvenez pas à résoudre le problème, le manuel d‘utilisation va vous montrer le chemin d'une nouvelle procédure – le contact avec le centre de service à la clientèle ou le service le plus proche.