Maintwindows
From CoolSolutionsWiki
-HA TODO
- Dec 8 Maint. window checklist
- Primary objective - Move two MTAs off NW cluster and retire that 4 node cluster.
- Check list
- Down mta agents and backup GW DB files
- Online mta agents and change paths, ensure syncs (connect to each Domain)
- Copy DB files to Linux servers
- configure/connect linux GW agents to DB files
- Configure HA load scripts for new MTAs
- Add HA monitor to GWIA and ensure each important HA sub-module has monitor enabled
- Verify each HA resource can go on each node
- chkconfig heartbeat off for tests, then turn back on after tests
- Check list
- X Secondary Objective - Install and configure GW Monitor and alerting
- Third Objective - Move Webaccess and IM to gwmail4 (non-cluster enabled)
- MISC
- Primary objective - Move two MTAs off NW cluster and retire that 4 node cluster.
Contents |
[edit]
Outages
- Feb 3, POA migrated and did NOT startup on its own. Cause was related to this evms issue
- Feb 27, the poa stop'd listening, likely because of the permissions issue - self inflicted
[edit]
Maint. window 1
COMPLETED DEC 22
[edit]
correct small size cib.xml tasklist
- ensure customer has full backup
- correct the /etc/ha.d/ha.cf file so the private NIC is first - we suspect comms problem and it might be trying the public LAN first :(
- Backout plan: cp ha.cf ha.cf.back1 BEFORE modifying. So backout plan is: rcheartbeat stop, cp ha.cf.back1 ha.cf, rcheartbeat start
- Verification: Sniff the wire and ensure HB traffic is talking on the HB Lan
- bring the other two nodes into the cluster
- ensure gwmail1 and gwmail2 servers are in standby mode
- ensure all 3 nodes have chkconfig heartbeat off - to avoid any chance of multiple reboots.
- ensure stonith is OFF
- consider stopping all HB resources as hb_gui requests seem to be ignored currently and it would be nice to stop all the resources and rcheartbeat stop
- backup the cib.xml file on gwmail3 and snip the check file
- rcheartbeat start on gwmail3
- crm_mon -i2 to ensure he joins and takes on resources
- gwmail1 and gwmail3 - rcheartbeat start
- use crm_mon -i2 to ensure they join and say standby
- verification: cibadmin -Q to ensure they are getting the proper cib.xml
- bring gwmail1 and gwmail2 out of standby and into ONLINE mode
- verify migration of resources to gwmail1 and gwmail2 are successful
- correct pridom's need for manual start (Troy)
- ensure cib.xml for this resource is correct
- thing1
- thing2
- ensure pridom resource is able to move to each node
- enable HB monitor for appropriate resources (like GWIA)
- put non-DC nodes into standby mode
- use hb_gui to enable HB Monitor
- take nodes out of standby mode once hb_gui modifications are done.
- re-enable stonith
- Once everything checks out ok, then make sure all 3 nodes have the same GW version.
- gwmail1 has the hp1b/DST patch on it - so rpm -Uivh to back rev those to the same version as gwmail2, gwmail3
- ensure GroupWise startup script is HB aware and has the correct paths to 64bit
- VERIFY each resource can go successfully to each node.
- chkconfig heartbeat on on each node
- Time permitting move the secondary GW domain over to the Linux HA
[edit]
Customer requirements
- A list of names and expertise of who will be on-line during the maintenance windows.
- A list of names and expertise of who will be on-site for the maintenance windows.
- Thomas E., PSE comms, NOS(NW/Linux), eDir, GW, ZENworks
- Cameron C., PSE Linux/IDM
- Troy W., PSE former GW resources for PSEs
- Jason R., PSE resource for Linux NOS
- A list of names and expertise of who will be on-site for the maintenance windows.
- Trigger times for initiating a roll back for each major task.
- To be determined by customer
- The roll back steps for each major task.
- The testing verification steps that indicate a completed task.
[edit]
Maint. window 2 - move secondary MTA to HB Cluster, bind former webaccess and IM ip addr. to gwmail4
[edit]
Move secondary domain
COMPLETED DEC 29
- ensure chkconfig heartbeat off
- ensure stonith is disabled
- ssh into HB DC
- put other two nodes into standby mode
- offline all resources gracefully
- Create EVMS container for secondary Domain
- Create EVMS volume for secondary Domain
- Create hb_gui resource group and add the necessary hb_gui resources to mount the disk
- mount the new secondary domain partition and copy the DB files to it
- ensure /gwise/dom2 domain directory is samba enabled
- update the domain paths to the new location
- repair the domain so the paths get sync'd to the DB
- ensure the agent has the GW startup files - AKA install and configure the GW MTA agent with the correct domain name and paths
- add additional hb_gui group resource sub-modules to enable the agent to start properly.
- verify the agent is online and communicating
- ps aux | grep gwmta
- gw http monitor - check links
- send mail to diff. Post Office and to/from the Internet
- re-enable stonith
- Verify resources can migrate to each node
- chkconfig heartbeat on
[edit]
Bind former webaccess and IM ip addr. to gwmail4
COMPLETED DEC 22
- Customer to provide Public SSL Certificate
- unbind/offline former IP addr. holder
- add these two ip addresses to gwmail4
- restart IM/webaccess services and see if they bind to the new alias
- If not simply re-run the config script for IM and put the new IP addr
- Webaccess should bind to all addresses including alias addresses, but if not we can add it to the apache conf file
- verify IM clients can connect non-ssl and ssl
- verify webaccess users can connect
- ensure chkconfig grpwise on and IM services are on
- put in place scripts to auto-restart service if they go offline
- retire former Webaccess and IM servers
- check timesync
- check report sync
- check for obits - dsrepair -a | adv | ext ref
- load config.nlm /all put sys:\system\config.txt in a safe place
- dsrepair -rc put sys:system\dsr_dib in a safe place
- nwconfig | directory options | remove DS from this server
[edit]
Remove old NW Cluster
COMPLETED DEC 29
- Monitor period before removing NW Cluster (time between maint. window 1 and maint. window 2 should be long enough to know)
- Remove NW cluster
- Time in sync?
- Report sync clean?
- no obits with dsrepair -a | adv | check ext ref
- document all load/unload scripts
- offline each resource
- uldncs.ncf
- delete sub NCS objects - or just wait
- nwconfig | dir | remove directory services (CHECK other services like slpda/time provider)
- Remove NW cluster
[edit]
Misc tasks
- snip old GW library reference from DB
- COMPLETED DEC 29
[edit]
Maint window Feb 14 troubleshoot small cib.xml
[edit]
OUTCOME
- Successfully have the full/correct cib.xml on each server
- Was able to observe in greater deepth the EVMS worker thread issue
[edit]
Plan A: (less impact)
- Disable stonith (follow our usual plan of ensuring backup and chkconfig heartbeat off etc etc)
- command line - migrate the PO resource from gwmail3 to one of the other nodes
- Then rcheartbeat stop, which will stop the heartbeat on the DC and the DC will go to one of the other nodes - then we'll check its cib.xml file.
Our impression is that it will then have a full sized cib.xml
Pros: The PO is migrated elsewhere and clients will have a good chance of not getting any errors from migrating the PO to another server.
[edit]
Plan B: (more impact)
- Disable stonith (follow our usual plan of ensuring backup and chkconfig heartbeat off etc etc)
- command line - offline all resources
- stop heartbeat on all nodes
- bring up non-gwmail3 first and observe if it gets a large cib.xml
[edit]
Secondary Objectives
- Change directory permissions and verify GW agent restarts
[edit]
TO DO
- Partition off Root
- Health Check
- Check timesync
- check report sync
- check for obits - dsrepair -a | check ext ref
- Take dibs on multiple servers
- Split partition Root
- Health Check
- Virtual IP for GW consoleone
- investigate EVMS worker threads not cleaning up
- chkconfig lkcd off (get all debug off lkcd-utils in software management).
[edit]
March 12 maint window
[edit]
Primary objectives
- Get poa status to show properly in hb_gui, and thus have HB monitor restart/migrate resource when needed
- change permissions to log file /var/run/novell/groupwise
- find ownership of files - export to a file
- stop poa via clm
- start poa via hb_gui
- move LAN switch power supply
- evms
- enable debug forevms
- change evms timeout from 10 min. to 12 seconds
- when finished
- renable chkconfig heartbeat
- renable stonith
[edit]
Secondary objectives
- umount restore1 on gwmail1
- add https for webmail
- second spam box x.x.80.151 to outbound list in that file.
[edit]
Mar 29
[edit]
Primary Objective
- Get gwmta and gwpoa running as gwuser
- modify uid.run for each agent
- rcgrpwise stop
- chown all files to gwuser, including /var/log/novell/groupwise (or set the path to /gwise/agent/log
- Update the DOA UNC path to .87 and not .8
[edit]
Procedure
- modify the uid.run for each agent and replace root with gwuser
- chown -R gwuser:users /gwise
- rcgrpwise stop
- backup the wpdomain.db for primary and secondary domains
- rcgrpwise start
- rcgrpwise status and confirm all agents are running after "owner" change.
- Update proper log for MTAs as they currently go to /var/log/novell/groupwise
- update log level for all agents from 32 mb to 256mb
- Connect to primary and update UNC path for MTA
- May require "rebuild repair"
- Once resolved, connect to secondary and check paths
- Setup sysstat for server utilization stats
[edit]
Secondary Objective
- Disconnect link to Metro
- Update log limits to 256mb instead of 32 mb
- Added second "mail forward" to GWIA
[edit]
Procedure
- Go into link configuration and remove Metro
- Mail forward
- Use the GWIA object and not the startup file. Separate the mail forward1 and forward2 with a space.
[edit]
Next Maint window
[edit]
Primary Objective
- Troubleshoot gwia (Issue 1: startup files point to wrong binary and thus requires manual start of gwia Issue 2: Gwia stops working after 7 days exactly), which may include reinstalling gwia. If we are reinstalling, we may want to move the gwia to mail2.
- Backout plan
- Tape backup of GW DB
- backup binary and GroupWise configuration directories
- Backout plan
[edit]
Secondary Objective
- Change daemon to run as gwuser instead root
- Set GW monitor to restart agents
[edit]
TO DO
- Power Path/EMC is being used instead of MPIO
- Consider FAN out of GW agents across multiple boxes. TE to send hard data, and pros/cons of each option, with snips of best practice.
- GWAVA Reload
- GWIA stop running after 7 days exactly
- set agents as gwuser
- PENDING sent email for Sam
- Internet outbound email with attachments. Not internal PO
- Large emails
- sent max at 10mb, but 8 mb are added 2 mb so an 8 mb is blocked. Change to 15 mb
- 9.1 PO->GWIA-> turns into 12 mb
