Maintwindows
From CoolSolutionsWiki
-HA TODO
- Dec 8 Maint. window checklist
- Primary objective - Move two MTAs off NW cluster and retire that 4 node cluster.
- Check list
- Down mta agents and backup GW DB files
- Online mta agents and change paths, ensure syncs (connect to each Domain)
- Copy DB files to Linux servers
- configure/connect linux GW agents to DB files
- Configure HA load scripts for new MTAs
- Add HA monitor to GWIA and ensure each important HA sub-module has monitor enabled
- Verify each HA resource can go on each node
- chkconfig heartbeat off for tests, then turn back on after tests
- Check list
- X Secondary Objective - Install and configure GW Monitor and alerting
- Third Objective - Move Webaccess and IM to gwmail4 (non-cluster enabled)
- MISC
- Primary objective - Move two MTAs off NW cluster and retire that 4 node cluster.
Contents |
Outages
- Feb 3, POA migrated and did NOT startup on its own. Cause was related to this evms issue
- Feb 27, the poa stop'd listening, likely because of the permissions issue - self inflicted
Maint. window 1
COMPLETED DEC 22
correct small size cib.xml tasklist
- ensure customer has full backup
- correct the /etc/ha.d/ha.cf file so the private NIC is first - we suspect comms problem and it might be trying the public LAN first :(
- Backout plan: cp ha.cf ha.cf.back1 BEFORE modifying. So backout plan is: rcheartbeat stop, cp ha.cf.back1 ha.cf, rcheartbeat start
- Verification: Sniff the wire and ensure HB traffic is talking on the HB Lan
- bring the other two nodes into the cluster
- ensure gwmail1 and gwmail2 servers are in standby mode
- ensure all 3 nodes have chkconfig heartbeat off - to avoid any chance of multiple reboots.
- ensure stonith is OFF
- consider stopping all HB resources as hb_gui requests seem to be ignored currently and it would be nice to stop all the resources and rcheartbeat stop
- backup the cib.xml file on gwmail3 and snip the check file
- rcheartbeat start on gwmail3
- crm_mon -i2 to ensure he joins and takes on resources
- gwmail1 and gwmail3 - rcheartbeat start
- use crm_mon -i2 to ensure they join and say standby
- verification: cibadmin -Q to ensure they are getting the proper cib.xml
- bring gwmail1 and gwmail2 out of standby and into ONLINE mode
- verify migration of resources to gwmail1 and gwmail2 are successful
- correct pridom's need for manual start (Troy)
- ensure cib.xml for this resource is correct
- thing1
- thing2
- ensure pridom resource is able to move to each node
- enable HB monitor for appropriate resources (like GWIA)
- put non-DC nodes into standby mode
- use hb_gui to enable HB Monitor
- take nodes out of standby mode once hb_gui modifications are done.
- re-enable stonith
- Once everything checks out ok, then make sure all 3 nodes have the same GW version.
- gwmail1 has the hp1b/DST patch on it - so rpm -Uivh to back rev those to the same version as gwmail2, gwmail3
- ensure GroupWise startup script is HB aware and has the correct paths to 64bit
- VERIFY each resource can go successfully to each node.
- chkconfig heartbeat on on each node
- Time permitting move the secondary GW domain over to the Linux HA
Customer requirements
- A list of names and expertise of who will be on-line during the maintenance windows.
- A list of names and expertise of who will be on-site for the maintenance windows.
- Thomas E., PSE comms, NOS(NW/Linux), eDir, GW, ZENworks
- Cameron C., PSE Linux/IDM
- Troy W., PSE former GW resources for PSEs
- Jason R., PSE resource for Linux NOS
- A list of names and expertise of who will be on-site for the maintenance windows.
- Trigger times for initiating a roll back for each major task.
- To be determined by customer
- The roll back steps for each major task.
- The testing verification steps that indicate a completed task.
Maint. window 2 - move secondary MTA to HB Cluster, bind former webaccess and IM ip addr. to gwmail4
Move secondary domain
COMPLETED DEC 29
- ensure chkconfig heartbeat off
- ensure stonith is disabled
- ssh into HB DC
- put other two nodes into standby mode
- offline all resources gracefully
- Create EVMS container for secondary Domain
- Create EVMS volume for secondary Domain
- Create hb_gui resource group and add the necessary hb_gui resources to mount the disk
- mount the new secondary domain partition and copy the DB files to it
- ensure /gwise/dom2 domain directory is samba enabled
- update the domain paths to the new location
- repair the domain so the paths get sync'd to the DB
- ensure the agent has the GW startup files - AKA install and configure the GW MTA agent with the correct domain name and paths
- add additional hb_gui group resource sub-modules to enable the agent to start properly.
- verify the agent is online and communicating
- ps aux | grep gwmta
- gw http monitor - check links
- send mail to diff. Post Office and to/from the Internet
- re-enable stonith
- Verify resources can migrate to each node
- chkconfig heartbeat on
Bind former webaccess and IM ip addr. to gwmail4
COMPLETED DEC 22
- Customer to provide Public SSL Certificate
- unbind/offline former IP addr. holder
- add these two ip addresses to gwmail4
- restart IM/webaccess services and see if they bind to the new alias
- If not simply re-run the config script for IM and put the new IP addr
- Webaccess should bind to all addresses including alias addresses, but if not we can add it to the apache conf file
- verify IM clients can connect non-ssl and ssl
- verify webaccess users can connect
- ensure chkconfig grpwise on and IM services are on
- put in place scripts to auto-restart service if they go offline
- retire former Webaccess and IM servers
- check timesync
- check report sync
- check for obits - dsrepair -a | adv | ext ref
- load config.nlm /all put sys:\system\config.txt in a safe place
- dsrepair -rc put sys:system\dsr_dib in a safe place
- nwconfig | directory options | remove DS from this server
Remove old NW Cluster
COMPLETED DEC 29
- Monitor period before removing NW Cluster (time between maint. window 1 and maint. window 2 should be long enough to know)
- Remove NW cluster
- Time in sync?
- Report sync clean?
- no obits with dsrepair -a | adv | check ext ref
- document all load/unload scripts
- offline each resource
- uldncs.ncf
- delete sub NCS objects - or just wait
- nwconfig | dir | remove directory services (CHECK other services like slpda/time provider)
- Remove NW cluster
Misc tasks
- snip old GW library reference from DB
- COMPLETED DEC 29
Maint window Feb 14 troubleshoot small cib.xml
OUTCOME
- Successfully have the full/correct cib.xml on each server
- Was able to observe in greater deepth the EVMS worker thread issue
Plan A: (less impact)
- Disable stonith (follow our usual plan of ensuring backup and chkconfig heartbeat off etc etc)
- command line - migrate the PO resource from gwmail3 to one of the other nodes
- Then rcheartbeat stop, which will stop the heartbeat on the DC and the DC will go to one of the other nodes - then we'll check its cib.xml file.
Our impression is that it will then have a full sized cib.xml
Pros: The PO is migrated elsewhere and clients will have a good chance of not getting any errors from migrating the PO to another server.
Plan B: (more impact)
- Disable stonith (follow our usual plan of ensuring backup and chkconfig heartbeat off etc etc)
- command line - offline all resources
- stop heartbeat on all nodes
- bring up non-gwmail3 first and observe if it gets a large cib.xml
Secondary Objectives
- Change directory permissions and verify GW agent restarts
TO DO
- Partition off Root
- Health Check
- Check timesync
- check report sync
- check for obits - dsrepair -a | check ext ref
- Take dibs on multiple servers
- Split partition Root
- Health Check
- Virtual IP for GW consoleone
- investigate EVMS worker threads not cleaning up
- chkconfig lkcd off (get all debug off lkcd-utils in software management).
March 12 maint window
Primary objectives
- Get poa status to show properly in hb_gui, and thus have HB monitor restart/migrate resource when needed
- change permissions to log file /var/run/novell/groupwise
- find ownership of files - export to a file
- stop poa via clm
- start poa via hb_gui
- move LAN switch power supply
- evms
- enable debug forevms
- change evms timeout from 10 min. to 12 seconds
- when finished
- renable chkconfig heartbeat
- renable stonith
Secondary objectives
- umount restore1 on gwmail1
- add https for webmail
- second spam box x.x.80.151 to outbound list in that file.
Mar 29
Primary Objective
- Get gwmta and gwpoa running as gwuser
- modify uid.run for each agent
- rcgrpwise stop
- chown all files to gwuser, including /var/log/novell/groupwise (or set the path to /gwise/agent/log
- Update the DOA UNC path to .87 and not .8
Procedure
- modify the uid.run for each agent and replace root with gwuser
- chown -R gwuser:users /gwise
- rcgrpwise stop
- backup the wpdomain.db for primary and secondary domains
- rcgrpwise start
- rcgrpwise status and confirm all agents are running after "owner" change.
- Update proper log for MTAs as they currently go to /var/log/novell/groupwise
- update log level for all agents from 32 mb to 256mb
- Connect to primary and update UNC path for MTA
- May require "rebuild repair"
- Once resolved, connect to secondary and check paths
- Setup sysstat for server utilization stats
Secondary Objective
- Disconnect link to Metro
- Update log limits to 256mb instead of 32 mb
- Added second "mail forward" to GWIA
Procedure
- Go into link configuration and remove Metro
- Mail forward
- Use the GWIA object and not the startup file. Separate the mail forward1 and forward2 with a space.
Next Maint window
Primary Objective
- Troubleshoot gwia (Issue 1: startup files point to wrong binary and thus requires manual start of gwia Issue 2: Gwia stops working after 7 days exactly), which may include reinstalling gwia. If we are reinstalling, we may want to move the gwia to mail2.
- Backout plan
- Tape backup of GW DB
- backup binary and GroupWise configuration directories
- Backout plan
Secondary Objective
- Change daemon to run as gwuser instead root
- Set GW monitor to restart agents
TO DO
- Power Path/EMC is being used instead of MPIO
- Consider FAN out of GW agents across multiple boxes. TE to send hard data, and pros/cons of each option, with snips of best practice.
- GWAVA Reload
- GWIA stop running after 7 days exactly
- set agents as gwuser
- PENDING sent email for Sam
- Internet outbound email with attachments. Not internal PO
- Large emails
- sent max at 10mb, but 8 mb are added 2 mb so an 8 mb is blocked. Change to 15 mb
- 9.1 PO->GWIA-> turns into 12 mb
Oct 17, 2009 reload verification
For the purposes of making a TEST, I will recommend a procedure that will give you more downtime than actually doing it when needed - this just ensures a smoother return to normal production. What you're testing is the mechanism, after all.
- down the production POA
- make a Reload backup
- turn on DR mode
- run as long as you need to.
- turn OFF DR mode
- migrate back
- resume the production POA.
The downtime will be as long as it would take to run a standard backup.
In an actual unplanned downtime scenario, (production POA is down already)
If the file system of the production POA IS available, 1) do steps 2 and 3 as above.
IF the file system of the production POA is not available,
- turn ON DR mode. You will be as current as the last backup that ran. In other words, you will not have data that landed after the last backup.
- turn on DR mode
- run as long as you need to.
- When you're ready to switch back to the production POA
- do a pre-migrate-back
- down DR mode (there will be downtime now)
- run a final migrate back
- start the production POA.
The clients will be able to connect so long as either the production or DR POA is up. The idea is to do this to minimize downtime. There WILL be downtime, however.
outcome
QUESTIONS
- automate ip addresss switch or do DNS
- automate the reload server binding to live server's ip address? no, but you can manually bind them, and then teach gwava reload
- get becfield own secondary ip addr
- GWIA automation
- no reload automation, just have a secondary gwia OR rsync data/config over and manually start gwia on reload as needed.
TIPS
- prior to going into DR mode, bind ip addresses so when the agent loads, it already binds to the correct address
- how to configure agent in DR mode to assume proper ip address
- go into | DR menu | select agent type | CONFIGURE DR mode main menu | enable DR mode settings | Go to the DR MTA config main menu | address |
TODO
- X document all ip addresses and ports for all agents
- get BT2 another IP address, and bind 79.5 as secondary...so 79.5 can move to the reload box.
- restore area - reload restore area configuration p. 61
- restore area IP addresses
completed
- SUCCESSFULLY put becfield into DR mode and verified internal/external email. Incoming needs refinement to bind to production IP address
- Successfully put primary and secondary domains in DR mod
- Successfully put MIA PO into DR mode
- Found where to "automate" binding of production IP addrresses, further documentation for phase 2 verification.
- familiarized with process/methodology of implementing DR modes.
- Documented process for future reference.
Nov 7, 2009
- Planned to move all GW agents to reload box and simulate system wide outage.
- change BT server so it'd have secondary IP address of 79.6 and give 79.6 primary and allow 79.5 to move to reload box
- turn off all agents so we can get all data
- use reload to backup all current data
- once all data sync'd, turn Reload DR agents on for all agents.
- all agents turned on and functioned except the primary. Contacted gwava support and got "OK" support
- Management decided to abort simulation due to the time window.
- resync'd all data back to live system
- loaded all agents up, but primary failed. Tried connecting with C1 to do repairs, that failed. Tried copying locally and do repairs, still could not connect. Thus corrupt primary DB, found that the primary DB had not changed since Oct 23, so we restored Oct 22 DB and did secondary to primary sync in C1 to get data that may have changed since Oct 22nd.
- This may be the cause of the corruption, or a multi-faceted cause of losing power. 7002459
- Prevention: run c1 off GW linux server through X
- This may be the cause of the corruption, or a multi-faceted cause of losing power. 7002459
TO DO
- Create a reload gwia dom profile so we can put the gwia dom in DR mode if needed.
- Once gwia dom is on reload, gwia should load up on reload server.
- Refine how BT uses its PO ip address
- setup and configure restore area
- Setup c1 to launch through X
- confirm sync secondary with primary completed, got memory error: memory function failure (likely workstation out of memory)
- Make sure all remote tools work effectively. Struggled with admin tools performing well.
