====== High availability with Heartbeat and DRBD on Debian ====== ===== Author: Philipp Noack ===== Note: If you clone the first server after setup you will NOT get it running. The second server has to be installed exactly like the first one! ------------------------------------------------------------------------------------------------------------ 1. Debian install : My setup was: /boot ext3 with 100 MB and boot-flag / ext3 with 5 GB swap wi th4 GB (depending on memory) /var ext3 with 5 GB /var2 with the rest of the space was setup but wasn't formatted, yet! 2. Install bigmem-kernel (in case of +4GB memory) type "apt-cache search linux-image bigmem" and choose the right kernel 3. Network config: /etc/network/interfaces (eth1 will be for DRBD (RAID over TCP/IP) / heartbeat) auto lo eth0 eth1 iface lo inet loopback iface eth0 inet static address 172.30.86.??? netmask 255.255.255.0 network 172.30.86.0 broadcast 172.30.86.255 gateway 172.30.86.1 iface eth1 inet static address 192.168.1.??? netmask 255.255.255.0 network 192.168.1.0 broadcast 192.168.1.255 4. Install heartbeat: aptitude install heartbeat 5. Install DRBD: Add Debain Backports for DRBD8 in sources.list (included since lenny) deb http://www.backports.org/debian etch-backports main contrib non-free Install packages : drbd8-source, drbd8-utils aptitude -t etch-backports install drbd8-source aptitude -t etch-backports install drbd8-utils Create kernel module (has to be redone if the kernel will be updated in future) module-assistant auto-install drbd8 reboot with the new kernel. 6. Edit the DRBD config (Official documentation: http://www.drbd.org/docs/install/). Here is my config as example : Important: You will find the name of the var2 partition in /etc/fstab. Mine was /dev/cciss/c0d0p7. global { usage-count yes; } common { syncer { rate 700000K; } } resource r0 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; } startup { degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; no-disk-flushes; } net { after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { al-extents 257; } on mbops01 { device /dev/drbd0; disk /dev/cciss/c0d0p6; address 192.168.1.1:7788; meta-disk internal; } on mbops02 { device /dev/drbd0; disk /dev/cciss/c0d0p6; address 192.168.1.2:7788; meta-disk internal; } } Customize the rights for DRBD: chgrp haclient /sbin/drbdsetup chmod o-x /sbin/drbdsetup chmod u+s /sbin/drbdsetup chgrp haclient /sbin/drbdmeta chmod o-x /sbin/drbdmeta chmod u+s /sbin/drbdmeta 7. Initialize DRBD (on both machines): drbdadm create-md r0 check it with "cat /proc/drbd" Warning: Do this set on the master server ONLY! drbdadm -- --overwrite-data-of-peer primary r0 8. Create filesystem /dev/drbd0 (only on the master server again): mkfs -t ext3 /dev/drbd0 9. Install OPSView incl. apache2 (or see the official debian documentation under http://docs.opsview.org/doku.php?id=opsview2.14:debian-installation): Add following lines to the sources.list: deb http://apt.opsview.org/debian etch main deb http://ftp.debian.org/debian etch non-free Then do a "apt-get update" and "apt-get install opsview". I just quote the original docu : "Once Opsview has been installed, a Catalyst web server should be listening on port 3000. The Apache web server can then be used as a proxy to make Opsview available on port 80 (http) ñ this also provides a significant improvement in performance as static content is then served directly by apache rather than via the perl Catalyst web server." apt-get install libapache2-mod-proxy-html a2enmod proxy a2enmod proxy_http a2enmod proxy_html /etc/init.d/apache2 force-reload 10. Remove opsview + Mysql + Apache from the runlevels to start automatically at startup (heartbeat does it for us now): update-rc.d -f opsview remove update-rc.d -f opsview-web remove update-rc.d -f mysql remove update-rc.d -f apache2 remove update-rc.d -f opsview-agent remove 11. Configure heartbeat: /etc/ha.d/ha.cf debugfile /var/log/ha-debug logfile /var/log/ha-log keepalive 2 deadtime 30 warntime 10 initdead 120 auto_failback off bcast eth1 # This is a ping test in our network to check which server can ping it ping 172.30.86.4 node mbops01 node mbops02 respawn hacluster /usr/lib/heartbeat/ipfail apiauth ipfail gid=haclient uid=hacluster The file /etc/ha.d/haresources: mbops01 drbddisk::r0 Filesystem::/dev/drbd0::/var2::ext3 172.30.86.170 mysql opsview opsview-web apache2 The file /etc/ha.d/authkeys: auth 3 3 md5 anypassword The set the filerights: chmod 600 /etc/ha.d/authkeys 11. Moving data: cd /usr/local/ tar cvzf nagios.tar.gz nagios mv nagios.tar.gz /var2 rm -r nagios cd /var2 tar xvzf nagios.tar.gz /var2 ln -s /var2/nagios /usr/local/nagios same with /usr/local/opsview-web same with /var/lib/mysql 12. Replace NRPE agents (to be done on both machines in primary mode) The opsview-agent needs the var2 partition to run, so you need to use another NRPE agent. Install nagios NRPE server and plugins apt-get install nagios-nrpe-server nagios-plugins-basic rm /etc/nagios/nrpe.cfg cp /var2/nagios/etc/nrpe.cfg /etc/nagios/nrpe.cfg Now you need to edit the paths in the nrpe.cfg, /usr/local/nagios/libexec is replaced by /usr/lib/nagios/plugins vim /etc/nagios/nrpe.cfg /etc/init.d/nagios-nrpe-server restart 13. Solving problems - Disk-flush errors on RAID systems Add the line "no-disk-flushes;" into the drbd.conf: resource r0 disk { no-disk-flushes; ... - Apache2 proxy doesn't work: cp /usr/share/doc/opsview/apache2-proxy.conf /etc/apache2/sites-available/opsview ln -s /etc/apache2/sites-enabled/opsview /etc/apache2/sites-available/opsview Customize the config files (remove comments and customize IPs). Do this on both machines, then just do a takeover to restart apache2 (hearbeat). - MySQL doesn't start/stop on a node: /etc/mysql/debian.cnf passwords have to match - Delete filesystem if there are problems with it dd if=/dev/zero bs=1M count=1 of=/dev/cciss/????; sync - Problems with ressources r1, r2 ... Delete the line 'after "r2";' in the drbd.conf