Linux Data Management

From CoolSolutionsWiki

Page for Summarizing Linux Data Management and High Availability Features.

The change log is available in the Discussion tab (above)

RZondervan


Welcome to the Linux Data Management Wiki!!

As already mentioned on the wiki main page, please feel free to join in. You can read anything in here without logging in. If you feel like commenting on something, you may use the Discussion tab (above), User Comments section (down), or email. For starting a new topic or editing a table, you'll need to use a Novell Login account (which you'll be prompted to create if you don't already have one). If you are unfamiliar with using Wiki's in general, please visit Novell Wiki or the grandaddy wiki info site www.wiki.org for some background info.

Several areas of the Novell webspace provide information regarding the SUSE LINUX Enterprise Server (commonly abbreviated as SLES).

Contents

General Linux Learning Information

Data Management and High Availability Features

Data management refers to the software in charge of:

  • Administering and managing on line storage
  • Providing increased operational efficiency and enhanced storage I/O path availability
  • Providing a fail over mechanism whereby multiple servers can mount the file systems and simultaneously access a single file system in a storage array connected by a storage array network (SAN).

High availability refers to the mechanisms that provides resiliency, if a server with a particular application crashes, by immediately restarting the application on another system without requiring administrative intervention.

The main goal of this page is to summarize features. A summary of File Systems and File Access Protocols is on http://wiki.novell.com/index.php/File_System_Primer

SUSE Supported File Systems and Features Comparison

(2007 table)

File Systems and Features Comparison
Feature Group Feature Ext3 Reiserfs 3.6 XFS
Journaling Journaling: Data/Metadata +/+ -/+ -/+
Journal-Replay Kernel/Userspace +/+ +/- +/-
Journal internal/external +/+ +/- +/+
Internal Structures Inode-Allocation-Map table unified B*-tree B+-tree
Dynamic Inode-Allocation-Map - + +
Block-Allocation-Map table unified B*-tree B+-trees
Extents - - +
Inode Inline Data + (sym-links<60chars) + +
Sparse Files + - see: ls -s;du;du --apparent-size +
Tail Packing - + -
Defrag +(unstable) (unnecessary) +
Resizing Off line extend/shrink +/+ +/+ -/-
On line extend/shrink (see comment)/- +/- +/-
ACLs, Quotas, ... Extended Attributes/Posix-ACLs(SUSE) +/+ +/+ +/+
32 64 64 64
Quotas (SUSE) + + +
Dump/Restore + - +
Sizes/Limits Blocksize(s): (range) default (1,2,4KiB)4KiB (up to 64KiB)4KiB (up to 64KiB)4KiB
max. FS size 16TiB 16TiB 8EiB
max. File size 2TiB 1EiB 8EiB

Used symbols in this table:

  • + means available/supported
  • - means unsupported or unavailable

Novell recommends using XFS especially for large-scale filesystems, e.g. in terms of fileserving (like with Samba, NFS, etc.)

ReiserFS is still the default File System in SLE10.

ReiserFS is still supported in SLE11, but Ext3 will be the default File System. Further information is available at the reiserfs project homepage http://www.namesys.com/.

A more detailed discussion of file system features is available at e.g.:

Online-extending ext3 filesystems is technically possible on SLE 10 and newer; full support by Novell for this feature will be added in future.

The aforementioned link shows the original table from 2007, but the table on this page is editable for comments.

LVM2 Limits

(Feb 2007)

LVM2 Limits
Feature Default Setting and Range
Max logical volumes Default set to 0, means unlimited
Max physical volumes Default set to 0, means unlimited
Physical Extent size Default 4MB, range from 8KB-16GB in powers of 2. There is no limit of number of extents in LVM2, therefore there is no need to adjust the extent size for large volumes
Bad Block Replacement Not yet implemented in LVM2. Solution:
  • Use EVMS, or
  • Wait for one of the next releases in SLES, or
  • Use SAN feature
LUN number limits The maximal number of LUNs can be defined on the kernel cmdline on boot.
  • Default: 512
  • Maximum: 2^32-1
  • Reference: /usr/src/linux/drivers/scsi/scsi_scan.c

LVM and EVMS Terms

Some EVMS feautures are:

  • A command line interface, ncurses menu and gui
  • Drive linking
  • Shrink and expand volumes
  • Create snapshots of volumes
  • Set up software RAID (redundant array of independent devices) features. EVMS provides a bad-block replacement (BBR) mechanism for md devices (LVM2 not yet).
  • EVMS can use many types of file systems and manipulate these storage pieces in ways that best meet the needs of the particular work environment.
  • EVMS also provides the capability to manage data on storage that is physically shared by nodes in a cluster. This shared storage allows data to be highly available from different nodes in the cluster
  • Autoyast only supports LVM, and EVMS since SLE10.
  • Initial activation after autoyast installation: evms_activate (after every reboot via /etc/init.d/evmsd)

EVMS provides three interfaces:

  • a graphical interface (GUI
  • an ncurses based interface
  • a command line interface, e.g. see [1]

Best practice: Use evmsgui or cli (Do not mix EVMS and Logical Volume Manager (LVM). LVM commands are blocked by EVMS, unless the block device is filtered by an EVMS exclude option.)

SLES8,9,10 support including heartbeat, ext2/3, Reiser, XFS, Swap, OCFS2, NTFS and FAT.

LVM and EVMS Terms
Logical Volume Manager (LVM and LVM2) Enterprise Volume Management System (EVMS) EVMS Parameters
Partition Physical Volume (PV) Segment Size
Collection of Partitions Volume Groups (VG) Container Physical Extent Size, default=16MB
EVMS Specific Region Region name, size in MB or # of extents, stripping and stripe_size
Logical Disk Logical Volume EVMS Volume

The combination of EVMS, DM (MPIO) over MD (host based RAID) is possible.

Volume Managers Matrix

(Feb 2007)

Overview (Cluster) Volume Managers
Feature LVM EVMS GFS VM VxVM
Costs free free $2,200 per node (Feb 2007) VFSHA starts at $10,200(Feb 2007)
Main Benefits flexible (resizeable, software raid support), all distros cluster support, quorum daemon instead of disk(2), EVMS provides a bad-block replacement mechanism for md devices cluster support cluster support, gui, host based raid
Main Disadvantages LVM does not provide a bad-block replacement mechanism for md devices (yet). LVM does not (yet) prevent concurrent mounts of a non-shared file system from cluster nodes. no autoyast feature before SLE10 RH only, costs, quorum disk requires third location with a SAN connection(2) costs, no mirrored coordinator disks (for quorum), third SAN location(2)
Snapshot yes via LVM VxFS provides an optional feature called "Storage Checkpoints" which allows for advanced file system snapshots
Support RH/SUSE RH/SUSE (not in combination with GFS) RH only Symantec (Veritas) for RH/SUSE only
max. volume size(1) 32-bit:16TB, 64-bit:8EB
Host based raid +, cluster support starting from RHEL4-U5, SUSE(6) +, dm mpio over md devices is supported starting from RHEL4-U5, RHEL5 based on LVM and CLVM +
Locking + (file system and os dependent) + distributed lock manager(4) + DLM (Distributed Lock Manager) or GULM (Grand Unified Lock Manager)(3) + distributed lock manager(4)
Third location SAN / LAN connection recommendation (quorum disk/daemon for cluster split brain detection)(2) - yes, quorumd daemon, SSL over TCP, or via Open Enterprise Server (OES Linux) and Business Continuity Cluster (BCC)(2) SAN SAN, at least 3 LU's required as coordinator disks ('quorum'/'voting') These LU's cannot be mirrored.

Notes

  • Note(1): Volume sizes above 1TB are not recommended because of backup/restore sla's. Volume Managers are restricted by the file system limits.
  • Note(2): A SAN (Fiber Channel)connection might not available from a third location. NAS storage for the quorum disk (e.g. Veritas) would be more simple, more cost effective and enables using existing TCP connections from any location.
  • Note(3): GULM (Grand Unified Lock Manager) will not be shipped with RHEL5. Oracle RAC certification currently (Feb 2007) requires GULM instead of DLM (Distributed Lock Manager). GULM requires at least 3 extra hosts.
  • Note(4): A distributed lock manager (versus a client server lock manager, such as GULM) does not require extra hosts.
  • Note(5): When enabled. An extent is a contiguous area of storage in a computer file system, reserved for a file. When starting to write to a file, a whole extent is allocated. When writing to the file again, possibly after doing other write operations, the data continues where the previous write left off. This reduces or eliminates file fragmentation.
  • Note(6): SUSE: dm mpio over md is available for Logical Volume Management (LVM) and EVMS.

SUSE: There is a heartbeat2 resource agent to take-over an md RAID group in heartbeat.

About SUSE and LVM2 OCFS2 cluster support: Nothing prevents e.g. OCFSv2 from working on LVM2 with (or even without!) cluster support.

Example: You can create a few LVM volumes (without any clusterware), mount them on a few servers and run e.g. OCFS on them – and it all will work until you want to change LVM data (then you need a special procedure OR cluster supported lvm). You can just build OCFSv2 over LVM and then bring all but one node down when you want to change LVM volumes (to make sure that all volumes see the change).

No third SAN Location Requirement:

Fail-over in a twin data center cluster is not an issue, but if the cause of a fail-over is a disaster between the data centers, then there is a chance of two identical services conflicting with each other. Although fail-over and split brain issues caused by a disaster between data centers can be automated using three SAN locations, it is generally recommended to let the administrator decide about the fail-over in case of broken connections between the data centers. An administrator decision should always be the case with only two data centers (two locations with SAN connections). The use of a third location for quorum decisions is therefore not recommended. In case of a twin data center solution, a redundant connection between the data centers is enough for High Availability (HA) purposes. The third location option is a High Availability (HA) 'failure on failure' scenario, which is not always a customer requirement.

NAS File Access Protocols

(Feb 2007)

Overview NAS File Access Protocols
Feature NFSv3 CIFS/SMB NFSv4
Costs - - -
Main Benefits stateless stateful, authenticated stateful, security, rich management features, kerberos authentication, NFS v4 is the next key file access protocol based on industry standards to come, NFSv4 is implemented in SLE10
Main Disadvantages requires HA servers, security issues, authentication optional requires HA servers, no Linux locking (smb) client available young, under development
Support RH/SUSE RH/SUSE NAS vendor, SUSE

Overview File Systems

(Feb 2007)

File Systems Matrix
Feature Ext3 XFS Reiser Ext4
Costs free free free free (introduced since Linux 2.6.19, not yet SLES)
Main Benefits RH and SUSE support, small file systems, and now htrees for scalability included very fast mounts with large volumes, scalability, stability, performance also with extremely large volumes/files (such as video rendering), defrag, file change log scalability and all-around performance, performance with large amount of files (e.g. mail & news services) upgrade path from ext3, but see disadvantage
Main Disadvantages not on line resizeable in SLES9, but is in SLE10 no shrink option, no RH support (SUSE supports XFS) less troubleshooting tools, development too much dependent on one person There are compatibility questions that still need to be resolved between ext3 and ext4 (for example, 48-bit block numbers)
Extents(5) - + v3: - (v4:+) +
On Line Resizeable + extend, - shrink, (not in SLES9 SP3, but it is in SLE10) + extend, - shrink + extend, - shrink + extend
Support RH/SUSE SUSE support, no RH support SUSE support, no RH support  ?
Max. File/Volume size(1) 16GiB to 2TiB/2TiB to 32TiB 8EiB/8EiB 1Ei(8TiB:32bit)/16TiB (For 16TB: Use a blockdevice not LARGER than 16TiB - 1 Byte! Use an external log and use mkreiserfs --block-size 8192) 16GiB to 2TiB/1024 PiB

Overview Shared File Systems

(Feb 2007)

Shared File Systems Matrix
Feature OCFS2 GFS VxFS PolyServe CFS
Costs free e.g. $2.2000 per node (Feb 2007) VFSHA starts at $10.200 (Feb 2007) Matrix Server starts at $1.500 per cpu
Main Benefits free, hasf support, third location quorumd via TCP exclusive write lock UNIX to Linux migrations, exclusive write lock, gui, file change log symmetrical cluster file system (CFS), Volume Manager and Matrix Manager, file locking with a distributed lock manager, focus on exporting to clients over CIFS or NFS as well as Microsoft SQL Server and Oracle 9i RAC and 10g
Main Disadvantages young, lack of features (exclusive write lock planned Q1 2008, no software raid support yet) costs, requires third location with SAN connection(2) costs, requires third location with SAN connection(2) costs, max 16 nodes
ACL - + +
POSIX permissions and file owner + + + +
Support SUSE/Oracle RH only Symantec (Veritas) for RH/SUSE only SLES9,RHEL3
Max. file/Volume size(1) 4PB 32-bit 16TB, 64-bit 8EB 16EB/?  ?/16TB volume
Host based RAID not supported yet Starting from RHEL4-U5 +

High Availability (HA) Features

More about HA Some learning topics in the link include:

  • Split-brain, Quorum, and Fencing
  • How to use a watchdog timer
  • Virtualization as High Availability (Disaster Recovery) or High-Availability as Virtualization?
  • Tools for monitoring services
  • Service Monitoring - basics of a key part of automated management
  • Automated Disaster Recovery - Data Replication

In its simplest form, a high-availability cluster consists of two servers with shared disk space between them and an extra connection between the two servers to provide a heartbeat between the two machines. One server (Server A in this example) is where the database runs. In the event of a problem with Server A:

  • The second server (Server B) will take over running the database.
  • The second server (Server B) will take over the IP address that the users are using to connect to server A.
  • The second server (Server B) will resume database operations with a minimum of disruption to the users.

Some concepts of HA with Databases are: Shared Disk Clusters:

  • Option 1) Fail-over Cluster: Two computers, shared disk array, one DB
  • Option 2) Shared Data Cluster: Multiple computers, shared disk, DB on multiple nodes, but only 1 DB file

Shared Nothing Clusters:

  • Option 3) Standby Cluster: Two separate computers, own disks, 1 DB per computer
  • Option 4) Shared Nothing Cluster: Separate computers, own disks, 1 DB per computer

Fail-over Cluster Pros Cons

Image:fail-over_cluster.jpg

This scenario is typically used as active/passive (a/p) or active/active (a/a) fail-over cluster. One computer mounts the disk with the DB and is running it. Only this one has exclusively access to the disk. The other one does nothing (a/p) or does a different service on a different disk (a/a). This scenario is available for nearly all databases, e.g. Oracle, DB2, Informix, Sybase, MySQL, Postgres, ...

Fail-over Cluster Pros Cons
Pros Cons
Cheap, only one DB license Second node not usable for the same DB
Included in Linux (heartbeat) Loss of the shared storage is a total loss of both systems
Simple DB installation - like a single system No scaling, only availability
Simple handling and administration
Popular and widely used hardware setup
KISS compliant (Keep It Simple and not Smart)

Fail-over Cluster Solutions

Fail-over Cluster Solutions
Open Source Solutions Commercial Solutions
Heartbeat 1 and 2 Steeleye Lifekeeper
Kimberlite (Missioncriticallinux), deprecated PolyServe Matrix HA
Failsafe, deprecated Convolo (Missioncriticallinux)
HP MC Serviceguard

Shared Data Cluster Pros Cons

Image:shared_data_cluster.jpg

Two (or more) computers with a shared disk array. Every node has a local DB installation (multiple instances), but all nodes write onto the same disk (raw device/cluster file system).

Shared Data Cluster Pros Cons
Pros Cons
Real parallel usage of all computers Oracle RAC is expensive
The only solution where the clients are not affected Loss of the shared storage is a total loss of both systems

Shared Data Cluster Solutions

Shared Data Cluster Solutions
Open Source Solutions Commercial Solutions
Nothing known Oracle RAC

This multiple DB instances scenario could be expanded with hard or soft mirrored disk sets. This would introduce a mix of scenarios 2 and 3, using best of both worlds.

Standby Cluster Pros Cons

Image:standby_cluster.jpg

One computer is running the live DB, the second gets all the transactions from the first, and is running in recovery mode.

Standby Cluster Pros Cons
Pros Cons
Could be different locations Two licenses needed(1)
Different DB’s Only one active DB
Independent Hardware/computer Complex administration of two or more OS and DB’s
Second DB could be used for read only analysis Potential loss of data (one log)(2)

(1) Depends on vendor

(2) Depends on implementation

Standby Cluster Solutions

Standby Cluster Solutions
Open Source Solutions Commercial Solutions
PostgreSQL Replication Oracle Dataguard
MySQL Replication Informix HDR/SDR
DB2 UDB HADR

Shared Nothing Cluster Pros Cons

Image:shared_nothing_cluster.jpg

Every computer is running a DB with a part of the Data, e.g. the first has A-D, second E-H, and so on...

Shared Nothing Cluster Pros Cons
Pros Cons
Good performance on reading During crashing of a node - the data of the node is not available
When adding another node all data must be repartitioned

Shared Nothing Cluster Solutions

Shared Nothing Cluster Solutions
Open Source Solutions Commercial Solutions
MySQL cluster DB/2

Some Heartbeat Features

Some heartbeat features are:

  • No SAN requirement
  • SLES 9 version 1 of heartbeat is two nodes only
  • SLES10 features unlimited nodes in version 2 (tested to 16)
  • Optional integration in the OCFS2 cluster user space and replaces the OCFS2 disk-based heartbeat
  • EVMS support

Some Features of Cluster File Systems

Some features of Cluster File Systems are:

  • Allows all servers in a cluster to see all data within the Linux file system ('virtual server').
  • Applications on different servers can perform concurrent reads and write to the same block of data with a small performance loss. Cluster File Systems are used in e.g. large web farms.
  • Allows for off line expansion of the file system in SLES10 SP1.

Some Cluster File System benefits are:

  • Alleviates the need to replicate data in order to make it accessible to many servers, e.g. database mining, web serving or file serving and moving applications or clients across servers (for capacity management and load balancing).
  • Alleviates the needs for adding a high available setup of extra servers sharing a file system. This requires less components in a HA architecture and makes the solution more robust.

Some Open Source OCFS2 Features

Oracle Linux Certification matrix

OCFS2 project web site

OCFS2 Development Roadmap

Oracle Cluster File System v2 (OCFS2) is an open source cluster management and symmetrical cluster file system shipped and supported in SLES10. OCFS2 is a Symmetrical Cluster File System and is using a metadata manager on every node, unlike a parallel file system or a Linux cluster file system with a single metadata manager. Some benefits of a symmetrical CFS are:

  • No single points of failure
  • Less performance bottlenecks

Some other OCFS2 features are:

  • Guarantees the block-level consistency of the data on shared OCFS2 file systems
  • No ACL capability yet
  • No exclusive write lock capability yet (now every lock request returns: successful). This feature is candidate for SLE10 SP2 (Q1 2008).
  • OCFS2 on top of a software mirror is not supported yet
  • Can be managed by EVMS
  • GUI interface: ocfs2console
  • Disk-based heartbeat, replaceable by the NIC based hearbeat v2 of the linux-ha.org project
  • OCFS2 includes a hot standby distributed lock manager on every node, which does not require a third location for managing the locking requests
  • OCFS2 offers integration with heartbeat2. Heartbeat2 offers a Resource Agent 'md group take over'. (which enables fail-over of host based mirroring of SAN volumes), but OCFS2 on top of a software mirror is not supported.

At this development stage of OCFS2, OCFS2 is recommended e.g. as shared image store in a cluster.

User Comments

Feel free to click the Discussion tab on the top of this page to fill in comments in the Talk page.