2004.07_Cluster Filesystems-What Filesystems are Best for Cluster Computing.pdf

KNOW HOW

Cluster Filesystems

Grand Designs

Building a cluster with Linux is the

first idea when some older PIII nodes

are replaced on the desktop. The ease

of installation and the readily avail-

able software packages to build a

cluster computer has changed from

an art performed only in the largest

computation centers, into the com-

mon practice of everyday system

installation and administration.

BY JOS VAN WEZEL

building a cluster, is similar to

what is done on a single

machine. Setting the IP address and

changing some startup environment are

basically the same on all Linux

machines. No specialized knowledge is

required. Besides the standard Linux

programs, some cluster specific software

is needed. One such program is responsi-

ble for handing out jobs and controlling

the workload. Another special software

package helps to install hundreds of

machines at the same time. Management

software to control or adapt a single

node or the whole cluster with a few

simple commands is another necessary

tool in a cluster.

This article is about the specialized

system software that is used in clusters

to write and read data. The software glue

between the cluster nodes is the cluster

filesystem. Highly optimized for the clus-

ter or large installation a cluster

filesystem is not found in ordinary stan-

dard Linux distributions. The choice for

a certain cluster filesystem depends on

the file access pattern and the costs.

Applications that run on one node may

run on another the next time and need to

see the same data space. Secondly there

are classes of highly parallel applications

that need parallel access to the same

data from many nodes at the same time.

Therefore data needs to be available to

every cluster node and access to the data

has to follow the well known UNIX para-

digm through open(), read(), write() and

close() calls.

A cluster filesystem gives global

access, that means ‘visible at every

node’, as well as multiple simultaneous

access to on-line storage.

A filesystem is necessary

Clusters need background storage that is

available to every node. Programs in a

cluster usually work on partitioned work

areas and need to share data through a

common resource like a filesystem. For

programs running in a cluster it is most

convenient to use the filesystem inter-

face offered by the underlying operating

system. It is very difficult to build the

storage sharing into the application.

After all, the application has to run on

different installations, different Linux

versions or even different operating sys-

tems. Besides, it would severely limit

performance.

A filesystem to store data is normally

part of each operating system. Linux

computers in a cluster have ext, JFS,

ReiserFS or equivalent installed, but this

storage is confined to the single node.

Blocks, files and connections

Files are written by the operating system

in chunks of a certain size called blocks.

This is tunable to a maximum of 4 KB.

The filesystem is created with a certain

block size. Consequently small changes

to a file are handled less efficiently on a

filesystem with 4 KB blocks size. Such a

layout is better in handling large sequen-

July 2004

www.linux-magazine.com

Cluster Filesystems

M ost of the configuration, when

Cluster Filesystems

KNOW HOW

nbd ) allows to attach a

disk over the network.

Ethernet is not optimal

for block oriented IO and

many filesystems sup-

port the use of additional

inter node connecting

hardware to demonstrate

their superior through-

put. The difference in

speed is also reflected in

the costs for intercon-

nect hardware. Examples

of also called memory

attached hardware are

Myrinet, Infiniband or

Fibre Channel. It is

cheaper to run piggy back on Ethernet.

The system then directs the IO transfers

to dedicated IO server. An IO server with

high speed connections can deliver the

actual data to the application more effi-

ciently and the meta-data server can

continue with other tasks. Because meta-

data is much smaller then the actual data

it can be completely cached. As the

working set in a cluster can become very

large, file server caches are never large

enough to store all active IO data.

Figure 1: Striped filesystem.

Network Attached Storage

Well known types of cluster filesystems

are network filesystems such as NFS or

SMB. NFS is used in many installations

to connect to Network Attached Storage

(NAS) servers. See Figure 2.

High throughput rates are achieved on

specialized NAS servers like those by

NetApp Exanet, Panasas or Zambeel.

These are proprietary, but highly opti-

mized NFS solutions, with their own

operating system. With Linux one can

take a dual CPU, a RAID controller on a

PCI card, 1 GB Ethernet interface, if not

on-board already, put in some EIDE

disks, install Linux as NFS server and

your NAS box is ready.

NAS as a cluster filesystem has its

drawbacks. Throughput for small files

and non sequential access is slow

because of the high latency since all the

data has to transfer through the

Server/TCP/IP stacks. Linux also has

limited storage capacity because of the

32 bit size block pointers although with

the 2.6 kernel this scales to 16 TB. The

scalability of the self constructed net-

tial IO. The kernel operates on blocks to

improve speed and does so also for the

IO subsystem. Disk drivers take care of

the block transfers from host to disk. The

filesystem takes care of the allocation

and administration of disk blocks.

Filesystem data access can be block

oriented or file oriented. There are solu-

tions that use and extend the local

storage and systems that implement the

actual block storage themselves. The

first method makes the implementation

and portability a lot easier, but limits

throughput. The other approach is to

handle the block storage in the cluster

filesystem. This allows optimization of

data access and the addition of features

for which normally a Logical Volume

Manager is responsible. The system is

also easier to manage because is has full

data control for the complete path from

disk to application.

To improve on throughput many solu-

tions offer the possibility to read and

write data in parallel to many disks. Files

are striped over several disk platters or

alternatively, individual files are written

to different disks. See Figure 1. The max-

imum IO throughput is limited by the

disk transfer speed capacity and can only

increase when more disks are accessed

in parallel.

To access the local disks on other clus-

ter nodes, an individual node can use the

network connection. An IP/Ethernet link

is already available to connect to the out-

side world. Sometimes a secondary

Ethernet connection is used solely for

the filesystem originated data transfers.

For Linux the Enhanced Network Block

Device ( http://www.it.uc3m.es/~ptb/

Metadata

A Linux filesystem stores data in files

and directories and keeps record about

these in i-nodes. The i-nodes contain,

among other things, information about

the stored files. Examples are the size,

creation time, type or the owner of a file.

This information is called meta-data.

Manipulations to the meta-data can be

handled separately from the actual data

input and output. This makes it possible

to offload the meta-data handling to a

dedicated meta-data server. In cluster

filesystems the separate meta-data

servers are usually also responsible for

write locks on a file.

Separation is used to improve through-

put. Read operations do not need the

meta-data after the location and access

rights of the actual data are established.

Figure 2: Network attached storage.

Figure 3: Direct attached storage.

www.linux-magazine.com

July 2004

KNOW HOW

Cluster Filesystems

work attached filesystem ends at the sin-

gle network connection. Even a modern

Gigabit link does not suffice to deliver IO

for more than a few cluster nodes. A pos-

sible solution is Ethernet bonding, which

combines two or more devices to

increase bandwidth.

Depending on the requirements regar-

ding throughput, costs and to a lesser

extend security, a network attached stor-

age system can function as cluster

filesystem.

tem solutions for Linux. Not all solutions

are designed for use in a cluster, but they

are very capable to do so.

OpenAFS

This is both a network and a distributed

filesystem. It offers file sharing, even

over a WAN, and a global name space.

The filesystem is completely virtual and

kept on (replicated) data servers and

metadata servers. Clients build up a

cache of recently used files which is reg-

ularly flushed to the data servers.

Applications use the cached data and

continue to work when the connection

to the data server is broken but have to

wait on close of the open file until the

server is on line again.

AFS has been branched into the Dis-

tributed File System (DFS) which was

maintained and marketed by an IBM

subsidiary. DFS is no longer supported

and development has stopped. It is an

excellent shared filesystem for a campus

or even world wide because it also has a

well established security model. An

instance or administrative domain is

called a cell. Users authenticate them-

selves and are sent a security token with

a limited lifetime. The token allows the

cache manager on the local machine to

talk to the AFS file server.

With OpenAFS an administrator can

add disk space and replace disk space

without service interruption. Because

the cache manager sits between user and

server, data migration to other servers

can happen transparently. A special

version of OpenAFS called MultiResi-

dent-AFS interfaces to tape and allows

automatic data migration to offline

storage.

Use of OpenAFS in a cluster is not rec-

ommended for high data throughput.

The benefit of a cache, which may

enhance stability, is obviated by

the slow file server access which

runs in user space.

NFS

NFS is a network based, file oriented

filesystem. Because NFS, when run over

UDP, is stateless, clients experience only

a short stall if the network is unavailable

or if a NFS server is rebooted. Clients

may disappear without notice and the

server does not have to do anything to

recover. This is in contrast with all other

mentioned systems below that do need

to clean up after (contact to) a client or

cluster member is lost.

Directory hierarchies local to the

server are made available to others by

exporting them. Clients mount the

exported directories on any location in

their own filesystem. A client cannot

export a mounted NFS filesystem. As

NFS uses a weak security model this

makes it impossible to safely share over

WA N .

Exported filesystems are usually main-

tained in a database. The database can

be NIS or LDAP. The autofs system, uses

this database to automatically mount the

proper file hierarchy. The trigger for the

automated mount is a program entering

the directory which is defined as mount

point. NFS mounts within NFS mounts

are allowed.

The combination with autofs and the

network wide database makes NFS a

very good candidate for use in a cluster.

Perhaps the only limitation is its speed,

but this depends heavily on the access

pattern.

Direct attached storage

Usually these disks are located next to

the computer in the same housing or in

near vicinity. Therefore this is called

direct attached storage as opposed to

network attached. See Figure 3. The dis-

tance is limited by the specifications of

the copper based connections. Fibre

Channel has opened the possibility to

connect disks at larger distance. FC can

connect hosts to storage devices directly

or via a FC switch. Switches can connect

to switches to build a storage network or

SAN. See Figure 4.

Where SCSI or ATA is limited to con-

nections between host and storage, Fibre

Channel is used to build a storage area

network or SAN that connects hundreds

of hosts and storage devices. The Fibre

Channel protocol is optimized for stor-

age devices. Features are the low latency

and protocol offloading which reduces

the interrupt and processing load on the

host.

Network attached direct

storage

Storage and networking are

increasingly integrating on the

hardware side. iSCSI is a standard

that defines the SCSI protocol over

IP. Conversely there is FC-IP,

which defines IP over Fibre Chan-

nel. Infiniband offers both IP and

FC, on the same connection hard-

ware. Modern high reliability

makes a Local Area Network a

good IO candidate for block trans-

fers which are usually handled by

direct attached storage.

GPFS

The General Parallel File System is

a commercial product of IBM.

GPFS is a truly parallel filesystem.

Data can be striped over many

disks and any node can access the

same file at the same time.

There are two possible access

configurations, either via SAN or

via direct attached storage. In the

SAN configuration each node sees

each block on all storage elements

that are made available to the

Well known cluster

filesystems

We will limit ourselves to some of

the most interesting cluster filesys-

Figure 4: Storage area network storage.

July 2004

www.linux-magazine.com

OpenAF

KNOW HOW

Cluster Filesystems

GPFS. Files are assembled from the

blocks distributed in the SAN and are

directly available on the node. The direct

attached configuration relies on a high

speed inter-node network which can be

Ethernet or Myrinet. Blocks from local

disks are shipped to other nodes. Files

are assembled by gathering blocks over

the IP network from local disks on sev-

eral nodes. It is the similar to PVFS.

Management is very simple. Com-

mands can be issued from any node. The

system has the capability of adding and

removing disks, rebalance data access,

change the block size and the number of

possible I-nodes. It overcomes the maxi-

mum filesystem size limitation because

it allows a configurable block size as

large as 1 MB.

There is one node per open file for

handling the metadata. All nodes can

access the same file, but changes to the

meta-data are handled by the meta-data

node. The locking is distributed over the

nodes accessing the file. All data is writ-

ten and read in parallel and throughput

scales linearly with the number of disks

and nodes. GPFS filesystems can be can

be exported with AFS or NFS from dedi-

cated servers. The compute nodes then

mount the exported filesystems.

GPFS depends on very expensive SAN

infrastructure to achieve a high perfor-

mance. The configuration where an IP

network is used to assemble disk stripes

can become an early bottleneck for

many data access patterns.

dling: a file is found by asking the MDS.

After opening, the MDS relays the actual

the IO to Object Storage Targets (OST) to

takes care of the data exchange. The

MDS keeps track of the data exchange in

a journal. Creating and writing a file

involves the creation of an i-node on the

MDS which then contacts an OST to allo-

cate storage. The allocation can be

striped across several OSTs to enhance

performance.

Throughput achieved in some pub-

lished tests is impressive. At the

moment, Lustre still lacks important

maintenance tools to use it in a produc-

tion environment. There is no filesystem

recovery utility and there is not yet an

automatic failover for the single MDS.

The original approach and the develop-

ment from the ground up makes Lustre a

solution that could develop into a very

powerful and elegant system.

col overhead. PVFS must be installed on

the cluster nodes since it does not allow

export via NFS or AFS.

PVFS2 is a code rewrite based on the

experiences gathered with PVFS1. It has

structural enhancements such as user

controlled striping and distributed meta-

data. This allows the installation of more

then one meta-data controllers which

relieves this bottleneck.

OpenGFS

OpenGFS or OGFS also implements a

journaled block based filesystem that

provides read and write access from mul-

tiple nodes. The dreaded ‘pool’ code was

changed to allow OGFS to use any logical

volume manager. ELVM is preferred.

Most recently the memexp locking was

replaced by the OpenDLM module. The

old memexp was a single point of failure

and very compute intensive. OGFS sup-

ports growing of filesystems and the

addition of disks (through the separate

LVM). Node failures are handled by log

recovery and isolating the failed node.

PVFS

The Parallel Virtual File system is block

based and provides high performance for

I/O intensive parallel or distributed

applications. The usual application envi-

ronment is a small (< 50 nodes) cluster

but there are no inherent limits. Parts of

the internal disk are made available to

PVFS on IO nodes.

The file space of these disks is then

distributed to the complete cluster and

accessed via Ethernet a kernel module

and the libpvfs lib installed on the

clients. Clients can be IO nodes (IOD)

themselves and one node or client has to

be configured as the meta-data node

(MDS).

Files are striped over the IO nodes.

After the initial administrative data

exchange with the MDS all data traffic

with the IODs is handled by the clients

individually via the libpvfs. The library

orchestrates the assembly of the files

from the stripes distributed over the

IODs. PVFS supports Myrinet and Infini-

band for intra-node communication.

PVFS currently contains neither means

for data redundancy nor is it possible to

recover from a failed node. There is a

potential bottleneck at the manager level

as the number of client nodes increases.

PVFS cannot go beyond the restrictions

introduced by TCP/IP on Linux, such as

limits on the number of simultaneous

open system sockets and TCP/IP proto-

Final remarks

None of the presented systems are ideal.

None of the discussed variants of cluster

capable filesystems is perfect for all pur-

poses. The open source packages

LUSTRE und OpenGFS are in their early

development and not production ready.

OpenAFS lacks throughput capacity.

Clearly the Linux and open source com-

munity still have some time to go in

developing stable, scalable cluster stor-

age. The commercial software GPFS or

Sistinas GFS are reportedly more stable

and scalable.

After you have made your choice it is

not possible to run rpm -i and forget

about it. The software has to be tuned

for the specific environment. The default

values are never the optimal values.

Proper configuration forces you to think

about data access patterns, optimal IO

paths, possible bottlenecks, block sizes,

strip sizes, cache usage etc. After instal-

lation of the cluster filesystem the fun

just starts.

LUSTRE

LUSTRE is a new and being actively

developed. Although Lustre is marketed

by HP, the project is committed to the

open source license model. Lustre has

excellent documentation. For configura-

tion and logging Lustre relies on the

open standards LDAP and XML. Lustre is

a file (object) based system.

Everything stored in Lustre is consid-

ered an object. The objects of the

filesystem are (special) files and directo-

ries. The attributes, meta-data, of these

objects such as size, creation time, sym-

bolic link pointers or backup flags are

stored on metadata servers (MDS). The

meta-data is kept separate from the

actual content. The MDS takes care of

file creation, attribute manipulation and

is responsible for the namespace han-

■

INFO

[1] OpenGFS: http://www.sourceforge.net/

projects/opengfs

[2]GFS: http://www.sistina.com

July 2004

www.linux-magazine.com

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: