POSIX I/O Elite Processing Augmentations.


74 views
Uploaded on:
Category: Animals / Pets
Description
POSIX is the IEEE Portable Operating System Interface for Computing Environments. ... This group of detail calls, the lite family, is given to permit to ...
Transcripts
Slide 1

POSIX I/O High Performance Computing Extensions Brent Welch (Speaker) Panasas www.pdl.cmu.edu/posix/November 17, 2005

Slide 2

APIs for HPC IO POSIX IO APIs (open, close, read, compose, detail) have semantics that can make it difficult to accomplish elite when expansive groups of machines access shared capacity. A working gathering (see next slide) of HPC clients is drafting some proposed API increases for POSIX that will give standard approaches to accomplish higher execution. Essential methodology is either to unwind semantics that can be costly, or to give more data to educate the capacity framework about access designs.

Slide 3

Contributors Lee Ward - Sandia National Lab Bill Lowe, Tyce McLarty – Lawrence Livermore National Lab Gary Grider, James Nunez – Los Alamos National Lab Rob Ross, Rajeev Thakur, William Gropp - Argonne National Lab Roger Haskin – IBM Brent Welch, Marc Unangst - Panasas Garth Gibson-CMU/Panasas Alok Choudhary – Northwestern U Tom Ruwart-U of Minnesota/IO Performance Others www.pdl.cmu.edu/posix/

Slide 4

POSIX Introduction POSIX is the IEEE Portable Operating System Interface for Computing Environments. "POSIX characterizes a standard path for an application project to acquire fundamental administrations from the working framework" The Open Group ( http://www.opengroup.org/) POSIX was made when a solitary PC possessed its own particular record framework. System document frameworks like NFS picked not to actualize strict POSIX semantics in all cases (e.g., lethargic access time spread) Heavily shared records (e.g., from bunches) can be extremely costly for record frameworks that give POSIX semantics, or have indistinct substance for document frameworks that twist the guidelines The objective is to make a standard approach to give elite and great semantics

Slide 5

Current HPC POSIX Enhancement Areas Ordering (stream of bytes thought needs to move towards disseminated vectors of units) readx(), writex() Coherence – (last author wins and other such things can be discretionary) lazyio_propogate(), lazyio_synchronize() Metadata (apathetic traits issues) statlite() Locking plans for collaborating forms lockg() Shared record descriptors (bunch document opens) openg(), sutoc() Portability of implying for formats and other data (record framework gives ideal access technique in standard call) ? (no API yet)

Slide 6

statlite, fstatlite,lstatlite Syntax int statlite(const singe * file_name , struct statlite * buf ); int fstatlite(int filedes , struct statlite * buf ); int lstatlite(const roast * file_name , struct statlite * buf ); Description This group of detail calls, the lite family, is given to permit to document I/O execution not to be bargained by successive utilization of detail data query. Some data can be costly to get when a record is occupied. They all arrival a statlite structure, which has all the typical fields from the detail group of calls however a portion of the fields (e.g., document size, alter time) are alternatively not ensured to be right. There is a litemask field that can be utilized to determine which of the discretionary fields you require to be totally right values returned.

Slide 7

statlite, fstatlite,lstatlite (cont.) Syntax int statlite(const roast * file_name , struct statlite * buf ); int fstatlite(int filedes , struct statlite * buf ); int lstatlite(const burn * file_name , struct statlite * buf ); Description statlite details the document indicated by file_name and fills in buf . lstatlite is indistinguishable to statlite , with the exception of on account of a typical connection, where the connection itself is statlite-ed, not the record that it alludes to. fstatlite is indistinguishable to detail , just the open record indicated by filedes (as returned by open (2)) is statlited-ed set up of file_name .

Slide 8

struct statlite struct statlite { dev_t st_dev;/* gadget */ino_t st_ino;/* inode */mode_t st_mode;/* assurance */nlink_t st_nlink;/* number of hard connections */uid_t st_uid;/* client ID of proprietor */gid_t st_gid;/* bunch ID of proprietor */dev_t st_rdev;/* gadget sort (if inode device)*/unsigned long st_litemask;/* bit cover for discretionary field precision *//* Fields underneath here are alternatively given and are ensured to be right just if there relating bit is set to 1 in the manditory st_litemask field, with the lite adaptations of the detail group of calls */off_t st_size;/* all out size, in bytes */blksize_t st_blksize;/* blocksize for filesystem I/O */blkcnt_t st_blocks;/* number of pieces dispensed */time_t st_atime;/* time of last get to */time_t st_mtime;/* time of last adjustment */time_t st_ctime;/* time of last change *//* End of discretionary fields */};

Slide 9

POSIX ACLs Legitimize NFSv4 ACLs in POSIX, permitting clients to pick system and after some time perhaps POSIX ACLs will blur away. Note that "POSIX ACLS" are truly just a proposed part of the standard and not generally executed or utilized NFSv4 ACLs are adjusted to the Windows ACL model, which is all the more broadly utilized and more sensible The two models vary in how ACLs are acquired, and in the tenets for preparing a long arrangement of ACE (access control sections) draft-falkner-nfsv4-acls-00.txt is an Internet Draft from Sun that clarifies how they are uncovering NFSv4 ACLs for Solaris 10.

Slide 10

NFSv4 ACLS Permission letter mapping: r - NFS4_ACE_READ_DATA w - NFS4_ACE_WRITE_DATA a - NFS4_ACE_APPEND_DATA x - NFS4_ACE_EXECUTE d - NFS4_ACE_DELETE l - NFS4_ACE_LIST_DIRECTORY f - NFS4_ACE_ADD_FILE s - NFS4_ACE_ADD_SUBDIRECTORY n - NFS4_ACE_READ_NAMED_ATTRS N - NFS4_ACE_WRITE_NAMED_ATTRS D - NFS4_ACE_DELETE_CHILD t - NFS4_ACE_READ_ATTRIBUTES T - NFS4_ACE_WRITE_ATTRIBUTES c - NFS4_ACE_READ_ACL C - NFS4_ACE_WRITE_ACL o - NFS4_ACE_WRITE_OWNER y - NFS4_ACE_SYNCHRONIZE

Slide 11

lockg Syntax   int lockg(int fd, int cmd, lgid_t *lgid); Description Apply, test, evacuate, or join a POSIX bunch lock on an open file.  Group locks are elite, entire document bolts that farthest point record access to a predetermined gathering of processes.  The record is indicated by fd, a record descriptor open for composing and the activity by cmd. The principal procedure to call lockg() passes a cmd of F_LOCK and an instated esteem for lgid.  Obtaining the lock is performed precisely as if a lockf() with pos of 0 and len of 0 were utilized (i.e. characterizing a lock area that incorporates a locale from byte position zero to present and future end-of-tile positions).  A hazy lock bunch id is returned in lgid.  This lgid might be passed to different procedures with the end goal of permitting them to join the gathering lock.

Slide 12

lockg (Continued) Description (Continued) Processes wishing to join the gathering lock call lockg() with a cmd of F_LOCK and the lgid came back to the primary process.  On achievement this  process has enrolled itself as an individual from the gathering of the group  lock. Substantial operations are given beneath: F_LOCK Set  a restrictive lock F_TLOCK Same  as  F_LOCK  however the call never pieces F_ULOCK Unlock the showed file.  F_TEST Test the lock

Slide 13

readdirplus & readdirlite Syntax struct dirent_plus *readdirplus(DIR * dirp ); int readdirplus_r(DIR * dirp , struct dirent_plus * passage , struct dirent_plus ** result ); struct dirent_lite *readdirlite(DIR * dirp ); int readdirlite_r(DIR * dirp , struct dirent_lite * section , struct dirent_lite ** result ); Description readdirplus (2) and readdirplus_r (2) give back a catalog section in addition to lstat (2) comes about (like the NFSv3 READDIRPLUS order) readdirlite (2) and readdirlite_r (2) give back a registry section in addition to lstatlite (2) comes about

Slide 14

readdirplus & readdirlite (Continued) Description (Continued) Results are returned as a dirent_plus or dirent_lite structure: struct dirent_plus { struct dirent d_dirent;/* dirent struct for this passage */struct detail d_stat;/* traits for this passage */int d_stat_err;/* errno for d_stat, or 0 */}; struct dirent_lite { struct dirent d_dirent;/* dirent struct for this passage */struct statlite d_stat;/* characteristics for this passage */int d_stat_err;/* errno for d_stat, or 0 */}; If d_stat_err is 0, d_stat field contains lstat (2)/lstatlite (2) comes about If readdir (2) stage succeeds yet lstat (2) or lstatlite (2) fizzles (record erased, occupied, and so forth.) d_stat_err field contains errno from detail call readdirplus_r (2)/readdirlite_r (2) variations give string safe API, like readdir_r (2)

Slide 15

Lazy I/O information uprightness Specify O_LAZY in banners contention to open (2) Requests lethargic I/O information honesty Allows system filesystem to unwind information coherency prerequisites to enhance execution for shared-compose document Writes may not be unmistakable to different procedures or customers until lazyio_propagate (2), fsync (2), or close (2) is called Reads may originate from neighborhood reserve (disregarding changes to document on sponsorship stockpiling) until lazyio_synchronize (2) is called Does not give synchronization crosswise over procedures or hubs – program must utilize outside synchronization (e.g., pthreads, XSI message lines, MPI) to arrange activities This is an insight just if filesystem does not bolster sluggish I/O trustworthiness, does not need to do anything any other way

Slide 16

lazyio_{propagate,synchronize} Syntax int lazyio_propagate(int fd , off_t balance , size_t check ); int lazyio_synchronize(int fd , off_t balance , size_t number ); Description lazyio_propagate (2) guarantees that any stored writes in the predefined district have been proliferated to the common duplicate of the support record. lazyio_synchronize (2) guarantees that the impacts of finished spreads in the predefined locale from different procedures or hubs, on any record descriptor of the sponsorship document, will be reflected in resulting read (2) and detail (2) approaches this hub. A few usage may perform this by refuting all reserved d