From nobody@FreeBSD.org Sun Feb 5 11:42:53 2012 Return-Path: Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E64781065670 for ; Sun, 5 Feb 2012 11:42:53 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22]) by mx1.freebsd.org (Postfix) with ESMTP id BB03C8FC16 for ; Sun, 5 Feb 2012 11:42:53 +0000 (UTC) Received: from red.freebsd.org (localhost [127.0.0.1]) by red.freebsd.org (8.14.4/8.14.4) with ESMTP id q15BgrA0041310 for ; Sun, 5 Feb 2012 11:42:53 GMT (envelope-from nobody@red.freebsd.org) Received: (from nobody@localhost) by red.freebsd.org (8.14.4/8.14.4/Submit) id q15Bgrh6041302; Sun, 5 Feb 2012 11:42:53 GMT (envelope-from nobody) Message-Id: <201202051142.q15Bgrh6041302@red.freebsd.org> Date: Sun, 5 Feb 2012 11:42:53 GMT From: Nicolas Bourdaud To: freebsd-gnats-submit@FreeBSD.org Subject: 'write' system call violates POSIX standard X-Send-Pr-Version: www-3.1 X-GNATS-Notify: >Number: 164793 >Category: standards >Synopsis: 'write' system call violates POSIX standard >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sun Feb 05 11:50:08 UTC 2012 >Closed-Date: >Last-Modified: Wed Feb 15 15:00:18 UTC 2012 >Originator: Nicolas Bourdaud >Release: FreeBSD 9.0-RELEASE >Organization: >Environment: GNU/kFreeBSD debian-bsd-amd64 9.0-RELEASE FreeBSD 9.0-RELEASE #0: Tue Jan 3 07:46:30 UTC 2012 root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC x86_64 amd64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GNU/kFreeBSD >Description: When a write() cannot transfer as many bytes as requested (because of a file limit), it fails instead of transferring as many bytes as there is room to write. This is a violation of the POSIX standard: http://pubs.opengroup.org/onlinepubs/007904975/functions/write.html >How-To-Repeat: fsize-lim.c.txt (attached) illustrates the problem. With a freebsd kernel, the output is: failed when adding 27 bytes after 59994 bytes (error: File too large) The expected output (like with a linux kernel) should be: added 6 bytes instead of 27 bytes after 59994 bytes failed when adding 27 bytes after 60000 bytes (error: File too large) >Fix: Patch attached with submission follows: #include #include #include #include #include #include #include #include #include #include #define TARGETSIZE 80000 #define LIMSIZE 60000 #define PATTSIZE 27 int main(void) { struct rlimit lim; int fd; ssize_t retc; size_t count = 0; const char pattern[PATTSIZE] = "Hello world!"; signal(SIGXFSZ, SIG_IGN); lim.rlim_cur = LIMSIZE; setrlimit(RLIMIT_FSIZE, &lim); fd = open("result.txt", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR); while (count < TARGETSIZE) { retc = write(fd, pattern, PATTSIZE); if (retc < PATTSIZE && retc > 0) fprintf(stderr, "added %zi bytes instead of %u bytes after %zu bytes\n", retc, PATTSIZE, count); else if (retc < 0) { fprintf(stderr, "failed when adding %u bytes after %zu bytes (error: %s)\n", PATTSIZE, count, strerror(errno)); break; } count += retc; } close(fd); return 0; } >Release-Note: >Audit-Trail: From: Bruce Evans To: Nicolas Bourdaud Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org Subject: Re: kern/164793: 'write' system call violates POSIX standard Date: Mon, 6 Feb 2012 05:54:50 +1100 (EST) On Sun, 5 Feb 2012, Nicolas Bourdaud wrote: >> Description: > When a write() cannot transfer as many bytes as requested (because of a file > limit), it fails instead of transferring as many bytes as there is room to > write. > > This is a violation of the POSIX standard: > http://pubs.opengroup.org/onlinepubs/007904975/functions/write.html FreeBSD's handling of the maxfilesize limits is similar, so it has the same bug. This affects many fileystems which copied the buggy code from ffs. (Both truncate() and write() fail if extending to or writing the full number of bytes would exceed the limit. This is correct for truncate(), but write() is required to creep up on the limit.) I think this is actually a bug in POSIX (XSI). Most programs aren't prepared to deal with short writes, and returning an error like truncate() is specified to is adequate. For regular files, most file systems in FreeBSD back out of writes after an i/o error, using ftruncate() (some truncation is necessary for security, since the place at which the error occurred is usually not known precisely), so the following bug in the upper layer rarely matters. From an old version of sys_generic.c, for writing (reading has a similar bug): % if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) { % /* XXX short write botch. */ % if (auio.uio_resid != cnt && (error == ERESTART || % error == EINTR || error == EWOULDBLOCK)) % error = 0; The XXX comment is only in my version. Here (auio.uio_resid != cnt) means that some i/o was done. In that case, write() is required to return the amount done, with no error, which is implemented by setting `error' to 0. But this is only done if `error' is one of ERESTART, EINTR or EWOULDBLOCK. At least the case of the most common error that is not one of these, namely EIO, is broken. The handling of the special 3 here is delicate: - ERESTART: hopefully can't happen, since if it happens then we should restart. This error is a non-error that in most cases means that the we handled a signal but are not returning with EINTR because SA_RESTART says to restart instead of returning. - EINTR: since we have this and not ERESTART, it is clearly correct to return, but if we did some i/o then we must return its amount and there is no way to return EINTR. - EWOULDBLOCK: similar to EINTR for a SIGALRM, but more precise. I guess this is here since it is the only other common error, and it is not really an error so failing for it would be obviously wrong (except when no i/o was done, EWOULDBLOCK = EAGAIN is the standard way to indicate this). The flag that controls backing out of writes is IO_UNIT. This is always set for write(2), and probably should be set unconditionally (so it shouldn't exist), since not setting it mainly asks for security holes and most cases are write(2) anyway. IO_UNIT means that the i/o is done as an "atomic unit". The semantics of "unit" probably includes doing all of it or none of it, so it would have to be broken to match the POSIX spec. > Patch attached with submission follows: > ... > int main(void) > { > struct rlimit lim; > int fd; > ssize_t retc; > size_t count = 0; > const char pattern[PATTSIZE] = "Hello world!"; > > signal(SIGXFSZ, SIG_IGN); > lim.rlim_cur = LIMSIZE; > setrlimit(RLIMIT_FSIZE, &lim); This is missing initialization of at least lim.rlim_max in lim. This gave the bizarre behaviour that when the program was statically linked, it failed for the first write, because the stack garbage for lim.rlim_max happened to be 0. Bruce From: Nicolas Bourdaud To: Bruce Evans Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org Subject: Re: kern/164793: 'write' system call violates POSIX standard Date: Wed, 15 Feb 2012 14:13:31 +0100 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig6C66010532ACA49C7A3CE91B Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 05/02/2012 19:54, Bruce Evans wrote: > I think this is actually a bug in POSIX (XSI). Most programs aren't > prepared to deal with short writes, and returning an error like > truncate() is specified to is adequate. I disagree, I think that most programs that check that the write succeeded also check that the write was complete. Actually it was because my programs were assuming the POSIX behavior that I notice the bug. In addition, I think (this must be confirmed) that the bug don't affect the version 8.2... So the programs are already facing the POSIX behavior. Moreover the programs that are cross platform (in particular ported to Linux) are already facing this behavior. Whatever is decided, either freebsd should conform to the POSIX standard, either the standard should be changed. >> Patch attached with submission follows: >> ... >> int main(void) >> { >> struct rlimit lim; >> int fd; >> ssize_t retc; >> size_t count =3D 0; >> const char pattern[PATTSIZE] =3D "Hello world!"; >> >> signal(SIGXFSZ, SIG_IGN); >> lim.rlim_cur =3D LIMSIZE; >> setrlimit(RLIMIT_FSIZE, &lim); >=20 > This is missing initialization of at least lim.rlim_max in lim. This > gave the bizarre behaviour that when the program was statically linked,= > it failed for the first write, because the stack garbage for > lim.rlim_max happened to be 0. Yes I forgot one line: getrlimit(RLIMIT_FSIZE, &lim); just before "lim.rlim_cur =3D LIMSIZE;" Best regards Nicolas --------------enig6C66010532ACA49C7A3CE91B Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJPO697AAoJEMTcslrXGllyZA8P/3JXPOFJEXOq9oDGBZFbZ1dx EwD8lGjcEwFbw6ZhybBP/EDLldU2J41cOUQ2hI4CCgUrI5lG9FN5P0Nqqa9S4yVA 06N3j0z2ZQuAPcQU6xgU6F4RkSRf8v4EdlAxXcqtP1UEIYwtETx6fkz8vM7hcgWV tIqbWUPmYJp5G3CtHbu29oPt3qUXNTRyO4um/zPbXGdAKVGlvoiv86L6GOrwFHCf iLni2W3hgKMBBzSAexHgNpb+bFlPJXSMePADL1HsFi350znwlscXZgLiyN7uHo2s GCzsckhOpyed40io6gM/md2QGn68U6csYetDEpn5O/YpUQXKJFjNodil3M8d1orV 6O8JNCIbCLovlMZSBmZLRTdzBi1thmaLD/xBHTKoZbZ/950c1SY+u1/PKoO2PPxn 9/QKEr7U9HNyMP5VqiUSnn6aidRnJX4dwdMKD5WF05Mf5C0HYPKBYDtB1CXIPwGr h4sAVc6paTCdj3m2G6Hu6cYY95WvnOkYeyOoVSgr1hKM+WXoMz3GWMAPKeuzSizO wN0BXVo0O7EvuXV68dpEe/jhmGfdipRKPxjoN6UNYP7Avwyex6fUxBJTDrrA8p7i hQ5YGS/nBWs1CQAgz9uga1iSmNjFmn+uHPqmIuxRtf5DHsHAW2I5BqFJ1o/YHyZZ E93qQBVtPOy29ixeMsqR =VNCr -----END PGP SIGNATURE----- --------------enig6C66010532ACA49C7A3CE91B-- From: Bruce Evans To: Nicolas Bourdaud Cc: Bruce Evans , freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org Subject: Re: kern/164793: 'write' system call violates POSIX standard Date: Thu, 16 Feb 2012 01:55:54 +1100 (EST) On Wed, 15 Feb 2012, Nicolas Bourdaud wrote: > On 05/02/2012 19:54, Bruce Evans wrote: >> I think this is actually a bug in POSIX (XSI). Most programs aren't >> prepared to deal with short writes, and returning an error like >> truncate() is specified to is adequate. > > I disagree, I think that most programs that check that the write > succeeded also check that the write was complete. Actually it was Well, in BSD, programs that don't understand short writes start with the cp utility in 4.4BSD (it checks for short writes, but then mishandles them by treating them as errors). This wasn't fixed in FreeBSD until 1998. > because my programs were assuming the POSIX behavior that I notice the > bug. In addition, I think (this must be confirmed) that the bug don't > affect the version 8.2... So the programs are already facing the POSIX No, it was in 4.4BSD, and hasn't been changed in FreeBSD since 1994. 8.2 only differs in having the check in all file systems instead of in vfs. Perhaps some file systems got it right, but ffs didn't. > behavior. Moreover the programs that are cross platform (in particular > ported to Linux) are already facing this behavior. > > Whatever is decided, either freebsd should conform to the POSIX > standard, either the standard should be changed. It must conform, since it is too late to fix standards. I forgot about this when I looked at ffs's handling of i/o errors recently. There are many more bugs. ffs normally tries to back out of writes completely after an i/o error, by using ftruncate() to return to the original file size. Garbage written to the disk or memory is too hard to back out of, but ffs avoids security holes by zeroing it memory (in case it is memmap()ed) and by making it inaccessible by normal means on the disk (ftruncate() does this. When the error is ENOSPC due to a full disk, this gives the same behaviour as ffs has now for EFBIG for the file size being too big (due to the maximum size for the file system, or the rlimit). POSIX has looser wording for the ENOSPC error. It says that ENOSPC shall be returned if there "was" no space... This can be interpreted as requiring the same things as EFBIG -- that if there was any space to begin with, ENOSPC is not required to be returned; presumably the write() should succeed in writing as much as possible since there is no other reasonable error. But ffs's behaviour is "correct" here. The most broken case here is for an i/o error for a write in the middle of a file. Then it is not reasonable to try to back out. ffs doesn't do the ftruncate() in this case. But it still tries to back out. This results in write() returning -1/EIO. This is wrong if something has been successfully written. On second thoughts is it is the best possible behaviour. Everything in the region of the file covered by the write() may have been clobbered, either by writing the requested bytes, or by a hardware or software error writing garbage, or by the intentional zeroing for security. The only way to tell the application about this is to say that the whole write failed. The application should assume that the entire region has been clobbered, and take steps to check and limit the extent of the damage, perhaps by trying to rewrite it all in smaller pieces. There seem to be more bugs in [f]truncate(): - POSIX requires SIGXFSZ for attempts to exceed the file size rlimit in truncate() too, but FreeBSD doesn't even check the rlimit for truncate(). Checking the rlimit in vfs makes all this easier to fix. I think write() can be fixed in a couple of lines in vfs. All file systems call back to vfs to check, though I don't know of any requirement for other errors to have precedence, so vfs could check up front. zfs's write vnop actually calls back to vfs before doing anything else, so this error already has precedence over all fs-specific errors for zfs. All other file systems' write vnop do the check a fair way into the vnop in much the same place as ffs. No file systems check the limit for truncate(). The limit checking is commented out in xfs's write vnop. Bruce >Unformatted: