SOURCES (LINUX_2_6): linux-2.6-x8664-bitops-fix-for-size-optimized...
pluto
pluto at pld-linux.org
Sun Oct 30 11:47:15 CET 2005
Author: pluto Date: Sun Oct 30 10:47:11 2005 GMT
Module: SOURCES Tag: LINUX_2_6
---- Log message:
- amd64 bitops fix for `-Os` optimized kernel.
---- Files affected:
SOURCES:
linux-2.6-x8664-bitops-fix-for-size-optimized-kernel.patch (NONE -> 1.1.2.1) (NEW)
---- Diffs:
================================================================
Index: SOURCES/linux-2.6-x8664-bitops-fix-for-size-optimized-kernel.patch
diff -u /dev/null SOURCES/linux-2.6-x8664-bitops-fix-for-size-optimized-kernel.patch:1.1.2.1
--- /dev/null Sun Oct 30 11:47:11 2005
+++ SOURCES/linux-2.6-x8664-bitops-fix-for-size-optimized-kernel.patch Sun Oct 30 11:47:06 2005
@@ -0,0 +1,215 @@
+Subject: amd64 bitops fix for -Os
+From: Alexandre Oliva <aoliva at redhat.com>
+Date: Sat, 29 Oct 2005 19:56:02 -0200
+
+https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171672
+
+This patches fixes a bug that comes up when compiling the kernel for
+x86_64 optimizing for size. It affects 2.6.14 for sure, but I'm
+pretty sure many earlier kernels are affected as well.
+
+The symptom is that, as soon as some change is made to the root
+filesystem (e.g. dmesg > /var/log/dmesg), the kernel mostly hangs. It
+was not the first time I'd run into this symptom, but this time I
+could track the problem down to enabling size optimizations in the
+kernel build. It took some time to narrow down the culprit source
+with a binary search, compiling part of the kernel sources with -Os
+and part with -O2, but eventually it was clear that bitops itself was
+to blame, which should have been clear from the soft lockup oops I
+got.
+
+The problem is that find_first_zero_bit() fails when called with an
+underflown size, because its inline asm assumes at least one iteration
+of scasq will run. When this does not hold, the conditional branch
+that follows it uses flags from instructions prior to the asm
+statement.
+
+When optimizing for speed, the generated code is such that the flags
+will have the correct value, because of the side effects on flags of
+the right shift of the size, that survive through to the asm
+statement. When optimizing for size, however, the mov instruction
+used to initialize %rax with -1 is replaced with a smaller or
+instruction, that modifies the flags and thus breaks the
+zero-trip-count case.
+
+Obviously the asm statement must not rely on the compiler setting up
+flags by chance, so we have to either force the flags to be set
+properly or make sure we run scasq at least once. In teh
+find_first_zero_bit case, this comes at pretty much no cost, since we
+already test size for non-zero, but we used to do that adjusting it
+from bits to words; changing it should have no visible effect on
+performance.
+
+As for find_first_bit, it's quite likely that the same bug is present
+when it's called by find_next_bit in the same conditions, but
+find_first_bit doesn't even test for zero. AFAICT, it has just been
+luckier, so I went ahead and added the same guard code to it. This
+unfortunately adds a test to the fast path, but I don't see how to
+avoid that without auditing all callers.
+
+I actually introduce means to guard against these cases in the public
+wrappers, but the BUG_ONs are disabled by default. I've left a kernel
+running with them enabled for a bit, and they never hit, which is a
+good sign, but I haven't tested it thoroughly or anything. We could
+probably do away with these new tests by modifying the find_next*bit
+functions so as to not call the find_first*bit functions if they've
+already exhausted the range implied by the size argument. I'm not
+sure whether that's worth doing, though, so I didn't.
+
+While staring at the code and trying to figure out what the problem
+was, I removed some needless casts from find_next_zero_bit, by
+constifying the automatic pointer properly, and also moved the actual
+code from find_first_zero_bit to a separate internal function, such
+that we could add the bug-check to the public interface only.
+
+I also noticed find_first_zero_bit was less efficient than
+find_first_bit in that the former saved and restored rbx, because GCC
+chose that to hold (addr) within the asm statement, instead of using
+the readily-available and caller-saved rsi. I've thus changed the
+code to prefer rsi, although in a perfect world the compiler would be
+able to figure that out by itself.
+
+The compiler could do a bit better in find_first_zero_bit: if the
+initial size turns out to be zero, it could return, like it does in
+find_first_bit, but instead of sets rdx to zero and jumps to the end
+of the function where rdx is copied to rax before the return
+statement. This is a negative effect of the assignment of variable
+res to rdx instead of rax, which gets the register allocator to map
+the pseudo register representing the return value to rdx, requiring a
+copy at the end and preventing (as far as the dumb compiler can see
+:-) the direct use of a return in the zero-size case. I've verified
+that this is not caused by the additional inline function that I
+introduced.
+
+I tried to change the use of registers so as to enable the better code
+for this path, but I couldn't come up with anything that was as
+efficient, so I figured I wouldn't try to optimize the exceptional
+path in expense of the common fast path and left it alone. If anyone
+can come up with something better, please go ahead.
+
+
+Anyhow, with this patch I could run 2.6.14, as in the Fedora
+development tree, except for the change to optimize for size.
+
+ Signed-off-by: Alexandre Oliva <oliva at lsd.ic.unicamp.br>
+
+--- arch/x86_64/lib/bitops.c~ 2005-10-27 22:02:08.000000000 -0200
++++ arch/x86_64/lib/bitops.c 2005-10-29 18:24:27.000000000 -0200
+@@ -1,5 +1,11 @@
+ #include <linux/bitops.h>
+
++#define BITOPS_CHECK_UNDERFLOW_RANGE 0
++
++#if BITOPS_CHECK_UNDERFLOW_RANGE
++# include <linux/kernel.h>
++#endif
++
+ #undef find_first_zero_bit
+ #undef find_next_zero_bit
+ #undef find_first_bit
+@@ -13,11 +19,21 @@
+ * Returns the bit-number of the first zero bit, not the number of the byte
+ * containing a bit.
+ */
+-inline long find_first_zero_bit(const unsigned long * addr, unsigned long size)
++static inline long
++__find_first_zero_bit(const unsigned long * addr, unsigned long size)
+ {
+ long d0, d1, d2;
+ long res;
+
++ /* We must test the size in words, not in bits, because
++ otherwise incoming sizes in the range -63..-1 will not run
++ any scasq instructions, and then the flags used by the je
++ instruction will have whatever random value was in place
++ before. Nobody should call us like that, but
++ find_next_zero_bit() does when offset and size are at the
++ same word and it fails to find a zero itself. */
++ size += 63;
++ size >>= 6;
+ if (!size)
+ return 0;
+ asm volatile(
+@@ -30,11 +46,22 @@
+ " shlq $3,%%rdi\n"
+ " addq %%rdi,%%rdx"
+ :"=d" (res), "=&c" (d0), "=&D" (d1), "=&a" (d2)
+- :"0" (0ULL), "1" ((size + 63) >> 6), "2" (addr), "3" (-1ULL),
+- [addr] "r" (addr) : "memory");
++ :"0" (0ULL), "1" (size), "2" (addr), "3" (-1ULL),
++ /* Any register here would do, but GCC tends to
++ prefer rbx over rsi, even though rsi is readily
++ available and doesn't have to be saved. */
++ [addr] "S" (addr) : "memory");
+ return res;
+ }
+
++long find_first_zero_bit(const unsigned long * addr, unsigned long size)
++{
++#if BITOPS_CHECK_UNDERFLOW_RANGE
++ BUG_ON (size + 63 < size);
++#endif
++ return __find_first_zero_bit (addr, size);
++}
++
+ /**
+ * find_next_zero_bit - find the first zero bit in a memory region
+ * @addr: The address to base the search on
+@@ -43,7 +70,7 @@
+ */
+ long find_next_zero_bit (const unsigned long * addr, long size, long offset)
+ {
+- unsigned long * p = ((unsigned long *) addr) + (offset >> 6);
++ const unsigned long * p = addr + (offset >> 6);
+ unsigned long set = 0;
+ unsigned long res, bit = offset&63;
+
+@@ -63,8 +90,8 @@
+ /*
+ * No zero yet, search remaining full words for a zero
+ */
+- res = find_first_zero_bit ((const unsigned long *)p,
+- size - 64 * (p - (unsigned long *) addr));
++ res = __find_first_zero_bit (p, size - 64 * (p - addr));
++
+ return (offset + set + res);
+ }
+
+@@ -74,6 +101,17 @@
+ long d0, d1;
+ long res;
+
++ /* We must test the size in words, not in bits, because
++ otherwise incoming sizes in the range -63..-1 will not run
++ any scasq instructions, and then the flags used by the jz
++ instruction will have whatever random value was in place
++ before. Nobody should call us like that, but
++ find_next_bit() does when offset and size are at the same
++ word and it fails to find a one itself. */
++ size += 63;
++ size >>= 6;
++ if (!size)
++ return 0;
+ asm volatile(
+ " repe; scasq\n"
+ " jz 1f\n"
+@@ -83,8 +121,7 @@
+ " shlq $3,%%rdi\n"
+ " addq %%rdi,%%rax"
+ :"=a" (res), "=&c" (d0), "=&D" (d1)
+- :"0" (0ULL),
+- "1" ((size + 63) >> 6), "2" (addr),
++ :"0" (0ULL), "1" (size), "2" (addr),
+ [addr] "r" (addr) : "memory");
+ return res;
+ }
+@@ -99,6 +136,9 @@
+ */
+ long find_first_bit(const unsigned long * addr, unsigned long size)
+ {
++#if BITOPS_CHECK_UNDERFLOW_RANGE
++ BUG_ON (size + 63 < size);
++#endif
+ return __find_first_bit(addr,size);
+ }
+
================================================================
More information about the pld-cvs-commit
mailing list