[Bash-completion-devel] more testing of disk prefetch

Wed Dec 3 18:22:49 UTC 2014

 I redid my patch series to include just the things I'm sure are
bugfixes for git HEAD.  I also redid their commit messages with line
wrapping.  Will send in a quoting cleanup patch later, with more stuff
in one patch, when I'm ready to sign off on it.

 I've played around with my prefetch idea, and now I'm happier with
it, having explored the possibilities a bit.  It'd be neat if some
people could test this on their own systems, esp. ones with any kind
of slow hard drive or slow CPU, or a crufty /etc/bash_completion.d
with a lot of crap in it.

 Sending an email chock-full of everything that's worth writing down,
just so it's in the list archives in case anyone ever wants it.  I
didn't want to stick most of this into a git commit or a bug comment.

 It seems to be hard to get Linux to really drop filesystem caches.

That or my SATA hard drive's internal cache is big enough that it
doesn't need to seek around to get the requested data when all you're
doing is  time . ./bash_completion  between dropping caches.  That
might be more likely, since it's faster than the first time running it
after not doing so for a while, but nowhere near as fast as with hot
cache.

Intel Core2 Duo E6600 (2.4GHz, 2 non-hyperthreaded cores) 5GB RAM
/etc on ext4 (noatime) on a WD10EARS-00M  (fairly old green-power magnetic HD)
/usr/local/src/bash-completion  on xfs on the same HD

One point of interest with these results is that the very FIRST
test of no-prefetch is FAR higher than any of the others.  IDK how to
get my system back to a state where it will take that long again.  If
anyone has any suggestions, that'd be great.  The hd churn from
grepping a linux kernel tree helps some, but it's not enough.

alias churn_hd="grep -r  --include '*.c' xxxxx /usr/local/src/linux/ubuntu-trusty/ >/dev/null"
alias dropcache="sync; echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null; sleep 6"
# the sleep 6 seconds is to give things time to settle down right
# after dropping caches.  Otherwise you get re-reading of
# constantly-accessed data contending with bash_completion, which is
# realistic but less repeatable.

  no prefetch
$ for i in {0..4};do  dropcache ; time BASH_COMPLETION_DISABLE_PREFETCH=1 . ./bash_completion;churn_hd; done 2>&1 |
  	grep '^real' --line-buffered | 
	perl -ne '/m(.*)s/ and $tot+=$1 and ++$c; print; END { print "  avg: ", $tot/$c, "\n"; }'
real    0m1.383s
real    0m0.605s
real    0m0.893s
real    0m0.797s
real    0m0.677s
 avg (of last 4 only): 0.743

 My first version of prefetch: just cat
  prefetch with:  ( exec cat $glob &>/dev/null </dev/null )& disown $!
for i in {0..4}; do dropcache; time . ./bash_completion; churn_hd; done 2>&1 | grep '^real'
real    0m0.761s      0m0.557s    0m0.821s    0m0.558s    0m0.749s (collapsed for readability)
  avg: 0.6892

with hot cache I get: avg: 0.0906

 Trying to get fancier: fork off cat, and then access all the inodes
of the files we will want, to get a deeper read queue depth.

prefetch with
(
	shopt -s failglob  # cat doesn't run at all with no matches, even if nullglob is on.
	# I was worried that cat could get stuck reading from stdin,
 	# but ( cat & ) redirects stdin from /dev/null because subshells don't have job control.
	exec &>/dev/null  # quash output of cat and any bash failglob errors
	cat $glob &	   # contents
	true $glob/   # inodes.  just expanding this glob will stat(2), no ls -dL needed.
) &
disown $! 	# don't pollute job control

real    0m1.431s    0m0.665s    0m0.594s    0m0.713s    0m0.570s
  avg (of last 4 only): 0.6355
real    0m0.701s    0m0.557s    0m0.713s    0m0.594s    0m0.797s
  avg: 0.6724
real    0m0.594s    0m0.977s    0m0.558s    0m0.737s    0m0.593s
  avg: 0.6918
real    0m0.893s    0m0.570s    0m0.749s    0m0.606s    0m0.569s
  avg: 0.6774
avg of averages: 0.669s (still excluding the 1.4sec outlier)

observations from strace:
$ for i in {0..4}; do sync; dropcache; (ls -dLF /; cat /dev/null; ) > /dev/null;
  strace -o prefetch.$i.glob-inode.cat-bg.strace -tt -s 256 -f -e trace='!rt_sigprocmask,rt_sigaction' \
  	 bash -c "time . ./bash_completion";
  churn_hd;
done 2>&1 | ...
real    0m1.135s    0m0.591s    0m0.590s    0m0.697s    0m0.588s
  avg: 0.6165 (excluding the outlier).  Faster I think because the
clock doesn't start until bash has loaded, so that gets libc and all
that cached.

in the cached case, strace bash -c 'time . ./bash_completion'
real    0m0.299s
user    0m0.103s
sys     0m0.064s

 cat's read() system calls weren't always returning at the same time
as bash's.  While bash was chewing on some of the bigger files, or
esp. ones that made it stat(2) other things to look for files from
some of the scripts sourced (e.g. grup did a lot of stuff), the cat
process got ahead of bash and was able to get files into cache so
bash's read(2) returned right away when it got there.  This is exactly
the behaviour I was trying to get from a prefetch process. IDK if
drop_caches isn't clearing inode caches or something, but all the
stat(2) system calls from bash in the prefetch subshell happen with no
delay.  Or maybe almost all the relevant inodes are near each other on
disk, and got read together in a block?  Anyway, once one is done,
it's just boom, sequence of stat calls while the other processes are
stuck on something.

 I had been doing inode prefetch by running ls -dL on the glob, but
that always slowed things down, by maybe 0.08s, compared to just cat
prefetch.  Regardless of whether I forked cat & and exec ls, or vice
versa.  But I think the main problem was just ls finding all its
libraries and stuff, and the startup overhead.  Expanding $glob/ to
make bash stat everything to see if it's a directory is a pretty neat
idea, IMHO.  And it might help in a case where not all the inodes load
together.  If they did, just exec cat in the background subshell get
it done as part of opening the files.  I'm leaving it in on the theory
that inodes might be near each other even if they don't all come in in
the same disk read, so it's better to prefetch all the stat info
before reading file contents.

 On cygwin, the extra CPU time from the stat system calls, even in a
background process, might be a bad thing.  cygwin stat is really slow,
last I heard.  And so is fork.  cygwin users might want to set
BASH_COMPLETION_DISABLE_PREFETCH=1, if this patch makes it in.

 On a single-core CPU, it could be a very small slowdown.  cat is
doing a decent amount of system calls.  Copying RAM around isn't a big
deal, the files are all very small, none big enough for cat to need
more than one size=65536 read.  Esp. since write(2) to /dev/null just
returns without doing anything, no extra copying there.

 There is a fadvise(1), which would be perfect.  (FADVISE_WILLNEED
would does exactly what cat does, but without copying the data to
userspace, or writing it.  It blocks until the data is cached, if the
readahead queue fills up.)  However, the current implementation is
written in perl, so it's not useful to start it for a handful of small
files.  It probably takes about as much disk IO to start it as
bash_completion load-time does total.

 I tested having the prefetch thread wait to finish stat(2)ing all the
files before running cat, and it performs essentially the same.  I
think I'll go with this version, since if there is a significant
amount of disk activity from the inodes, that + cat's read requests
could make an annoying hiccup of disk load that might interfere with
something else the user had running.  Prob. better to err on the side
of being less agressive with read queue depth.  Also, this way forks
only once.

( #  prefetch with
	shopt -s failglob  # don't even run cat if no matches
	exec &>/dev/null
	true $glob/
	exec cat $glob
) &   disown $!
real    0m1.143s    0m0.605s    0m0.677s    0m0.725s    0m0.546s
  avg: 0.63825 (last 4 only)
real    0m0.689s    0m0.558s    0m0.785s    0m0.606s    0m0.833s
  avg: 0.6942
real    0m0.629s    0m0.773s    0m0.581s    0m0.749s    0m0.594s
  avg: 0.6652
average of averages: 0.666s

 This is the version in the patch I'm attaching.

 I also tested with (ls -dLF /; cat /dev/null; ) >/dev/null  ahead of
the timing timed part, to get some essential stuff cached again.

  prefetch:
(...;	true $glob/
	exec cat $glob ) & disown $!
$ for i in {0..4}; do sync; dropcache; (ls -dLF /; cat /dev/null; ) >/dev/null;
  time . ./bash_completion; churn_hd; done 2>&1 | ...
real    0m1.033s    0m0.472s    0m0.581s    0m0.545s    0m0.749s
  avg: 0.58675 (last 4)
real    0m0.533s    0m0.689s    0m0.509s    0m0.665s    0m0.533s
  avg: 0.5858
real    0m0.665s    0m0.521s    0m0.713s    0m0.533s    0m0.617s
  avg: 0.6098
real    0m0.485s    0m0.605s    0m0.545s    0m0.628s    0m0.581s
  avg: 0.5688
avg of averages: 0.588s

no-prefetch with that ls and cat outside the timed part:
$ for i in {0..4}; do sync; dropcache; (ls -dLF /; cat /dev/null; ) >/dev/null; 
	time BASH_COMPLETION_DISABLE_PREFETCH=1 . ./bash_completion; churn_hd;
  done 2>&1 | grep '^real' --line-buffered | ...
real    0m1.195s
real    0m0.641s
real    0m0.700s
real    0m0.665s
real    0m0.701s
  avg: 0.67675 (only last 4)
real    0m0.617s
real    0m0.706s
real    0m0.617s
real    0m0.700s
real    0m0.641s
  avg: 0.6562
real    0m0.821s
real    0m0.629s
real    0m0.845s
real    0m0.653s
real    0m0.917s
  avg: 0.773
real    0m0.677s
real    0m0.773s
real    0m0.605s
real    0m1.169s
real    0m0.617s
  avg: 0.685 (excluding outlier)
avg of averages: 0.698s

 So, in this case prefetch is saving 0.11s, out of 0.7, or a 15%
speedup.  Well that's not as good as I thought it was doing, but it's
pretty decent.

It's a good thing bash doesn't support inline assembly,
or I'd be at this for weeks... :P

 Seriously though, a lot of people are stuck waiting for bash for a
second or so, and speeding it up a bit is worth putting effort into,
IMO.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-_longopt-fix-parsing-help-output-that-has-in-the-des.patch
Type: text/x-diff
Size: 3558 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/bash-completion-devel/attachments/20141203/eca9f893/attachment.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-upstart-support-for-service-completion.patch
Type: text/x-diff
Size: 1481 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/bash-completion-devel/attachments/20141203/eca9f893/attachment-0001.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-speed-up-loading-the-compat-dir-with-disk-prefetch.patch
Type: text/x-diff
Size: 4065 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/bash-completion-devel/attachments/20141203/eca9f893/attachment-0002.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 359 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/bash-completion-devel/attachments/20141203/eca9f893/attachment.sig>