此份手稿是为了我在 YAPC::EU::2002 上的讲演而准备的。它不是我实际上所说讲演的抄本,更合适的说法是,它只是转成幻灯片的 pod 笔记,一些随手写在纸上或存在脑中的笔记,我试着将它们做成连贯的文章。我还试着增加一些有用的反馈 - 有时我能记起这是谁说的,在此感谢他们。
您可以按此查看幻灯片,我希望找到幻灯片做了哪些改变是显而易见的。
你可能有些 Perl 程序,但是它们运行得很慢。然后你很想对此做些改变。此文讲述了怎样对程序提速和怎样及时地避免问题。
但是这些可能不是实际或策略上可接受的方案。
那么,你能做某些折衷。
如下是我的 Perl 程序,它调用了 Perl 函数 rot32。而这里是一个 C 函数 rot32 ,它接受两个整数,用第二个整数旋转第一个,然后返回一个整数值。这是你所需的,它能工作得很好。
#!/usr/local/bin/Perl -w use strict; printf "$_:t%08Xt%08Xn", rot32 (0xdead, $_), rot32 (0xbeef, -$_) foreach (0..31); use Inline C => <<'EOC'; unsigned rot32 (unsigned val, int by) { if (by >= 0) return (val >> by) | (val << (32 - by)); return (val << -by) | (val >> (32 + by)); } EOC __END__
0: 0000DEAD 0000BEEF
1: 80006F56 00017DDE
2: 400037AB 0002FBBC
3: A0001BD5 0005F778
4: D0000DEA 000BEEF0
...
您是否使用了这种语言最棒的特性?
我相信你不会这么做。但是你是否保留您的数组已排序,这样你能使用对分检索?这很快。但是使用 hash 应该会更快。
G和标记/gc可能会更快。 if ( /G.../gc ) {
...
} elsif ( /G.../gc ) {
...
} elsif ( /G.../gc ) {
pack 和 unpack
undef
您是否只为了丢弃它而计算?
For example the script in the Encode module that compiles character conversion tables would print out a warning if it saw the same character twice. If you or I build Perl we'll just let those build warnings scroll off the screen - we don't care - we can't do anything about it. And it turned out that keeping track of everything needed to generate those warnings was slowing things down considerably. So I added a flag to disable that code, and Perl 5.8 defaults to use it, so it builds more quickly.
许多对我有益的激烈质问者/hecklers (大部分是看过讲演的 London.pm (我将 David Adler 做为 London.pm 的一分子因为他订阅了这份列表))希望我提醒各位除非你确实绝对的需要,否则真的不应该去优化。您在让您的代码更难维修,更难扩展和更易引进新的 bugs 。很可能在一开始你就在需要优化的地方做了些不该做的事。
我同意上述观点。
同样,我也不想改变幻灯片的运行顺序。There isn't a good order to try to describe things in, and some of the ideas that follow are actually more "good practice" than optimisation techniques, so possibly ought to come before the slides on finding slowness. I'll mark what I think are good habits to get into, and once you understand the techniques then I'd hope that you'd use them automatically when you first write code. That way (hopefully) your code will never be so slow that you actually want to do some of the brute force optimising I describe here.
[这时候你们或许会不安地笑,因为我打赌只有非常少的人能写份全面的测试代码]
You can never have enough memory, and it's never fast enough.
Computer memory is like a pyramid. At the point you have the CPU and its registers, which are very small and very fast to access. Then you have 1 or more levels of cache, which is larger, close by and fast to access. Then you have main memory, which is quite large, but further away so slower to access. Then at the base you have disk acting as virtual memory, which is huge, but very slow.
Now, if your program is swapping out to disk, you'll realise, because the OS can tell you that it only took 10 seconds of CPU, but 60 seconds elapsed, so you know it spent 50 seconds waiting for disk and that's your speed problem. But if your data is big enough to fit in main RAM, but doesn't all sit in the cache, then the CPU will keep having to wait for data from main RAM. And the OS timers I described count that in the CPU time, so it may not be obvious that memory use is actually your problem.
This is the original code for the part of the Encode compiler (enc2xs) that generates the warnings on duplicate characters:
if (exists $seen{$uch}) {
warn sprintf("U%04X is %02X%02X and %02X%02Xn",
$val,$page,$ch,@{$seen{$uch}});
}
else {
$seen{$uch} = [$page,$ch];
}
It uses the hash %seen to remember all the Unicode characters that it has processed. The first time that it meets a character it won't be in the hash, the exists is false, so the else block executes. It stores an arrayref containing the code page and character number in that page. That's three things per character, and there are a lot of characters in Chinese.
If it ever sees the same Unicode character again, it prints a warning message. The warning message is just a string, and this is the only place that uses the data in %seen. So I changed the code - I pre-formatted that bit of the error message, and stored a single scalar rather than the three:
if (exists $seen{$uch}) {
warn sprintf("U%04X is %02X%02X and %04Xn",
$val,$page,$ch,$seen{$uch});
}
else {
$seen{$uch} = $page << 8 | $ch;
}
That reduced the memory usage by a third, and it runs more quickly.
How do you make things faster? Well, this is something of a black art, down to trial and error. I'll expand on aspects of these 4 points in the next slides.
By having commented out slower code near the faster code you can look back and get ideas for other places you might optimise in the same way.
下面是一些我认为有用的习惯,所以您应当将它们运用到日常程序中。
AutoSplit 与 AutoLoader
一个可能的问题是使用 AutoLoader 让子程序带来调试混乱。当处于测试状态,您能通过在自动加载子程序前添加注释 __END__ 来使 AutoLoader 失效。如此一来,它们就普通地被加载,编译和测试了。
... 1; # While debugging, disable AutoLoader like this: # __END__ ...
当然,为了使 use 正常,您还得在加载程序段的后面另添加 1; 和可能要另一个 __END__ 。
Schwern notes that commenting out __END__ can cause surprises if the main body of your module is running under use strict; because now your AutoLoaded subroutines will suddenly find themselves being run under use strict. This is arguably a bug in the current AutoSplit - when it runs at install time to generate the files for AutoLoader to use it doesn't add lines such as use strict; or use warnings; to ensure that the split out subroutines are in the same environment as was current at the __END__ statement. This may be fixed in 5.10.
Elizabeth Mattijsen notes that there are different memory use versus memory shared issues when running under mod_Perl, with different optimal solutions depending on whether your apache is forking or threaded.
=pod @ __END__
#!Perl -w use strict;
=head1 You don't want to do that
big block of pod
=cut
... 1; __END__
=head1 You want to do this
如果您将您的代码放在 __END__ 后面,那么 Perl 分析器就不会去注意它。这能省下一点点 CPU,但是如果你有一块很大的 pod (>4K) ,那它意味着文件的最后磁盘块将不会被读进 RAM 。这也许能获得某些加速。[A helpful heckler observed that modern raid systems may well be reading in 64K chunks, and modern OSes are getting good at read ahead, so not reading a block as a result of =pod @ __END__ may actually be quite rare.]
如果你还是将您的 pod (和测试)放在函数代码的旁边(这看起来更是一种好习惯),那么此建议与您无关。
Exporter 是用 Perl 所写的。虽然它很快,但也不是即时的。
许多模块,为了节省您的输入,都默认在您的命名空间内倒出许多函数和符号变量。如果您只有一个参数在 use 后(模块名参数),比如
use POSIX; # Exports all the defaults
于是 POSIX 将有用地在您的命名空间内倒出它的默认符号变量列表。如果您在模块名后有一列表,那它只倒出此列表的符号变量use POSIX (); # Exports nothing.
您仍然可以使用所有的函数和其他符号变量 - 但您必须使用它们的全名,如在前面输入 POSIX:: 。许多人说这样实际上让您的代码更干净,而且现在很清楚的知道子程序是在哪定义的。除了这些,它还更快:
| use POSIX; | use POSIX (); |
| 0.516s | 0.355s |
| use Socket; | use Socket (); |
| 0.270s | 0.231s |
POSIX 默认倒出一大堆符号变量。如果您使用了不倒出,它在开始就 少 30% 的时间。 Socket 能少 15% 的时间。
$& variable returns the last text successfully matched in any regular expression. It's not lexically scoped, so unlike the match variables $1 etc it isn't reset when you leave a block. This means that to be correct Perl has to keep track of it from any match, as Perl has no idea when it might be needed. As it involves taking a copy of the matched string, it's expensive for Perl to keep track of. If you never mention $&, then Perl knows it can cheat and never store it. But if you (or any module) mentions $& anywhere then Perl has to keep track of it throughout the script, which slows things down. So it's a good idea to capture the whole match explicitly if that's what you need. $text =~ /.* rules/;
$line = $&; # Now every match will copy $& - slow $text =~ /(.* rules)/;
$line = $1; # Didn't mention $& - fast
use English gives helpful long names to all the punctuation variables. Unfortunately that includes aliasing $& to $MATCH which makes Perl think that it needs to copy every match into $&, even if you script never actually uses it. In Perl 5.8 you can say use English '-no_match_vars'; to avoid mentioning the naughty "word", but this isn't available in earlier versions of Perl.
$1 etc, so it all you need is grouping use a the non-capturing (?:...) instead of the capturing (...).
/o flag to tell Perl, and it will never waste time checking or recompiling it.
qr// operator to pre-compile your regexps. It often is the easiest way to write regexp components to build up more complex regexps. Using it to build your regexps once is a good idea. But don't screw up (like parrot's assemble.pl did) by telling Perl to recompile the same regexp every time you enter a subroutine: sub foo {
my $reg1 = qr/.../;
my $reg2 = qr/... $reg1 .../;
You should pull those two regexp definitions out of the subroutine into package variables, or file scoped lexicals.
You find what is slow by using a profiler. People often guess where they think their program is slow, and get it hopelessly wrong. Use a profiler.
Devel::DProf is in the Perl core from version 5.6. If you're using an earlier Perl you can get it from CPAN.
You run your program with -d:DProf
Perl5.8.0 -d:DProf enc2xs.orig -Q -O -o /dev/null ...
which times things and stores the data in a file named tmon.out. Then you run dprofpp to process the tmon.out file, and produce meaningful summary information. This excerpt is the default length and format, but you can use options to change things - see the man page. It also seems to show up a minor bug in dprofpp, because it manages to total things up to get 106%. While that's not right, it doesn't affect the explanation.
Total Elapsed Time = 66.85123 Seconds
User+System Time = 62.35543 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
106. 66.70 102.59 218881 0.0003 0.0005 main::enter
49.5 30.86 91.767 6 5.1443 15.294 main::compile_ucm
19.2 12.01 8.333 45242 0.0003 0.0002 main::encode_U
4.74 2.953 1.078 45242 0.0001 0.0000 utf8::unicode_to_native
4.16 2.595 0.718 45242 0.0001 0.0000 utf8::encode
0.09 0.055 0.054 5 0.0109 0.0108 main::BEGIN
0.01 0.008 0.008 1 0.0078 0.0078 Getopt::Std::getopts
0.00 0.000 -0.000 1 0.0000 - Exporter::import
0.00 0.000 -0.000 3 0.0000 - strict::bits
0.00 0.000 -0.000 1 0.0000 - strict::import
0.00 0.000 -0.000 2 0.0000 - strict::unimport
At the top of the list, the subroutine enter takes about half the total CPU time, with 200,000 calls, each very fast. That makes it a good candidate to optimise, because all you have to do is make a slight change that gives a small speedup, and that gain will be magnified 200,000 times. [It turned out that enter was tail recursive, and part of the speed gain I got was by making it loop instead]
Third on the list is encode_U, which with 45,000 calls is similar, and worth looking at. [Actually, it was trivial code and in the real enc2xs I inlined it]
utf8::unicode_to_native and utf8::encode are built-ins, so you won't be able to change that.
Don't bother below there, as you've accounted for 90% of total program time, so even if you did a perfect job on everything else, you could only make the program run 10% faster.
compile_ucm is trickier - it's only called 6 times, so it's not obvious where to look for what's slow. Maybe there's a loop with many iterations. But now you're guessing, which isn't good.
One trick is to break it into several subroutines, just for benchmarking, so that DProf gives you times for different bits. That way you can see where the juicy bits to optimise are.
Devel::SmallProf should do line by line profiling, but every time I use it it seems to crash.
现在您已经确定了慢的地方,您还需要在两个代码中找到更快的一个。模块 Benchmark 能让这变得容易。一个特别好的子程序是 cmpthese ,它能摘录代码片段而且绘制图表。cmpthese 可于 Perl 5.6 的 Benchmark 里发现。
因为为了比较各运行 1000 次的两个代码片段 orig 和 new ,您可以这么做
use Benchmark ':all';
sub orig {
...
}
sub new {
...
}
cmpthese (10000, { orig => &orig, new => &new } );
Benchmark 运行了两者,分别计时,然后输出一个有用的对比图表:
Benchmark: timing 10000 iterations of new, orig...
new: 1 wallclock secs ( 0.70 usr + 0.00 sys = 0.70 CPU) @ 14222.22/s (n=10000)
orig: 4 wallclock secs ( 3.94 usr + 0.00 sys = 3.94 CPU) @ 2539.68/s (n=10000)
Rate orig new
orig 2540/s -- -82%
new 14222/s 460% --
很容易看清楚新的代码比原始代码快 4 倍。
Actually, I didn't tell the whole truth earlier about what causes slowness in Perl. [And astute hecklers such as Philip Newton had already told me this]
When Perl compilers your program it breaks it down into a sequence of operations it must perform, which are usually referred to as ops.So when you ask Perl to compute $a = $b + $c it actually breaks it down into these ops:
$b onto the stack
$c onto the stack
$a
Computers are fast at simple things like addition. But there is quite a lot of overhead involved in keeping track of "which op am I currently performing" and "where is the next op", and this book-keeping often swamps the time taken to actually run the ops. So often in Perl it's the number of ops your program takes to perform its task that is more important than the CPU they use or the RAM it needs. The hit list is
So what were my example code snippets that I Benchmarked?
It was code to split a line of hex (54726164696e67207374796c652f6d61) into groups of 4 digits (5472 6164 696e ...) , and convert each to a number
sub orig {
map {hex $_} $line =~ /(....)/g;
} sub new {
unpack "n*", pack "H*", $line;
}
The two produce the same results:
| orig | new |
|---|---|
| 21618, 24932, 26990, 26400, 29556, 31084, 25903, 28001, 26990, 29793, 26990, 24930, 26988, 26996, 31008, 26223, 29216, 29552, 25957, 25646 | 21618, 24932, 26990, 26400, 29556, 31084, 25903, 28001, 26990, 29793, 26990, 24930, 26988, 26996, 31008, 26223, 29216, 29552, 25957, 25646 |
but the first one is much slower. Why? Following the data path from right to left, it starts well with a global regexp, which is only one op and therefore a fast way to generate a list of the 4 digit groups. But that map block is actually an implicit loop, so for each 4 digit block it iterates round and repeatedly calls hex. Thats at least one op for every list item.
Whereas the second one has no loops in it, implicit or explicit. It uses one pack to convert the hex temporarily into a binary string, and then one unpack to convert that string into a list of numbers. n is big endian 16 bit quantities. I didn't know that - I had to look it up. But when the profiler told me that this part of the original code was a performance bottleneck, the first think that I did was to look at the the pack docs to see if I could use some sort of pack/unpack as a speedier replacement.
You can ask Perl to tell you the ops that it generates for particular code with the Terse backend to the compiler. For example, here's a 1 liner to show the ops in the original code:
$ Perl -MO=Terse -e'map {hex $_} $line =~ /(....)/g;'
LISTOP (0x16d9c8) leave [1]
OP (0x16d9f0) enter
COP (0x16d988) nextstate
LOGOP (0x16d940) mapwhile [2]
LISTOP (0x16d8f8) mapstart
OP (0x16d920) pushmark
UNOP (0x16d968) null
UNOP (0x16d7e0) null
LISTOP (0x115370) scope
OP (0x16bb40) null [174]
UNOP (0x16d6e0) hex [1]
UNOP (0x16d6c0) null [15]
SVOP (0x10e6b8) gvsv GV (0xf4224) *_
PMOP (0x114b28) match /(....)/
UNOP (0x16d7b0) null [15]
SVOP (0x16d700) gvsv GV (0x111f10) *line
At the bottom you can see how the match /(....)/ is just one op. But the next diagonal line of ops from mapwhile down to the match are all the ops that make up the map. Lots of them. And they get run each time round map's loop. [Note also that the {}s mean that map enters scope each time round the loop. That not a trivially cheap op either]
Whereas my replacement code looks like this:
$ Perl -MO=Terse -e'unpack "n*", pack "H*", $line;'
LISTOP (0x16d818) leave [1]
OP (0x16d840) enter
COP (0x16bb40) nextstate
LISTOP (0x16d7d0) unpack
OP (0x16d7f8) null [3]
SVOP (0x10e6b8) const PV (0x111f94) "n*"
LISTOP (0x115370) pack [1]
OP (0x16d7b0) pushmark
SVOP (0x16d6c0) const PV (0x111f10) "H*"
UNOP (0x16d790) null [15]
SVOP (0x16d6e0) gvsv GV (0x111f34) *line
There are less ops in total. And no loops, so all the ops you see execute only once. :-)
[My helpful hecklers pointed out that it's hard to work out what an op is. Good call. There's roughly one op per symbol (function, operator, variable name, and any other bit of Perl syntax). So if you golf down the number of functions and operators your program runs, then you'll be reducing the number of ops.]
[These were supposed to be the bonus slides. I talked to fast (quelle surprise) and so manage to actually get through the lot with time for questions]
Memoize follows the grand Perl tradition by trading memory for speed. You tell Memoize the name(s) of functions you'd like to speed up, and it does symbol table games to transparently intercept calls to them. It looks at the parameters the function was called with, and uses them to decide what to do next. If it hasn't seen a particular set of parameters before, it calls the original function with the parameters. However, before returning the result, it stores it in a hash for that function, keyed by the function's parameters. If it has seen the parameters before, then it just returns the result direct from the hash, without even bothering to call the function.
Memoize uses is a regular Perl hash. This means that you can tie the hash to a disk file. This allows Memoize to remember things across runs of your program. That way, you could use Memoize in a CGI to cache static content that you only generate on demand (but remember you'll need file locking). The first person who requests something has to wait for the generation routine, but everyone else gets it straight from the cache. You can also arrange for another program to periodically expire results from the cache. As of 5.8 Memoize module has been assimilated into the core. Users of earlier Perl can get it from CPAN.
These are quite general ideas for optimisation that aren't particularly Perl specific.
enc2xs was calling a function each time round a loop based on a hash lookup using $type as the key. The value of $type didn't change, so I pulled the lookup out above the loop into a lexical variable: my $type_func = $encode_types{$type};
and doing it only once was faster.
enc2xs was calling a function which took several arguments from a small number of places. The function contained code to set defaults if some of the arguments were not supplied. I found that the way the program ran, most of the calls passed in all the values and didn't need the defaults. Changing the function to not set defaults, and writing those defaults out explicitly where needed bought me a speed up.
|
Perl can't spot that it could just throw away the old lexicals and re-use their space, but you can, so you can save CPU and RAM by re-writing your tail recursive subroutines with loops. In general, trying to reduce recursion by replacing it with iterative algorithms should speed things up. yay for y
Ops are bad, m'kayAnother example lifted straight from #foreach my $c (split(//,$out_bytes)) {
# $s .= sprintf "\x%02X",ord($c);
#}
# 9.5% faster changing that loop to this:
$s .= sprintf +("\x%02X" x length $out_bytes), unpack "C*", $out_bytes;
The original makes a temporary list with The new code effectively merges the How to make Perl fast enough
|
|





