Why is this code 6.5x slower with optimizations enabled?Is it safe to read past the end of a buffer within the same page on x86 and x64?With arrays, why is it the case that a[5] == 5[a]?Why is the Android emulator so slow? How can we speed up the Android emulator?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why does changing 0.1f to 0 slow down performance by 10x?Why is it faster to process a sorted array than an unsorted array?Why does the C preprocessor interpret the word “linux” as the constant “1”?Why is printing “B” dramatically slower than printing “#”?Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?
What does it mean to express a gate in Dirac notation?
Packing rectangles: Does rotation ever help?
how to find the equation of a circle given points of the circle
How to reduce LED flash rate (frequency)
How do I reattach a shelf to the wall when it ripped out of the wall?
US visa is under administrative processing, I need the passport back ASAP
Binary Numbers Magic Trick
Does Gita support doctrine of eternal cycle of birth and death for evil people?
How did Captain America manage to do this?
What does the "ep" capability mean?
How to stop co-workers from teasing me because I know Russian?
Why does nature favour the Laplacian?
How would one muzzle a full grown polar bear in the 13th century?
Use tikz commands in caption
Reducing vertical space in stackrel
What do the phrase "Reeyan's seacrest" and the word "fraggle" mean in a sketch?
What's the polite way to say "I need to urinate"?
Please, smoke with good manners
What are the potential pitfalls when using metals as a currency?
Size of electromagnet needed to replicate Earth's magnetic field
Noun clause (singular all the time?)
How much cash can I safely carry into the USA and avoid civil forfeiture?
Can someone publish a story that happened to you?
Map of water taps to fill bottles
Why is this code 6.5x slower with optimizations enabled?
Is it safe to read past the end of a buffer within the same page on x86 and x64?With arrays, why is it the case that a[5] == 5[a]?Why is the Android emulator so slow? How can we speed up the Android emulator?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why does changing 0.1f to 0 slow down performance by 10x?Why is it faster to process a sorted array than an unsorted array?Why does the C preprocessor interpret the word “linux” as the constant “1”?Why is printing “B” dramatically slower than printing “#”?Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
|
show 5 more comments
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
2
Please report it to gcc's bugzilla.
– Marc Glisse
Apr 7 at 21:17
2
Using-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.
– David Schwartz
Apr 7 at 21:18
2
It is generatingrepnz scasb
for strlen at -O1.
– Marc Glisse
Apr 7 at 21:23
8
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
Apr 8 at 2:57
3
@Damon Performance considerations are also considered bug reports for gcc (and most compilers for that matter). If they decide not to change it, that's fine. If they decide that it's worth changing, that's also fine. If you do not file performance bugs, the compiler team will not realize that there's something to look at.
– Justin
Apr 8 at 21:06
|
show 5 more comments
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
c performance gcc glibc
edited Apr 8 at 6:37
Muntasir
6381919
6381919
asked Apr 7 at 20:54
TsarNTsarN
37848
37848
2
Please report it to gcc's bugzilla.
– Marc Glisse
Apr 7 at 21:17
2
Using-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.
– David Schwartz
Apr 7 at 21:18
2
It is generatingrepnz scasb
for strlen at -O1.
– Marc Glisse
Apr 7 at 21:23
8
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
Apr 8 at 2:57
3
@Damon Performance considerations are also considered bug reports for gcc (and most compilers for that matter). If they decide not to change it, that's fine. If they decide that it's worth changing, that's also fine. If you do not file performance bugs, the compiler team will not realize that there's something to look at.
– Justin
Apr 8 at 21:06
|
show 5 more comments
2
Please report it to gcc's bugzilla.
– Marc Glisse
Apr 7 at 21:17
2
Using-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.
– David Schwartz
Apr 7 at 21:18
2
It is generatingrepnz scasb
for strlen at -O1.
– Marc Glisse
Apr 7 at 21:23
8
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
Apr 8 at 2:57
3
@Damon Performance considerations are also considered bug reports for gcc (and most compilers for that matter). If they decide not to change it, that's fine. If they decide that it's worth changing, that's also fine. If you do not file performance bugs, the compiler team will not realize that there's something to look at.
– Justin
Apr 8 at 21:06
2
2
Please report it to gcc's bugzilla.
– Marc Glisse
Apr 7 at 21:17
Please report it to gcc's bugzilla.
– Marc Glisse
Apr 7 at 21:17
2
2
Using
-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen
is slower than the library's.– David Schwartz
Apr 7 at 21:18
Using
-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen
is slower than the library's.– David Schwartz
Apr 7 at 21:18
2
2
It is generating
repnz scasb
for strlen at -O1.– Marc Glisse
Apr 7 at 21:23
It is generating
repnz scasb
for strlen at -O1.– Marc Glisse
Apr 7 at 21:23
8
8
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
Apr 8 at 2:57
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
Apr 8 at 2:57
3
3
@Damon Performance considerations are also considered bug reports for gcc (and most compilers for that matter). If they decide not to change it, that's fine. If they decide that it's worth changing, that's also fine. If you do not file performance bugs, the compiler team will not realize that there's something to look at.
– Justin
Apr 8 at 21:06
@Damon Performance considerations are also considered bug reports for gcc (and most compilers for that matter). If they decide not to change it, that's fine. If they decide that it's worth changing, that's also fine. If you do not file performance bugs, the compiler team will not realize that there's something to look at.
– Justin
Apr 8 at 21:06
|
show 5 more comments
2 Answers
2
active
oldest
votes
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code calls the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1
is always slower, by as much as a factor of 10 for moderately long strings, while -O2
is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen
is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
Apr 7 at 22:00
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
Apr 8 at 0:08
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
Apr 8 at 0:13
1
@MatthieuM.: you can look on Godbolt (and note that-march=
implies-mtune
).gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inliningstrlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2pcmpeqb / pmovmskb / test
/jz
, thenbsf
, without risking page faults for crossing into the next page. (We knowcalloc
gave us 16-byte aligned memory becausealignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.
– Peter Cordes
Apr 8 at 19:20
1
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
|
show 15 more comments
GCC's inline strlen
patterns are much slower than what it could do with SSE2 pcmpeqb
/ pmovmskb
, and bsf
, given the 16-byte alignment from calloc
. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3
inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc
), it chooses to inline strlen
as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2
and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen
implementations also take advantage of that. See my answer there for links to glibc strlen
and a summary of how it runs so fast for large strings.)
At -O1
, gcc always (even without known alignment) chooses to inline strlen
as repnz scasb
, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos
and rep movs
, not the repz
/repnz
instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s
to a volatile void *tmp
, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile
, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall
vs. unrolled_loop
vs. rep_byte
for inlining string operations in general (not just strlen; memcmp
would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() ...
, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen
strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf
(bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf
so it runs as tzcnt
on CPUs that support BMI1, which is more efficient on AMD. bsf
and tzcnt
give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn
to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb
+ bsf
is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16
so calloc
will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen
block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3
, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen
. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16
displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16
(maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb
with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor
was actually slightly better than vpxor
on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb
is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb
latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference
= performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat
showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk 'sum+= $1 ENDprint sum/100;'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
- ~62100
clock_t
time units:-O1
rep scas: (clock()
is a bit obsolete, but I didn't bother changing it.) - ~15900
clock_t
time units:-O3
gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with-march=native
forandn
) - ~1880
clock_t
time units:-O3
with glibcstrlen
function calls, using AVX2 - ~3190
clock_t
time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline. - ~3230
clock_t
time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub
to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen
really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen
at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0
to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen
out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen
an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;
). But the inner-most asm loop itself is aligned with .p2align 4
or .p2align 5
. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5
to .p2align 4
reverses them: -UHIDE_ALIGNMENT
becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw
+ a 3-byte nop
inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33
, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY
, and i<1280000
as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without-DHIDE_ALIGNMENT
, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3strlen
. (-DHIDE_ALIGNMENT
)- 6100 clock_t: gcc -O3 4-byte loop
- 37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen
that has to go through the PLT (call + jmp [mem]
), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33
. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb
is more than 10x worse than calling glibc strlen
for a 33-byte string. It would be less bad in real use cases where strlen
could branch miss or even miss in code-cache and stall, but straight-line repz scasb
wouldn't. But 10x is huge, and that's for a fairly short string.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code calls the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1
is always slower, by as much as a factor of 10 for moderately long strings, while -O2
is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen
is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
Apr 7 at 22:00
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
Apr 8 at 0:08
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
Apr 8 at 0:13
1
@MatthieuM.: you can look on Godbolt (and note that-march=
implies-mtune
).gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inliningstrlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2pcmpeqb / pmovmskb / test
/jz
, thenbsf
, without risking page faults for crossing into the next page. (We knowcalloc
gave us 16-byte aligned memory becausealignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.
– Peter Cordes
Apr 8 at 19:20
1
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
|
show 15 more comments
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code calls the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1
is always slower, by as much as a factor of 10 for moderately long strings, while -O2
is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen
is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
Apr 7 at 22:00
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
Apr 8 at 0:08
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
Apr 8 at 0:13
1
@MatthieuM.: you can look on Godbolt (and note that-march=
implies-mtune
).gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inliningstrlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2pcmpeqb / pmovmskb / test
/jz
, thenbsf
, without risking page faults for crossing into the next page. (We knowcalloc
gave us 16-byte aligned memory becausealignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.
– Peter Cordes
Apr 8 at 19:20
1
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
|
show 15 more comments
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code calls the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1
is always slower, by as much as a factor of 10 for moderately long strings, while -O2
is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen
is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code calls the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1
is always slower, by as much as a factor of 10 for moderately long strings, while -O2
is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen
is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
edited Apr 13 at 6:08
answered Apr 7 at 21:42
chqrliechqrlie
64.3k852108
64.3k852108
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
Apr 7 at 22:00
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
Apr 8 at 0:08
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
Apr 8 at 0:13
1
@MatthieuM.: you can look on Godbolt (and note that-march=
implies-mtune
).gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inliningstrlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2pcmpeqb / pmovmskb / test
/jz
, thenbsf
, without risking page faults for crossing into the next page. (We knowcalloc
gave us 16-byte aligned memory becausealignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.
– Peter Cordes
Apr 8 at 19:20
1
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
|
show 15 more comments
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
Apr 7 at 22:00
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
Apr 8 at 0:08
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
Apr 8 at 0:13
1
@MatthieuM.: you can look on Godbolt (and note that-march=
implies-mtune
).gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inliningstrlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2pcmpeqb / pmovmskb / test
/jz
, thenbsf
, without risking page faults for crossing into the next page. (We knowcalloc
gave us 16-byte aligned memory becausealignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.
– Peter Cordes
Apr 8 at 19:20
1
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
2
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in
<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.– chqrlie
Apr 7 at 22:00
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in
<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.– chqrlie
Apr 7 at 22:00
1
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If
strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.– Brendan
Apr 8 at 0:08
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If
strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.– Brendan
Apr 8 at 0:08
4
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a
strlen_small()
and a separate strlen_large()
, but there isn't.– Brendan
Apr 8 at 0:13
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a
strlen_small()
and a separate strlen_large()
, but there isn't.– Brendan
Apr 8 at 0:13
1
1
@MatthieuM.: you can look on Godbolt (and note that
-march=
implies -mtune
). gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test
/jz
, then bsf
, without risking page faults for crossing into the next page. (We know calloc
gave us 16-byte aligned memory because alignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.– Peter Cordes
Apr 8 at 19:20
@MatthieuM.: you can look on Godbolt (and note that
-march=
implies -mtune
). gcc -march=skylake -O3
still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen
at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test
/jz
, then bsf
, without risking page faults for crossing into the next page. (We know calloc
gave us 16-byte aligned memory because alignof(maxalign_t) = 16
in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.– Peter Cordes
Apr 8 at 19:20
1
1
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)
– Peter Cordes
Apr 8 at 19:23
|
show 15 more comments
GCC's inline strlen
patterns are much slower than what it could do with SSE2 pcmpeqb
/ pmovmskb
, and bsf
, given the 16-byte alignment from calloc
. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3
inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc
), it chooses to inline strlen
as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2
and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen
implementations also take advantage of that. See my answer there for links to glibc strlen
and a summary of how it runs so fast for large strings.)
At -O1
, gcc always (even without known alignment) chooses to inline strlen
as repnz scasb
, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos
and rep movs
, not the repz
/repnz
instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s
to a volatile void *tmp
, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile
, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall
vs. unrolled_loop
vs. rep_byte
for inlining string operations in general (not just strlen; memcmp
would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() ...
, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen
strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf
(bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf
so it runs as tzcnt
on CPUs that support BMI1, which is more efficient on AMD. bsf
and tzcnt
give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn
to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb
+ bsf
is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16
so calloc
will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen
block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3
, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen
. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16
displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16
(maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb
with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor
was actually slightly better than vpxor
on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb
is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb
latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference
= performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat
showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk 'sum+= $1 ENDprint sum/100;'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
- ~62100
clock_t
time units:-O1
rep scas: (clock()
is a bit obsolete, but I didn't bother changing it.) - ~15900
clock_t
time units:-O3
gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with-march=native
forandn
) - ~1880
clock_t
time units:-O3
with glibcstrlen
function calls, using AVX2 - ~3190
clock_t
time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline. - ~3230
clock_t
time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub
to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen
really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen
at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0
to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen
out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen
an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;
). But the inner-most asm loop itself is aligned with .p2align 4
or .p2align 5
. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5
to .p2align 4
reverses them: -UHIDE_ALIGNMENT
becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw
+ a 3-byte nop
inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33
, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY
, and i<1280000
as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without-DHIDE_ALIGNMENT
, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3strlen
. (-DHIDE_ALIGNMENT
)- 6100 clock_t: gcc -O3 4-byte loop
- 37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen
that has to go through the PLT (call + jmp [mem]
), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33
. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb
is more than 10x worse than calling glibc strlen
for a 33-byte string. It would be less bad in real use cases where strlen
could branch miss or even miss in code-cache and stall, but straight-line repz scasb
wouldn't. But 10x is huge, and that's for a fairly short string.
add a comment |
GCC's inline strlen
patterns are much slower than what it could do with SSE2 pcmpeqb
/ pmovmskb
, and bsf
, given the 16-byte alignment from calloc
. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3
inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc
), it chooses to inline strlen
as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2
and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen
implementations also take advantage of that. See my answer there for links to glibc strlen
and a summary of how it runs so fast for large strings.)
At -O1
, gcc always (even without known alignment) chooses to inline strlen
as repnz scasb
, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos
and rep movs
, not the repz
/repnz
instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s
to a volatile void *tmp
, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile
, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall
vs. unrolled_loop
vs. rep_byte
for inlining string operations in general (not just strlen; memcmp
would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() ...
, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen
strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf
(bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf
so it runs as tzcnt
on CPUs that support BMI1, which is more efficient on AMD. bsf
and tzcnt
give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn
to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb
+ bsf
is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16
so calloc
will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen
block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3
, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen
. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16
displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16
(maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb
with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor
was actually slightly better than vpxor
on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb
is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb
latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference
= performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat
showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk 'sum+= $1 ENDprint sum/100;'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
- ~62100
clock_t
time units:-O1
rep scas: (clock()
is a bit obsolete, but I didn't bother changing it.) - ~15900
clock_t
time units:-O3
gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with-march=native
forandn
) - ~1880
clock_t
time units:-O3
with glibcstrlen
function calls, using AVX2 - ~3190
clock_t
time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline. - ~3230
clock_t
time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub
to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen
really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen
at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0
to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen
out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen
an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;
). But the inner-most asm loop itself is aligned with .p2align 4
or .p2align 5
. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5
to .p2align 4
reverses them: -UHIDE_ALIGNMENT
becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw
+ a 3-byte nop
inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33
, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY
, and i<1280000
as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without-DHIDE_ALIGNMENT
, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3strlen
. (-DHIDE_ALIGNMENT
)- 6100 clock_t: gcc -O3 4-byte loop
- 37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen
that has to go through the PLT (call + jmp [mem]
), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33
. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb
is more than 10x worse than calling glibc strlen
for a 33-byte string. It would be less bad in real use cases where strlen
could branch miss or even miss in code-cache and stall, but straight-line repz scasb
wouldn't. But 10x is huge, and that's for a fairly short string.
add a comment |
GCC's inline strlen
patterns are much slower than what it could do with SSE2 pcmpeqb
/ pmovmskb
, and bsf
, given the 16-byte alignment from calloc
. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3
inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc
), it chooses to inline strlen
as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2
and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen
implementations also take advantage of that. See my answer there for links to glibc strlen
and a summary of how it runs so fast for large strings.)
At -O1
, gcc always (even without known alignment) chooses to inline strlen
as repnz scasb
, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos
and rep movs
, not the repz
/repnz
instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s
to a volatile void *tmp
, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile
, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall
vs. unrolled_loop
vs. rep_byte
for inlining string operations in general (not just strlen; memcmp
would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() ...
, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen
strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf
(bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf
so it runs as tzcnt
on CPUs that support BMI1, which is more efficient on AMD. bsf
and tzcnt
give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn
to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb
+ bsf
is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16
so calloc
will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen
block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3
, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen
. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16
displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16
(maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb
with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor
was actually slightly better than vpxor
on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb
is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb
latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference
= performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat
showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk 'sum+= $1 ENDprint sum/100;'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
- ~62100
clock_t
time units:-O1
rep scas: (clock()
is a bit obsolete, but I didn't bother changing it.) - ~15900
clock_t
time units:-O3
gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with-march=native
forandn
) - ~1880
clock_t
time units:-O3
with glibcstrlen
function calls, using AVX2 - ~3190
clock_t
time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline. - ~3230
clock_t
time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub
to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen
really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen
at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0
to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen
out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen
an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;
). But the inner-most asm loop itself is aligned with .p2align 4
or .p2align 5
. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5
to .p2align 4
reverses them: -UHIDE_ALIGNMENT
becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw
+ a 3-byte nop
inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33
, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY
, and i<1280000
as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without-DHIDE_ALIGNMENT
, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3strlen
. (-DHIDE_ALIGNMENT
)- 6100 clock_t: gcc -O3 4-byte loop
- 37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen
that has to go through the PLT (call + jmp [mem]
), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33
. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb
is more than 10x worse than calling glibc strlen
for a 33-byte string. It would be less bad in real use cases where strlen
could branch miss or even miss in code-cache and stall, but straight-line repz scasb
wouldn't. But 10x is huge, and that's for a fairly short string.
GCC's inline strlen
patterns are much slower than what it could do with SSE2 pcmpeqb
/ pmovmskb
, and bsf
, given the 16-byte alignment from calloc
. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3
inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc
), it chooses to inline strlen
as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2
and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen
implementations also take advantage of that. See my answer there for links to glibc strlen
and a summary of how it runs so fast for large strings.)
At -O1
, gcc always (even without known alignment) chooses to inline strlen
as repnz scasb
, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos
and rep movs
, not the repz
/repnz
instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s
to a volatile void *tmp
, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile
, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall
vs. unrolled_loop
vs. rep_byte
for inlining string operations in general (not just strlen; memcmp
would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() ...
, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen
strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf
(bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf
so it runs as tzcnt
on CPUs that support BMI1, which is more efficient on AMD. bsf
and tzcnt
give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn
to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb
+ bsf
is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16
so calloc
will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen
block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3
, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen
. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16
displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16
(maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb
with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor
was actually slightly better than vpxor
on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb
is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb
latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference
= performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat
showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk 'sum+= $1 ENDprint sum/100;'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
- ~62100
clock_t
time units:-O1
rep scas: (clock()
is a bit obsolete, but I didn't bother changing it.) - ~15900
clock_t
time units:-O3
gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with-march=native
forandn
) - ~1880
clock_t
time units:-O3
with glibcstrlen
function calls, using AVX2 - ~3190
clock_t
time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline. - ~3230
clock_t
time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub
to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen
really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen
at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0
to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen
out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen
an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;
). But the inner-most asm loop itself is aligned with .p2align 4
or .p2align 5
. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk 'sum+= $1 ENDprint sum/100;'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5
to .p2align 4
reverses them: -UHIDE_ALIGNMENT
becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw
+ a 3-byte nop
inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33
, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY
, and i<1280000
as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without-DHIDE_ALIGNMENT
, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3strlen
. (-DHIDE_ALIGNMENT
)- 6100 clock_t: gcc -O3 4-byte loop
- 37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen
that has to go through the PLT (call + jmp [mem]
), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33
. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb
is more than 10x worse than calling glibc strlen
for a 33-byte string. It would be less bad in real use cases where strlen
could branch miss or even miss in code-cache and stall, but straight-line repz scasb
wouldn't. But 10x is huge, and that's for a fairly short string.
edited Apr 9 at 11:14
answered Apr 9 at 9:37
Peter CordesPeter Cordes
137k19208352
137k19208352
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Please report it to gcc's bugzilla.
– Marc Glisse
Apr 7 at 21:17
2
Using
-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.– David Schwartz
Apr 7 at 21:18
2
It is generating
repnz scasb
for strlen at -O1.– Marc Glisse
Apr 7 at 21:23
8
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
Apr 8 at 2:57
3
@Damon Performance considerations are also considered bug reports for gcc (and most compilers for that matter). If they decide not to change it, that's fine. If they decide that it's worth changing, that's also fine. If you do not file performance bugs, the compiler team will not realize that there's something to look at.
– Justin
Apr 8 at 21:06