This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
| Other format: | [Raw text] | |
Hi, Jakub.
> > They had 8 bytes each in order to allow direct comparisons
> with the count
> > in a register without having to load the value. Even if in
> memcpy they
> > can be used as 4-byte variables, I have other routines that
> would benefit
> > from them being 8 bytes long.
>
> In the last round of routines you sent I haven't seen that,
> but sure, if
> some var has justification for being 64-bit, so be it. The important
> is just (%rip) addressing.
Got it. It actually made the patch much leaner, as it doesn't touch on
RTLD stuff anymore.
> > I guess that using the red zone is better. As the routine
> has several
> > exit points to improve performance, after each one new CFI
> directives
> > would have to be added, which complicates maintaining the code.
>
> Even with red zone you need some CFI directives (which say
> where %r12/$r13/%r14
> have been saved or cfi_restore for them), but don't need any CFA
> adjustments.
I chose for using the red zone with the CFI directives.
> > I'll double-check that RDI has the expected value always.
> Otherwise, I'll
> > just use an entry in the red zone.
>
> I believe so. L(1{,a,b,c,d,loop}) always increment %rdi by
> the size they
> stored into (%rdi). All other ret's are preceeded by jnz
> L(1), which relies
> on %rdi pointing after the last byte stored.
Indeed. The tail code is tad harder to read though.
Again, in addition to the source-code patches, I also attached the
resulting data obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a
3GHz Core2 with DDR2-533. The file memcpy-opteron-old.txt has the
original output of string/test-memcpy on the Athlon64 system and the
file memcpy-opteron-new.txt the output using the new routine. The files
memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results
but on the Core2 system.
I also plotted the performance of the new routine relative to the old
one (where a ratio of 1 stands for performance parity and >1 for
performance improvement) in movs-opteron-new-movs-opteron-old.png for
the Athlon64 system and in movs-core2-new-movs-core2-old.png for the
Core2 system.
2007-05-04 Evandro Menezes <evandro.menezes@amd.com>
* sysdeps/x86_64/memcpy.S: new code to handle more block size
ranges.
* sysdeps/x86_64/mempcpy.S: modified macro definition.
* sysdeps/unix/sysv/linux/x86_64/sysconf.c: moved code to detect
caches sizes...
* sysdeps/x86_64/cacheinfo.c: ... here.
* sysdeps/x86_64/Makefile: added cacheinfo.c.
Could you please review it?
Thanks,
--
_______________________________________________________
Evandro Menezes AMD Austin, TX
Attachment:
movs-core2-new-movs-core2-old-ratio.png
Description: movs-core2-new-movs-core2-old-ratio.png
Attachment:
movs-opteronf-new-movs-opteronf-old-ratio.png
Description: movs-opteronf-new-movs-opteronf-old-ratio.png
Attachment:
memcpy.diff.bz2
Description: memcpy.diff.bz2
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |