ããã¯ãç§ãã©ã®ããã«æé©åã«åãçµã¿ãææ°ã®x86ããã»ããµãŒã§æéã®ãµã€ãºå€æŽãåãåã£ããã«ã€ããŠã®äžé£ã®èšäºã®ç¶ãã§ãã åèšäºã§ã¯ãã¹ããŒãªãŒã®äžéšã説æããŸãããä»ã®äººã«ã³ãŒããæé©åããããã«ããã·ã¥ããããšèããŠããŸãã åã®ã·ãªãŒãºïŒ
â ããŒã0
â ããŒã1ãäžè¬çãªæé©å
ååã¯ãã¢ãããŒããå€æŽããã«ãå¹³åã§2.5åã®å éãåŸãŸããã ä»åã¯ãSIMDã¢ãããŒããé©çšããŠãããã«3.5åå éããæ¹æ³ã瀺ããŸãã ãã¡ãããã°ã©ãã£ãã¯ã¹åŠçã«SIMDã䜿çšããããšã¯ããŠããŠã§ã¯ãªããSIMDããã®ããã«çºæããããšããèšããŸãã ããããå®éã«ã¯ãç»ååŠçã¿ã¹ã¯ã«ãããã䜿çšããéçºè
ã¯ã»ãšãã©ããŸããã ããšãã°ãããªãããç¥ãããŠããäžè¬çãªã©ã€ãã©ãªImageMagickãšLibGDã¯ãSIMDã䜿çšããã«äœæãããŠããŸãã ããã¯ãSIMDã¢ãããŒãã客芳çã«ã¯ããè€éã§ã¯ãã¹ãã©ãããã©ãŒã ã§ã¯ãªããããšãæ
å ±ãã»ãšãã©ãªãããã§ãã åºæ¬ãèŠã€ããã®ã¯éåžžã«ç°¡åã§ããã詳现ãªè³æãå®éã®åé¡ã®åæã¯ååã§ã¯ãããŸããã Stack Overflowã®ãããããæåéããã¹ãŠã®ããããªããšã«ã€ããŠå€ãã®è³ªåããããŸãïŒããŒã¿ãããŠã³ããŒãããæ¹æ³ãã¢ã³ããã¯ããæ¹æ³ãããã¯ããæ¹æ³ã 誰ããèªåã§ã³ãŒã³ãåããªããã°ãªããªãããšãããããŸãã
SIMDãšã¯
ãã§ã«åºæ¬ã«ç²ŸéããŠããå Žåã¯ããã®ã»ã¯ã·ã§ã³ãã¹ãããããŠãã ããã SIMDã¯ãåäžã®åœä»€ãè€æ°ã®ããŒã¿ãè¡šããŸãã ãã®ã¢ãããŒãã«ãããè€æ°ã®åäžã®æäœã1ã€ã«ãŸãšããããšãã§ããŸãã æäœãå®è¡ãããããŒã¿ã»ããã¯ãã¯ãã«ãšåŒã°ããŸãã

ã»ãšãã©ã®å Žåãææ°ã®ããã»ããµã§ã¯ãSIMDåœä»€ã¯å¯Ÿå¿ããã¹ã«ã©ãŒãšåãæ°ã®ã¯ããã¯ãµã€ã¯ã«ã§å®è¡ãããŸãã ã€ãŸã çè«çã«ã¯ãSIMDã«åãæ¿ãããšã䜿çšããããŒã¿åãšåœä»€ã»ããã«å¿ããŠã2ã4ã8ã16ãããã«ã¯32åã®å éãæåŸ
ã§ããŸãã å®éã«ã¯ãããã¯ç°ãªã£ãŠåºãŠããŸãã ãŸãããã¯ãã«åãããã³ãŒãã§ããã³ãŒãã®äžéšã¯ã¹ã«ã©ãŒã®ãŸãŸã§ãã 第äºã«ãå€ãã®å Žåããã¯ãã«æŒç®ã®ããã«ãããŒã¿ãæºåããå¿
èŠããããŸãïŒã¢ã³ããã¯ãšããã¯ã ååãšããŠãSIMDã³ãŒããèšè¿°ãããšããããŒã¿ã®ããã¯ãšã¢ã³ããã¯ã¯æãé£ããããšã§ãã 第äžã«ãSIMDåœä»€ã¯éåžžã®åœä»€ã®æ£ç¢ºãªã³ããŒã§ã¯ãããŸãããäžéšã®æäœã«ã¯åé¡ãé©åã«è§£æ±ºããç¹å®ã®åœä»€ããããä»ã®ã¿ã¹ã¯ã«ã¯å¿
èŠãªåœä»€ããããŸããã ããšãã°ãæå°å€ãšæ倧å€ãèŠã€ããããã«ãæ¡ä»¶ä»ããžã£ã³ããªãã§åäœããåå¥ã®SIMDåœä»€ããããŸãã ãã ããx86ããã»ããµã«ã¯æŽæ°ãã¯ãã«ã®é€ç®ã¯ãããŸããã
ããå€ãã®å€ããã¯ãã«ã¬ãžã¹ã¿ã«é
眮ãããŸãã ããŒã¿ã®çš®é¡ãšã¬ãžã¹ã¿ã®ãµã€ãºã«äŸåããŸãã SSEã¬ãžã¹ã¿ã¯128ãããã§ãã ããšãã°ã32ãããæŽæ°ã§äœæ¥ããå Žåã4ã€ã®å€ã1ã€ã®SSEã¬ãžã¹ã¿ã«åãŸããŸãã 以äžã®ããŒã¿ã¿ã€ããäž»ã«å©çšå¯èœã§ãïŒ
- 8ãããæŽæ°ïŒç¬Šå·ä»ããŸãã¯ç¬Šå·ãªãïŒ
- 16ãããæŽæ°ïŒç¬Šå·ä»ããŸãã¯ç¬Šå·ãªãïŒ
- 32ãããæŽæ°ïŒç¬Šå·ä»ããŸãã¯ç¬Šå·ãªãïŒ
- 64ãããæŽæ°ïŒç¬Šå·ä»ããŸãã¯ç¬Šå·ãªãïŒ
- å粟床浮åå°æ°ç¹æ°ã32ããã
- 64ãããã®å粟床浮åå°æ°ç¹æ°
ãã¹ãŠã®ãã¯ãã«ã¬ãžã¹ã¿ã¯åãã§ãããããããã©ã®ã¿ã€ãã®ããŒã¿ã§ããããç¥ããŸããã 解éã¯ãã¬ãžã¹ã¿ã§åäœããåœä»€ã®ã¿ã«äŸåããŸãã ãããã£ãŠãã»ãšãã©ã®åœä»€ã«ã¯ãç°ãªãã¿ã€ãã®ããŒã¿ãæ±ãããã®ããªãšãŒã·ã§ã³ããããŸãã åœä»€ã¯ã1ã€ã®ã¿ã€ãã®ããŒã¿ãåä¿¡ããå¥ã®ã¿ã€ããæäŸã§ããŸãã ã¢ã»ã³ãã©ã«ç²ŸéããŠããå Žåãããã¯éåžžã®ã¬ãžã¹ã¿ãã©ã®ããã«æ©èœãããã幟åé£æ³ãããŸãã32ãããeax
ã¬ãžã¹ã¿ã«äœããæžã蟌ãã§ããã16ãããax
ããŒããæäœã§ããŸãã
é©åãªã³ãã³ãæ¡åŒµæ©èœã®éžæ
æåã®SIMDåœä»€ã¯ãIntel Pentium MMXããã»ããµã«ç»å ŽããŸããã å®éã«MMX-ããã¯ããŒã ã®æ¡å€§ã®ååã§ãã ãã®ãããã¯éåžžã«éèŠã ã£ããããIntelã¯ãããããã»ããµã®ååã«åãå
¥ããŸããã MMXã䜿çšããŠã2ã€ã®ç»åã®æ··åãã¹ãŒããŒãµã³ããªã³ã°ãªã©ã®ç°¡åãªã¢ã«ãŽãªãºã ãæžããããšããããŸãã Delphiã§äœæããŸããããMMXã䜿çšããã«ã¯ãäžã®ã¬ãã«ã«ç§»åããŠã¢ã»ã³ãã©ãŒã§æ¿å
¥ããå¿
èŠããããŸããã
ãã以æ¥ãããã»ããµããŒã ãšé¢é£ããéçºããŒã«ã®éçºã«ã€ããŠã¯ããŸããã©ããŒããŠããŸããã ãããã£ãŠãæè¿SIMDãåã³åãäžãããšããç§ã¯ããããé©ããèŠããŸããã ããããã³ã³ãã€ã©ã¯ãå€å°è€éãªå Žåã§ãSIMDåœä»€ãèªåçã«é©çšããããšã¯ã§ããŸããã ãããŠã圌ãæèœã§ããã°ã圌ã¯éåžžãèªåã§æžããSIMDã³ãŒããããæªããªããŸãã ãããäžæ¹ã§ãSIMDã䜿çšããã«ã¯ãã¢ã»ã³ãã©ãŒã§èšè¿°ããå¿
èŠããªããªãããã¹ãŠãç¹å¥ãªé¢æ°-çµã¿èŸŒã¿é¢æ°ã§è¡ãããŸãã
ã»ãšãã©ã®å Žåãåçµã¿èŸŒã¿é¢æ°ã¯1ã€ã®ç¹å®ã®ããã»ããµåœä»€ã«å¯Ÿå¿ããŠããŸãã ã€ãŸã èšè¿°ãããã³ãŒãã¯éåžžã«å¹ççã§ãããŒããŠã§ã¢ã«è¿ããã®ã§ãã ãããåæã«ã䜿ãæ
£ããæ¯èŒçå®å
šãªCæ§æã䜿çšããŠã³ãŒããèšè¿°ããŸãã éåžžã©ãããçµã¿èŸŒã¿é¢æ°ãå®çŸ©ãããŠããããããŒãã¡ã€ã«ãã€ã³ã¯ã«ãŒãããŸããéåžžã¯ãç¹æ®ãªããŒã¿åã䜿çšããŠå€æ°ã宣èšããéåžžã®åŒã³åºãé¢æ°ãšããŠå®çŸ©ããŸãã ã€ãŸããéåžžã®ã³ãŒããèšè¿°ããŸãã å°ãäžäŸ¿ã§ãããSIMDããŒã¿åã§ã¯æ°åŠæŒç®ã䜿çšã§ããªãããããã¹ãŠã®èšç®ã«çµã¿èŸŒã¿é¢æ°ã䜿çšããå¿
èŠããããŸãã 倧ãŸãã«èšããšã ss0
+ ss1
æžãããšã¯ã§ããŸããadd_float(ss0, ss1)
ïŒé¢æ°ã®ååãçºæãããŸããïŒããã§ããŸããã
SIMDæ¡åŒµæ©èœã¯å€æ°ãããŸãã åºæ¬çã«ãããã»ããµã«æ°ããæ¡åŒµæ©èœãååšãããšããããšã¯ããã¹ãŠã®å
è¡æ©èœãååšããããšãæå³ããŸãã æ¡åŒµæ©èœã®å€èŠ³ã®æç³»åã«åŸã£ãŠã次ã®é åºã§é
眮ãããŸãã
MMXãSSEãSSE2ãSSE3ãSSSE3ãSSE4.1ãSSE4.2ãAVXãAVX2ãAVX-512
ã芧ã®ãšããããã®ãªã¹ãã¯å°è±¡çã§ãã ãã¹ãŠã®ãã·ã³ã«ãã¹ãŠã®æ¡åŒµæ©èœãããããã§ã¯ãããŸããã åç©é€šã§ã¯ãMMXã®ãªãã©ã€ãx86ããã»ããµã®ã¿ãèŠã€ããããšãã§ããŸãã SSE2ã¯ã64ãããããã»ããµã«å¿
èŠãªæ¡åŒµæ©èœã§ãã æè¿ã§ã¯ãã»ãŒã©ãã«ã§ããããŸãã SSE4.2ã®ãµããŒãã¯ãNehalemã¢ãŒããã¯ãã£ä»¥éãã©ã®ããã»ããµã§ãèŠã€ããããšãã§ããŸãã 2008幎ããã ãã ããAVX2ã¯ãHaswellã³ã¢ä»¥éã®éäºç®ã®Intelããã»ããµãŒã§ã®ã¿äœ¿çšã§ããŸãã 2013幎以éãAMDã§ã¯2017幎ã«ãªãªãŒã¹ãããRyzenããã»ããµãŒã«ç»å ŽããŸããã AVX-512ã¯çŸåšãIntel Xeonããã³Xeon PhiãµãŒããŒããã»ããµã§ã®ã¿å©çšå¯èœã§ãã
åœä»€ã»ããã®éžæã¯ãã³ãŒãã®èšè¿°ã®ããã©ãŒãã³ã¹ãšè€éããããã³ããã»ããµã®ãµããŒãã«äŸåããŸãã éçºè
ã¯ãç°ãªãåœä»€ã»ããã«å¯ŸããŠã³ãŒãã®å®è£
ãè€æ°äœæããå ŽåããããŸãã SSE4.2ãšAVX2ã®2ã€ãéžæããŸããã ç§ã¯ãã®ããã«æšè«ããŸããïŒSSE4.2ã¯ãå°ãªããšãããã©ãŒãã³ã¹ã«é¢å¿ããã人ãªã誰ã§ãå¿é
ããå¿
èŠã®ãªãåºæ¬ã»ããã§ããããšãã°ãSSE2ã«ãã¹ãŠãå®è£
ããŸãã AVX2ã¯ãå°ãªããšã3幎ã«1åã¯ããŒããŠã§ã¢ãå€æŽããã®ãé¢åã§ã¯ãªã人åãã§ãã å®è£
ã®ããã«éžæãããã®ãäœã§ãããéžæãããåœä»€ã»ãããåããåžå Žã®ããã»ããµã®æ°ã¯å¢å ããã ãã§ãããããæéãçµã€ã«ã€ããŠãéžæã¯ããæ£ç¢ºã«ãªããŸãã
SSE4ã®å®è£
æåŸã«ã³ãŒãã«æ»ããŸãããã Cã§SSE4.2ã䜿çšããã«ã¯ã次ã®3ã€ã®ããããŒãã¡ã€ã«ãæ¥ç¶ããå¿
èŠããããŸãã
#include <emmintrin.h> #include <mmintrin.h> #include <smmintrin.h>
ããã«ãã³ã³ãã€ã©ãã©ã°-msse4
æå®ããå¿
èŠããããŸãã Pythonã¢ãžã¥ãŒã«ïŒç§ãã¡ã®ã¢ãžã¥ãŒã«ãªã©ïŒã®æ§ç¯ã«ã€ããŠè©±ããŠããå Žåã¯ãã³ãã³ãã©ã€ã³ãããã®ãã©ã°ãçŽæ¥è¿œå ããŠãã¢ã»ã³ããªãè€éã«ããªãããã«ããããšãã§ããŸãã
$ CC="ccache cc -msse4" python ./setup.py develop
æãå
æ¬çãªçµã¿èŸŒã¿ãªãã¡ã¬ã³ã¹ã¯ã Intel Intrinsics Guideã«ãããŸãã åªããæ€çŽ¢ãšãã£ã«ã¿ãªã³ã°ããããçµã¿èŸŒã¿é¢æ°ã®èª¬æã¯ã察å¿ããåœä»€ãåœä»€ã®æ¬äŒŒã³ãŒããããã«ã¯ææ°äžä»£ã®Intelããã»ããµã®ã¯ããã¯ãµã€ã¯ã«ã§ã®å®è¡æéã瀺ããŸãã åèãšããŠãããã¯ãŠããŒã¯ãªãã®ã§ãã ãããããã®ã¬ã€ãã®åœ¢åŒã§ã¯ãã©ã®ãããªããšãèµ·ããã¹ããã«ã€ããŠã®äžè¬çãªç¶æ³ãææ¡ããããšã¯ã§ããŸããã
ãã¯ãã«åã¯ãç°ãªãããŒã¿ã«å¯Ÿããåãæäœã®ã¿ã«åœ¹ç«ã¡ãŸãã ãã®å Žåãåãã¢ã¯ã·ã§ã³ãç°ãªãç»åãã£ã³ãã«ã§å®è¡ãããŸãïŒ
for (xx = 0; xx < imOut->xsize; xx++) { ss0 = 0.5; ss1 = 0.5; ss2 = 0.5; for (y = ymin; y < ymax; y++) { ss0 = ss0 + (UINT8) imIn->image[y][xx*4+0] * k[y-ymin]; ss1 = ss1 + (UINT8) imIn->image[y][xx*4+1] * k[y-ymin]; ss2 = ss2 + (UINT8) imIn->image[y][xx*4+2] * k[y-ymin]; } imOut->image[yy][xx*4+0] = clip8(ss0); imOut->image[yy][xx*4+1] = clip8(ss1); imOut->image[yy][xx*4+2] = clip8(ss2); }
åçŽæ¹åãšæ°Žå¹³æ¹åã®ééã«ã€ããŠãåæ§ã®ã³ãŒãããããŸãã 䟿å®äžãäž¡æ¹ãå¥ã
ã®é¢æ°ã«é
眮ãã次ã®ã·ã°ããã£ãæã€2ã€ã®é¢æ°ã®ãã¬ãŒã ã¯ãŒã¯å
ã§ã®ã¿SIMDã䜿çšããŸãã
void ImagingResampleHorizontalConvolution8u( UINT32 *lineOut, UINT32 *lineIn, int xsize, int *xbounds, float *kk, int kmax ); void ImagingResampleVerticalConvolution8u( UINT32 *lineOut, Imaging imIn, int ymin, int ymax, float *k );
èŠããŠãããªããåã®ããŒãã§ãç»åã®2ã3ã4ãã£ã³ãã«ã«å¯ŸããŠ3ã€ã®ç¹å¥ãªã±ãŒã¹ãäœæããŸããã ããã¯ããã£ãã«ãéãå
éšã«ãŒããåãé€ããåæã«ç»åã«ãªããã£ãã«ã«å¯ŸããŠäžå¿
èŠãªèšç®ãå®è¡ããªãããã«å¿
èŠã§ããã SIMDããŒãžã§ã³ã§ã¯ããã£ãã«ããšã«å®è£
ãå
±æããŸããããã¹ãŠã®èšç®ã¯åžžã«4ã€ã®ãã£ãã«ã§å®è¡ãããŸãã åãã¯ã»ã«ã¯4ã€ã®32ãããæµ®åå°æ°ç¹æ°ã§è¡šãããæ£ç¢ºã«1ã€ã®SSEã¬ãžã¹ã¿ãå æããŸãã ã¯ãã3ãã£ã³ãã«ç»åã®å Žåã4ã€ã®æäœã¯ã¢ã€ãã«ç¶æ
ã«ãªãã2ãã£ã³ãã«ç»åã®å Žåã¯ååã«ãªããŸãã ãã ããæçšãªããŒã¿ã䜿çšããŠSSEã¬ãžã¹ã¿ãå¯èœãªéãé§åããããšããããããããã«ç®ãã€ã¶ãæ¹ãç°¡åã§ãã

äžèšã®ã³ãŒããããäžåºŠèŠãŠãã ããã æåã®æ®µéã§ã¯ãããããªãŒã«ã¯0.5ã®äžå®å€ãå²ãåœãŠãããŸãããããã¯çµæãäžžããããã«å¿
èŠã§ãã é¢æ°_mm_set1_*
ã¯ãåäžã®æµ®åå°æ°ç¹å€ãã¬ãžã¹ã¿å
šäœã«ããŒãããããã«äœ¿çšãããŸãã
__m128 sss = _mm_set1_ps(0.5);
éåžžãé¢æ°åã®æåŸã®éšåã¯ãæ©èœããããŒã¿ã®ã¿ã€ãã瀺ããŸãã ç§ãã¡ã®å Žåãããã¯_ps
ã§ãããããã¯ãã·ã³ã°ã«ãæå³ããŸãã
ããã«ããã¯ã»ã«ãæµ®åå°æ°ç¹æ°ã®ãã¯ãã«ãšããŠäœ¿çšãããå Žåã¯ããã¯ã»ã«ãäœããã®æ¹æ³ã§ãã®è¡šçŸã«å€æããå¿
èŠããããŸãã SSEã«ã¯ã8ãããå€ãå粟床æ°å€ã«äžåºŠã«å€æããåœä»€ã¯ãããŸããã _mm_cvtepi32_ps
ããããŸããããã¯ã32ãããæŽæ°ãå粟床æ°å€ã«å€æããŸããã䜿çšããåã«ã8ãããæ°å€ã32ãããæ°å€ã«ã¢ã³ããã¯ããå¿
èŠããããŸãã ããã«_mm_cvtepu8_epi32
é¢æ°_mm_cvtepu8_epi32
䟿å©ã§ãã 圌女ã¯ãã¢ãã¬ã¹ãã¡ã¢ãªå
ã®128ãããå€ã«æž¡ãå¿
èŠããããŸãã
__m128i pix_epi32 = _mm_cvtepu8_epi32(*(__m128i *) &imIn->image32[y][xx]); __m128 pix_ps = _mm_cvtepi32_ps(pix_epi32);
å€ãèªã¿èŸŒããšãã«æ瀺çã«è¡ãå¿
èŠãããSIMDã³ãŒãã®éã«æ³šæããŠãã ããã ã¹ã«ã©ãŒã³ãŒãã§ã¯ãããã¯ååšããŸãããã³ã³ãã€ã©ãŒã¯ã8ãããæŽæ°ã«floatãä¹ç®ãããããæåã®æŽæ°ãfloatã«å€æããå¿
èŠãããããšãã³ã³ãã€ã©ãŒèªèº«ãç解ããŠããŸãã
1ãã¯ã»ã«ã®ãã¹ãŠã®ãã£ãã«ã®ä¿æ°ã¯åãã§ããããã _mm_set1_ps
ã_mm_set1_ps
ãŸãã
__m128 mmk = _mm_set1_ps(k[y - ymin]);
ä¿æ°ããã£ãã«ã§ä¹ç®ããããããªãŒã«è¿œå ããããšã¯æ®ããŸãã
__m128 mul = _mm_mul_ps(pix_ps, mmk); sss = _mm_add_ps(sss, mul);
çŸåšã sss
ã¢ãã¥ã ã¬ãŒã¿ã«ã¯ãã¯ã»ã«ãã£ãã«ã®å€ããããŸããããã¯å®éã«ç¯å²[0ã255]ãè¶
ããããšãã§ãããããäœããã®æ¹æ³ã§ãããã®å€ãå¶éããå¿
èŠããããŸãã åã®èšäºã®clip8
é¢æ°ãèŠããŠããŸããïŒ 2ã€ã®æ¡ä»¶ä»ãé·ç§»ããããŸããã SIMDã®å Žåãããã»ããµã¯ãã¹ãŠã®ããŒã¿ã«å¯ŸããŠåãã³ãã³ããå®è¡ããå¿
èŠããããããããŒã¿ã«å¿ããŠæ¡ä»¶ä»ããžã£ã³ãã䜿çšããããšã¯ã§ããŸããã ããããå®éã«ã¯_mm_min_epi32
ããã³_mm_max_epi32
ããããããããã«åªããŠã_mm_max_epi32
ã ãããã£ãŠãå€ã笊å·ä»ã32ãããæŽæ°ã«å€æãã[0ã255]å
ã§ããããããªãã³ã°ããŸãã
__m128i mmmax = _mm_set1_epi32(255); __m128i mmmin = _mm_set1_epi32(0); __m128i ssi = _mm_cvtps_epi32(sss); ssi = _mm_max_epi32(mmmin, _mm_min_epi32(mmmax, ssi));
æ®å¿µãªããã _mm_cvtepu8_epi32
ã«ã¯éã®åœä»€ã¯ãããŸããããããã£ãŠãå¿
èŠãªãã€ããæåã«ç§»åããŠããã _mm_cvtsi128_si32
ã䜿çšããŠSSEã¬ãžã¹ã¿ãæ±çšã¬ãžã¹ã¿ã«å€æããããšããè¯ãæ¹æ³ã¯_mm_cvtepu8_epi32
ãŸãã_mm_cvtsi128_si32
ã
__m128i shiftmask = _mm_set_epi8(-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,12,8,4,0); lineOut[xx] = _mm_cvtsi128_si32(_mm_shuffle_epi8(ssi, shiftmask));
shiftmask
ãã¹ã¯ãã¹ã¯ã§ã¯ãäžäœãã€ããå³ã«è¡ãããšã«æ³šæããŠshiftmask
ã æ°Žå¹³ãã¹ã®å Žåããã¹ãŠããŸã£ããåãã§ããã¯ã»ã«ã®ããŒãé åºã®ã¿ãå€æŽãããŸããé£æ¥ãããã¯ã»ã«ã¯è¡ããããŒããããé£æ¥ããè¡ããã¯ããŒããããŸããã
ãã¹ãŠã®æºåãã§ããŸããããã¹ããå®è¡ããŠçµæã確èªããŸãã
Scale 2560Ã1600 RGB image to 320x200 bil 0,01151 s 355.87 Mpx/s 161.4 % to 320x200 bic 0,02005 s 204.27 Mpx/s 158.7 % to 320x200 lzs 0,03421 s 119.73 Mpx/s 137.2 % to 2048x1280 bil 0,04450 s 92.05 Mpx/s 215.0 % to 2048x1280 bic 0,05951 s 68.83 Mpx/s 198.3 % to 2048x1280 lzs 0,07804 s 52.49 Mpx/s 189.6 % to 5478x3424 bil 0,18615 s 22.00 Mpx/s 215.5 % to 5478x3424 bic 0,24039 s 17.04 Mpx/s 210.5 % to 5478x3424 lzs 0,30674 s 13.35 Mpx/s 196.2 %
ã³ããã8d0412bã®çµæã
2.5åãã3åã«æé·ïŒ 3ãã£ã³ãã«ã®RGBç»åã§ãã¹ãããŠããããšãæãåºãããŠãã ããããããã£ãŠããã®å Žåã®3åã®å éã¯åºæºãšèŠãªãããšãã§ããŸãã
é©åãªæ¢±å
以åã®ããŒãžã§ã³ã¯ãã¹ã«ã©ãŒã³ãŒãããã³ããŒãããã»ãŒ1察1ã§ããã ããã«ãããã¯ãšã¢ã³ããã¯ã®ããã«ãSSEã®ææ°ããŒãžã§ã³ã«ç»å Žããé¢æ°_mm_cvtepu8_epi32
ã _mm_max/min_epi32
ã _mm_shuffle_epi8
ã æããã«ã人ã
ã¯äœããã®åœ¢ã§ãããã®ã¿ã¹ã¯ãšä»¥åã®ããŒãžã§ã³ã®SSEã«å¯ŸåŠããŸããã å®éã _mm_pack*
ããã³_mm_unpack*
ããŒã¿ãããã¯/ã¢ã³ããã¯ããããã®äžé£ã®é¢æ°ããããŸãã ããã§è§£åãããŸã圹ã«ç«ããªãå ŽåïŒ _mm_cvtepu8_epi32
ç§ãã¡ã®ç®çã«é©ããŠããŸãïŒã梱å
ã倧å¹
ã«ç°¡çŽ åã§ããŸãã å€ãã·ããããã³ããªãã³ã°ããããã«å®æ°ãäžèŠã«ãªãããã«åçŽåããããã«ïŒç§ãã¡ã¯mmmax
ã mmmin
ãããã³shiftmask
ã«ã€ããŠè©±ããŠããïŒã
å®éã«ã¯ããããã³ã°é¢æ°_mm_packs _mm_packs*
ã¯ã _mm_packs_epi32
ãªã©ã®ååã®æåsã§ç€ºãããããã«ã飜åç¶æ
ã§å®è¡ãããŸãã 飜åãšã¯ãå€æäžã«å€æ°ã®å€ãæ°ããåã®å¶éãè¶
ããå Žåããã®åã§ã¯æ¥µç«¯ãªãŸãŸã«ãªãããšãæå³ããŸãã ããšãã°ã16ãããã®ç¬Šå·ä»ãæŽæ°ãã8ãããã®ç¬Šå·ãªããžã®å€æãè¡ãå Žåãå€257ã¯255ã«å€æããã-3ã¯0ã«å€æãããŸããããã±ãŒãžåé¢æ°ã¯åæã«å€ãã·ããããç¯å²å€ã«ãªãããšãé²ããŸãã
__m128i ssi = _mm_cvtps_epi32(sss); ssi = _mm_packs_epi32(ssi, ssi); ssi = _mm_packus_epi16(ssi, ssi); lineOut[xx] = _mm_cvtsi128_si32(ssi);
ãã®æé©åã¯å éãäžããŸããããçŸããèŠããè¿œå ã®å®æ°ãå¿
èŠãšããŸããã ã³ãããb17cdc9ãç£èŠããŸãã
AVXã¬ãžã¹ã¿
ç¬ããºãã³ãçãŠããå Žåã圌女ã¯ããããã®ããã«ããã®ã§ããããããããšãããããŸããïŒ

ãŸããã¬ãžã¹ã¿ã«2åã®ããŒã¿ãå«ãŸããŠããå Žåã¯ã©ããªããŸããïŒ AVXã®ä»çµã¿ãç¥ã£ããšããããã«ãã®åçãæãåºããŸããã äžèŠãAVXã®æ瀺ã¯å¥åŠã§éè«ççã«èŠããŸãã ããã¯ãSSEã®ããã«ããã£ã2åããšããã ãã§ãªããããçš®ã®ãããŸããªããžãã¯ãæã£ãŠããŸãã ãã§ã«äœ¿çšããã®ãšåãæ··ååœä»€ãèŠãŠãã ããã SSEããŒãžã§ã³ã®æ¬äŒŒã³ãŒãã¯æ¬¡ã®ãšããã§ãã
FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI ENDFOR
AVXã®å Žåãã«ãŠã³ã¿ã¯åã«15ãã31ã«å¢å ããã ãã§ãããšä»®å®ããã®ã¯è«ççã§ããããããAVXããŒãžã§ã³ã®æ¬äŒŒã³ãŒãã¯éåžžã«è€éã§ãã
FOR j := 0 to 15 i := j*8 IF b[i+7] == 1 dst[i+7:i] := 0 ELSE index[3:0] := b[i+3:i] dst[i+7:i] := a[index*8+7:index*8] FI IF b[128+i+7] == 1 dst[128+i+7:128+i] := 0 ELSE index[3:0] := b[128+i+3:128+i] dst[128+i+7:128+i] := a[128+index*8+7:128+index*8] FI ENDFOR
AVXã¯SSEã®2åã§ã¯ãªãã2 SSEã§ãïŒ ã€ãŸããAVXã¬ãžã¹ã¿ãäžå¯Ÿã®ãã¯ãã«ãšããŠèŠãå¿
èŠããããŸãã ãããŠãã»ãšãã©ã®ããŒã ã®ãããã®ãã¯ãã«ã¯ãçžäºäœçšããŸããã AVXã³ãã³ãã®æ¬äŒŒã³ãŒããããäžåºŠèŠãŠãã ãããæåã®ãããã¯ã¯äžäœ128ãããã§ã®ã¿åäœãã2çªç®ã®ãããã¯ã¯äžäœ128ãããã§ã®ã¿åäœããããšãã¯ã£ãããšããããŸãã ãŸããäžäœãã€ããäžäœãã€ãã«ãªãããã«ããŸãã¯ãã®éã«æ··åããããšã¯ã§ããŸããã ããã«ããã®åœä»€ã§ã¯ãåé¢ã¯å³å¯ã§ã¯ãããŸãããã·ããããæ¹æ³ã瀺ãã¬ãžã¹ã¿ã¯ãäžéšãšäžéšãç°ãªãæ¹æ³ã§ã·ããã§ããŸãã ãããŠãäž¡æ¹ã®éšåã®æäœã®åŒæ°ãåãã§ããããšããããããŸãã æ¬äŒŒã³ãŒã_mm256_blend_epi16
äŸã次ã«ç€ºã_mm256_blend_epi16
ã
FOR j := 0 to 15 i := j*16 IF imm8[j%8] dst[i+15:i] := b[i+15:i] ELSE dst[i+15:i] := a[i+15:i] FI ENDFOR dst[MAX:256] := 0
jã¯15ã«å埩ããããã¹ã¯ã¯8ãæ³ãšãããã€ãimm8[j%8]
ããååŸãããããšã«æ³šæããŠãã ããã ã€ãŸã ã¬ãžã¹ã¿ã®äžéšãšäžéšã¯åžžã«åããã¹ã¯ãæã¡ãŸãã é梱ãšæ¢±å
ã¯äŸç¶ãšããŠå€ãã®åé¡ã§ãããäžéšãšäžéšã§ç¬ç«ããŠçºçããŸãã
PACK_SATURATED(src[127:0]) { dst[15:0] := Saturate_Int32_To_UnsignedInt16 (src[31:0]) dst[31:16] := Saturate_Int32_To_UnsignedInt16 (src[63:32]) dst[47:32] := Saturate_Int32_To_UnsignedInt16 (src[95:64]) dst[63:48] := Saturate_Int32_To_UnsignedInt16 (src[127:96]) RETURN dst[63:0] } dst[63:0] := PACK_SATURATED(a[127:0]) dst[127:64] := PACK_SATURATED(b[127:0]) dst[191:128] := PACK_SATURATED(a[255:128]) dst[255:192] := PACK_SATURATED(b[255:128]) dst[MAX:256] := 0

çµæã®äžäœãããã¯ãå
¥åãã©ã¡ãŒã¿ãŒã®äžäœãããã®ã¿ã«åºã¥ããŠèšç®ãããŸãã ãããŠãäžéšã®ãã®ã¯äžéšã®ãã®ã«ã®ã¿åºã¥ããŠããŸãã ç§ã¯ãã®èŠåã«ã€ããŠã®åºå
žã§æ瀺çã«èšåããŠããŸãããããã®ç解ã«ããSSEã³ãŒãã®AVXãžã®ç§»æ€ã倧å¹
ã«ç°¡çŽ åãããŸãã
AVXã³ãã³ã
AVXã³ãã³ãèªäœãšSSEãåºå¥ãããã®ããããŸãã æåã®ããŒãã®æåŸã§èª¬æããããŒã¿äŸåé¢ä¿ã®åé¡ãèŠããŠããŸããïŒ åœŒå¥³ã¯AVXã§é²é³ãããŸããã ãã¡ãããã¹ã«ã©ãŒåœä»€ã®åäœãä¿®æ£ããããšã¯ãã§ã«äžå¯èœã§ãããSSEããAVXã«åãæ¿ãããšãã«åãééããé²ãããšãã§ããŸãã åAVXåœä»€ã®æ¬äŒŒã³ãŒãã§ã¯ãæåŸã«æ¬¡ã®è¡ããããŸãã
dst[MAX:256] := 0
ããã¯ãAVXãå°æ¥ã®äžä»£ïŒ512ããã以äžïŒã®ã¬ãžã¹ã¿ã䜿çšããŠæ¢åã®ã³ãã³ãã®åäœã決å®ããããšãæå³ããŸãã ããããããã ãã§ã¯ãããŸããã VEXãªãã³ãŒãã·ã¹ãã ã¯ãAVXã³ãã³ãã®ãšã³ã³ãŒãã«äœ¿çšãããŸãã AVXã«ã¯ãSEXåœä»€ã§ããVEXã§ãšã³ã³ãŒãããæ©èœãå«ãŸããŠããŸãã ãã®æ¹æ³ã§ã³ãŒãã£ã³ã°ãããSSEåœä»€ã¯ãæäžäœãããããªã»ããããããšããä¿èšŒãåãåããŸãã SSEãŸãã¯ãã®éã®åŸã«AVXã³ãã³ãã䜿çšãããšãçŽ100ãã£ãã¯ã®ããã«ãã£ããããšèããããšããããããããŸããã 幞ããªããšã«ããã®ããã«ãã£ã¯VEXã§ãšã³ã³ãŒããããSSEã³ãã³ãã«ã¯é©çšãããŸããã -mavx
ãã©ã°ãæå®ããŠçµã¿èŸŒã¿é¢æ°ã䜿çšãã-mavx
ã³ã³ãã€ã©ãŒã¯æ°ãã圢åŒã§åœä»€ãçæããŸãã æªããã¥ãŒã¹ã¯ãã³ãŒãã-mavx
ã§ã³ã³ãã€ã«ãããSSEã³ãã³ããå«ãããAVXã³ãã³ããå«ãŸãªãå ŽåãVEXã§ãšã³ã³ãŒããããAVXã®ãªãããã»ããµãŒã§ã¯æ©èœããªãããšã§ãã ã€ãŸã åã圢åŒã®ã¢ã»ã³ããªã¢ãžã¥ãŒã«å
ã§ãå€ã圢åŒã®SSEåœä»€ãšAVXåœä»€ã䜿çšããããšã¯ã§ããŸãã ã
if (is_avx_available()) { resample_avx(); } else { resample_sse(); }
-mavx
ãã©ã°ããã-mavx
resample_sse()
é¢æ°ãã-mavx
ã³ãŒãã¯AVXãªãã§ã¯ããã»ããµã§éå§ãããããã®ãã©ã°resample_avx()
ãªãresample_avx()
é¢æ°ããã®ã³ãŒãã¯ã³ã³ãã€ã«resample_avx()
ãŸããã
AVX2ãåçŽéè·¯
ãããŸã§ãSIMDãžã®è»¢éã¯éåžžã«ç°¡åã§ããã4ã€ã®æµ®åå°æ°ç¹æ°ãšããŠè¡šããã1ã€ã®ãã¯ã»ã«ã1ã€ã®SSE4ã¬ãžã¹ã¿ã«åãŸãããã§ãã ããããAVX2ã§ã¯ãäžåºŠã«8ã€ã®æµ®åå°æ°ç¹å€ãã€ãŸã2ãã¯ã»ã«ãåŠçããå¿
èŠããããŸãã ãããã1ã€ã®ã¬ãžã¹ã¿ã«åã蟌ããã¯ã»ã«ã¯ã©ãã§ããïŒ ããã§ãããºãã³ãçãŠããç¬ã®åçãæ¿å
¥ããããšæããŸãã ãã¬ãŒã ãã©ã®ããã«èŠããããããšãã°æ°Žå¹³ããã¿èŸŒã¿ãæãåºãããŠãã ããïŒ
for (xx = 0; xx < xsize; xx++) { xmin = xbounds[xx * 2 + 0]; xmax = xbounds[xx * 2 + 1]; for (x = xmin; x < xmax; x++) { __m128i pix = lineIn[x]; __m128 mmk = k[x - xmin];
ããšãã°ãã©ã€ã³å
ã®é£æ¥ãã¯ã»ã«ãååŸããããšãã§ããŸãïŒ lineIn[x]
ããã³lineIn[x + 1]
ãããã¯æãæçœãªãªãã·ã§ã³ã§ãã ãã ãããããã®ãã¯ã»ã«ã«å¯ŸããŠãç°ãªãä¿æ°ïŒ k[x - xmin]
ããã³k[x - xmin + 1]
ïŒãæºåããå¿
èŠããããŸãã ãŸããxmaxããxminãŸã§ã®è·é¢ã¯å¥æ°ã«ãªãå¯èœæ§ããããæåŸã®ãã¯ã»ã«ãèšç®ããã«ã¯ãSSEã³ãŒããšAVXã³ãŒããçµã¿åãããå¿
èŠããããŸãã
é£æ¥ããè¡ã§ãã¯ã»ã«ãååŸã§ããŸãïŒ lineIn1[x]
ããã³lineIn2[x]
ã ãã ãããã¯ã»ã«ãå¥ã
ã«ããŒãããã³ã¢ã³ããŒãããå¿
èŠããããããããŸã䟿å©ã§ã¯ãããŸããã
å®éãã©ã®æ¹æ³ã«ãããã€ãã®é·æãšçæããããŸãã ççŽã«èšã£ãŠãæ°Žå¹³éè·¯ãAVX2ã«è»¢éããããšã¯ããŸã䟿å©ã§ã¯ãããŸããã ããäžã€ã¯åçŽã§ãïŒ åœŒãèŠãŠïŒ
for (xx = 0; xx < imIn->xsize; xx++) { for (y = ymin; y < ymax; y++) { __m128i pix = image32[y][xx]; __m128 mmk = k[y - ymin];
ã©ã€ã³image32[y][xx]
ããã³image32[y][xx + 1]
ã®é£æ¥ãã¯ã»ã«ãååŸã§ãããããã¯åãä¿æ°ãæã¡ãŸãã å
éšãµã€ã¯ã«ãå®äºãããšãããããªãŒã¯é£æ¥ãã2ã€ã®ãã¯ã»ã«ã®çµæã«ãªããŸã;ããã¯ããããšãé£ãããããŸããã ã€ãŸãããã¹ãŠã®__m128
ãã¬ãã£ãã¯ã¹ã__m256
ã«ã _mm_
ã_mm256_
å€æŽããã ãã§ãã³ãŒããæžãæããããšãã§ããŸãã æ¬åœã«ç°ãªãã®ã¯ãæåŸã«_mm256_castsi256_si128
ããã³_mm_storel_epi64
ã䜿çšããããšã ãã§ãã 1ã€ã¯noopã§ãåãã£ã¹ãã ãã§ãã ãããŠã2çªç®ã¯64ãããå€ãã¬ãžã¹ã¿ããã¡ã¢ãªã«ä¿åããŸãã
Scale 2560Ã1600 RGB image to 320x200 bil 0.01162 s 352.37 Mpx/s -0,9 % to 320x200 bic 0.02085 s 196.41 Mpx/s -3,8 % to 320x200 lzs 0.03247 s 126.16 Mpx/s 5,4 % to 2048x1280 bil 0.03992 s 102.61 Mpx/s 11,5 % to 2048x1280 bic 0.05086 s 80.53 Mpx/s 17,0 % to 2048x1280 lzs 0.06563 s 62.41 Mpx/s 18,9 % to 5478x3424 bil 0.15232 s 26.89 Mpx/s 22,2 % to 5478x3424 bic 0.19810 s 20.68 Mpx/s 21,3 % to 5478x3424 lzs 0.23601 s 17.36 Mpx/s 30,0 %
ã³ããã86fe8a2ã®çµæã
åããŠããã©ãŒãã³ã¹ããããã«äœäžããŸããã ããã¯æž¬å®ãšã©ãŒã§ã¯ãããŸããããªããªã çµæã¯éåžžã«å®å®ããŠããŸãã åŸã§ãã®çç±ã説æããŸãã ãããŸã§ã®éãã²ã€ã³ã¯äž»ã«å¢å ã§ãããç»åã®å€§å¹
ãªæžå°ã§ã¯ãªãããšã¯æããã§ãã æšæž¬ããã®ã¯é£ããããšã§ã¯ãããŸããããããã¯åçŽæ¹åã®ééãæ°Žå¹³æ¹åã®ééã®åŸã«è¡ãããæçµç»åã®ãµã€ãºã倧ãããªããšå¹æã倧ãããªãããã«èµ·ãããŸãã äžè¬çã«ããã®ç¶æ³ã¯éåžžã«ããžãã£ãã§ãã
AVX2æ°Žå¹³ãã¹
æ°Žå¹³æ¹åã®ãã¹ã®å Žåã2ã€ã®é£æ¥ãããã¯ã»ã«ãé£ç¶ããŠååŸããæ¹ã䟿å©ã§ãã 次ã«ããããã®ããã«ç°ãªãä¿æ°ãæºåããå¿
èŠããããŸãã
__m256 mmk = _mm256_set1_ps(k[x - xmin]); mmk = _mm256_insertf128_ps(mmk, _mm_set1_ps(k[x - xmin + 1]), 1);
æåŸã«ã256ãããã¬ãžã¹ã¿ã®äžéšã®çµæãäžéšã®çµæã«è¿œå ããå¿
èŠããããŸãã
__m128 sss = _mm_add_ps( _mm256_castps256_ps128(sss256), _mm256_extractf128_ps(sss256, 1));
Scale 2560Ã1600 RGB image to 320x200 bil 0,00918 s 446.18 Mpx/s 26,6 % to 320x200 bic 0,01490 s 274.90 Mpx/s 39,9 % to 320x200 lzs 0,02287 s 179.08 Mpx/s 42,0 % to 2048x1280 bil 0,04186 s 97.85 Mpx/s -4,6 % to 2048x1280 bic 0,05029 s 81.44 Mpx/s 1,1 % to 2048x1280 lzs 0,06178 s 66.30 Mpx/s 6,2 % to 5478x3424 bil 0,16219 s 25.25 Mpx/s -6,1 % to 5478x3424 bic 0,19996 s 20.48 Mpx/s -0,9 % to 5478x3424 lzs 0,23377 s 17.52 Mpx/s 1,0 %
ã³ãããfd82859ã®çµæã
ããã§ãã以åã®ããŒãžã§ã³ãšæ¯èŒããŠäžéšã®ãµã€ãºã§ããããªæ倱ãèŠãããŸãã ããããäž¡æ¹ã®æé©åãåèšãããšã6ã50ïŒ
å¢å ããŸããã å¹³åããŠãAVX2ããŒãžã§ã³ã¯SSE4ããŒãžã§ã³ãã25ïŒ
é«éã§ãã
ããã¯ããããã§ããããããšãå°ãã§ããïŒ ãã¡ãããæ°ããäžé£ã®æ瀺ããããã«å€ããååŸããããšæããŸãã , 100% , 50% . , , .
, . , . Intel Core i5-4258U. , , .
, : . i5-4258U 2.4 2.9 . â , . â , . , . , , , . Intel Power Gadget . , SSE4- 2.9 . AVX2-, 2.75 . AVX2-, 2.6 . ã€ãŸã , AVX2-, . AVX2- , - . AVX2 . äœãèšããŸããïŒ , AVX2 .
, Xeon E5-2680 v2 ( Haswell, ) â AVX2- , , .