ææ°ã®x86ããã»ããµãŒäžã§æéã®ãµã€ãºå€æŽã€ã¡ãŒãžãäœæã§ããæé©åææ³ã«ã€ããŠãåŒãç¶ã詳ãã説æããŸãã ä»åã¯ãæµ®åå°æ°ç¹èšç®ããæŽæ°èšç®ãžã®å€æã«ã€ããŠèª¬æããŸãã æåã«ããããã©ã®ããã«æ©èœãããã«ã€ããŠå°ãçè«ã説æããŸãã æ¬¡ã«ãSIMDããŒãžã§ã³ãå«ãå®éã®ã³ãŒãã«æ»ããŸãã
åã®ããŒãã§ã¯ïŒ
â ããŒã0
â ããŒã1ãäžè¬çãªæé©å
â ããŒã2ãSIMD
æŽæ°ãšæµ®åå°æ°ç¹
ã³ã³ãã¥ãŒã¿ãŒã®ã¡ã¢ãªã§æ°å€ã衚ãã«ã¯ãäž»ã«2ã€ã®æ¹æ³ãããããšã誰ããç¥ã£ãŠãããšæããŸãã äž»ãªæ©èœã¯æ¬¡ã®ãšããã§ãã
æŽæ°
- æ£ç¢ºãªå€ãä¿å
- å€ã®ç¯å²ãæ¯èŒççã
- 飿¥ãã2ã€ã®å€ã®å·®ã¯åžžã«1ã§ã
- ä¿ç®¡ã«äŸ¿å©
- 1-0.1 = 0
æµ®åå°æ°ç¹æ°
- ç¹å®ã®ç²ŸåºŠã§è¿äŒŒå€ãä¿åãã
- å€ã®ç¯å²ã¯éåžžã«åºãã§ãã
- 飿¥ããå€ã®å·®ã¯ãå®å®ã®å
åã®æ°ããã倧ããå ŽåããããŸãã
- äžéã³ã³ãã¥ãŒãã£ã³ã°ã«äŸ¿å©
- 1-0.1 = 0.900000000000000022204460 ...
æµ®åå°æ°ç¹æ°ã¯ãä»®æ°ãšææ°ã®2ã€ã®å€ãšããŠã¡ã¢ãªã«ä¿åãããŸãã ä»®æ°ã¯ãå€èªäœãæ ŒçŽãããããïŒã»ãŒæŽæ°ïŒã§ãããææ°ã¯ä»®æ°å€ãã·ãããããæ¡æ°ã瀺ããŸãã æ°å€ã®çã®å€ãèŠã€ããã«ã¯ãä»®æ°ã«ææ°ã®ãããæ·±åºŠãæããå¿
èŠããããŸãïŒmã»2áµã ãã®å Žåã®å®¹éã¯2ã§ãã 2鲿°ã·ã¹ãã ã

ãã®ã·ã¹ãã ã¯ãæ°å€ã®æŽæ°è¡šçŸãšæ¯èŒããŠéåžžã«è€éã§ãããšæãããŸãã ãã ããææ°ã®ããã»ããµã§ã¯ãæµ®åå°æ°ç¹æ°ã®ã»ãšãã©ã®æŒç®ïŒä¹ç®ãå ç®ãªã©ïŒã¯ãæŽæ°ã®æŒç®ãšåãã¯ããã¯ãµã€ã¯ã«æ°ã§å®è¡ãããŸãã æäœã®è€éãã¯ãããã»ããµå
ã®ãã©ã³ãžã¹ã¿æ°ã®å¢å ã®ã¿ã«ã€ãªãããŸãããã¯ããã¯ãµã€ã¯ã«æ°ã®å¢å ã«ã¯ã€ãªãããŸããã ããããæäœèªäœã®é床ã«å ããŠãçç£æ§ã®ã³ã³ããã¹ãã§èæ
®ããå¿
èŠãããããã€ãã®èŠçŽ ããããŸãã
äžèšã®ããã«ãæŽæ°ã¯ããå€ãã®å Žåã¹ãã¬ãŒãžã«äœ¿çšãããæµ®åå°æ°ç¹ã¯èšç®ã«äœ¿çšãããŸãã ããã¯ããããã¥ãŒããå¥ã®ãã¥ãŒã«å€æããå¿
èŠãããããšãå€ãããšãæå³ããŸãã ããã¯ãã¡ããé«éã§ãããã¿ã¹ã¯ã«ãã£ãŠã¯ããã©ãŒãã³ã¹ã«åœ±é¿ããå ŽåããããŸãã æåã®ããŒãã®æåŸã§ããã®ãããªå€æãæãé«äŸ¡ãªèšç®æäœã«ãªã£ãå Žåã«ã€ããŠèª¬æããŸããã
x86ã¢ãŒããã¯ãã£ã®æµ®åå°æ°ç¹åã®æå°ãµã€ãºã¯32ãããã§ãã æŽæ°ãä¿åããã«ã¯ãå€ãã®å Žå16ããããå Žåã«ãã£ãŠã¯8ãããã䜿çšã§ããŸãã ããã»ããµãåæã«1ã€ã®åœä»€ãå®è¡ããŠããå Žåãããã¯ã¹ã«ã©ãŒã³ã³ãã¥ãŒãã£ã³ã°ã«ãšã£ãŠéèŠã§ã¯ãããŸããã ãã ãããã¯ãã«ã³ã³ãã¥ãŒãã£ã³ã°ã®å Žåã1ãµã€ã¯ã«ããã2ã4åã®æŒç®ãçæã§ããŸãã
- 2çªç®ã®ããŒãã®çµããã«ãAVX2ã³ãŒããæ©èœããŠãããšãã«ãããã»ããµãã¯ããã¯åšæ³¢æ°ãé
ãããããšãããããŸããã ç§ã®èгå¯ã«ãããšãããã¯æµ®åå°æ°ç¹æ°ã§äœæ¥ããŠãããšãã«ã®ã¿çºçããŸãã æŽæ°ã®AVX2ã³ãã³ãã䜿çšãããå Žåãããã»ããµã¯æå€§åšæ³¢æ°ã§å®è¡ãç¶ç¶ããŸãã ãã¡ãããããã¯ããªãå
·äœçãªãã€ã³ãã§ããåäœã¯ãããã»ããµã®ç®çã«å¿ããŠãäžä»£ããšã«ç°¡åã«å€æŽã§ããŸãã ãã ããçµè«ã¯å€ãããŸãããæµ®åå°æ°ç¹æ°ã¯ããµã€ã¯ã«ããšã«åãããã©ãŒãã³ã¹ã§ãã£ãŠããæŽæ°ãããé
ãå®è¡ãããå¯èœæ§ããããŸãã
äžåç¹
æŽæ°ã¯ããã©ãŒãã³ã¹ãåäžããå¯èœæ§ããããããç®è¡ãæŽæ°ã«å€æããŠã¿ãããšãã§ããŸãã ãããããããŒããã©ãã§ãintã«çœ®ãæããããšã§ãããè¡ãããšã¯ããã¡ããæ©èœããŸããã ãµã€ãºå€æŽã®éã0ã1ã®ç¯å²ã§å€ãã®èšç®ãå®è¡ãããŸããã€ãŸãã æŽæ°è¡šçŸã§ã¯ããŒãã«ãªããŸãã
ããã§ã åºå®å°æ°ç¹æ°ã圹ç«ã¡ãŸãã å³å¯ã«èšãã°ãæŽæ°ãåºå®å°æ°ç¹æ°ã§ããããã®ãã€ã³ãã¯æäžäœãããã®åŸã«åºå®ãããŸãã ããããããšãã°8æ¡ã®2æ¡ã«ææ©çã«ç§»åãããŠããããå®éã«1/256ã§ãããšæ³å®ããããšãã§ããŸãã 256ã¯åäœã512ã¯ãã¥ãŒã¹ã384ã¯1.5ã§ãã ããã¯äœãäžããŸããïŒ ãã®åœ¢åŒã§ã¯ãæ°å€ã®æŽæ°éšåã ãã§ãªãã宿°éšåãæžã蟌ãããšãã§ããŸãã åºå®å°æ°ç¹æ°ã®ããªãäžè¬çãªäŸã¯ãäžéšã®ããã°ã©ãã³ã°èšèªã§äœ¿çšå¯èœãªé貚ããŒã¿åã§ãã ã»ã³ããŸãã¯ã»ã³ãã®æŽæ°ãæ ŒçŽããŸããã«ãŒãã«ãŸãã¯ãã«ãååŸããã«ã¯ãå€ã100ã§é€ç®ããå¿
èŠããããŸãã
ç¹°ãè¿ããŸãããåºå®å°æ°ç¹æ°ãšã¯ãèšç®ã®ç²ŸåºŠãé«ããããã«å®æ°ãæããæ°å€ã§ãã æµ®åå°æ°ç¹æ°ãšã¯ç°ãªãããã®å®æ°ã¯æ°å€èªäœã«ã¯æ ŒçŽãããŸããããã¢ã«ãŽãªãºã ã®å®è£
ã«çŽæ¥çµã¿èŸŒãããšããããã»ã¹ã§èšç®ããããšãã§ããŸãã
äžè¬ã«ãåºå®å°æ°ç¹æ°ãæ±ãããšã¯å€§ããããšã§ã¯ãããŸããããçæãã¹ãããšãããã€ããããŸãã
å€ã®ç¯å²ã¯ãå°æ°éšåã®ç²ŸåºŠãé«ããªããšå°ãããªããŸãã åºå®å°æ°ç¹ã1ãããå·Šã«ç§»åãããšã粟床ã¯2åã«ãªããŸãããå€ã®ç¯å²ã¯ååã«æžå°ããŸãã åžžã«ãã€ã³ããå¯èœãªéãå·Šã«ç§»åããããšããå¿
èŠããããŸããããªãŒããŒãããŒã¯é¿ããŠãã ããã ãããã£ãŠãèšç®ãåºå®å°æ°ç¹ã«å€æããåã«ãèšç®ã«åå ããæå€§å€ã®ç¯å²ã決å®ããå¿
èŠããããŸãã ããšãã°ãå€ã®ç¯å²ã-128ã384ã®å Žåãå€ïŒç¬Šå·ãå«ãïŒã衚ãããã«å¿
èŠãªãããæ°ã¯10ã«ãªããŸãã16ãããã®ããŒã¿åã䜿çšãããå Žåã粟床ã®ããã«6ãããã®ã¿ãæ®ããŸãã
åºå®å°æ°ç¹æ°ã®å ç®ããã³æžç®æŒç®ã¯ãéåžžã©ããæ©èœããŸãã æŽæ°ã«ããä¹ç®ãã
- 2ã€ã®æ°å€ã«åºå®å°æ°ç¹ãä¹ç®ãããšã粟床ã®åå ãšãªãæ¡æ°ã2åã«ãªããŸãã åãããã«ãéšåå
šäœãä¿ç®¡ããæŸé»ã®æ°ã2åã«ãªããŸãã ã€ãŸããä¹ç®åŸã«å
ã®ç²ŸåºŠãååŸããå¿
èŠãããå Žåã粟床ã®ãããæ°ã ãçµæãã·ããããå¿
èŠããããŸãã ãŸãã¯ãæ€èšããã®ããã䟿å©ãªå Žåãã·ããããããšã¯ã§ããŸãããããã®äºå®ãèŠããŠããå¿
èŠããããŸãã

粟床ã«ãŠã³ã
ã³ãŒããæŽæ°ã«å€æããåã«ãæ£ç¢ºãã®ããã«äœããããå²ãåœãŠãããšãã§ããããèšç®ãããšããã§ãããã ããã«ããã®ãããªèšç®ã¯åæäœã«å¯ŸããŠæå³ããããŸãã ãããŠãæåŸããå§ããæ¹ãè¯ãã§ãïŒ
float ss, ss0, ss1; for (xx = 0; xx < imOut->xsize; xx++) { ss0 = ss1 = ss2 = 0.5; for (y = ymin; y < ymax; y++) { ss0 = ss0 + ((UINT8) imIn->image[y][xx*4+0]) * k[y-ymin]; ss1 = ss1 + ((UINT8) imIn->image[y][xx*4+1]) * k[y-ymin]; ss2 = ss2 + ((UINT8) imIn->image[y][xx*4+2]) * k[y-ymin]; } imOut->image[yy][xx*4+0] = clip8(ss0); imOut->image[yy][xx*4+1] = clip8(ss1); imOut->image[yy][xx*4+2] = clip8(ss2); }
ss0
- ss0
ss2
ã¯ããã¯ã»ã«ããšã®ä¿æ°ã®ç©ã®åèšãå«ãŸããŠããŸãã ãã¯ã»ã«ã®ç¯å²ã¯[0ã255]ã§ãä¿æ°ã®åèšã¯1ã«çããããšãããã£ãŠããŸãã ã€ãŸããããããªãŒã®æçµå€ss0
- ss0
ãç¯å²[ ss0
]ã«ãªããŸãã ããããããã§çµããã§ãïŒ äžè¬ã«ãäžéšã®ä¿æ°ã¯è² ã«ãªãå¯èœæ§ãããããã®çµæãæ£ã®ä¿æ°ã®åèšãè€æ°ã«ãªãå ŽåããããŸãïŒèšäº0ãããã£ã«ã¿ãŒã®ã°ã©ããèŠãŠãã ããïŒã ãããã£ãŠãäžéå€ã¯ç¯å²[0.255]ããããã«è¶
ããå ŽåããããŸãã ãã®å Žåãè² æ°ã®å Žåã¯1ããããäžããã®ãªãŒããŒãããŒã®å Žåã¯ãã1ãããã§ååã§ãã å€ãä¿åããã«ã¯ãåèšã§10ããã[-512,511]ãå¿
èŠã§ãã ããããªãŒã32ããã以äžã«ããããšã¯è«ççã§ãããããããããªãŒã®ç²ŸåºŠãä¿åããããã«22ããããæ®ããŸãïŒ PRECISION_BITS
ãšåŒã³ãŸãããïŒã
ä¿æ°ã«ãããã¯ã»ã«ã®ä¹ç®ãåŠçããããã«æ®ã£ãŠããŸãã åºå®å°æ°ç¹æ°ã«æŽæ°ãæããå Žåã远å ã®å€æã¯äžèŠã§ãããšæ¢ã«è¿°ã¹ãŸããã ãã®å ŽåãæŽæ°ã¯ãã¯ã»ã«å€ã§ãã ããã¯ãä¿æ°ã®ç²ŸåºŠãããããªãŒã®ç²ŸåºŠ-22ããããšåãã§ããããšãæå³ããŸãã
åºå®å°æ°ç¹ã¹ã«ã©ãŒã³ã³ãã¥ãŒãã£ã³ã°
ããã¯é©ãã¹ãããšã§ãããäžèšã®ã³ãŒãã§ã¯ã1è¡ã ãã倿ŽããŠãåºå®å°æ°ç¹ã§åäœããããã«å€æããå¿
èŠããããŸãã åœåãããããªãŒã«ã¯0.5ã®å€ãå²ãåœãŠãããŠããŸãã æ°ããçªå·äœç³»ã§ã¯ãããã¯å€1 << (PRECISION_BITS - 1)
察å¿ããŸãã ã€ãŸããåäœã¯ç²ŸåºŠãã1ãããã ãã·ããããŸãã 0.5æ°ãããŠãããã
int ss, ss0, ss1; for (xx = 0; xx < imOut->xsize; xx++) { ss0 = ss1 = ss2 = 1 << (PRECISION_BITS -1); for (y = ymin; y < ymax; y++) { ss0 = ss0 + ((UINT8) imIn->image[y][xx*4+0]) * k[y-ymin]; ss1 = ss1 + ((UINT8) imIn->image[y][xx*4+1]) * k[y-ymin]; ss2 = ss2 + ((UINT8) imIn->image[y][xx*4+2]) * k[y-ymin]; } imOut->image[yy][xx*4+0] = clip8(ss0); imOut->image[yy][xx*4+1] = clip8(ss1); imOut->image[yy][xx*4+2] = clip8(ss2); }
ä»ã®ãã¹ãŠã®èšç®ã¯å€æŽãããªããŸãŸã§ãããããã¯éæ¥çã«æ£ããè»éã«ä¹ã£ãŠããããšã瀺åããŠããŸãã çµå±ã®ãšãããæŠå¿µã倿ŽããŠãå®è£
ã«åé¡ã¯çããŸããã§ããããããã©ãŒãã³ã¹ã®ç²åŸãæåŸ
ã§ããŸãã
ããããç¯å²[ clip8
ã®æçµãã¯ã»ã«å€ãå¶éããclip8
颿°ã¯ã倧ããå€ãããŸãã ããã¯ïŒ
static inline UINT8 clip8(float in) { int out = (int) in; if (out >= 255) return 255; if (out <= 0) return 0; return (UINT8) out; }
次ã®ããã«ãªããŸããïŒ
static inline UINT8 clip8(int in) { if (in >= (1 << PRECISION_BITS << 8)) return 255; if (in <= 0) return 0; return (UINT8) (in >> PRECISION_BITS); }
ãŸããåãå
¥ããããå€ã倿ŽãããŸã-çŸåšã¯32ãããæŽæ°ã§ãã 第äºã«ãæŽæ°åã«ããã«ã¯ãã£ã¹ããããŸããïŒä»¥åã¯æåã®è¡ã«ãããŸããïŒã 代ããã«ã 1 << PRECISION_BITS << 8
å€ãšæ¯èŒã§ããŸãã ãã®å€ã¯åºå®å°æ°ç¹æ°ã·ã¹ãã ã§ã¯256ã§ããããã¯ãå°æ°éšã®ãããæ°ãšå¥ã®8ãããã ãã·ãããããããã§ãã ãããŠãåãã®ããã«ã 1 << 8
ã¯æ£ç¢ºã«256ã§ãããã§ã«æåŸã«ããã¹ãŠã®æ¯èŒã§è² ã®çµæãåŸãããå Žåãå€ã¯å®éã«ã¯ãã€ã³ããªãã§éåžžã®å
šäœã«æžå°ããŸãã 粟床ã®ãããæ°ã«ããéåžžã®ã·ããã«ãã£ãŠäžããããŸãã
ããã§ãä¿æ°ãåºå®å°æ°ç¹ã«ããå¿
èŠããããŸãã æåã«ãä¿æ°ã¯-1ãã1ãŸã§ã®æµ®åå°æ°ç¹æ°ã§ããããšãæãåºããŠãã ããããããŠã1ãã¯ã»ã«ãèšç®ããããã®ãã¹ãŠã®ä¿æ°ã®åèšã¯1ã«çãããªããŸãã å®éã«ä¿æ°ãèšç®ããããã«æŽæ°æŒç®ã䜿çšããããšã¯æå³ããªããšç¢ºä¿¡ããŠããŸãã 第äžã«ãä¿æ°ã®èšç®ã¯ãããã䜿çšãããããã¯ããã«çãæéã§ãã æ¬¡ã«ãäžéšã®ãã£ã«ã¿ãŒå
ã§äžè§é¢æ°ã䜿çšãããŸãã ãããã£ãŠãæµ®åå°æ°ç¹ä¿æ°ãèšç®ããŠãããããããåºå®ä¿æ°ã«å€æããŠ(1 << PRECISION_BITS)
ä¹ç®ããã®ã¯æ£ããããã§ãã
for (x = 0; x < xsize * kmax; x++) { kk[x] = (int) (prekk[x] * (1 << PRECISION_BITS)); }
ããã¯äœãäžããŸããïŒ ä»¥äžã¯ãæµ®åå°æ°ç¹æ°ã§åŸãããã¹ã«ã©ãŒèšç®ã®ææ°ã®çµæã§ãã
Scale 2560Ã1600 RGB image to 320x200 bil 0.03009 s 136.10 Mpx/s to 320x200 bic 0.05187 s 78.97 Mpx/s to 320x200 lzs 0.08113 s 50.49 Mpx/s to 2048x1280 bil 0.14017 s 29.22 Mpx/s to 2048x1280 bic 0.17750 s 23.08 Mpx/s to 2048x1280 lzs 0.22597 s 18.13 Mpx/s to 5478x3424 bil 0.58726 s 6.97 Mpx/s to 5478x3424 bic 0.74648 s 5.49 Mpx/s to 5478x3424 lzs 0.90867 s 4.51 Mpx/s
ã³ããã57e8925ã®çµæã
ãããŠãåºå®å°æ°ç¹ã®çµæã¯æ¬¡ã®ãšããã§ãã
Scale 2560Ã1600 RGB image to 320x200 bil 0.02079 s 196.99 Mpx/s 44.7 % to 320x200 bic 0.03459 s 118.41 Mpx/s 50.0 % to 320x200 lzs 0.05649 s 72.50 Mpx/s 43.6 % to 2048x1280 bil 0.10483 s 39.07 Mpx/s 33.7 % to 2048x1280 bic 0.13362 s 30.66 Mpx/s 32.8 % to 2048x1280 lzs 0.17210 s 23.80 Mpx/s 31.3 % to 5478x3424 bil 0.46706 s 8.77 Mpx/s 25.7 % to 5478x3424 bic 0.59492 s 6.88 Mpx/s 25.5 % to 5478x3424 lzs 0.72819 s 5.62 Mpx/s 24.8 %
ã³ããã15d0573ã®çµæã
ã芧ã®ãšããããã¹ãŠãç¡é§ã§ã¯ãªããæé·ã¯éåžžã«æ·±å»ã§ãã äœãããããã¯ã»ã«å€ã倿ããããã®æäœãå¢ããããã倧å¹
ãªæžå°ãèŠãããŸãã
åºå®å°æ°ç¹SIMDã³ã³ãã¥ãŒãã£ã³ã°
3çªç®ã®éšåããåºå®å°æ°ç¹èšç®ãžã®SIMDã³ãŒãã®è»¢éã¯ã4ã€ã®æ®µéã«åããããšãã§ããŸãã
- SSE4åçŽéè·¯ã®ç§»å
- SSE4æ°Žå¹³ãã¹å€æ
- åçŽéè·¯AVX2ã®ç¿»èš³
- æ°Žå¹³éè·¯AVX2ã®ç¿»èš³
ãããã®ã¹ããŒãžã¯éåžžã«åäžã§ããããã1ã€ã ããæ
éã«æ€èšããããšã¯çã«ããªã£ãŠããŸãã ããã¯ãæµ®åå°æ°ç¹æ°ã®SSE4åçŽãã¹ã®äŸã§ãã
ImagingResampleVerticalConvolution8u(UINT32 *lineOut, Imaging imIn, int ymin, int ymax, float *k) { int y, xx = 0; for (; xx < imIn->xsize; xx++) { __m128 sss = _mm_set1_ps(0.5); for (y = ymin; y < ymax; y++) { __m128i pix = _mm_cvtepu8_epi32(*(__m128i *) &imIn->image32[y][xx]); __m128 mmk = _mm_set1_ps(k[y - ymin]); __m128 mul = _mm_mul_ps(_mm_cvtepi32_ps(pix), mmk); sss = _mm_add_ps(sss, mul); } __m128i ssi = _mm_cvtps_epi32(sss); ssi = _mm_packs_epi32(ssi, ssi); lineOut[xx] = _mm_cvtsi128_si32(_mm_packus_epi16(ssi, ssi)); } }
__m128
ããŒã¿__m128
ã¯ã4ã€ã®æµ®åå°æ°ç¹æ°ãæ ŒçŽãããŸãã äžèŠã«ãªããŸãã__m128i
ã«çœ®ãæããå¿
èŠããããŸãã _mm_set1_ps
颿°ã®é¡äŒŒç©ã¯_mm_set1_epi32
ã§ãã 倿颿°_mm_cvtepi32_ps
ãš_mm_cvtps_epi32
äžèŠã«ãªãã代ããã«ãæåŸã®çµæãPRECISION_BITS
ã«ãã£ãŠå³ã«ã·ããããå¿
èŠããããŸãã _mm_mul_ps
颿°ã䜿çšãã_mm_mul_ps
ã®ã¿å°é£ãçºçããå¯èœæ§ããããŸããçŽæ¥çãªé¡äŒŒç¹ããªãããã§ãããèŠãã°_mm_mullo_epi32
ããã_mm_mullo_epi32
ã å®éã2ã€ã®32ãããæ°ãæãããš64ãããæ°ã«ãªããŸãã Loã¯ãçµæã®äžäœ32ããããè¿ãããããšãæå³ããŸããããã¯ãŸãã«å¿
èŠãªãã®ã§ãã ãã¹ãŠã®ã³ãŒãã¯æ¬¡ã®ããã«ãªããŸãã
ImagingResampleVerticalConvolution8u(UINT32 *lineOut, Imaging imIn, int ymin, int ymax, int *intk) { int y, xx = 0; for (; xx < imIn->xsize; xx++) { __m128i sss = _mm_set1_epi32(1 << (PRECISION_BITS -1)); for (y = ymin; y < ymax; y++) { __m128i pix = _mm_cvtepu8_epi32(*(__m128i *) &imIn->image32[y][xx]); __m128i mmk = _mm_set1_epi32(intk[y - ymin]); __m128i mul = _mm_mullo_epi32(pix, mmk); sss = _mm_add_epi32(sss, mul); } sss = _mm_srai_epi32(sss, PRECISION_BITS); sss = _mm_packs_epi32(sss, sss); lineOut[xx] = _mm_cvtsi128_si32(_mm_packus_epi16(sss, sss)); } }
ããã§ãSSE4ããŒãžã§ã³ã§åŸãããçµæãæµ®åå°æ°ç¹æ°ã§æ¯èŒã§ããŸãã
Scale 2560Ã1600 RGB image to 320x200 bil 0.01151 s 355.87 Mpx/s to 320x200 bic 0.02005 s 204.27 Mpx/s to 320x200 lzs 0.03421 s 119.73 Mpx/s to 2048x1280 bil 0.04450 s 92.05 Mpx/s to 2048x1280 bic 0.05951 s 68.83 Mpx/s to 2048x1280 lzs 0.07804 s 52.49 Mpx/s to 5478x3424 bil 0.18615 s 22.00 Mpx/s to 5478x3424 bic 0.24039 s 17.04 Mpx/s to 5478x3424 lzs 0.30674 s 13.35 Mpx/s
ã³ããã8d0412bã®çµæã
åºå®å°æ°ç¹æ°ã®SS4ã§åŸãããçµæã§ã¯ïŒ
Scale 2560Ã1600 RGB image to 320x200 bil 0.01253 s 326.82 Mpx/s -8.1 % to 320x200 bic 0.02239 s 182.94 Mpx/s -10.5 % to 320x200 lzs 0.03663 s 111.83 Mpx/s -6.6 % to 2048x1280 bil 0.04712 s 86.92 Mpx/s -5.6 % to 2048x1280 bic 0.06731 s 60.86 Mpx/s -11.6 % to 2048x1280 lzs 0.08176 s 50.10 Mpx/s -4.5 % to 5478x3424 bil 0.19010 s 21.55 Mpx/s -2.1 % to 5478x3424 bic 0.25013 s 16.38 Mpx/s -3.9 % to 5478x3424 lzs 0.31413 s 13.04 Mpx/s -2.4 %
ã³ããã7d8df66ã®çµæã
ãããŠãé©ããç§ãåŸ
ã£ãŠããŸããã é·ãéãç§ã¯äœãããŸããããªãã£ãããçè§£ããããšããŸããã ããæç¹ã§ãã«ãŒãã§äœ¿çšãããååœä»€ã®ã¿ã€ãã³ã°ãèŠã«è¡ããŸãããã解決çã¯ããã«ãããŸããã
ããã¯ãIntel Intrinsics Guideã«ã¯è¡šç€ºãããŸãããããã¯ãåžžã«æŽæ°ãããå€ãããã»ããµã®ããŒã¿ãæã
åé€ãããããã§ãã ããããç§_mm_mullo_epi32
ããã_mm_mullo_epi32
ãããšãã _mm_mullo_epi32
ãªãã¬ãŒã·ã§ã³ã«ã¯æ¬¡ã®ã¿ã€ãã³ã°ããŒãã«ããããŸããã
Architecture Latency Throughput Broadwell 10 2 Haswell 10 2 Ivy Bridge 5 1
次ã«ãæµ®åå°æ°ç¹æ°ã«é¢ããåæ§ã®_mm_mul_ps
ã¿ã€ãã³ã°ãšæ¯èŒããŸãã
Architecture Latency Throughput Broadwell 3 0.5 Haswell 5 0.5 Ivy Bridge 5 1
Haswellã¢ãŒããã¯ãã£ããå§ããŠãIntelã¯æŽæ°32ãããæ°ã®ãã¯ãã«ä¹ç®ã§åŸç¹ããããšãããããŸãã ããã«ãä¹ç®ã®ä»ã®ãã¹ãŠã®ãªãã·ã§ã³ã¯ã¢ãŒããã¯ãã£ããã¢ãŒããã¯ãã£ãžãšé«éã«æé·ãç¶ãããããæŽæ°ããã³32ãããã§ãã
è峿·±ãããšã«ãããã¯ã³ãŒãã®AVX2ããŒãžã§ã³ã§ã¯èгå¯ããããé
å»¶ã®å¢å ã«ããæªåœ±é¿ã¯åºå®å°æ°ç¹èšç®ãžã®åãæ¿ãã«ããè¯å®çãªå¹æãããåªå
ãããŸããã ãŸããåºå®å°æ°ç¹æ°ã®ããã©ãŒãã³ã¹ã¯çŽ10ïŒ
åäžããŸãã ããã«ã¯2ã€ã®çç±ããããŸãã
- åé ã§è¿°ã¹ãããã«ãæµ®åå°æ°ç¹ã䜿çšããAVX2åœä»€ãšã¯ç°ãªããããã»ããµã¯æŽæ°ã®AVX2åœä»€ãå®è¡ãããšãã«åšæ³¢æ°ãé
ãããŸããã
- AVX2ããŒãžã§ã³ã¯äžåºŠã«2åã®ããŒã¿ãåŠçããŸããã€ãŸããåãéã®ããŒã¿ã«å¯ŸããŠå®è¡ãããä¹ç®åœä»€ã¯2åå°ãªããªããŸãã ããã¯ã倧ããªé
å»¶ã®æªåœ±é¿ã2åç®ç«ããªãããšãæå³ããŸãã
èæ¯ã®æ¢æ±
PillowããŒãžã§ã³3.3ã®æŽæ°èšç®ãæºåããŸããã ãããŠãç§ã¯PillowãšPillow-SIMDã®ããŒãžã§ã³ãå€ããå°ãªããåæçã«ãªãªãŒã¹ããåãæ¹åã詊ã¿ãŸããã ãŸããæŽæ°ã«åãæ¿ãããšPillowãé¡èã«å¢å ããããšã¯éåžžã«æ®å¿µã§ããããPillow-SIMDã§åãã«è¶³ããªãããŸãã¯ãŸã£ããåŸãããŸããã§ããã ãã®åŸããªãªãŒã¹ã§ã¯ãã«ãŒããå±éããããšã§ããã¯ãã°ããããã«è£ãããšãã§ããŸããã ããã«ãããåœä»€ãã€ãã©ã€ã³ãæ¹åãããäœéä¹ç®ã®åœ±é¿ããããã«æé€ãããŸããã ããããããã«ã€ããŠã¯ããã®ã·ãªãŒãºã®æåŸã®èšäºã§ãäŒãããããšæããŸãã
Pillowã®éåžžããŒãžã§ã³ã®ããã©ãŒãã³ã¹ãã©ã®ããã«å€åããããèŠããšãPillow 3.3ã§ã¯æŽæ°èšç®ã«ããããªãã®å¢å ããã£ãããšãããããŸãã Pillow 3.4ã§ã¯ããã¹ãŠãã»ãŒåãã¬ãã«ã®ãŸãŸã§ããã

äžæ¹ãPillow-SIMDã®ç¶æ³ã¯å察ã§ããããŒãžã§ã³3.3ã¯ã以åã®ãã®ãããã»ãšãã©é
ãããšã倿ããŸããã ãããã3.4ã§ã¯å€§ããªé£èºããããŸãããããã«ãããPillow-SIMDã¯çŸåšãCPUã§æãé«éãªãµã€ãºå€æŽã®å®è£
ã§ãããšèšããŸãã

Pillow-SIMD 3.4ã§ãã®ãããªæ¹åãå®çŸããã«ã¯ãæŽæ°ã®32ããããã¯ãã«ä¹ç®ãåãé€ãå¿
èŠããããŸããã ããããã©ã®ããã«ïŒ ãã¹ãŠã®èšç®ã16ãããã«å€æããŸããïŒ ãã®å Žåãä¿æ°ïŒ PRECISION_BITS
ïŒã16-8-2 = 6ããããã€ãŸãåèš64åã®å€ãæ®ããŠããããšãèšç®ããã®ã¯ç°¡åã§ãã å®éã«ã¯ããã¹ãŠã®ä¿æ°ã®åèšã¯1ïŒã€ãŸã64ïŒã«çãããªããã°ãªããªããããã¯ããã«å°ãããªããŸãã ä¿æ°ã®æ°ã¯ããã£ã«ã¿ãŒãŠã£ã³ããŠã®ãµã€ãºãšçž®å°ã¹ã±ãŒã«ã«äŸåããŸãïŒè©³çްã«ã€ããŠã¯ã ããŒã0ãåç
§ïŒã Lanczosãã£ã«ã¿ãŒã§ç»åã10åã«çž®å°ãããšãä¿æ°èªäœã¯60ã«ãªããŸãã 16ãããã§ã®èšç®ã¯æããã«ååã«æ£ç¢ºã§ã¯ãªããä»ã®äœããçºæããå¿
èŠããããŸããã
ç§ã¯ãã®èãã«æ©ãŸãããŠããŸããïŒäœããã®çç±ã§ãIntelã¯æãç®ãã«ãããããšããå¥åŠãªæ¹æ³ã§æ±ºå®ããŸããã ãŸããä»ã®éçºè
ã¯ã€ã³ã¿ãŒãããäžã§åŸæããããšã¯ãããŸããããåé¡ã®è§£æ±ºã«æåãç¶ããŠããŸãã ã°ã©ãã£ãã¯ã¹ãæ±ãã®ã«32ãããã®ä¹ç®ã¯æ¬åœã«å¿
èŠãªããããããŸããããããªãã§ã¯ã©ãããã°ããã®ãããããŸããã
_mm_mullo_epi16
ã ç³ã¿èŸŒã¿ã®çµæã32ãããã«ãªãããã«ãä¿æ°ã®ãããæ·±åºŠãæ
éã«éžæããããšãã§ããŸããããã¯ã»ã«å€ã«ä¿æ°ãä¹ç®ããçµæã¯16ããã以å
ã«ãšã©ãŸããŸãã æ¬¡ã«ãä¿æ°èªäœã®ç²ŸåºŠã®ããã«7ããããæ®ããŸãïŒ1ãããã¯ç¬Šå·ã«é²ã¿ãŸãïŒã ããã¯ããã¹ãŠã®ä¿æ°ã®åèšã6ãããããã倧å¹
ã«åªããŠããŸããã å¶ç¶å¥ã®è§£æ±ºçãèŠã€ãããšãã«ããããå®è£
ããããšããŠããŸããã
ãããããã³ãã«ã«ç¹å¥ãªæç€ºããã£ãå Žåã¯ã©ãã§ããããïŒ
ããŸããŸãªè§åºŠããæã£ãŠããããŒã«ãèŠãŠãåé¡ã解決ããŠãããšãã«ããã®ã¿ã¹ã¯ã®ããã«ç¹å¥ã«èæ¡ãããããŒã«ã«å¶ç¶åºãããããšããŸãã
ä¹ç®ã®é£ããã¯äœã§ããïŒ ä¹ç®çµæãæ ŒçŽããã«ã¯ããªãã©ã³ãã®2åã®ããããå¿
èŠã§ãã ãããã£ãŠãéžæããå¿
èŠããããŸããçµæã®äžéšãŸãã¯äžéšãååŸããå¿
èŠããããŸãã 粟床ã«ã¯ãã®åé¡ããããæå¹ãããã®ããäžéšã®ã¿ããªãã©ã³ããã䜿çšãããŸãã ä¹ç®çµæå
šäœãååŸã§ãããã©ãã§ããããïŒ ãã®å Žåã2åã®ããããã€ãŸããçµæãæã€2ã€ã®ã¬ãžã¹ã¿ãå¿
èŠã«ãªããŸãã ããããä¹ç®åŸã«ããã2ã€ã®ã¬ãžã¹ã¿ã远å ãããšã©ããªããŸããïŒ ããã§ããä¹ç®ã®çµæãå ç®ããå¿
èŠããããŸãããããç³ã¿èŸŒã¿ã®æå³ã§ãã ãããããä¹ç®ã®ããã«Xãã¢ã®ãªãã©ã³ããåããããããä¹ç®ããXç©ãååŸããæ¬¡ã«é£æ¥ãããã®ã远å ããX / 2ç©ã®åºåãåºåããåœä»€ãååŸããŸãã ãããŠãå¥åŠãªããšã«ããã®ãããªåœä»€ã¯ãã§ã«SSE2ã§èŠã€ãããŸããïŒ _mm_madd_epi16
ãšåŒã°ã_mm_madd_epi16
ã ãããŠã圌女ã®é
å»¶ã¯_mm_mullo_epi32
é
å»¶ããã2åäœãã圌女ã¯3åã®æäœãå®è¡ããŸãã
ç¹°ãè¿ããŸãããå
¥åã«ã¯2ã€ã®ã¬ãžã¹ã¿ããããããããã«8ã€ã®16ããã笊å·ä»ãæŽæ°ããããŸãã ãããã®æ°å€ã¯ãã¢ã§ä¹ç®ããã8ã€ã®32ãããä¹ç®çµæãèšæ¶ãããããšã念é ã«çœ®ããŠããŸãã 飿¥ããä¹ç®çµæãåèšããŠã4ã€ã®32ããã笊å·ä»ãæ°å€ãååŸããŸãã 4ã€ã®äœéä¹ç®ã®ä»£ããã«1ã€ã®ã¯ã€ãã¯åœä»€ã§8ã€ã®ä¹ç®ãš4ã€ã®å ç®ã å®è³ªçã«ç²ŸåºŠã®æå€±ã¯ãããŸããã
å¯äžã®åé¡ã¯ã飿¥ããä¹ç®çµæãå ç®ããããã¯ã»ã«ã®å Žåã¯é£æ¥ãã£ãã«ã«ãªãããšã§ãã é¡ã«ã³ãã³ããé©çšãããšãæåã®ãã¯ã»ã«ã®èµ€ã®ãã£ã³ãã«ãæåã®ãã¯ã»ã«ã®ç·ã«è¿œå ãããæåã®ãã¯ã»ã«ã®éãã¢ã«ãã¡ãã£ã³ãã«ã«è¿œå ãããŸãã åãããšã2çªç®ã®ãã¯ã»ã«ã«ãåœãŠã¯ãŸããŸãã ç³ã¿èŸŒã¿ã§ã¯ãæåã®ãã¯ã»ã«ã®èµ€ãã£ã³ãã«ã2çªç®ã®ãã¯ã»ã«ã®èµ€ãã£ã³ãã«ã«è¿œå ããå¿
èŠããããŸãã ã€ãŸãããã®åœä»€ãé©çšããåã«ãå€ãå°ãæ··ããå¿
èŠããããŸãã
16ãããä¿æ°ãžã®åãæ¿ã
æ®å¿µãªããã int
åãINT16
眮ãæããã ãã§ã¯ååã§ã¯ãããŸãããä¿æ°ã¯ãã®åãè¶
ããå¯èœæ§ããããŸãã æåã«ãææ°ïŒå¿
èŠã«å¿ããŠæ°å€ã®ç²ŸåºŠãŸãã¯ä»®æ³äžåç¹ã®äœçœ®ïŒãã¢ã«ãŽãªãºã èªäœã§èšå®ããããã»ã¹ã§èšç®ã§ãããšè¿°ã¹ãŸããã ãããŠãç°ãªãå
¥åããŒã¿ã«å¿ããŠãç°ãªãåºå±è
ãéžæããå¿
èŠãããå Žåã«ã®ã¿åœãŠã¯ãŸããŸãã ãããŠããã®èšç®ã«ã¯ãä¿æ°ã®æå€§å€ãå¿
èŠã§ãã
#define MAX_COEFS_PRECISION (16 - 1) #define PRECISION_BITS (32 - 8 - 2) coefs_precision = 0; while ( maxkk < (1 << (MAX_COEFS_PRECISION-1)) && (coefs_precision < PRECISION_BITS) ) { maxkk *= 2; coefs_precision += 1; };
ã€ãŸããäžæ¹ã§ã¯æå€§ä¿æ°ã®å€ã16ããããè¶
ããªãããã«ãïŒ16ããã圢åŒã§è¡šç€ºãããããïŒã仿¹ã§ã¯ç³ã¿èŸŒã¿å
šäœã®å€ã32ããããè¶
ããªãããã«ããå¿
èŠããããŸãïŒãã®æ¡ä»¶ã¯ãããŸã§ã«æºããããŠããŸãïŒ coefs_precision < PRECISION_BITS
ïŒã
ç§ã¯ãã§ã«ã³ãŒãã«ããªãç²ããŠããããã§ãã®ã§ã _mm_madd_epi16
åœä»€ãé©çšã§ããããã«ãäœã倿Žããå¿
èŠããããããã¯ã»ã«ãã©ã®ããã«æ··åããããåæããŸããã èå³ã®ãã人ã¯ããã€ãã®ããã«ãgithubã§ã³ãããã®å€æŽã確èªããã³ã¡ã³ãã§è³ªåããããšãã§ããŸãã æµ®åå°æ°ç¹æ°ã®SSE4ããŒãžã§ã³ã«é¢é£ãã16ãããä¿æ°ã®SSE4ããŒãžã§ã³ã®çµæãããã«ç€ºããŸãã
Scale 2560Ã1600 RGB image to 320x200 bil 0,00844 s 485.20 Mpx/s 36,4 % to 320x200 bic 0,01289 s 317.79 Mpx/s 55,5 % to 320x200 lzs 0,01903 s 215.24 Mpx/s 79,8 % to 2048x1280 bil 0,04481 s 91.41 Mpx/s -0,7 % to 2048x1280 bic 0,05419 s 75.59 Mpx/s 9,8 % to 2048x1280 lzs 0,06930 s 59.11 Mpx/s 12,6 % to 5478x3424 bil 0,19939 s 20.54 Mpx/s -6,6 % to 5478x3424 bic 0,24559 s 16.68 Mpx/s -2,1 % to 5478x3424 lzs 0,29152 s 14.05 Mpx/s 5,2 %
ã³ããã9b9a91fã®çµæã
ãããŠãæµ®åå°æ°ç¹æ°ã®AVX2ããŒãžã§ã³ã«é¢é£ãã16ãããä¿æ°ã®AVX2ããŒãžã§ã³ã®çµæïŒ
Scale 2560Ã1600 RGB image to 320x200 bil 0.00682 s 600.15 Mpx/s 34.6 % to 320x200 bic 0.00990 s 413.86 Mpx/s 50.5 % to 320x200 lzs 0.01424 s 287.54 Mpx/s 60.6 % to 2048x1280 bil 0.03889 s 105.31 Mpx/s 7.6 % to 2048x1280 bic 0.04519 s 90.64 Mpx/s 11.3 % to 2048x1280 lzs 0.05226 s 78.38 Mpx/s 18.2 % to 5478x3424 bil 0.15195 s 26.96 Mpx/s 6.7 % to 5478x3424 bic 0.16977 s 24.13 Mpx/s 17.8 % to 5478x3424 lzs 0.20229 s 20.25 Mpx/s 15.6 %
ã³ããã3ad4718ã®çµæã
åèš
å
šäœãšããŠãæŽæ°èšç®ãžã®ç§»è¡ã«ãããã¹ã«ã©ãŒã³ãŒããšSIMDã®äž¡æ¹ã§ã²ã€ã³ãåŸãããŸããã SSE4ããŒãžã§ã³ã§ã¯ãããã€ãã®ãã£ã«ã¿ãŒã䜿çšããŠç»åãæ¡å€§ãããšãããã©ãŒãã³ã¹ããããã«äœäžããŸãã ããããå®éã«ã¯ãããã«ç€ºãããŠããã³ãŒãã¯Pillow-SIMDããŒãžã§ã³3.3ãŸãã¯3.4ã«å«ãŸããŠãããã®ãšã¯ãŸã£ããç°ãªããŸããããã¯äžçš®ã®ããã°ã¬ããã§ãã å®éã®ããŒãžã§ã³ã§ã¯ãããã©ãŒãã³ã¹ã®äœäžã¯ãããŸããã§ããã
æ¯ãè¿ã£ãŠæåã®ããŒãžã§ã³ãæãåºããšãåãããŒããŠã§ã¢ã§çŸåšã®ã³ãŒãã10ã12åé«éã§ããããšãããããŸãã 2ç§ããã£ãããšã1ç§éã«5åå®è¡ã§ããããã«ãªããŸããïŒ ãããã å
¬åŒã®ãã³ãããŒã¯ãèŠããšãAVX2ã䜿çšããPillow-SIMD 3.4ã®å®éã®ããã©ãŒãã³ã¹ã¯ããã®èšäºã®æåŸã§å€æããããã2åé«ãããšãããããŸãã ãããã£ãŠã次ã®ããŒãã«ã¯çç±ãšè³æããããŸãã