ä»æ¥ã¯ãã MXNetã®ã³ã³ããã¹ãã§ã®åæ£åŠç¿ã¯ã©ã®ããã«æ©èœããŸããïŒããšããç°¡åãªè³ªåã«çããŸãã
MXNet v0.10.0ã§ãã¹ãããããã¹ãŠã®ã³ãŒãäŸã¯ãä»ã®ããŒãžã§ã³ã§ã¯åäœããªãïŒãŸãã¯ç°ãªãåäœãããïŒå ŽåããããŸãããäžè¬çãªæŠå¿µã¯é·ãéå€æŽãããªããšæããŸãã
ããŠãæåŸã«ãäž»èŠéšåã«é²ãåã«ããã®èšäºãæžãããšãã§ããªãã£ãååã«èšäºãæžãæå©ããããŠãããããšã«æè¬ããããšæããŸãã
- ããã³ãžã£ã³ãã;
- ã¹ããŒã«ã»ãã«ãã£;
ãŸãã DLAMIã䜿çšããŠãã·ã³ãäžããŠãèšäºã®ãã¹ãŠã®äŸãå®äºããããšããå§ãããŸããç¹ã«éåžžã«ç°¡åã§ãã ã³ãŒããå®è¡ããã«ã¯ãç¡æã®AWSãã·ã³ãéåžžã«é©ããŠããŸãã
åæãçµãããç«ã®äžã«ç»ã......
MXNetã®èŠæ¹ãåæ£åŠç¿
MXNetã§ã¯ãåŠç¿ããã»ã¹ã®ãã¹ãŠã®åå è
ã¯3ã€ã®è«çã°ã«ãŒãã«åããããŸãã
- ã¹ã±ãžã¥ãŒã©ãŒ
- ãµãŒããŒ
- åŽåè
ããã¯çŽç²ã«è«ççãªé
åžã§ããããããã¹ãŠã®åå è
ãåããã·ã³ã§äœæ¥ã§ããŸãã
æåã«ãååå è
ãäœã§ãããã®è¡šé¢çãªèª¬æãèŠãŠã¿ãŸãããã
ãã©ã³ããŒ
ã¹ã±ãžã¥ãŒã©ãŒã¯ã¯ã©ã¹ã¿ãŒã®äžå€®ããŒãã§ãããã¯ã©ã¹ã¿ãŒã®åææ§æãæ
åœããåŠç¿ããã»ã¹ã®ååå è
ã«å¿
èŠãªæ
å ±ãæäŸããŸãã ã¯ã©ã¹ã¿ããã¬ãŒãã³ã°ãéå§ããæºåãã§ãããããã«ã圌ãã©ã®ããã«ãµã¹ãã³ãã¢ãã¡ãŒã·ã§ã³ã«èœã¡ããã確èªããŸãã ãŸããã¯ã©ã¹ã¿ãŒããã¬ãŒãã³ã°ãçµäºããŠãããã®ã¿ã¹ã¯ã¯èªåçã«ãªãã«ãªããŸãã
誰ãããã¯ã©ã¹ã¿ãŒã«ã¯ã¹ã±ãžã¥ãŒã©ãŒã1ã€ããååšã§ããªãããšããã§ã«æšæž¬ããŠãããšæããŸãã
ãµãŒããŒ
ãµãŒããŒã¯ããã¬ãŒãã³ã°ã¢ãã«ãã©ã¡ãŒã¿ãŒã®ãªããžããªãšããŠæ©èœããŸãã ã€ãŸããã¢ãã«ãã¹ã¿ã€ã«Y = AX + Bã§ãã¬ãŒãã³ã°ãããŠããå ŽåããµãŒããŒã¯ãã¯ãã«AãšBãæ ŒçŽããŸãããŸããã¢ãã«ã®æ£ããæŽæ°ãæ
åœããŸãã è€æ°ã®ãµãŒããŒãååšããå¯èœæ§ããããããè€æ°ã®ãµãŒããŒã«ã¢ãã«ãé
åžããã«ãŒã«ããããŸãã ããããããã¯å¥ã®èšäºã®ãããã¯ã§ãã
åŽåè
ãããã¯ãå®éã«ã¢ãã«ãã¬ãŒãã³ã°ãçŽæ¥å®è¡ããã¯ã©ã¹ã¿ãŒã¡ã³ããŒã§ãã åã¯ãŒã«ãŒã¯ããã¬ãŒãã³ã°ãå¿
èŠãªããŒã¿ã®äžéšãåãåããåŸé
ã¹ããããæ€èšããããããµãŒããŒã«éä¿¡ããŠã¢ãã«ãæŽæ°ããŸãã
ã¯ã©ã¹ã¿ãŒã®äŸ
åœã®ã¯ã©ã¹ã¿ãŒã®äŸãåãäžããŸãããïŒ
- å¥ã®ãã·ã³ã®ã¹ã±ãžã¥ãŒã©ãŒã
- 2ã€ã®ãµãŒããŒã
- åŽåè
ããã3å°ã®ãã·ã³ã
ã¯ã©ã¹ã¿ãŒèªäœã¯æ¬¡ã®ããã«ãªããŸãã
ãã®å³ã¯ã説æãããŠããæ§æãšåæ§ã«ãããŒã¿ã¹ããªãŒã ãèŠèŠåããããã«ã®ã¿äœ¿çšãããŸãã
ã¯ã©ã¹ã¿ãŒã®åæå
äžèšã®ãããªå€§ããªã¯ã©ã¹ã¿ãŒã¯å®éã«ã¯äœæããŸãããã1å°ã®ç©çãã·ã³äžã«3ã€ã®ããŒããããã¯ããã«å°ããªã¯ã©ã¹ã¿ãŒã§ç®¡çããŸãã ããã«ã¯ããã€ãã®çç±ããããŸãã
- ããã·ã³ãã«ã§å®äŸ¡ã
- ãã¹ãŠã®ãã°ã1ã€ã®å Žæã«ããããã説æãç°¡åã«ãªããŸãã
ç¶è¡ããåã«ã1ã€ã®è©³çŽ°ãæ確ã«ããå¿
èŠããããŸãã MXNetã®å Žåãåæ£åŠç¿ãšã¯æ¬è³ªçã«KVStoreã®äœ¿çšãæå³ããŸãã ãã®ååã¯ãKey Value Storageã®é åèªã§ãã ãããŠæ¬è³ªçã«ããµãŒããŒäžã§å®è¡ãããããã€ãã®è¿œå æ©èœãåããåæ£ã¹ãã¬ãŒãžã§ãïŒããšãã°ãåäœäžã®ã¢ãã«ããåŸé
ã¹ããããåãåã£ãŠãã¢ãã«ãæ£ç¢ºã«æŽæ°ããæ¹æ³ãç¥ã£ãŠããŸãïŒã
ãŸããKVStoreãµããŒãã¯ã次ã®2ã€ã®æ¹æ³ã®ããããã§ã®ã¿äœ¿çšã§ããŸãã
- MXNetã¯æåã§ãã«ãããããã©ã°USE_DIST_KVSTORE = 1ãæå¹ã«ãªã£ãŠãããã
- DLAMIã䜿çšãããïŒãã¬ãŒã ã¯ãŒã¯ã¯USE_DIST_KVSTORE = 1ãã©ã°ãæå¹ã«ããŠæåã§æ§ç¯ãããããïŒ
ãã®èšäºã§ã¯ãJun / Jul DLAMIïŒMXNet 0.10.0ïŒãªãªãŒã¹ã®MXNetã䜿çšãããããšãæ³å®ããŠããŸãã
èªã¿åãæã«ãå
¬åŒã®MXNet pipããã±ãŒãžãKVStoreããµããŒãããå¯èœæ§ã¯ãŸã ãŒãã§ã¯ãããŸããã
è«çã¯ã©ã¹ã¿ã¡ã³ããŒã®äœæãéå§ããŸãã åå è
ãäœæããã«ã¯ãããã€ãã®ç°å¢å€æ°ãäœæããŠããmxnetã¢ãžã¥ãŒã«ãã€ã³ããŒãããã ãã§ãã
ãã©ã³ããŒ
ãŸããã¹ã±ãžã¥ãŒã©ãŒãå®è¡ããŸãã
ubuntu:~$ python >>> import subprocess >>> import os >>> scheduler_env = os.environ.copy() >>> scheduler_env.update({ ⊠"DMLC_ROLE": "scheduler", ⊠"DMLC_PS_ROOT_PORT": "9000", ⊠"DMLC_PS_ROOT_URI": "127.0.0.1", ⊠"DMLC_NUM_SERVER": "1", ⊠"DMLC_NUM_WORKER": "1", ⊠"PS_VERBOSE": "2" ⊠}) >>> subprocess.Popen("python -c 'import mxnet'", shell=True, env=scheduler_env) <subprocess.Popen object at 0x7facb0622850>
ããã§å°ãåæ¢ããŠãäœãèµ·ãã£ãŠããã®ããç解ããŸãããã ã³ãŒãã®æåã®4è¡ã¯ãPythonããã°ã©ããŒã«å€ãã®è³ªåãåŒãèµ·ãããªãã¯ãã§ããäŸåé¢ä¿ãã€ã³ããŒãããŠãOSç°å¢ãäœæããã ãã§ãã ããã§èå³æ·±ãã®ã¯ãç°å¢å€æ°ã«ã©ã®ãããªæŽæ°ãè¡ããããã§ãã
DMLC_ROLEãèŠãŠã¿ãŸãããã ããã䜿çšãããŠããå Žæãã€ãŸãps-liteããã±ãŒãžã§æ£ç¢ºã«èŠãŠã¿ãŸãããã å
¬åŒã®READMEã«åŸã£ãŠïŒç¡æ翻蚳ïŒïŒ
ãã©ã¡ãŒã¿ãä¿åããããã®ç°¡åã§å¹ççãªãµãŒããŒå®è£
ã
ããŠãç°å¢å€æ°ãããã§èªã¿åãããæ£ç¢ºãªå ŽæïŒãšããã§ãç¹å®ã®ã³ããããžã®ãã¹ãŠã®åç
§ïŒã
val = CHECK_NOTNULL(Environment::Get()->find("DMLC_ROLE"));
ããã§äœãèµ·ããŠããã®ããç解ããããã«ãC ++ã®ç¬¬äžäººè
ã§ãã£ãŠã¯ãããŸããã ããŒãã®è«ççãªåœ¹å²ã¯ããã®å€æ°ãDMLC_ROLEãã®è¡ã«ãã£ãŠæ±ºå®ãããŸãã ããããã§ããããã®å€æ°ã«èš±å¯ãããå€ã®1ã€ãå«ãŸããŠãããã©ããã¯ãã§ãã¯ãããŠããªãããã§ãã ããã«ãããèå³æ·±ãåé¡ãçºçããå¯èœæ§ããããŸãã
2çªç®ã«èå³æ·±ãã®ã¯ãå€æ°ãèªã¿åãããå Žæã ãã§ãªããå€æ°ã䜿çšãããå Žæã§ãã ããã«ã€ããŠè©±ãã«ã¯ãvan.ccãã¡ã€ã«ã«ç®ãåããå¿
èŠããããŸããvan.ccãã¡ã€ã«ã¯2å以äžè¡šç€ºãããŸããå€æ°ã䜿çšãããå€æ° "is_scheduler"ãäœæãããç¹å®ã®è¡ã次ã«ç€ºããŸãã
scheduler_.hostname = std::string(CHECK_NOTNULL(Environment::Get()->find("DMLC_PS_ROOT_URI"))); scheduler_.port = atoi(CHECK_NOTNULL(Environment::Get()->find("DMLC_PS_ROOT_PORT"))); scheduler_.role = Node::SCHEDULER; scheduler_.id = kScheduler; is_scheduler_ = Postoffice::Get()->is_scheduler();
ã³ãŒãããã°ãã調ã¹ãŠããã§äœãèµ·ãããã確èªãããšã次ã®èå³æ·±ãå ŽæãèŠãããšãã§ããŸãã
ãã®ç¹å®ã®äŸã§ã¯ãå€æ°ãroleãã¯Node :: SCHEDULERãšæ±ºããŠçãããªããŸããã ãã®ããããããä¿®æ£ããããã®ãã«ãªã¯ãšã¹ããäœæããæ©äŒããããŸãïŒèª°ãä¿®æ£ããŠããªãå ŽåïŒã
ãã®å ŽæãèŠãã ãã§ããã©ã³ããŒã«ã¯ããŸãä»äºããªãããšãããããŸãã ããã¯ãåäœäžã®ãµãŒããŒããµãŒããŒãšã¯ç°ãªããã¹ã±ãžã¥ãŒã©ã¯è»¢éãããIPã¢ãã¬ã¹ãšããŒãã䜿çšããã·ã¹ãã å
ã®ç©ºãããŒããæ€çŽ¢ããªãããã§ãã
ããã«é²ãã§ããã©ã¡ãŒã¿ãŒïŒDMLC_PS_ROOT_PORTã æ¢åã®ç¥èãèæ
®ããŠãããã«è¿
éã«å¯ŸåŠããŸãã ãã§ã«èŠãã³ãŒãã¯æ¬¡ã®ãšããã§ãã
scheduler_.hostname = std::string(CHECK_NOTNULL(Environment::Get()->find("DMLC_PS_ROOT_URI"))); scheduler_.port = atoi(CHECK_NOTNULL(Environment::Get()->find("DMLC_PS_ROOT_PORT")));
ç¹°ãè¿ããŸãããããã¯van.ccã®ãã®ã§ãã ãæ³åã®ãšãããããã¯ã¹ã±ãžã¥ãŒã©ãã¡ãã»ãŒãžããªãã¹ã³ããããŒãã§ãã
ãã®æ®µéã§ãDMLC_PS_ROOT_URIãã¹ã±ãžã¥ãŒã©ã®IPã¢ãã¬ã¹ã«éããªãããšã¯æããã§ãã ããã§ã¯ãDMLC_NUM_SERVERãšDMLC_NUM_WORKERã®èª¬æã«ããã«é£ã³èŸŒã¿ãŸãããã
ãã®ãããã¯ã©ã¹ã¿ãŒå
ã®ãã¹ãŠã®MXNetè«çããŒããä»ã®ãã¹ãŠã®ããŒããèªèããŠããå¿
èŠããããŸããã ãã®ãããåããŒãã®èµ·ååã«ãç°å¢å€æ°ã¯ã¯ã©ã¹ã¿ãŒå
ã®ã¯ãŒã«ãŒãšãµãŒããŒã®æ°ãèšé²ããŸãïŒã¹ã±ãžã¥ãŒã©ãŒã®æ°ã¯åžžã«1ã§ãããããäžèŠã§ãïŒã ãšããã§ããã®æ
å ±ã¯ïŒä»ã®ã¯ã©ã¹ã¿ãŒæ
å ±ãšãšãã«ïŒ Postofficeã¯ã©ã¹ã«ä¿åãããŸãã
ããŠãæåŸã®ãã©ã¡ãŒã¿ã§ãããããããæãéèŠãªã¢ãŒããã¯ãã£ã®1ã€ã§ããPS_VERBOSEã§ãã ããã«ãããæ°ããäœæãããããã»ã¹ããããã°æ
å ±ãåºåããããã«ãªããŸãããããã¯ä»ã§ã¯éåžžã«éèŠã§ãã
åœã®å³ã®èŠ³ç¹ããèŠããšãã¯ã©ã¹ã¿ãŒã¯æ¬¡ã®ããã«ãªããŸãã
ãµãŒããŒãèµ·åããŸã
ã¹ã±ãžã¥ãŒã©ãŒãã§ããã®ã§ããµãŒããŒãäžããŸãããã 1å°ã®ãã·ã³ã§ãã¹ãŠã®è«çããŒããäžããããããµãŒããŒãèµ·åããã«ã¯ãç°å¢ãã©ã¡ãŒã¿ãŒã®ã³ããŒãäœæããããã§å¿
èŠãªå€æŽãå床è¡ãå¿
èŠããããŸãã
>>> server_env = os.environ.copy() >>> server_env.update({ ⊠"DMLC_ROLE": "server", ⊠"DMLC_PS_ROOT_URI": "127.0.0.1", ⊠"DMLC_PS_ROOT_PORT": "9000", ⊠"DMLC_NUM_SERVER": "1", ⊠"DMLC_NUM_WORKER": "1", ⊠"PS_VERBOSE": "2" ⊠}) >>> subprocess.Popen(âpython -c 'import mxnet'â, shell=True, env=server_env) <subprocess.Popen object at 0x7facb06228d0>
ç§ã¯ä»ãã³ãŒãã§äœãèµ·ãã£ãŠãããã質åãåŒãèµ·ãããªãããšãæã¿ãŸããã念ã®ããã«ïŒ
- æ°ããããã»ã¹ã¯ãµãŒããŒïŒDMLC_ROLEïŒã§ãã
- ã¹ã±ãžã¥ãŒã©ã®IPïŒDMLC_PS_ROOT_URIïŒã瀺ããŸãã
- ã¹ã±ãžã¥ãŒã©ãçä¿¡æ¥ç¶ããªãã¹ã³ããããŒãïŒDMLC_PS_ROOT_PORTïŒã«ã€ããŠè©±ããŸãã
- ãµãŒããŒã«ã¯ã©ã¹ã¿ãŒå
ã®ã¯ãŒã«ãŒã®æ°ãäŒããŸãïŒDMLC_NUM_WORKERïŒ
- ã¯ã©ã¹ã¿ãŒå
ã®ãµãŒããŒã®æ°ïŒDMLC_NUM_SERVERïŒããµãŒããŒã«äŒããŸã
- ããŠãåºåããããã°ã¢ãŒãã«èšå®ããŸãïŒ2ïŒ
ãããã誰ããå°ãããããããŸããïŒåŸ
ã£ãŠãDMLC_PS_ROOT_PORTãšDMLC_PS_ROOT_URIããéå§ããŠããè«çããŒãã®IPãšããŒãã瀺ããšæã£ãã®ã§ããïŒ çãã¯ããŒã§ããããã¯ã¹ã±ãžã¥ãŒã©ã®ã¢ãã¬ã¹ãšããŒãã§ãããä»ã®å
šå¡ãã¢ãã¬ã¹ãææ¡ããã·ã¹ãã ã§äœ¿çšå¯èœãªããŒããèŠã€ããå¿
èŠããããŸãã 圌ãã¯åœŒãããã¯ããŠã¹ã±ãžã¥ãŒã©ãŒãã¯ã©ã¹ã¿ãŒã«è¿œå ããããã«é Œãããã«ã¹ã±ãžã¥ãŒã©ãŒã«ã€ããŠã®æ
å ±ãå¿
èŠãšããŸãã
ãµãŒããŒãèµ·åãããšãå³ã¯æ¬¡ã®ããã«ãªããŸãã
ã¯ãŒã«ãŒãèµ·åããŸã
å®éã«ãã¯ãŒã«ãŒèªèº«ãèµ·åããŠKVStoreãäœæããŸãã
>>> os.environ.update({ ⊠"DMLC_ROLE": "worker", ⊠"DMLC_PS_ROOT_URI": "127.0.0.1", ⊠"DMLC_PS_ROOT_PORT": "9000", ⊠"DMLC_NUM_SERVER": "1", ⊠"DMLC_NUM_WORKER": "1", ⊠"PS_VERBOSE": "2" ⊠}) >>> worker_env = os.environ.copy() >>> import mxnet >>> kv_store = mxnet.kv.create('dist_async')
ãšããã§ãKVStoreã¯æ¬¡ã®2ã€ã®ã¢ãŒãã§åäœã§ããŸãã
ãããã®ã¢ãŒããã©ã®ããã«ç°ãªããã«ã€ããŠã¯ã奜å¥å¿readerçãªèªè
ã«è³ªåãä»»ããŸã ã ããã«ã€ããŠã¯ãã¡ãã§èªãããšãã§ããŸã ã
ã¯ãŒã«ãŒãéå§ãããšãå³ã¯æ¬¡ã®ããã«ãªããŸãã
ããŒãã®ã©ã€ããµã€ã¯ã«ïŒVanïŒ
KVStoreã®äœææã«äœãèµ·ããããæ¥ãã§è°è«ããåã«ãåããŒãã«ã¯æ¬¡ã®ã€ãã³ããå«ãã©ã€ããµã€ã¯ã«ããããšããäºå®ã«ã€ããŠè©±ãå¿
èŠããããŸãã
ãŸãããããã®ã€ãã³ãã®åŠçãæ
åœããåãã¯ã©ã¹ïŒ Van ïŒã«ã¯ãä»ã«ãéèŠãªã¡ãœãããããã€ããããŸãã ãããã®ããã€ãã«ã€ããŠã¯ãä»ã®èšäºã§åŸã»ã©è©³ãã説æããŸãããä»ã¯æ¬¡ã®ããã«ãªã¹ãããŸãã
- éä¿¡ -ã¡ãã»ãŒãžãéä¿¡ããŸã
- PackMeta-ã¢ãã«ããããã¡ãã»ãŒãžã«å€æããŸã
- UnpackMeta-ãããã¡ãã»ãŒãžãã¢ã³ããã¯ããã¢ãã«ãäœæããŸã
- HeartBeat-圌ããŸã çããŠãããšããã¡ãã»ãŒãžãéä¿¡ããŸã
ããã¯ãéå§ä¿¡å·ãå°çãããšãã«åããŒããè¡ãããšã§ãã
ã¯ã©ã¹ã¿ãŒã®åæå
äžèšã®ãã¹ãŠã®ã³ãã³ããå®äºãããšããã«ã以åã«èµ·åãã3ã€ã®ããã»ã¹ããåæã«ååŸãããå€ãã®ãããã°æ
å ±ãç»é¢ã«è¡šç€ºãããŸãã 次ã«ãåè¡ãèŠãŠãå段éã§äœãèµ·ãã£ãŠããã®ããå³ãã©ã®ããã«èŠããã®ãã詳ãã説æããŸãã
[00:33:12] src / van.ccïŒ75ïŒããŒã«=ã¯ãŒã«ãŒã«ãã€ã³ããip = 1.1.1.1ãããŒã= 37350ãis_recovery = 0
ããã«ãããã¯ãŒã«ãŒããã»ã¹ãéå§ãããŸãã ãã®å Žåãããã¯Startã¡ãœããã§ãããã¢ãã¬ã¹ã1.1.1.1ã§ãããããŒã«ããworkerãã§ãããèŠã€ãã£ãããŒãã37350ã§ããããšã瀺ããŸããã¢ãã¬ã¹ãšããŒãã瀺ãããšã«ãããããã«ã¹ã±ãžã¥ãŒã©ãŒã«ã¯ã©ã¹ã¿ãŒã«è¿œå ããæºåãã§ããããšãéç¥ãããïŒ
[00:33:12] src / van.ccïŒ136ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 3ãã³ã³ãããŒã«= {cmd = ADD_NODEãããŒã= {role = workerãip = 1.1.1.1ãããŒã= 37350ãis_recovery = 0}}
ãã®ç¹å®ã®ã¡ãã»ãŒãžã¯ãããã§Sendã¡ãœããã§çæãããŸãã ãã®äžã§ãããã€ãã®ããšã«æ³šæããå¿
èŠããããŸãã
- is_recovery = 0-埩æ§ã¢ãŒãã§ã¯ãªãããšãå ±åããŸãããã®éšåã¯ãã®èšäºã®ç¯å²å€ã§ã
- cmd = ADD_NODE-ã¯ãŒã«ãŒãã¯ã©ã¹ã¿ãŒã«è¿œå ããã¹ã±ãžã¥ãŒã©ãŒãžã®ã³ãã³ã
- ïŒ => 1-åããŒãã«ã¯ç¬èªã®ã©ã³ã¯ããããŸãã ã©ã³ã¯ã¯ãã©ã³ããŒã«ãã£ãŠå²ãåœãŠãããŸãã ã¹ã±ãžã¥ãŒã©ãŒèªäœã®ã©ã³ã¯ã¯1ã§ãããã®å Žåãã©ã³ã¯ã®ãªãããŒãã¯ãã©ã³ã¯1ïŒã¹ã±ãžã¥ãŒã©ãŒïŒã®ããŒãã«ã¡ãã»ãŒãžãéä¿¡ããŸãã
å³ã§ã¯ããã®ã¡ãã»ãŒãžã³ã°ã¯ââ次ã®ããã«ãªããŸãã
ããã«é²ã
[00:33:13] src / van.ccïŒ75ïŒããŒã«=ãµãŒããŒãžã®ãã€ã³ããIP = 2.2.2.2ãããŒã= 54160ãis_recovery = 0
ããã«ãããµãŒããŒãèµ·åããŸããã ããŒãïŒ54160ïŒãèŠã€ããŸãããããã©ã³ããŒã¯ããã«ãã®ããšãéç¥ããããšããŸãïŒ
[00:33:13] src / van.ccïŒ136ïŒïŒ => 1. MetaïŒrequest = 0ãtimestamp = 0ãcontrol = {cmd = ADD_NODEãnode = {role = serverãip = 2.2.2.2ãport = 54160ãis_recovery = 0}}
ãã£ãŒãã§ã¯ã次ã®ããã«ãªããŸãã
ã¯ãŒã«ãŒã®å Žåãšåæ§ã«ããµãŒããŒã¯ã³ãã³ããADD_NODEããéä¿¡ããŠã¯ã©ã¹ã¿ãŒã«ç»é²ããŸãã ãµãŒããŒã¯ãŸã ã¯ã©ã¹ã¿ãŒã«ç»é²ãããŠããããã©ã³ã¯ããªãããããïŒ=> 1ãã衚瀺ãããŸãã
[00:33:13] src / van.ccïŒ75ïŒããŒã«ã«ãã€ã³ã=ã¹ã±ãžã¥ãŒã©ãŒãID = 1ãIP = 127.0.0.1ãããŒã= 9000ãis_recovery = 0
æåŸã«ãã¹ã±ãžã¥ãŒã©ãŒãå®è¡ãããŠããŸãã ããŒã«ã«IPãšããŒã9000ã䜿çšããŸãïŒã¯ã©ã¹ã¿ãŒå
ã®ãã¹ãŠã®ããŒãã¯ããã®ã¢ãã¬ã¹ãšããŒããæ¢ã«ç¥ã£ãŠããå¿
èŠããããŸãïŒã ã¹ã±ãžã¥ãŒã©ãæèµ·ãããã®ã§ããã®ç¬éã«åœŒã圌ã«éä¿¡ããããã¹ãŠã®çä¿¡ã¡ãã»ãŒãžãåä¿¡ããããšãæåŸ
ããããšã¯è«ççã§ã...ãããŠåºæ¥äžããïŒ
[00:33:13] src / van.ccïŒ161ïŒïŒ => 1. MetaïŒrequest = 0ãtimestamp = 0ãcontrol = {cmd = ADD_NODEãnode = {role = serverãip = 2.2.2.2ãport = 54160ãis_recovery = 0}}
ãµãŒããŒããã®ã¡ãã»ãŒãžã ãã°ã®ãã®éšåã¯ã Receiveã¡ãœããã«ãã£ãŠçæãããŸããã ããã§ã¯ããã«æ£ç¢ºã§ã ã ã¹ã±ãžã¥ãŒã©ãŒã¯ããã«2çªç®ã®ã¡ãã»ãŒãžãåä¿¡ããŸããä»åã¯ã¯ãŒã«ãŒããã§ãïŒ
[00:33:13] src / van.ccïŒ161ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 3ãã³ã³ãããŒã«= {cmd = ADD_NODEãããŒã= {role = workerãip = 1.1.1.1ãããŒã= 37350ãis_recovery = 0}}
ãã©ã³ããŒãæåã«è¡ãããšã¯ãæåã«ã¯ãŒã«ãŒã«ã©ã³ã¯ãå²ãåœãŠãããšã§ãïŒ9ïŒïŒ
[00:33:13] src / van.ccïŒ235ïŒã©ã³ã¯= 9ãããŒãããŒã«ã«å²ãåœãŠãŸã= workerãip = 1.1.1.1ãããŒã= 37350ãis_recovery = 0
ãµãŒããŒïŒ8ïŒãžïŒ
[00:33:13] src / van.ccïŒ235ïŒã©ã³ã¯= 8ãããŒãããŒã«=ãµãŒããŒã«å²ãåœãŠãIP = 2.2.2.2ãããŒã= 54160ãis_recovery = 0
ãã®åŸãéåžžã«éèŠãªéšåããããŸãã
[00:33:13] src / van.ccïŒ136ïŒïŒ => 9. MetaïŒrequest = 0ãtimestamp = 0ãcontrol = {cmd = ADD_NODEãnode = {role = workerãid = 9ãip = 1.1.1.1ãport = 37350ãis_recovery = 0 role = serverãid = 8ãip = 2.2.2.2ãport = 54160ãis_recovery = 0 role = schedulerãid = 1ãip = 127.0.0.1ãport = 9000ãis_recovery = 0}}
ãããã®ã¡ãã»ãŒãžã¯ãã¹ã±ãžã¥ãŒã©ãŒããã¹ãŠã®ã¯ã©ã¹ã¿ãŒããŒãïŒãã®å Žåãæåã®äœæ¥ãµãŒããŒãšæåã®ãµãŒããŒïŒããADD_NODEã³ãã³ããåä¿¡ãããã¹ãŠã®ããŒãã«ã©ã³ã¯ãšã¯ã©ã¹ã¿ãŒå
ã®ä»ã®ãã¹ãŠã®ããŒãã«é¢ããæ
å ±ãéç¥ãå§ããããšã瀺ããŠããŸãã ã€ãŸããã¹ã±ãžã¥ãŒã©ã¯ããã¹ãŠã®ã¯ã©ã¹ã¿ãŒããŒãã«é¢ãããã¹ãŠã®æ
å ±ããã¹ãŠã®ã¯ã©ã¹ã¿ãŒããŒãã«éä¿¡ããŸãã
ãã®ç¹å®ã®ã¡ãã»ãŒãžã§ã¯ãã¯ã©ã¹ã¿ãŒã«é¢ãããã¹ãŠã®ããŒã¿ã衚瀺ããããã®ã¡ãã»ãŒãžã¯ã©ã³ã¯9ïŒããã¯åŸæ¥å¡ã§ãïŒã§ããŒãã«éä¿¡ãããŸãã ã¯ã©ã¹ã¿ãŒã«é¢ããæ
å ±ã¯ãããšãã°ãã©ã®ãµãŒããŒã«ã¢ãã«ã®æŽæ°ãéä¿¡ããããç解ããããã«å¿
èŠãªãããéèŠã§ãã
å³ã§ã¯ããã®ããã»ã¹ã¯æ¬¡ã®ããã«ãªããŸãã
次ã®çµè«ïŒ
[00:33:13] src / van.ccïŒ136ïŒïŒ => 8. MetaïŒrequest = 0ãtimestamp = 1ãcontrol = {cmd = ADD_NODEãnode = {role = workerãid = 9ãip = 1.1.1.1ãport = 37350ãis_recovery = 0 role = serverãid = 8ãip = 2.2.2.2ãport = 54160ãis_recovery = 0 role = schedulerãid = 1ãip = 127.0.0.1ãport = 9000ãis_recovery = 0}}
ã¹ã±ãžã¥ãŒã©ãŒã¯ãåã確èªãã©ã³ã¯8ïŒãµãŒããŒïŒã®ããŒãã«éä¿¡ããŸãã å³ã¯æ¬¡ã®ããã«ãªããŸãã
[00:33:13] src / van.ccïŒ251ïŒã¹ã±ãžã¥ãŒã©ãŒã¯1ã€ã®ã¯ãŒã«ãŒãš1ã€ã®ãµãŒããŒã«æ¥ç¶ãããŠããŸã
ã¹ã±ãžã¥ãŒã©ã¯ã1ã€ã®ã¯ãŒã«ãŒãš1ã€ã®ãµãŒããŒïŒãã¹ãŠã®ã¯ã©ã¹ã¿ãŒããŒãïŒã«æ¥ç¶ãããŠããããšãåãã§çºè¡šããŸããã
ãªãã€ã³ããŒ-å®éã®ã¯ã©ã¹ã¿ãŒã§å®è¡ããå Žåããããã®ãã°ã¯ãã¹ãŠç°ãªããã·ã³äžã«ãããããå¿
èŠä»¥äžã®æ
å ±ãããããã«èŠãããããããŸããã
[00:33:13] src / van.ccïŒ161ïŒ1 =>2147483647ãã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 0ãã³ã³ãããŒã«= {cmd = ADD_NODEãããŒã= {role = workerãid = 9ãip = 1.1.1.1ãããŒã= 37350ãis_recovery = 0ããŒã«=ãµãŒããŒãid = 8ãip = 2.2.2.2ãããŒã= 54160ãis_recovery = 0ããŒã«=ã¹ã±ãžã¥ãŒã©ãŒãid = 1ãip = 127.0.0.1ãããŒã= 9000ã is_recovery = 0}}
[00:33:13] src / van.ccïŒ281ïŒW [9]ã¯ä»ã®ãŠãŒã¶ãŒã«æ¥ç¶ãããŠããŸã
ãã®ã¯ãŒã«ãŒã¯ãã¹ã±ãžã¥ãŒã©ãŒããã¡ãã»ãŒãžãåä¿¡ããã¯ã©ã¹ã¿ãŒã«æ¥ç¶ããŠããããšãå ±åããŸãã ã2147483647ããšã¯äœããå°ãããããããŸããã çãã¯ããããªã=ïŒãããããã°ã§ãããã1 => 9ãã衚瀺ãããããšãæåŸ
ããŠããŸãã ãããã£ãŠãã¯ãŒã«ãŒã¯èªåã®ã©ã³ã¯ãW [9]ããæ£ããèªèããŠããããããã°ã¯ãã°èšé²ã®ããã»ã¹ã®ã©ããã«ååšããå¯èœæ§ãé«ããããä¿®æ£ããŠãããžã§ã¯ãã®è²¢ç®è
ã«ãªããŸãã
[00:33:13] src / van.ccïŒ161ïŒ1 =>2147483647ãã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 1ãã³ã³ãããŒã«= {cmd = ADD_NODEãããŒã= {ããŒã«=ã¯ãŒã«ãŒãid = 9ãip = 1.1.1.1ãããŒã= 37350ãis_recovery = 0ããŒã«=ãµãŒããŒãid = 8ãip = 2.2.2.2ãããŒã= 54160ãis_recovery = 0ããŒã«=ã¹ã±ãžã¥ãŒã©ãŒãid = 1ãip = 127.0.0.1ãããŒã= 9000ã is_recovery = 0}}
[00:33:13] src / van.ccïŒ281ïŒS [8]ã¯ä»ã®ãŠãŒã¶ãŒã«æ¥ç¶ãããŠããŸã
ãµãŒããŒã«ã€ããŠãåãã§ãã圌ã¯ã¡ãã»ãŒãžãåãåãããã®äžçã«ã€ããŠåãã§äŒããŸããã
[00:33:13] src / van.ccïŒ136ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 4ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ136ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 2ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ136ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 1ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
å¥ã®éèŠãªéšåã ãããŸã§ãADD_NODEã³ãã³ãã¯1ã€ããèŠãŠããŸããã ããã§ã¯ãæ°ãããBARRIERãã確èªããŸãã èŠããã«ãããã¯ããªã¢ã®æŠå¿µã§ããããã«ãã¹ã¬ããããã°ã©ãã³ã°ã®èªè
ã«ã¯ããªãã¿ã§ãããã誰ãããã®ããªã¢ã«éãããŸã§åæ¢ãããããšãé¡ã£ãŠããŸãã èšç»æ
åœè
ã¯ãæ£ç¢ºã«å
šå¡ãããªã¢ã«å°éããããšããã€ã§ãéç¥ãã責任ãããããããç¶ç¶ããããšãã§ããŸãã æåã®éå£ã¯ãã¯ã©ã¹ã¿ãŒãéå§ãããçŽåŸããã ããã¬ãŒãã³ã°ãéå§ãããåã«é
眮ãããŸãã 3ã€ã®ããŒããã¹ãŠïŒã¹ã±ãžã¥ãŒã©ãŒèªäœãå«ãïŒãã¡ãã»ãŒãžãéä¿¡ããŸãããããã¯åºæ¬çã«ãéå£ã«å°éããã®ã§ã次ã«é²ãããšãã§ããããšãç¥ãããŠãã ããããšããæå³ã§ãã
ãŸããã¡ãã»ãŒãžãããããããã«ãããªã¢ã°ã«ãŒãïŒbarrier_groupïŒã®æŠå¿µããããŸãã ããªã¢ã°ã«ãŒãã¯ãç¹å®ã®ããªã¢ã«é¢ä¿ããããŒãã®ã°ã«ãŒãã§ãã ãããã®ã°ã«ãŒãã¯æ¬¡ã®ãšããã§ãã
1-ã¹ã±ãžã¥ãŒã©ãŒ
2-ãµãŒããŒ
4-åŽåè
ãæ³åã®ãšãããããã¯2ã®çŽ¯ä¹ã§ãããããã°ã«ãŒã7ã¯4 + 2 + 1ã§ããæ¬è³ªçã«ããã®éå£ã¯ãã¹ãŠã®äººã«åãã§ããŸãã
ãã¡ããããã°ã§ã¯3ã€ã®ã¡ãã»ãŒãžãéä¿¡ãããã®ã§ãã¹ã±ãžã¥ãŒã©ã«ãããããã®ã¡ãã»ãŒãžã®åä¿¡ã«ã€ããŠ3è¡ãæåŸ
ããã®ã¯åœç¶ã§ãã
[00:33:13] src / van.ccïŒ161ïŒ1 => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 2ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ291ïŒ7ïŒ1ã®ããªã¢ã«ãŠã³ã
[00:33:13] src / van.ccïŒ161ïŒ9 => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 4ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ291ïŒ7ïŒ2ã®ããªã¢ã«ãŠã³ã
[00:33:13] src / van.ccïŒ161ïŒ8 => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 1ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ291ïŒ7ïŒ3ã®ããªã¢ã«ãŠã³ã
ãã£ãŒãã§è¡ãããŠããããšã¯æ¬¡ã®ããã«ãªããŸãã
ããã§ãããŒããç¹å®ã®ã°ã«ãŒãã®ããªã¢ã«å°éãããšããæ°ããã¡ãã»ãŒãžãåä¿¡ãããšãã«ã¹ã±ãžã¥ãŒã©ãäœãè¡ãããè°è«ããŸãã
- ç¹å®ã®ã°ã«ãŒãã§BARRIERã³ãã³ããéä¿¡ããããŒãã®æ°ã®ã«ãŠã³ã¿ãŒãå¢ãããŸã ïŒ ãã¡ã ïŒ
- ã«ãŠã³ã¿ãŒãã°ã«ãŒãå
ã®ããŒãã®æ°ã«çããå Žåãéåžžã®æäœãç¶è¡ã§ããããšã確èªããã¡ãã»ãŒãžãå
šå¡ã«éä¿¡ããŸã
äžèšã®ãã°ã§ã¯ãæ°ããã¡ãã»ãŒãžãåä¿¡ããããã³ã«ã«ãŠã³ã¿ãŒãã©ã®ããã«å¢å ãããã確èªã§ããŸãã ããŠã圌ãäºæ³ãµã€ãºïŒ3ïŒã«éããç¬éã«ãã¹ã±ãžã¥ãŒã©ãŒã¯ç¢ºèªãéä¿¡ãå§ããŸããã
[00:33:13] src / van.ccïŒ136ïŒïŒ => 9.ã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 3ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 0}
[00:33:13] src / van.ccïŒ136ïŒïŒ => 8.ã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 4ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 0}
[00:33:13] src / van.ccïŒ136ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 5ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 0}
ãã®å³ã§ã¯ã次ã®ããã«ãªã£ãŠããŸãã
ã芧ã®ãšãããã¹ã±ãžã¥ãŒã©ã¯èªåèªèº«ã«ç¢ºèªãéä¿¡ããŸãã ãã¡ãããã¹ã±ãžã¥ãŒã©ããã¡ãã»ãŒãžãéä¿¡ãããããïŒæ倧3ã€ïŒã ãããã®ã¡ãã»ãŒãžãåä¿¡ããããã°ã確èªããå¿
èŠããããŸã ã
[00:33:13] src / van.ccïŒ161ïŒ1 => 9. MetaïŒrequest = 0ãtimestamp = 3ãcontrol = {cmd = BARRIERãbarrier_group = 0}
[00:33:13] src / van.ccïŒ161ïŒ1 => 8.ã¡ã¿ïŒãªã¯ãšã¹ã= 0ãã¿ã€ã ã¹ã¿ã³ã= 4ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 0}
[00:33:13] src / van.ccïŒ161ïŒ1 => 1. MetaïŒrequest = 0ãtimestamp = 5ãcontrol = {cmd = BARRIERãbarrier_group = 0}
ããŠãæåŸã®ã¿ããã çŸåšãã¹ã±ãžã¥ãŒã©ãŒã¯ãã¬ãŒãã³ã°ã®çµäºåŸã«ãã¹ãŠã®ããŒããå°éãã2çªç®ã®éå£ã«å°éããŠããŸãããã¹ã±ãžã¥ãŒã©ãŒã¯ãã¬ãŒãã³ã°ã«åå ããŠããªãããããã§ã«åãéå£ã«å°éããŠããŸãã ãã®ããã圌ã¯éå£ã«å°éããbarrier_group = 7ã°ã«ãŒããéä¿¡ããã¡ãã»ãŒãžã®åä¿¡ãå³åº§ã«ç¢ºèªããéå£ã°ã«ãŒãã«ãŠã³ã¿ãŒ7ã1ã«èšå®ããŸãã
[00:33:13] src / van.ccïŒ136ïŒïŒ => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 6ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ161ïŒ1 => 1.ã¡ã¿ïŒãªã¯ãšã¹ã= 1ãã¿ã€ã ã¹ã¿ã³ã= 6ãã³ã³ãããŒã«= {cmd = BARRIERãbarrier_group = 7}
[00:33:13] src / van.ccïŒ291ïŒ7ïŒ1ã®ããªã¢ã«ãŠã³ã
ãã®æ®µéã§ãã¯ã©ã¹ã¿ãŒã®åæåãå®äºããåŠç¿ãéå§ã§ããŸã...
ãã¬ãŒãã³ã°
ãã¹ãŠã®ã³ãŒããå®è¡ããããKVstoreãåæåããŸããã ä»äœïŒ çŽæ¥åŠç¿ã«äœ¿çšããŸãããã ãããã ãç·åœ¢ãªã°ã¬ããµãŒã®éåžžã«åçŽãªäŸã䜿çšããŸã ã ç¶è¡ããåã«ãäŸãèªãã§ãäœãèµ·ãã£ãŠããã®ããç解ããŠãã ããã 説æããäŸã®ãã¬ãŒãã³ã°ãé
åžããã«ã¯ãã³ãŒãã®1è¡ã®ã¿ãå€æŽããå¿
èŠããããŸãã 代ããã«ïŒ
model.fit(train_iter, eval_iter, optimizer_params={ 'learning_rate':0.005, 'momentum': 0.9}, num_epoch=50, eval_metric='mse', batch_end_callback = mx.callback.Speedometer(batch_size, 2))
æžãå¿
èŠãããïŒ
model.fit(train_iter, eval_iter, optimizer_params={ 'learning_rate':0.005, 'momentum': 0.9}, num_epoch=50, eval_metric='mse', batch_end_callback = mx.callback.Speedometer(batch_size, 2), kvstore=kv_store)
ãšãŠãç°¡å èŠããã«ãã¯ãã
å°ãçµè«
èªè
ããMXNetã¯ã©ã¹ã¿ãŒã®çºå£²æã«äœãèµ·ãã£ãŠãããã«ã€ããŠããã詳现ã«ç解ããŠããã ããã°å¹žãã§ãã ãŸããåé¡ãçºçããå Žåã«ãã®èšäºãã¯ã©ã¹ã¿ãŒã®ãããã°ã«åœ¹ç«ã€ããšãé¡ã£ãŠããŸãã ããã«ããã®ç¥èãããã°ãã¯ã©ã¹ã¿ãŒã®ãããã¯ãŒã¯ã®ç¹æ§ã«ã€ããŠããã€ãã®çµè«ãåŒãåºãããšãã§ããŸãã
- ã¹ã±ãžã¥ãŒã©ãä»ã®äººãšè¿
éã«æ¥ç¶ããããšã¯éèŠã§ã¯ãããŸãã
- ãµãŒããŒå士ãé«éã§æ¥ç¶ããããšã¯éèŠã§ã¯ãããŸãã
- ãã¹ãŠã®ã¯ãŒã«ãŒã¯åãµãŒããŒã«ãã°ããæ¥ç¶ããå¿
èŠããããŸã
- åŽåè
ã¯äºãã«è¿
éãªã€ãªãããæã€ããšã¯éèŠã§ã¯ãããŸãã
Mediumã«é¢ããå
ã®èšäºã®æšå¥šã«éåžžã«æè¬ããŸãã ãŸããMXNetã䜿çšããŠåæ£åã®AWSããŒã¹ã®æ©æ¢°åŠç¿ã·ã¹ãã ãçªç¶æ§ç¯ããŠããŠã質åãããå Žåã¯ããæ°è»œã«ãåãåãããã ããïŒviacheslav@kovalevskyi.comïŒã
åç
§ïŒ