ãã®åºçç©ã¯ãAutumn BigData Conferenceã§ã®AlexSerbulã«ãããã¬ãŒã³ããŒã·ã§ã³ã«åºã¥ããŠããŸããããã°ããŒã¿ã¯ãã¬ã³ãã£ã§é¢é£æ§ã®ãããããã¯ã§ãã ããããå€ãã®äººã¯ãçè«çèå¯ã®éå°ãšå®éçãªæšå¥šäºé
ã®ç¹å®ã®æ¬ åŠã«äŸç¶ãšããŠæããŠããŸãã ãã®æçš¿ã§ã¯ããã®ã®ã£ãããéšåçã«åãã1,000äžã¢ã€ãã ã®è£œåã«ã¿ãã°ãã¯ã©ã¹ã¿ãŒåããäŸã䜿çšããŠãããã°ããŒã¿ãåŠçããããã®äžŠåã¢ã«ãŽãªãºã ã®äœ¿çšã«ã€ããŠèª¬æããŸãã
æ®å¿µãªããã倧éã®ããŒã¿ãããå ŽåãMapReduceã䜿çšããŠäžŠè¡ããŠåäœããããã«ãå€å
žçãªã¢ã«ãŽãªãºã ãå床ãåçºæãããå¿
èŠããããŸãã ãããŠããã¯å€§ããªåé¡ã§ãã

åºå£ã¯äœã§ããïŒ æéãšãéãç¯çŽããããã«ã䞊åã¯ã©ã¹ã¿ãªã³ã°ã¢ã«ãŽãªãºã ãå®è£
ã§ããã©ã€ãã©ãªãèŠã€ããããšããããšã¯ç¢ºãã§ãã ãã¡ãããJavaãã©ãããã©ãŒã ã«ã¯ãæ»ã«ããApache Mahoutãšæé·ãç¶ããApache Spark MLlibããããŸãã æ®å¿µãªãããMahoutã¯MapReduceã§ãµããŒãããã¢ã«ãŽãªãºã ã¯ã»ãšãã©ãããŸããããããã®ã»ãšãã©ã¯äžè²«ããŠããŸãã
ææãªSpark MLlibå±±ã¯ãã¯ã©ã¹ã¿ãªã³ã°ã¢ã«ãŽãªãºã ãè±å¯ã§ããããŸããã ãããŠãç§ãã¡ã®ããªã¥ãŒã ã§ã¯ãäºæ
ã¯ããã«æªåããŠããŸã-ãããææ¡ãããŠãããã®ã§ãïŒ
- K-means
- ã¬ãŠã¹æ··å
- é»åå埩ã¯ã©ã¹ã¿ãªã³ã°ïŒPICïŒ
- æœåšãã£ãªã¯ã¬å²ãåœãŠïŒLDAïŒ
- ã¹ããªãŒãã³ã°k-means
ã¯ã©ã¹ã¿ãªã³ã°çšã«1000äžã2000äžã®ãšã³ãã£ãã£ãããå Žåãäžèšã®ãœãªã¥ãŒã·ã§ã³ã¯ãã¯ã圹ã«ç«ã¡ãŸãããããŒãã³ã¢ãå¿
èŠã§ãã ãããããŸãæåã«ã
ãã®ããã1,000äžã¢ã€ãã ã®ã«ã¿ãã°ãã¯ã©ã¹ã¿ãŒåããå¿
èŠããããŸãã ãªããããå¿
èŠãªã®ã§ããïŒ å®éããŠãŒã¶ãŒã¯ãªã³ã©ã€ã³ã¹ãã¢ã§æšå¥šã·ã¹ãã ã䜿çšã§ããŸãã ãŸãã圌女ã®ä»äºã¯ãåœç€Ÿã®ãã©ãããã©ãŒã ã§åäœãããã¹ãŠã®ãµã€ãã®ååã®éçŽãããã«ã¿ãã°ã®åæã«åºã¥ããŠããŸãã ããåºèã§ãè²·ãæãaãåãããã«recommendedãéžã¶ããã«å§ãããããšããŸãããïŒãããããã¯äŒæ¥ããã°ã§ãããããã°ããŒããæ°åŠã«ã€ããŠè©±ãããšã¯ã§ããŸãã-誰ããç ã£ãŠããŸãïŒã ã·ã¹ãã ã¯ããã«ã€ããŠåŠç¿ããåæããå¥ã®ã¹ãã¢ã§åããã€ã€ãŒã«ãã¿ã³ã¢ã³ãŒãã£ãªã³ããã¬ã€ããŠæ¥œããããšãæšå¥šããŸãã ã€ãŸããã¯ã©ã¹ã¿ãªã³ã°ã«ãã£ãŠè§£æ±ºãããæåã®ã¿ã¹ã¯ã¯ãé¢å¿ã®ãã転éã§ãã
2çªç®ã®ã¿ã¹ã¯ïŒååã®æ£ããè«çãªã³ã¯ã確å®ã«äœæããŠãé¢é£ãã売äžãå¢ããããšã ããšãã°ããŠãŒã¶ãŒãã«ã¡ã©ã賌å
¥ããå Žåãã·ã¹ãã ã¯ããããªãŒãšã¡ã¢ãªãŒã«ãŒããæšå¥šããŸãã ã€ãŸããåæ§ã®è£œåãã¯ã©ã¹ã¿ãŒã§åéããã¯ã©ã¹ã¿ãŒã¬ãã«ã§æ¢ã«å¥ã®ã¹ãã¢ã§äœæ¥ããå¿
èŠããããŸãã
äžèšã®ããã«ãåœç€Ÿã®è£œåããŒã¿ããŒã¹ã¯ã1C-Bitrixãã©ãããã©ãŒã ã§åäœããæ°äžã®ãªã³ã©ã€ã³ã¹ãã¢ã®ã«ã¿ãã°ã§æ§æãããŠããŸãã ãŠãŒã¶ãŒã¯ããã«ããŸããŸãªé·ãã®ããã¹ãèšè¿°ãå
¥åãã誰ããååãã¢ãã«ãã¡ãŒã«ãŒã®ãã©ã³ããè¿œå ããŸãã æ°äžã®åºèãããã¹ãŠã®è£œåãåéããã³åé¡ããããã®äžè²«ããæ£ç¢ºãªçµ±äžã·ã¹ãã ãã³ã³ãã€ã«ããããšã¯ãè²»çšãããããé·ããå°é£ã§ãã ãããŠãé¡äŒŒã®è£œåãè¿
éã«çµåããå調ãã£ã«ã¿ãªã³ã°ã®å質ã倧å¹
ã«æ¹åããæ¹æ³ãæ¢ããŠããŸããã çµå±ã®ãšãããã·ã¹ãã ã賌å
¥è
ãããç¥ã£ãŠããå Žåãã©ã®ç¹å®ã®è£œåãšãã®ãã©ã³ãã圌ãèå³ãæã£ãŠãããã¯åé¡ã§ã¯ãªããã·ã¹ãã ã¯åžžã«åœŒã«æäŸãããã®ãèŠã€ããŸãã
ããŒã«éžæ
ãŸãããããã°ãããŒã¿ãæ±ãã®ã«é©ããã¹ãã¬ãŒãžã·ã¹ãã ãéžæããå¿
èŠããããŸããã å°ãç¹°ãè¿ããŸãããçŽ æã®åºå®ã«ã¯äŸ¿å©ã§ãã ç§ãã¡èªèº«ã®ããã«ãããŒã¿ããŒã¹ã®4ã€ã®ããã£ã³ãããç¹å®ããŸããã
- MapReduce SQLïŒHiveãPigãSpark SQL ã 10åè¡ãRDBMSã«ããŠã³ããŒãããããŒã¿éçŽãå®è¡ããŸã-ãã®ã¯ã©ã¹ã®ã¹ãã¬ãŒãžã·ã¹ãã ã¯ããã®ã¿ã¹ã¯ã«ãããŸãé©ããŠããŸãããããšãæããã«ãªããŸãã ãããã£ãŠãçŽæçã«ã¯ãMapReduceã䜿çšããŠSQLã䜿çšããããšãã§ããŸãïŒFacebookã¯ãã€ãŠäœ¿çšããŠããŸããïŒã ããã§ã®æãããªæ¬ ç¹ã¯ãçãã¯ãšãªã®å®è¡é床ã§ãã
- SQL on MPPïŒå€§èŠæš¡äžŠååŠçïŒïŒImpalaãPrestoãAmazon RedShiftãVertica ã ãã®ãã£ã³ãã®ä»£è¡šè
ã¯ãMapReduceãä»ããSQLã¯è¡ãæ¢ãŸãã§ãããšäž»åŒµããŠãããããããŒã¿ãä¿åãããŠããåããŒãã§ãã©ã€ããŒãå®éã«èµ·åããå¿
èŠããããŸãã ãããããããã®ã·ã¹ãã ã®å€ãã¯äžå®å®æ§ã«æ©ãŸãããŠããŸãã
- NoSQLïŒCassandraãHBaseãAmazon DynamoDB 3çªç®ã®ãã£ã³ãã§ã¯ãBigDataã¯NoSQLã§ãã ããããå®éãNoSQLã¯ãªã³ã°æ§é ã«çµ±åããããmemcachedãµãŒããŒãã®ã»ããã§ããããå€ã«ãã£ãŠããŒãååŸããŠå¿çãè¿ãããªã©ã®ã¯ãšãªããã°ããå®è¡ã§ããããšãç解ããŠããŸãã ãŸããããŒã«ãã£ãŠã©ã³ãã ã¢ã¯ã»ã¹ã¡ã¢ãªã§ãœãŒããããããŒã¿ã»ãããè¿ãããšãã§ããŸãã ããã¯ã»ãšãã©ãã¹ãŠã§ã-JOINSãå¿ããŠãã ããã
- ã¯ã©ã·ãã¯ïŒMySQLãMS SQLãOracleãªã©ã ã¢ã³ã¹ã¿ãŒã¯ããããããäžèšã®ãã¹ãŠ-ãŽããå¿ããã®æ²ãã¿ãããã³ãªã¬ãŒã·ã§ãã«ããŒã¿ããŒã¹ãBigDataã§ããŸãæ©èœãããšäž»åŒµããŸãã ãã©ã¯ã¿ã«ããªãŒïŒ https://en.wikipedia.org/wiki/TokuDB ïŒãæäŸããŠãã人ãããã°ãã¯ã©ã¹ã¿ãŒãæäŸããŠãã人ãããŸãã ã¢ã³ã¹ã¿ãŒãçãããã
ã¯ã©ã¹ã¿ãªã³ã°ã®åé¡ã綿å¯ã«åŠçããçµæãåžå Žã«ã¯æ¢è£œã®ãªãã¡ãŒãããã»ã©å€ããªãããšãããããŸããã
- Spark MLlibïŒScala / Java / Python / RïŒ-倧éã®ããŒã¿ãããå Žåã
- scikit-learn.orgïŒPythonïŒ-ããŒã¿ãã»ãšãã©ãªãå Žåã
- R-å®è£
ã®ã²ã©ãæ²çïŒæ°åŠè
ãçµ±èšåŠè
ãžã®ããã°ã©ãã³ã°ã®ä¿¡é ŒïŒã«ããççŸããææ
ãåŒãèµ·ãããŸãããå€ãã®æ¢è£œã®ãœãªã¥ãŒã·ã§ã³ããããæè¿ã®Sparkãšã®çµ±åã¯éåžžã«æ¥œãããã®ã§ãã

çµå±ãSparkMLlibã«æ±ºããŸããããªããªããèŠããšãããä¿¡é Œã§ãã䞊åã¯ã©ã¹ã¿ãªã³ã°ã¢ã«ãŽãªãºã ãããããã§ãã
ã¯ã©ã¹ã¿ãªã³ã°ã¢ã«ãŽãªãºã ã®æ€çŽ¢
æåã«ãååã®ããã¹ãã«ãã説æïŒååãšçã説æïŒãã¯ã©ã¹ã¿ãŒã«çµåããæ¹æ³ãç解ããå¿
èŠããããŸããã
èªç¶èšèªã¯ãŒãããã»ãã·ã³ã°ã¯ãæ©æ¢°åŠç¿ã
æ
å ±æ€çŽ¢ãªã©ãã³ã³ãã¥ãŒã¿ãŒãµã€ãšã³ã¹ã®å¥ã®å·šå€§ãªåéã§ã
ãèšèªåŠãããã«ã¯ïŒã¿ãã ïŒïŒãã£ãŒãã©ãŒãã³ã°ã
ã¯ã©ã¹ã¿ãªã³ã°çšã«æ°åãæ°çŸãæ°åã®ããŒã¿ãããå Žåãã»ãŒãã¹ãŠã®å€å
žçãªã¢ã«ãŽãªãºã ãããã«ã¯éå±€çãªã¯ã©ã¹ã¿ãªã³ã°ã§ãå¯èœã§ãã åé¡ã¯ãéå±€çã¯ã©ã¹ã¿ãªã³ã°ã®ã¢ã«ãŽãªãºã ã®è€éããã»ãŒOïŒN
3 ïŒã§ããããšã§ãã ã€ãŸãã巚倧ãªã¯ã©ã¹ã¿ãŒäžã®ããŒã¿ããªã¥ãŒã ã§æ©èœãããŸã§ããæ°åå幎ãåŸ
ã€å¿
èŠããããŸãã ãŸããç¹å®ã®ãµã³ãã«ã®ã¿ã®åŠçã«éå®ããããšã¯äžå¯èœã§ããã ãããã£ãŠãé¡ã®éå±€çã¯ã©ã¹ã¿ãªã³ã°ã¯ç§ãã¡ã«é©ããŠããªãã
次ã«ããã²ããçããããK-meansã¢ã«ãŽãªãºã ã«ç§»ããŸããã

ããã¯éåžžã«ã·ã³ãã«ã§ãååã«ç 究ãããæ®åããŠããã¢ã«ãŽãªãºã ã§ãã ãã ããããã°ããŒã¿ã§ã¯éåžžã«ãã£ãããšåäœããŸããã¢ã«ãŽãªãºã ã®è€éãã¯çŽOïŒnkdiïŒã§ãã n = 10,000,000ïŒååã®æ°éïŒãk = 1,000,000ïŒäºæ³ãããã¯ã©ã¹ã¿ãŒæ°ïŒãd = <1,000,000ïŒåèªã®ã¿ã€ãããã¯ãã«æ¬¡å
ïŒãi = 100ïŒæŠç®ã®å埩æ°ïŒãO = 10
21æäœã æ¯èŒã®ããã«ãå°çã®å¹Žéœ¢ã¯1.4 * 10
17ç§ã§ãã
C-meansã¢ã«ãŽãªãºã ã¯ãã¡ãžãŒã¯ã©ã¹ã¿ãªã³ã°ãèš±å¯ããŸãããã¹ãã¯ãã«å æ°å解ãšåæ§ã«ããªã¥ãŒã ã«å¯ŸããŠããã£ãããšæ©èœããŸãã åãçç±ã§ãDBSCANãšç¢ºçã¢ãã«ã¯ç§ãã¡ã«é©åããŸããã§ããã
ã¯ã©ã¹ã¿ãªã³ã°ãå®è¡ããããã«ãæåã®æ®µéã§ããã¹ãããã¯ãã«ã«å€æããããšã«ããŸããã ãã¯ãã«ã¯å€æ¬¡å
空éå
ã®ç¹å®ã®ãã€ã³ãã§ããããã®ã¯ã©ã¹ã¿ãŒãç®çã®ã¯ã©ã¹ã¿ãŒã«ãªããŸãã
2ã10èªã®è£œå説æãã¯ã©ã¹ã¿ãªã³ã°ããå¿
èŠããããŸããã é¡ãŸãã¯ç®ã«å¯ŸããæãåçŽã§å€å
žçãªè§£æ±ºçã¯
ãèšèã®è¢ã§ã ã ã«ã¿ãã°ãããã®ã§ãèŸæžãå®çŸ©ã§ããŸãã ãã®çµæãçŽ100äžèªã®ã³ãŒãã¹ããããŸãã ã¹ããã³ã°åŸãçŽ50äžåãæ®ã£ãŠãããé«é »åºŠããã³äœé »åºŠã®åèªã¯ç Žæ£ãããŸããã ãã¡ãããtf / idfã䜿çšããããšãã§ããŸãããè€éã«ããããšã¯ãããŸããã
ãã®ã¢ãããŒãã®æ¬ ç¹ã¯äœã§ããïŒ çµæãšããŠåŸããã巚倧ãªãã¯ãã«ã¯ããã®é¡äŒŒæ§ãä»ãšæ¯èŒããŠèšç®ããã®ã«é«äŸ¡ã§ãã çµå±ã®ãšãããã¯ã©ã¹ã¿ãªã³ã°ãšã¯äœã§ããïŒ ããã¯ãåæ§ã®ãã¯ãã«ãèŠã€ããããã»ã¹ã§ãã ãŸãããµã€ãºã50äžã®å Žåãæ€çŽ¢ã«ã¯å€ãã®æéãããããããå§çž®æ¹æ³ãåŠç¿ããå¿
èŠããããŸãã ãããè¡ãã«ã¯ãã«ãŒãã«ããã¯ã䜿çšããŠã50äžåã®å±æ§ã§ã¯ãªã10äžåã®åèªãããã·ã¥ããŸããåªããæ©èœããããŒã«ã§ããã競åãçºçããå¯èœæ§ããããŸãã 䜿çšããŸããã§ããã
æåŸã«ãç Žæ£ããå¥ã®ãã¯ãããžãŒã«ã€ããŠèª¬æããŸãããçŸåšããã®äœ¿çšãçå£ã«æ€èšããŠããŸãã ããã¯ãGoogleãéçºãã2å±€ãã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããŠããã¹ããã¯ãã«ã®æ¬¡å
ãå§çž®ããããšã«ãããããã¹ãã®çµ±èšåŠçãè¡ãææ³ã§ããWord2Vecã§ãã å®éãããã¯å€ãè¯ãæ°žé ã®çµ±èšN-gramããã¹ãã¢ãã«ã®éçºã§ãããã¹ãããã°ã©ã ã®ããªãšãŒã·ã§ã³ã®ã¿ã䜿çšãããŸãã
æåã®ã¿ã¹ã¯ã¯ãWord2Vecã§çŸãã解決ãããŸãããè¡åå解ãã«ãã次å
ã®åæžïŒå
·äœçã«ã¯åŒçšç¬Šã§å²ãŸããŠããŸãããè¡åã¯ãããŸããããå¹æã¯éåžžã«äŒŒãŠããŸãïŒã ã€ãŸããããšãã°ã50äžåã®å±æ§ã§ã¯ãªãã100åã®å±æ§ã®ã¿ãå€æããŸããã³ã³ããã¹ãã«é¡äŒŒããåèªãããå Žåãã·ã¹ãã ã¯ãããããå矩èªããšèŠãªããŸãïŒãã¡ãããã³ãŒããŒãšçŽ
è¶ãçµã¿åãããããšãã§ããŸãïŒã å€æ¬¡å
空éå
ã®ãããã®é¡äŒŒããåèªã®ãã€ã³ãã¯äžèŽãå§ããŸããã€ãŸããé¡äŒŒããæå³ã®åèªã¯å
±éã®ã¯ã©ãŠãã«ã¯ã©ã¹ã¿ãŒåãããŸãã ããšãã°ããã³ãŒããŒããšããè¶ãã¯æå³ãè¿ãèšèã«ãªããŸãããªããªãããããã¯æèã§äžç·ã«èŠã€ããããšãå€ãããã§ãã Word2Wecã©ã€ãã©ãªã®ãããã§ããã¯ãã«ã®ãµã€ãºãå°ããããããšãã§ãããã¯ãã«èªäœãããææ矩ã«ãªããŸããã
ãã®ãããã¯ã¯äœå¹ŽãåãããããŸãïŒæœåšçãªã»ãã³ãã£ãã¯ã€ã³ããã¯ã¹ãšPCA / SVDã«ãããã®ããªãšãŒã·ã§ã³ã¯ããç 究ãããŠãããå®éã«ã¯term2documentãããªãã¯ã¹ã®åãŸãã¯è¡ãã¯ã©ã¹ã¿ãŒåããããšã«ããé¡ã®è§£æ±ºçã¯åæ§ã®çµæããããããŸã-ããã¯éåžžã«é·ãæéããããããŸããã
Word2Vecã®äœ¿çšãéå§ããå¯èœæ§ãéåžžã«é«ããªããŸãã ã¡ãªã¿ã«ããã®äœ¿çšã«ãããã¿ã€ããã¹ãèŠã€ããŠãæç« ãåèªã®ãã¯ãã«ä»£æ°ã§éã¶ããšãã§ããŸãã
ãã«ãããŒã¯ãæ§ç¯ããŸãïŒ..ã
çµæãšããŠãç§åŠåºçç©ãé·æéæ€çŽ¢ããåŸãç¬èªã®ããŒãžã§ã³ã®k-Means-Bootstrap Averaging for Sparkã«ããã¯ã©ã¹ã¿ãªã³ã°ãäœæããŸããã
æ¬è³ªçã«ãããã¯éå±€çãªk-Meansã§ãããããŒã¿ã®äºåçãªã¬ã€ã€ãŒããšã®ãµã³ããªã³ã°ãè¡ããŸãã 倧éã®ãµãŒããŒã䜿çšããå¿
èŠããããŸãããã1,000äžååãæéãåŠçããã®ã«åŠ¥åœãªæéãèŠããŸããã ããããçµæã¯æ©èœããŸããã§ããããªããªã ããã¹ãããŒã¿ã®äžéšãã¯ã©ã¹ã¿ãŒåã§ããŸããã§ãã-éŽäžã¯é£è¡æ©ã§æ¥çãããŠããŸããã ã¡ãœããã¯æ©èœããŸããããéåžžã«å€±ç€Œã§äžæ£ç¢ºã§ãã
å€ããã®ã«ã¯åžæããããŸããããçŸåšã§ã¯å¿ããããŠãããéè€ãŸãã¯ãã»ãŒéè€ããèŠã€ãã確ççææ³-
å±ææ§ã«ææãªããã·ã¥ ã
ããã§èª¬æããæ¹æ³ã®å€åœ¢ã§ã¯ãããã·ã¥é¢æ°ã«åŸã£ãŠããã«ãåæ£ãããããã«ãããã¹ãããå€æãããåããµã€ãºã®ãã¯ãã«ã䜿çšããå¿
èŠããããŸããã ãããŠãMinHashãåããŸããã
MinHashã¯ãçžäºã®Jaccardã®é¡äŒŒæ§ãç¶æããªããã倧ããªãµã€ãºã®ãã¯ãã«ãå°ããªãã¯ãã«ã«å§çž®ããæè¡ã§ãã 圌女ã¯ã©ã®ããã«åããŠããŸããïŒ ç¹å®ã®æ°ã®ãã¯ãã«ãŸãã¯ã»ããã®ã»ããããããåãã¯ãã«/ã»ãããå®è¡ããããã·ã¥é¢æ°ã®ã»ãããå®çŸ©ããŸãã

ããšãã°ã50åã®ããã·ã¥é¢æ°ãå®çŸ©ããŸãã 次ã«ãåããã·ã¥é¢æ°ããã¯ãã«/ã»ããã§å®è¡ããããã·ã¥é¢æ°ã®æå°å€ã決å®ããæ°ããå§çž®ãã¯ãã«ã®Näœçœ®ã«æžã蟌ãŸããæ°å€ãååŸããŸãã 50åè¡ããŸãã

Pr [h
min ïŒAïŒ= h
min ïŒBïŒ] = JïŒAãBïŒ
ãããã£ãŠã枬å®å€ãå§çž®ãããã¯ãã«ãåäžã®æž¬å®å€ã«æžãããšããåé¡ã解決ããŸããã
ããã¹ãã®ã·ã³ã°ãªã³ã°
é¡ã®ããã¹ãããã¯ãã«åããããšãæåŠããããšãå®å
šã«å¿ããŸããããªããªãã 補ååãšç°¡åãªèª¬æã¯ãã次å
ã®åªããã«èŠããéåžžã«æŸé»ããããã¯ãã«ãäœæããŸããã
補ååã¯éåžžããã®ã¿ã€ããšãµã€ãºã«ã€ããŠã§ããïŒ
ãçžæš¡æ§ã®èµ€ãããªãŒãã³ãã
ã¬ããã¹ãã©ã€ããã³ãããã2ã€ã®ãã¬ãŒãºã¯ãåèªã®ã»ãããæ°ãå Žæãç°ãªããŸãã ããã«ãå
¥åãããšãã«ã¿ã€ããã¹ãããŸãã ãããã£ãŠãã¹ããã³ã°åŸã§ãåèªãæ¯èŒããããšã¯ã§ããŸããããã¹ãŠã®ããã¹ãã¯æå³ãè¿ããã®ã®ãæ°åŠçã«ç°ãªãããã§ãã
åæ§ã®ç¶æ³ã§
ã¯ãã·ã³ã°ã«
ã¢ã«ãŽãªãºã ïŒã·ã³ã°ã«ããã¬ãŒã¯ãã¿ã€ã«ïŒããã䜿çšãããŸãã ããã¹ããéçãæçã®åœ¢ã§æ瀺ããŸãã
{ããã³ããããã¿ããŒãããã¢ããããny kãããs kraãããkrasãã...}ãããŠãå€ãã®éšåãæ¯èŒãããšãç°ãªãããã¹ãã®2ã€ã®ããã¹ããçªç¶äºãã«é¡äŒŒããŠããããšãããããŸãã ç§ãã¡ã¯é·ãéã¯ãŒãããè©ŠããŠããŸããããç§ãã¡ã®çµéšã§ã¯ããã®æ¹æ³ã§ãã補åã«ã¿ãã°ã®çãããã¹ãã®èª¬æãæ¯èŒã§ããŸããã ãã®æ¹æ³ã¯ãçäœãæ€åºããããã«ãé¡äŒŒã®èšäºãç§åŠè«æãèå¥ããããã«ã䜿çšãããŸãã
ç¹°ãè¿ããŸãããéåžžã«ãŸã°ããªããã¹ããã¯ãã«ãç Žæ£ããåããã¹ããäžé£ã®åž¯ç¶ç±ç¹ã«çœ®ãæããMinHashã䜿çšããŠåäžã®ãµã€ãºã«çž®å°ããŸããã
ãã¯ãã«å
ãã®çµæãã«ã¿ãã°ããã¯ãã«åããåé¡ã次ã®ããã«è§£æ±ºããŸããã MinHashã·ã°ããã£ã䜿çšããŠã100ã500ã®å°ããªãã¯ãã«ãååŸããŸããïŒãµã€ãºã¯ãã¹ãŠã®ãã¯ãã«ã§åãããã«éžæãããŠããŸãïŒã 次ã«ãã¯ã©ã¹ã¿ãŒã圢æããããã«ããããããæ¯èŒããå¿
èŠããããŸãã ãã§ãã«ããã§ã«ç¥ã£ãŠããããã«ãããã¯éåžžã«é·ãæéã§ãã ãããŠãLSHïŒ
Locality-Sensitive Hashing ïŒã¢ã«ãŽãªãºã ã®ãããã§ããã®åé¡ã¯ã¯ã³ãã¹ã§è§£æ±ºãããŸããã
èãæ¹ã¯ãåæ§ã®ãªããžã§ã¯ããããã¹ãããã¯ãã«ã1ã€ã®ããã·ã¥é¢æ°ã®ã»ããã1ã€ã®ãã±ããã«è¡çªãããšãããã®ã§ãã ãããŠãããããééããŠåæ§ã®èŠçŽ ãåéããããšãæ®ã£ãŠããŸãã ã¯ã©ã¹ã¿ãªã³ã°åŸã100äžãã±ãããååŸããããããããã¯ã©ã¹ã¿ãŒã«ãªããŸãã
ã¯ã©ã¹ã¿ãªã³ã°
äŒçµ±çã«ãããã€ãã®ãã³ã-ããã·ã¥é¢æ°ã®ã»ããã䜿çšãããŸãã ããããã¿ã¹ã¯ãããã«ç°¡çŽ åããŸããããã³ãã¯1ã€ã ãæ®ããŸããã ãã¯ãã«ã®æåã®40åã®èŠçŽ ãååŸãããããã·ã¥ããŒãã«ã«å
¥åããããšããŸãã ãããŠãæåã¯åãããŒã¹ãæã€èŠçŽ ããããŸãã 以äžã§ãïŒ æå§ãã«ãçŽ æŽãããã ããæ£ç¢ºã«ããå¿
èŠãããå Žåã¯ããã³ãã°ã«ãŒãã䜿çšã§ããŸãããã¢ã«ãŽãªãºã ã®æåŸã®éšåã§ã¯ããããããçžäºã«é¡äŒŒãããªããžã§ã¯ããåéããã®ã«æéãããããŸãã
æåã®ã€ãã¬ãŒã·ã§ã³ã®åŸãç§ãã¡ã¯è¯ãçµæãåŸãŸããïŒã»ãŒãã¹ãŠã®ããã«ãšã»ãŒãã¹ãŠã®åæ§ã®è£œåãäžç·ã«ã¹ã¿ãã¯ããŸããã èŠèŠçã«è©äŸ¡ã ãŸãããã€ã¯ãã¯ã©ã¹ã¿ãŒã®æ°ãããã«æžããããã«ã以åã«é »ç¹ã«åºçŸããããã£ãã«èŠã€ãããªãåèªãåé€ããŸããã
çŸåšãããã2ã3æéã§ã8ã€ã®ã¹ããããµãŒããŒã§ã1,000äžåã®ååãçŽ100äžåã®ã¯ã©ã¹ã¿ãŒã«ã¯ã©ã¹ã¿ãŒåãããŠããŸãã å®éããã³ãã¯1ã€ã ããªã®ã§ã1ã€ã®ãã¹ã§ã èšå®ãè©ŠããŠã¿ããšããšãããè»ããœãŒã»ãŒãžãªã©ããax + planeãã®ãããªæããªããšã®ãªããããªãé©åãªã¯ã©ã¹ã¿ãŒãåŸãããŸããã ãããŠçŸåšããã®å§çž®ãããã¯ã©ã¹ã¿ãŒã¢ãã«ã¯ãå人æšèŠã·ã¹ãã ã®ç²ŸåºŠãåäžãããããã«äœ¿çšãããŠããŸãã
ãŸãšã
ã³ã©ãã¬ãŒã·ã§ã³ã¢ã«ãŽãªãºã ã§ã¯ãç¹å®ã®ååã§ã¯ãªããã¯ã©ã¹ã¿ãŒã§äœæ¥ãéå§ããŸããã æ°ãã補åãç»å Žããã¯ã©ã¹ã¿ãŒãèŠã€ããŠããã«çœ®ããŸããã ãããŠéã®ããã»ã¹-ã¯ã©ã¹ã¿ãŒããå§ãããŸãã次ã«ãã¯ã©ã¹ã¿ãŒããæã人æ°ã®ãã補åãéžæãããŠãŒã¶ãŒã«è¿ããŸãã ã¯ã©ã¹ã¿ãŒã«ã¿ãã°ã䜿çšãããšãæšå¥šã®ç²ŸåºŠãæ°ååäžããŸããïŒ1ãæåã«çŸåšã®ã¢ãã«ã®ãªã³ãŒã«ã枬å®ããŸãïŒã ãããŠãããã¯ããŒã¿ïŒååïŒã®å§çž®ãšãããã®æå³ã®çµã¿åããã«ãããã®ã§ãã ãããã£ãŠãç§ã¯ããªãã«å©èšããããšæããŸã-ããã°ããŒã¿ã«é¢é£ããããªãã®åé¡ã®ç°¡åãªè§£æ±ºçãæ¢ããŠãã ããã ç©äºãè€éã«ããããšããªãã§ãã ããã äœæ¥ã®10ïŒ
ã§ã¿ã¹ã¯ã®90ïŒ
ã解決ã§ããã·ã³ãã«ã§å¹æçãªãœãªã¥ãŒã·ã§ã³ããã€ã§ãèŠã€ããããšãã§ãããšä¿¡ããŠããŸãïŒ ããã°ããŒã¿ãæ±ãããšã«æåããæåããŸããïŒ