
HabrÃ©ã§æ©æ¢°åŠç¿ã³ãŒã¹ãåè¬ããŠããçãããããã«ã¡ã¯ïŒ
æåã®2ã€ã®éšåïŒ 1ã2 ïŒã§ã¯ããã³ãããã®ããŒã¿ã®åæåæãšãããŒã¿ããçµè«ãåŒãåºãããšãã§ããåçã®æ§ç¯ãå®è·µããŸããã 仿¥ãæåŸã«ãæ©æ¢°åŠç¿ã«ç§»ããŸãããã æ©æ¢°åŠç¿ã®åé¡ã«ã€ããŠè©±ãã2ã€ã®ç°¡åãªã¢ãããŒã-æ±ºå®æšãšæè¿åã®æ¹æ³ãèããŠã¿ãŸãããã ãŸããçžäºæ€èšŒã䜿çšããŠç¹å®ã®ããŒã¿ã®ã¢ãã«ãéžæããæ¹æ³ã«ã€ããŠã説æããŸãã
UPDïŒçŸåšãã³ãŒã¹ã¯è±èªã§ã mlcourse.aiãšãããã©ã³ãåã§ãMedium ã«é¢ããèšäº ãKaggleïŒ Dataset ïŒããã³GitHubã«é¢ããè³æããããŸã ã
ãªãŒãã³ã³ãŒã¹ã®2åç®ã®ç«ã¡äžãïŒ2017幎9æãã11æïŒã®äžç°ãšããŠããã®èšäºã«åºã¥ããè¬çŸ©ã®ãã㪠ã
ã·ãªãŒãºã®èšäºã®ãªã¹ã ãã®èšäºã®æŠèŠïŒ
- ã¯ããã«
- æ±ºå®æš
- æè¿åæ³
- ã¢ãã«ãã©ã¡ãŒã¿ãŒã®éžæãšçžäºæ€èšŒ
- å¿çšäŸ
- æ±ºå®æšã®é·æãšçæãæè¿åæ³
- 宿é¡â3
- æçšãªãªãœãŒã¹
ã¯ããã«
ãããããããã«æŠãã«çªå
¥ããããšæããŸããããŸããã©ã®ãããªåé¡ã解決ããã®ãããããŠæ©æ¢°åŠç¿ã®åéã§ã®ãã®å Žæã«ã€ããŠè©±ããŸãã
æ©æ¢°åŠç¿ã®å€å
žçã§äžè¬çãªïŒçã¿ã䌎ãå³å¯ã§ã¯ãªãïŒå®çŸ©ã¯æ¬¡ã®ããã«èãããŸãïŒT. Mitchell "Machine learning"ã1997ïŒïŒ
圌ãã¯ãã¯ã©ã¹Pããåé¡ã解決ãããšãã«ã³ã³ãã¥ãŒã¿ãŒããã°ã©ã ãåŠç¿ãããšèšãããã®ããã©ãŒãã³ã¹ãã¡ããªãã¯Pã«åŸã£ãŠãçµéšEã®èç©ãšãšãã«åäžããå Žåã
ããã«ãç°ãªãã·ããªãªã§ã¯ã TãP ãããã³Eã¯ãŸã£ããç°ãªãããšãæå³ããŸãã æ©æ¢°åŠç¿ã§æã人æ°ã®ããTã¿ã¹ã¯ã®äžã§ ïŒ
- åé¡-ãªããžã§ã¯ãã®ç¹æ§ã«åºã¥ãã«ããŽãªã®1ã€ãžã®å²ãåœãŠ
- ååž°-ä»ã®å±æ§ã«åºã¥ããŠãªããžã§ã¯ãã®éç屿§ãäºæž¬ãã
- ã¯ã©ã¹ã¿ãªã³ã°-ãããã®ãªããžã§ã¯ãã®ç¹æ§ã«åºã¥ããŠãªããžã§ã¯ãã®ã»ãããã°ã«ãŒãã«åå²ããã°ã«ãŒãå
ã§ã¯ãªããžã§ã¯ããäºãã«é¡äŒŒããåãã°ã«ãŒãã®å€ã§ã¯ãªããžã§ã¯ããé¡äŒŒããªãããã«ãã
- ç°åžžæ€åº-ãµã³ãã«å
ã®ä»ã®ãã¹ãŠãŸãã¯ãªããžã§ã¯ãã®ã°ã«ãŒããšãéåžžã«ç°ãªãããªããžã§ã¯ããæ€çŽ¢ããŸã
- ãã®ä»ãããå
·äœçã«ã ãã£ãŒãã©ãŒãã³ã°ã®æ©æ¢°åŠç¿ã®åºæ¬ã®ç« ã§åªããã¬ãã¥ãŒãæäŸãããŠããŸãïŒIan GoodfellowãYoshua BengioãAaron Courvilleã2016ïŒ
çµéšEã¯ããŒã¿ïŒããããªããã©ãããªãïŒãæå³ããããã«å¿ããŠãæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã¯æåž«ãããšæåž«ãªãïŒ æåž«ãã ïŒæåž«ãªãåŠç¿ïŒã§æãããã®ã«åå²ã§ããŸãã æåž«ãªãã§æããåé¡ã«ã¯ã 屿§ã®ã»ããã§èšè¿°ããããªããžã§ã¯ãã§æ§æããããµã³ãã«ããããŸã ã ããã«å ããŠãæåž«ã«ããæå°ã®ã¿ã¹ã¯ã§ã¯ã trainingãšåŒã°ããç¹å®ã®ãµã³ãã«ã®åãªããžã§ã¯ãã«ã€ããŠã ã¿ãŒã²ãã屿§ãæ¢ç¥ã§ããå®éãããã¯ãã¬ãŒãã³ã°ãµã³ãã«ããã§ã¯ãªããä»ã®ãªããžã§ã¯ãã«ã€ããŠäºæž¬ããããã®ã§ãã
äŸ
åé¡ãšååž°ã®ã¿ã¹ã¯ã¯ãæåž«ã«æããã¿ã¹ã¯ã§ãã äŸãšããŠãã¯ã¬ãžããã¹ã³ã¢ãªã³ã°ã®ã¿ã¹ã¯ã玹ä»ããŸãã顧客ã«ã€ããŠã¯ã¬ãžããçµç¹ãèç©ããããŒã¿ã«åºã¥ããŠãããŒã³ã®ããã©ã«ããäºæž¬ããããšæããŸãã ããã§ãã¢ã«ãŽãªãºã ã®å Žåããšã¯ã¹ããªãšã³ã¹Eã¯å©çšå¯èœãªãã¬ãŒãã³ã°ã»ããã§ãã ãªããžã§ã¯ã ïŒäººã
ïŒã®ã»ããã¯ããããããç¹æ§ã®ã»ããïŒå¹Žéœ¢ã絊äžãããŒã³ã®çš®é¡ãéå»ã®éè¿æžãªã©ïŒãšã¿ãŒã²ãã屿§ã«ãã£ãŠç¹åŸŽä»ããããŸã ã ãã®ã¿ãŒã²ãã屿§ãåçŽã«ããŒã³ã®æªè¿æžã®äºå®ïŒ1ãŸãã¯0ãã€ãŸãéè¡ãããŒã³ãè¿æžãã顧客ãšè¿æžããªãã£ã顧客ãç¥ã£ãŠããïŒã§ããå Žåãããã¯ïŒãã€ããªïŒåé¡ã¿ã¹ã¯ã§ãã ã¯ã©ã€ã¢ã³ããããŒã³ã®è¿æžã«ã©ãã ãã®æéãé
ãããããç¥ã£ãŠããŠãæ°ãã顧客ã«å¯ŸããŠåãããšãäºæž¬ãããå Žåãããã¯ååž°ã¿ã¹ã¯ã«ãªããŸãã
æåŸã«ãæ©æ¢°åŠç¿ã®å®çŸ©ã«ããã3çªç®ã®æœè±¡åã¯ã ã¢ã«ãŽãªãºã Pã®ããã©ãŒãã³ã¹ãè©äŸ¡ããããã®ã¡ããªãã¯ã§ãã ãã®ãããªã¡ããªãã¯ã¯ãã¿ã¹ã¯ãã¢ã«ãŽãªãºã ããšã«ç°ãªããŸããã¢ã«ãŽãªãºã ãåŠç¿ãããšãã«ããããã«ã€ããŠèª¬æããŸãã ä»ã®ãšãããåé¡åé¡ã解決ããã¢ã«ãŽãªãºã ã®æãåçŽãªå質ã¡ããªãã¯ã¯ãæ£è§£ã®å²åïŒ æ£ç¢ºã ã æ£ç¢ºãã§ã¯ãªãããã®ç¿»èš³ã¯å¥ã®ã¡ããªãã¯ã 粟床ã®ããã«äºçŽãããŠããŸãïŒãã€ãŸããåã«ãã¹ããµã³ãã«äžã®ã¢ã«ãŽãªãºã ã®æ£ããäºæž¬ã®å²åã§ãããšããŸããã
次ã«ãæåž«ã«æãã2ã€ã®ã¿ã¹ã¯ãåé¡ãšååž°ã«ã€ããŠèª¬æããŸãã
æ±ºå®æš
æãäžè¬çãªãã®ã®1ã€ã§ããæ±ºå®æšã䜿çšããŠãå顿¹æ³ãšååž°æ¹æ³ã®ã¬ãã¥ãŒãéå§ããŸãã æ±ºå®æšã¯ã人éã®æŽ»åã®ããŸããŸãªåéã®æ¥åžžç掻ã§äœ¿çšãããŸãããæ©æ¢°åŠç¿ããã¯ã»ã©é ãããšããããŸãã æ±ºå®æšã¯ãã©ã®ãããªç¶æ³ã§äœããã¹ããã«ã€ããŠã®èŠèŠçãªæç€ºãšåŒã¶ããšãã§ããŸãã ç ç©¶æã®ç§åŠè
ã®ã«ãŠã³ã»ãªã³ã°ã®åéããäŸãæããŸãããã Higher School of Economicsã¯ãåŸæ¥å¡ã®çæŽ»ãæ¥œã«ããæ
å ±ã¹ããŒã ãäœæããŸãã 以äžã¯ãç ç©¶æã®ããŒã¿ã«ã§ç§åŠè«æãå
¬éããããã®æç€ºã®æçã§ãã

æ©æ¢°åŠç¿ã®èгç¹ããèšãã°ãããã¯ããã€ãã®åºæºã«åŸã£ãŠããŒã¿ã«ã®åºç圢æ
ïŒæ¬ãèšäºãæ¬ã®ç« ããã¬ããªã³ããé«ççµæžåŠããã³ãã¹ã¡ãã£ã¢ã§ã®åºçïŒã決å®ããåºæ¬åé¡åã§ãããšèšããŸãïŒåºçã®çš®é¡ïŒã¢ãã°ã©ãããã³ãã¬ãããèšäºããªã©ïŒãèšäºãå
¬éãããåºçç©ã®çš®é¡ïŒç§åŠéèªãäœåéãªã©ïŒããã³ãã®ä»ã
å€ãã®å Žåãæ±ºå®æšã¯ãå°éå®¶ã®çµéšã®äžè¬åãå°æ¥ã®åŸæ¥å¡ã«ç¥èã移転ããææ®µããŸãã¯äŒç€Ÿã®ããžãã¹ããã»ã¹ã®ã¢ãã«ãšããŠæ©èœããŸãã ããšãã°ãéè¡ã»ã¯ã¿ãŒã«ã¹ã±ãŒã©ãã«ãªæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãå°å
¥ããåã«ãã¯ã¬ãžããã¹ã³ã¢ãªã³ã°ã®ã¿ã¹ã¯ã¯å°éå®¶ã«ãã£ãŠè§£æ±ºãããŸããã åãæã«ããŒã³ãä»äžããæ±ºå®ã¯ãããã€ãã®çŽèгçã«ïŒãŸãã¯çµéšããïŒæŽŸçããã«ãŒã«ã«åºã¥ããŠè¡ãããã«ãŒã«ã¯æ±ºå®ããªãŒã®åœ¢åŒã§è¡šãããšãã§ããŸãã
ãã®å Žåãã幎霢ãããå®¶ã«ãããããåå
¥ãããæè²ãã®èšå·ã«åŸã£ãŠããã€ããªåé¡åé¡ã解決ãããŠãããšèšããŸãïŒã¿ãŒã²ããã¯ã©ã¹ã«ã¯ã貞ãåºãããšãæåŠãã®2ã€ã®æå³ããããŸãïŒã
æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãšããŠã®æ±ºå®æšã¯åºæ¬çã«åãã§ãããæææ§ããšãã圢åŒã®è«çèŠåã®çµåã§ãã a å°ãªã x ãããŠæçŸ© b å°ãªã y ... =>ã¯ã©ã¹1ã®ãããªãŒãããŒã¿æ§é ãæ±ºå®ããªãŒã®å€§ããªå©ç¹ã¯ãç°¡åã«è§£éã§ãã人ã
ãçè§£ã§ããããšã§ããããšãã°ãäžã®å³ã®å³ã«ããã°ãåãæã«ããŒã³ãæåŠãããçç±ã説æã§ããŸãããããã£ãŠãåŸã§èª¬æããŸãããä»ã®å€ãã®ã¢ãã«ã¯ããæ£ç¢ºã§ããããã®ããããã£ãæãããããŒã¿ãèªã¿èŸŒãã§åçãåãåã£ãããã©ãã¯ããã¯ã¹ãã®ããã«èããããšãã§ããŸããæ±ºå®æšã®ãçè§£å¯èœæ§ããšã¢ãã«ãšã®é¡äŒŒæ§ 人ã®ãœãªã¥ãŒã·ã§ã³ïŒã¢ãã«ãäžåžã«ç°¡åã«èª¬æã§ããŸãïŒãæ±ºå®æšã¯éåžžã«äººæ°ãããããã®å顿¹æ³ã®ã°ã«ãŒãã®ä»£è¡šã®1ã€ã§ããC4.5ã¯ã10ã®ãã¹ãããŒã¿ãã€ãã³ã°ã¢ã«ãŽãªãºã ïŒãããŒã¿ãã€ãã³ã°ã®ããã10ã¢ã«ãŽãªãºã ãïŒã®ãªã¹ãã®æåãšèŠãªãããŸããç¥èããã³æ
å ±ã·ã¹ãã ã2008幎ãPDF ïŒã
æ±ºå®æšãæ§ç¯ããæ¹æ³
ã¯ã¬ãžããã¹ã³ã¢ãªã³ã°ã®äŸã§ã¯ãããŒã³ãä»äžããæ±ºå®ã¯ã幎霢ãäžåç£ã®å
¥æå¯èœæ§ãåå
¥ãªã©ã«åºã¥ããŠè¡ãããããšãããããŸããã ããããéžæããæåã®å
åã¯äœã§ããïŒ ãããè¡ãã«ã¯ããã¹ãŠã®ç¬Šå·ããã€ããªã§ãããããåçŽãªäŸãèããŠãã ããã
ããã§ã¯ãã20ã®è³ªåããšããã²ãŒã ãæãåºãããšãã§ããŸããããã¯ã決å®ããªãŒã®æŠèŠã§ããèšåãããŠããŸãã ãã£ãšèª°ããããããã¬ã€ããŸããã 1äººã¯æå人ãäœãã2人ç®ã¯ãã¯ãããŸãã¯ãããããã§çãããã質åã®ã¿ãå°ããŠæšæž¬ããããšããŸãïŒãããããªããããã³ãèšããªãããªãã·ã§ã³ã¯çç¥ããŸãïŒã ã©ã®æšæž¬ã®è³ªåãæåã«å°ããããŸããïŒ ãã¡ãããæ®ãã®ãªãã·ã§ã³ã®æ°ãæžããå¯èœæ§ãæãé«ããã®ã§ãã ããšãã°ããã¢ã³ãžã§ãªãŒãã»ãžã§ãªãŒã§ããïŒããšãã質å åŠå®çãªçãã®å Žåãããã«æ€çŽ¢ããããã®éžæè¢ã70åãè¶
ããŸãïŒãã¡ããããã¹ãŠã®äººãæå人ãšããããã§ã¯ãããŸããããããã§ãå€ãã®äººãããŸãïŒãããããã¯å¥³æ§ã§ããïŒã ãã§ã«çŽååã®æå人ãã«ããããŸããã ã€ãŸãããããã¯ã¢ã³ãžã§ãªãŒããžã§ãªãŒã§ããããã¹ãã€ã³åœç±ããããµãã«ãŒã倧奜ãããšããèšå·ãããããæ§å¥ããšããèšå·ã®æ¹ã人ã
ã®ãµã³ãã«ãå
±æããã®ã«ã¯ããã«åªããŠããŸãã ããã¯ããšã³ããããŒã«åºã¥ãæ
å ±ã®å¢å ãšããæŠå¿µã«çŽæçã«å¯Ÿå¿ããŠããŸãã
ãšã³ããããŒ
ã·ã£ãã³ã®ãšã³ããããŒã¯ã N 次ã®ãããªå¯èœæ§ã®ããæ¡ä»¶ïŒ
LargeS=â sumNi=1pi log2piã
ã©ãã§ pi -ã·ã¹ãã ãèŠã€ãã確ç i ç¶æ
ã ããã¯ãç©çåŠãæ
å ±çè«ããã®ä»ã®åéã§äœ¿çšãããéåžžã«éèŠãªæŠå¿µã§ãã ãã®æŠå¿µã®å°å
¥ïŒçµã¿åããçè«ããã³æ
å ±çè«ïŒã®åææ¡ä»¶ãçç¥ãããšãçŽæçã«ããšã³ããããŒã¯ã·ã¹ãã ã®ã«ãªã¹ã®çšåºŠã«å¯Ÿå¿ããããšã«æ³šæããŠãã ããã ãšã³ããããŒãé«ãã»ã©ãã·ã¹ãã ã®é åºã¯å°ãããªããéã®å Žåãåæ§ã§ãã ããã¯ãã20ã®è³ªåãã²ãŒã ã®ã³ã³ããã¹ãã§è©±ããããµã³ãã«ã®å¹æçãªåé¢ãã圢åŒåããã®ã«åœ¹ç«ã¡ãŸãã
äŸ
ãšã³ããããŒãããªãŒãæ§ç¯ããããã®è¯ãå
åãèå¥ããã®ã«ã©ã®ããã«åœ¹ç«ã€ãã説æããããã«ãããã«ãšã³ããããŒãšãã·ãžã§ã³ããªãŒã®åãããã¡ãã®äŸã瀺ããŸãã 座æšã«ãã£ãŠããŒã«ã®è²ãäºæž¬ããŸãã ãã¡ãããããã¯äººçãšã¯é¢ä¿ãããŸããããæ±ºå®æšãæ§ç¯ããããã«ãšã³ããããŒãã©ã®ããã«äœ¿çšããããã瀺ãããšãã§ããŸãã
9åã®éãããŒã«ãš11åã®é»è²ã®ããŒã«ããããŸãã ã©ã³ãã ã«ããŒã«ãåŒããå Žåã p1= frac920 éè²ã§ç¢ºçããããŸã p2= frac1120 -é»è²ã ãããã£ãŠãç¶æ
ã®ãšã³ããã㌠S0=â frac920 log2 frac920â frac1120 log2 frac1120\çŽ1 ã ãã®å€èªäœã¯ãŸã äœãããããŸããã æ¬¡ã«ãããŒã«ã2ã€ã®ã°ã«ãŒãã«åããå Žåã®ãšã³ããããŒã®å€åãèŠãŠã¿ãŸãããã座æšã¯12以äžã§ã12ãã倧ããã§ãã

å·Šã®ã°ã«ãŒãã«ã¯13åã®ããŒã«ãããããã®ãã¡8åã¯éã5åã¯é»è²ã§ããã ãã®ã°ã«ãŒãã®ãšã³ããããŒã¯çãã S1=â frac513 log2 frac513â frac813 log2 frac813\çŽ0.96 ã å³åŽã®ã°ã«ãŒãã«ã¯7åã®ããŒã«ãããããã®ãã¡1åã¯éã6åã¯é»è²ã§ãã æ£ããã°ã«ãŒãã®ãšã³ããããŒã¯ S2=â frac17 log2 frac17â frac67 log2 frac67\çŽ0.6 ã ã芧ã®ãšãããäž¡æ¹ã®ã°ã«ãŒãã§ãšã³ããããŒã¯åæç¶æ
ãšæ¯èŒããŠæžå°ããŸããããå·ŠåŽã§ã¯ããã»ã©ã§ã¯ãããŸããã ãšã³ããããŒã¯æ¬è³ªçã«ã·ã¹ãã ã®ã«ãªã¹ïŒãŸãã¯äžç¢ºå®æ§ïŒã®çšåºŠã§ããããããšã³ããããŒã®æžå°ã¯æ
å ±ã®å¢å ãšåŒã°ããŸãã 圢åŒçã«ã屿§ã§ãµã³ãã«ãåå²ãããšãã®æ
å ±ã®ã²ã€ã³ïŒæ
å ±ã²ã€ã³ãIGïŒ Q ïŒãã®äŸã§ã¯ãããã¯ã x leq12 "ïŒã¯æ¬¡ã®ããã«å®çŸ©ãããŸã
LargeIGïŒQïŒ=SOâ sumqi=1 fracNiNSiã
ã©ãã§ q -ããŒãã£ã·ã§ã³ã®åŸã®ã°ã«ãŒãã®æ°ã Ni -屿§ã®å¯Ÿè±¡ãšãªããµã³ãã«èŠçŽ ã®æ° Q æã£ãŠãã i çªç®ã®å€ã ç§ãã¡ã®å Žåãåé¢åŸã2ã€ã®ã°ã«ãŒããååŸãããŸããïŒ q=2 ïŒ-13èŠçŽ ã®1ã€ïŒ N1=13 ïŒã7ãã2çªç®ïŒ N2=7 ïŒ æ
å ±ã®ç²åŸã倿
\ã©ãŒãžIGïŒx leq12ïŒ=S0â frac1320S1â frac720S2\çŽ0.16
ã座æšã12以äžã§ãããããšã«åºã¥ããŠããŒã«ã2ã€ã®ã°ã«ãŒãã«åå²ãããšãæåãããå€ãã®é åºä»ããããã·ã¹ãã ããã§ã«åãåããŸããã åã°ã«ãŒãã®ããŒã«ãåãè²ã«ãªããŸã§ãããŒã«ãã°ã«ãŒãã«åãç¶ããŸãã

å³åŽã®ã°ã«ãŒãã§ã¯ãã座æšã18以äžãã«åºã¥ããŠãå·ŠåŽã«1ã€ã ã远å ã®ããŒãã£ã·ã§ã³ãå¿
èŠã§ãã-ããã«3ã€ã æããã«ãåãè²ã®ããŒã«ãæã€ã°ã«ãŒãã®ãšã³ããããŒã¯0ïŒ log21=0 ïŒãåãè²ã®ããŒã«ã®ã°ã«ãŒããæ³šæããããšããèãã«å¯Ÿå¿ããŸãã
ãã®çµæã座æšã«ãã£ãŠããŒã«ã®è²ãäºæž¬ããæ±ºå®æšãæ§ç¯ããŸããã ãã®ãããªæ±ºå®ããªãŒã¯ããã¬ãŒãã³ã°ã»ããïŒæåã®20åã®ããŒã«ïŒã«å®å
šã«é©åãããããæ°ãããªããžã§ã¯ãïŒæ°ããããŒã«ã®è²ã決å®ããïŒã§ããŸãæ©èœããªãå¯èœæ§ãããããšã«æ³šæããŠãã ããã æ°ããããŒã«ãåé¡ããã«ã¯ããã¬ãŒãã³ã°ã»ãããè²ã§çæ³çã«çè²ããŠããªãå Žåã§ããã質åããŸãã¯åºåã®æ°ãå°ãªãããªãŒã®æ¹ãé©ããŠããŸãã ãã®åé¡ãåèšç·Žãããã«æ€èšããŸãã
ããªãŒæ§ç¯ã¢ã«ãŽãªãºã
åã®äŸã§æ§ç¯ãããããªãŒããããæå³ã§æé©ã§ããããšã確èªã§ããŸãããã£ã5ã€ã®ã質åãïŒå±æ§ã®æ¡ä»¶ x ïŒæ±ºå®æšããã¬ãŒãã³ã°ã»ããã«ãé©åãããããããã€ãŸããããªãŒããã¬ãŒãã³ã°ãªããžã§ã¯ããæ£ããåé¡ããããã«ããŸãã ãµã³ããªã³ã°åé¢ã®ä»ã®æ¡ä»¶äžã§ã¯ãããªãŒã¯ããæ·±ããªããŸãã
ID3ãC4.5ãªã©ã®äžè¬çãªæ±ºå®æšæ§ç¯ã¢ã«ãŽãªãºã ã®åºç€ã¯ãæ
å ±æé·ã®è²ªæ¬²ãªæå€§åã®ååã§ããåã¹ãããã§ããã®å±æ§ãéžæãããŸãã ããã«ããšã³ããããŒããŒããŸãã¯äœããã®å°ããªå€ã«çãããªããŸã§ïŒåãã¬ãŒãã³ã°ãåé¿ããããã«ããªãŒããã¬ãŒãã³ã°ãµã³ãã«ã«å®å
šã«é©åããªãå ŽåïŒãæé ãååž°çã«ç¹°ãè¿ãããŸãã
ã¢ã«ãŽãªãºã ãç°ãªããšããæ©æåæ¢ããŸãã¯ãã¯ãªããã³ã°ãã«ç°ãªããã¥ãŒãªã¹ãã£ãã¯ã䜿çšãããåãã¬ãŒãã³ã°ãããããªãŒã®æ§ç¯ãåé¿ãããŸãã
def build(L): create node t if the stopping criterion is True: assign a predictive model to t else: Find the best binary split L = L_left + L_right t.left = build(L_left) t.right = build(L_right) return t
åé¡åé¡ã«ããããã®ä»ã®åå²åè³ªåºæº
ãšã³ããããŒã®æŠå¿µã«ãããããªãŒå
ã®ããŒãã£ã·ã§ã³ã®åè³ªã®æŠå¿µãã©ã®ããã«åœ¢åŒåã§ããããèŠã€ããŸããã ããããããã¯åãªããã¥ãŒãªã¹ãã£ãã¯ã§ãããä»ã«ããããŸãã
- ãžãäžçŽç©ïŒ G=1â sum limitskïŒpkïŒ2 ã ãã®åºæºã®æå€§åã¯ãåããµãããªãŒã«ããåãã¯ã©ã¹ã®ãªããžã§ã¯ãã®ãã¢ã®æ°ã®æå€§åãšããŠè§£éã§ããŸãã Evgeny Sokolovã®ãªããžããªãããããã«ã€ããŠïŒããã«å€ãã®ããšãïŒåŠã¶ããšãã§ããŸã ã Giniã€ã³ããã¯ã¹ãšæ··åããªãã§ãã ããïŒ Alexander Dyakonovã®ããã°æçš¿ã§ãã®æ··ä¹±ã«ã€ããŠè©³ããèªãã§ãã ããã
- 誀åé¡ãšã©ãŒïŒ E=1â max limitskpk
å®éã«ã¯ãåé¡ãšã©ãŒã¯ã»ãšãã©äœ¿çšããããGiniã®äžç¢ºå®æ§ãšæ
å ±ã®å¢å ã¯ã»ãŒåãããã«æ©èœããŸãã
ãã€ããªåé¡åé¡ã®å ŽåïŒ p+ -ã©ãã«+ïŒãšã³ããããŒãšGiniäžç¢ºå®æ§ãæã€ãªããžã§ã¯ãã®ç¢ºçã¯ã次ã®åœ¢åŒãåããŸãã
S=âp+ log2p+âpâ log2pâ=âp+ log2p+âïŒ1âp+ïŒ log2ïŒ1âp+ïŒ;
G=1âp2+âp2â=1âp2+âïŒ1âp+ïŒ2=2p+ïŒ1âp+ïŒ
åŒæ°ã®2ã€ã®é¢æ°ããããããããšã p+ ããšã³ããããŒããããã¯Giniã®2åã®äžç¢ºå®æ§ããããã«éåžžã«è¿ããããå®éã«ã¯ãããã2ã€ã®åºæºã¯ã»ãŒåãããã«ãæ©èœãããŸãã
ã©ã€ãã©ãªãã€ã³ããŒããã from __future__ import division, print_function
çµµãæã plt.rcParams['figure.figsize'] = (6,4) xx = np.linspace(0,1,50) plt.plot(xx, [2 * x * (1-x) for x in xx], label='gini') plt.plot(xx, [4 * x * (1-x) for x in xx], label='2*gini') plt.plot(xx, [-x * np.log2(x) - (1-x) * np.log2(1 - x) for x in xx], label='entropy') plt.plot(xx, [1 - max(x, 1-x) for x in xx], label='missclass') plt.plot(xx, [2 - 2 * max(x, 1-x) for x in xx], label='2*missclass') plt.xlabel('p+') plt.ylabel('criterion') plt.title(' p+ ( )') plt.legend();
äŸ
åæããŒã¿ã«Scikit-learnã©ã€ãã©ãªã®æ±ºå®æšã䜿çšããäŸãèããŠã¿ãŸãããã 2ã€ã®ã¯ã©ã¹ã¯ãå¹³åãç°ãªã2ã€ã®æ£èŠååžããçæãããŸãã
ããŒã¿çæã®ããã®ã³ãŒã ããŒã¿ã衚瀺ããŸãã éå
¬åŒã«ã¯ããã®å Žåã®åé¡ã¿ã¹ã¯ã¯ã2ã€ã®ã¯ã©ã¹ïŒé»è²ããèµ€è²ã®ç¹ïŒãåå²ããäœããã®ãè¯ããå¢çç·ãæ§ç¯ããããšã§ãã èªåŒµãããšããã®å Žåã®æ©æ¢°åŠç¿ã¯ãé©åãªåå²å¢çãéžæããæ¹æ³ã«åž°çããŸãã ãããããçŽç·ã¯åçŽãªå¢çç·ã§ãããåèµ€ãç¹ãå²ãè€éãªæ²ç·ã¯è€éãããŠããã¬ãŒãã³ã°ãµã³ãã«ã®å
ãšåãååžããã®æ°ããäŸã§å€ãã®ééããç¯ããŸãã çŽæã¯ã2ã€ã®ã¯ã©ã¹ãåå²ããæ»ãããªå¢çç·ããŸãã¯å°ãªããšãåãªãçŽç·ïŒ n -次å
ã®å Žå-è¶
å¹³é¢ïŒã
çµµãæã plt.rcParams['figure.figsize'] = (10,8) plt.scatter(train_data[:, 0], train_data[:, 1], c=train_labels, s=100, cmap='autumn', edgecolors='black', linewidth=1.5); plt.plot(range(-2,5), range(4,-3,-1));
æ±ºå®æšãåŠç¿ããŠãããã2ã€ã®ã¯ã©ã¹ãåé¢ããŠã¿ãŸãããã ããªãŒã§ã¯ãããªãŒã®æ·±ããå¶éããmax_depth
ãã©ã¡ãŒã¿ãŒã䜿çšããŸãã çµæã®ã¯ã©ã¹åé¢å¢çãèŠèŠåããŸãã
ããªãŒãåŠç¿ãããã®å¢çç·ãæãããã®ã³ãŒã from sklearn.tree import DecisionTreeClassifier
ãããŠãããªãŒèªäœã¯ã©ã®ããã«èŠããŸããïŒ ããªãŒã¯ãã¹ããŒã¹ã7ã€ã®é·æ¹åœ¢ã«ãã«ãããããŠããããšãããããŸãïŒããªãŒã«ã¯7ã€ã®èããããŸãïŒã ãã®ãããªåé·æ¹åœ¢ã§ã¯ã1ã€ãŸãã¯å¥ã®ã¯ã©ã¹ã®ãªããžã§ã¯ãã®æ®åçã«å¿ããŠãããªãŒäºæž¬ãäžå®ã«ãªããŸãã
ããªãŒã衚瀺ããããã®ã³ãŒã ãã®ãããªããªãŒã¯ã©ã®ããã«ãèªã¿åãå¯èœãã§ããïŒ
æåã¯200ã®ãªããžã§ã¯ããããã100ã¯1ã€ã®ã¯ã©ã¹ã§ã100ã¯å¥ã®ã¯ã©ã¹ã§ããã åæç¶æ
ã®ãšã³ããããŒã¯æå€§-1ã§ããããã®åŸã屿§ã®æ¯èŒã«å¿ããŠãªããžã§ã¯ãã2ã€ã®ã°ã«ãŒãã«åããŸãã x1 䟡å€ãã 0.3631 ïŒäžã®å³ã§ãããªãŒã®å¢çç·ã®ãã®ã»ã¯ã·ã§ã³ãèŠã€ããŸãïŒã åæã«ããªããžã§ã¯ãã®å·Šå³äž¡æ¹ã®ã°ã«ãŒãã®ãšã³ããããŒãæžå°ããŸããã ãã®ããã«ãããªãŒã¯æ·±ã3ãŸã§æ§ç¯ãããŸãããã®èŠèŠåã§ã¯ã1ã€ã®ã¯ã©ã¹ã®ãªããžã§ã¯ããå€ãã»ã©ãé ç¹ã®è²ã¯æ¿ããªã¬ã³ãžã«è¿ããéã«ã2çªç®ã®ã¯ã©ã¹ã®ãªããžã§ã¯ããå€ãã»ã©ãè²ã¯æ¿ãéè²ã«è¿ããªããŸãã ãããã£ãŠãåãã¯ã©ã¹ã®ãªããžã§ã¯ãã®éå§æã«ã¯ãããªãŒã®ã«ãŒãé ç¹ã¯çãããªããŸãã
æ±ºå®æšãå®éçç¹æ§ãšã©ã®ããã«æ©èœããã
ãµã³ãã«ã«ãå€ãã®äžæã®å€ãæã€éç屿§ã幎霢ãããããšããŸãã æ±ºå®æšã¯ããµã³ãã«ã®æé©ãªïŒæ
å ±ã²ã€ã³ã®ã¿ã€ãã®åºæºã«åŸã£ãŠïŒããŒãã£ã·ã§ã³åå²ãæ¢ãããAge <17ãããAge <22.87ããªã©ã®ãã€ããªèšå·ããã§ãã¯ããŸãã ãããããã®ãããªå¹Žéœ¢ã®ãã«ããããå€ããããšã©ããªããŸããïŒ ãããããŸã ã絊äžãã®éç屿§ãããã絊äžãå€ãã®æ¹æ³ã§ãåæžãã§ããå Žåã¯ã©ãã§ããããã åã¹ãããã§æé©ãªããªãŒãéžæããã«ã¯ããã€ããªèšå·ãå€ãããŸãã ãã®åé¡ã解決ããããã«ããã¥ãŒãªã¹ãã£ãã¯ã䜿çšããŠãå®é屿§ãæ¯èŒãããããå€ã®æ°ãå¶éããŸãã
ãããããã¡ãã®äŸã§èããŠã¿ãŸãããã æ¬¡ã®éžæè¢ããããŸãã
幎霢ã®é«ãé ã«äžŠã¹æ¿ããŸãã
ãã®ããŒã¿ã«ã€ããŠæ±ºå®ããªãŒããã¬ãŒãã³ã°ãïŒæ·±ããå¶éããã«ïŒãããã調ã¹ãŸãã
ããªãŒãåŠç¿ããã³æç»ããããã®ã³ãŒã age_tree = DecisionTreeClassifier(random_state=17) age_tree.fit(data[''].values.reshape(-1, 1), data[' '].values) export_graphviz(age_tree, feature_names=[''], out_file='../../img/age_tree.dot', filled=True) !dot -Tpng '../../img/age_tree.dot' -o '../../img/age_tree.png'
次ã®å³ã§ã¯ãããªãŒã«43.5ã19ã22.5ã30ãããã³32幎ãšãã5ã€ã®å€ãå«ãŸããŠããããšãããããŸãã ããèŠããšããããã¯ã¿ãŒã²ããã¯ã©ã¹ã1ãã0ïŒãŸãã¯ãã®éïŒã«ãå€åããã幎霢éã®æ£ç¢ºãªå¹³åå€ã§ãã è€éãªãã¬ãŒãºã§ãã®ã§ãäŸïŒ43.5ã¯38ã49æ³ã®å¹³åã38æ³ã«è¿æžããªãã£ãã¯ã©ã€ã¢ã³ãã49æ³ãã49æ³ã«è¿éããã¯ã©ã€ã¢ã³ãã§ãã åæ§ã«ã18幎ãã20幎ã®å¹³åã¯19幎ã§ãã ã€ãŸããéç屿§ããåãåããããã®ãããå€ãšããŠãããªãŒã¯ãã¿ãŒã²ããã¯ã©ã¹ãå€ã倿Žããå€ããèŠããã
ãã®å Žåãã幎霢<17.5ããšããèšå·ãèæ
®ããã®ãæå³ããªããªãçç±ãæ€èšããŠãã ããã
ããè€éãªäŸãèããŠã¿ãŸãããã屿§ã絊äžãïŒåã«ãŒãã«/æïŒã远å ããŸãã
幎霢ã§äžŠã¹æ¿ãããšãã¿ãŒã²ããã¯ã©ã¹ïŒãã¯ã¬ãžããã®ããã©ã«ããïŒã5åïŒ1ãã0ãŸãã¯ãã®éã«ïŒå€æŽãããŸãã ãããŠã絊äžã§ãœãŒãããå Žå-7åã ããªãŒã¯ã©ã®ããã«å±æ§ãéžæããŸããïŒ èŠãŠã¿ãŸãããã
ããªãŒãåŠç¿ããã³æç»ããããã®ã³ãŒã age_sal_tree = DecisionTreeClassifier(random_state=17) age_sal_tree.fit(data2[['', '']].values, data2[' '].values); export_graphviz(age_sal_tree, feature_names=['', ''], out_file='../../img/age_sal_tree.dot', filled=True) !dot -Tpng '../../img/age_sal_tree.dot' -o '../../img/age_sal_tree.png'

ããªãŒã«ã¯ã幎霢å¥ãšçµŠäžå¥ã®äž¡æ¹ã®å
èš³ãå«ãŸããŠããããšãããããŸãã ããã«ãå
åãæ¯èŒãããããå€ïŒå¹Žéœ¢ã¯43.5ããã³22.5æ³ã絊äžã¯95ããã³30.5ã«ãŒãã«/æã ç¹°ãè¿ãã«ãªããŸããã絊äžã88ã®äººã¯ãæªããããããŠ102-ãè¯ãããšå€æããã®ã«å¯ŸããŠã95äžã88ãš102ã®éã®å¹³åã§ããããšãããããŸãã ã€ãŸãã絊äžãšå¹Žéœ¢ã®æ¯èŒã¯ããã¹ãŠã®å¯èœãªå€ã§ã¯ãªããããå°æ°ã®å€ã§ç¢ºèªãããŸããã ãããŠããªããããã®å
åãæšã«çŸããã®ã§ããïŒ ãã®çç±ã¯ãGiniã®äžç¢ºå®æ§ã®åºæºã«ãããšãããŒãã£ã·ã§ã³ãåªããŠããããã§ãã
çµè«ïŒæ±ºå®æšã®å®éçç¹æ§ãåŠçããããã®æãåçŽãªãã¥ãŒãªã¹ãã£ãã¯ïŒå®éçç¹æ§ã¯æé ã«ãœãŒããããã¿ãŒã²ããç¹æ§ãå€ã倿ŽããããªãŒã§ãããã®ãããå€ã®ã¿ããã§ãã¯ãããŸãã ããã»ã©å³å¯ã«èãããããã§ã¯ãããŸããããããã¡ãã®äŸã䜿ã£ãŠãã®æå³ãäŒããããšæããŸãã
ããã«ãããŒã¿ã«å€ãã®éçç¹æ§ããããããããã«å€ãã®äžæã®å€ãããå Žåãäžèšã®ãã¹ãŠã®ãããå€ãéžæã§ããããã§ã¯ãªããäžäœNã®ã¿ãéžæã§ãããããåãåºæºã®æå€§å¢å ãåŸãããŸãã ã€ãŸããå®éã«ã¯ãåãããå€ã«å¯ŸããŠæ·±ã1ã®ããªãŒãæ§ç¯ããããšã³ããããŒïŒãŸãã¯Giniã®äžç¢ºå®æ§ïŒãã©ãã ãæžå°ããããèæ
®ãããå®éç屿§ãæ¯èŒããããã«æé©ãªãããå€ã®ã¿ãéžæãããŸãã
説æã®ããã«ïŒçµŠäžã«ãã£ãŠç Žããããšã leq 34.5ãå·Šã®ãµãã°ã«ãŒãã¯ãšã³ããããŒ0ïŒãã¹ãŠã®ã¯ã©ã€ã¢ã³ãã¯ãæªããïŒã§ãããå³-0.954ïŒ3ãæªããããã³5ãè¯ããïŒã§ã¯ã宿é¡ã®1ã€ã®éšåãããªãŒã®æ§ç¯æ¹æ³ãå®å
šã«çè§£ããããã ãã§ããããšã確èªã§ããŸãïŒãæ
å ±ã®ã²ã€ã³ã¯çŽ0.3ã§ãã
ãããŠã絊äžãã§å²ããš leq 95 "å·ŠåŽã®ãµãã°ã«ãŒãã§ã¯ããšã³ããããŒã¯0.97ïŒ6"æªã "ãš4"è¯ã "ïŒã§ãå³åŽ-0ïŒ1ã€ã®ãªããžã§ã¯ãã®ã¿ïŒãæ
å ±ã®ã²ã€ã³ã¯çŽ0.11ã§ãã
ãã®ããã«åããŒãã£ã·ã§ã³ã®æ
å ±ã®å¢å ãèšç®ãããšã倧ããªããªãŒãæ§ç¯ããåã«ïŒãã¹ãŠã®åºæºã§ïŒåå®éçç¹æ§ãæ¯èŒãããããå€ãéžæã§ããŸãã
éç圢質ã®éååã®äŸã¯ã thisãŸãã¯thisã®ãããªæçš¿ã§èŠã€ããããšãã§ããŸãã ãã®ããŒãã«é¢ããæãæåãªç§åŠèšäºã®1ã€ã¯ããæ±ºå®æšçæã«ãããé£ç¶å€å±æ§ã®åŠçã«ã€ããŠãã§ãïŒUM FayyadãKB Iraniã "Machine Learning"ã1992ïŒã
åºæ¬çãªããªãŒãã©ã¡ãŒã¿ãŒ
ååãšããŠã決å®ããªãŒã¯ãåã·ãŒãã«ãªããžã§ã¯ãã1ã€ã ãååšãããããªæ·±ããŸã§æ§ç¯ã§ããŸãã ããããå®éã«ã¯ããã®ãããªããªãŒãåãã¬ãŒãã³ã°ããããšããäºå®ã®ããã«ïŒ1ã€ã®ããªãŒã®ã¿ãæ§ç¯ãããå ŽåïŒããã¯è¡ãããŸãã-ãã¬ãŒãã³ã°ã»ããã«åãããããŠãæ°ããããŒã¿ã®äºæž¬ã«ã¯ããŸãæ©èœããŸããã ããªãŒã®äžã®ã©ãããéåžžã«æ·±ããšããã«ãããŒãã£ã·ã§ã³ãããã»ã©éèŠã§ãªãçç±ã§è¡šç€ºãããŸãïŒããšãã°ãã¯ã©ã€ã¢ã³ãããµã©ããããæ¥ãã®ããã³ã¹ãããããæ¥ãã®ãïŒã èªåŒµãããšãç·è²ã®ãºãã³ã§ããŒã³ãæ±ããŠéè¡ã«æ¥ã4人ã®ã¯ã©ã€ã¢ã³ãã®ãã¡ã誰ãããŒã³ãè¿ããªãã£ãããšã倿ãããããããŸããã ãã ããåé¡ã¢ãã«ã§ãã®ãããªç¹å®ã®ã«ãŒã«ãçæããããšã¯æãŸãããããŸããã
次ã®2ã€ã®äŸå€ããããŸããããªãŒãæå€§æ·±åºŠãŸã§æ§ç¯ãããå Žåã§ãã
- ã©ã³ãã ãã©ã¬ã¹ãïŒå€ãã®ããªãŒã®æ§æïŒã¯ãæå€§ã®æ·±ããŸã§æ§ç¯ãããããªãŒã®å¿çãå¹³åããŸãïŒãªããããè¡ãå¿
èŠãããã®ãââãåŸã§ããããŸãïŒ
- åªå® ãã®ã¢ãããŒãã§ã¯ãããªãŒã¯æåã«æå€§ã®æ·±ããŸã§æ§ç¯ããããã®åŸãããªãŒã®å質ããã®ããŒãã£ã·ã§ã³ãšããŒãã£ã·ã§ã³ãªãã§æ¯èŒããããšã§ãäžããäžã«åŸã
ã«ããªãŒã®äžéšãåé€ãããŸãïŒæ¯èŒã¯ã以äžã§èª¬æããçžäºæ€èšŒã䜿çšããŠå®è¡ãããŸãïŒã Evgeny Sokolovã®ãªããžããªã®è³æã§è©³çްãèªãããšãã§ããŸãã
以äžã®ç»åã¯ãåèšç·ŽãããããªãŒã«ãã£ãŠæ§ç¯ãããå¢çç·ã®äŸã§ãã
æ±ºå®æšã®å Žåã®åãã¬ãŒãã³ã°ã«å¯ŸåŠããäž»ãªæ¹æ³ïŒ
- ã·ãŒãå
ã®ãªããžã§ã¯ãã®æ·±ããŸãã¯ãªããžã§ã¯ãã®æå°æ°ã®äººçºçãªå¶éïŒããªãŒã®æ§ç¯ã¯ãããæç¹ã§åæ¢ããŸãã
- æšåã
Scikit-learnã®DecisionTreeClassifierã¯ã©ã¹
sklearn.tree.DecisionTreeClassifier
ã¯ã©ã¹ã®äž»ãªãã©ã¡ãŒã¿ãŒïŒ
max_depth
æå€§ããªãŒæ·±åºŠmax_features
ããªãŒå
ã®æé©ãªããŒãã£ã·ã§ã³ãæ€çŽ¢ããããã£ãŒãã£ã®æå€§æ°ïŒããã¯ã倿°ã®ãã£ãŒãã£ã§ã¯ã ãã¹ãŠã®å±æ§ã®äžã§ïŒæ
å ±ã®æé·ã®ã¿ã€ãã®åºæºã«ããïŒããŒãã£ã·ã§ã³ãæ€çŽ¢ããã®ã«ãé«äŸ¡ãã«ãªãããå¿
èŠã§ãïŒmin_samples_leaf
ã·ãŒãå
ã®ãªããžã§ã¯ãã®æå°æ°ã ãã®ãã©ã¡ãŒã¿ãŒã«ã¯æç¢ºãªè§£éããããŸããããšãã°ã5ã®å ŽåãããªãŒã¯å°ãªããšã5ã€ã®ãªããžã§ã¯ãã«åœãŠã¯ãŸãåé¡èŠåã®ã¿ãçæããŸã
ããªãŒã®ãã©ã¡ãŒã¿ãŒã¯ãå
¥åããŒã¿ã«å¿ããŠèª¿æŽããå¿
èŠããããŸããããã¯éåžžãå°ãäœãã®äº€å·®æ€èšŒã䜿çšããŠè¡ãããŸãã
ååž°åé¡ã®æ±ºå®æš
éç屿§ãäºæž¬ããå ŽåãããªãŒãæ§ç¯ãããšããèãæ¹ã¯å€ãããŸããããåè³ªåºæºã¯å€ãããŸãã
- äžå€®ä»è¿ã®åæ£ïŒ LargeD= frac1 ell sum limits elli=1ïŒyiâ frac1 ell sum limits elli=1yiïŒ2ã
ã©ãã§ ell -ã·ãŒãå
ã®ãªããžã§ã¯ãã®æ°ã yi -ã¿ãŒã²ãã屿§ã®å€ã ç°¡åã«èšãã°ãå¹³åã®åšãã®åæ£ãæå°åããŠãåã·ãŒãã®ã¿ãŒã²ãããã£ãŒãã£ã®å€ãã»ãŒçãããªãããã«ãµã³ãã«ãåè§£ãããã£ãŒãã£ãæ¢ããŸãã
äŸ
颿°ã®åšãã«åæ£ãããããŒã¿ãçæãã fïŒxïŒ=eâx2+1.5âeâïŒxâ2ïŒ2 ãã€ãºãããå Žåã¯ããããã®æ±ºå®æšãèšç·Žããæšãã©ã®ãããªäºæž¬ããããã瀺ããŸãã
ã³ãŒã n_train = 150 n_test = 1000 noise = 0.1 def f(x): x = x.ravel() return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2) def generate(n_samples, noise): X = np.random.rand(n_samples) * 10 - 5 X = np.sort(X).ravel() y = np.exp(-X ** 2) + 1.5 * np.exp(-(X - 2) ** 2) + \ np.random.normal(0.0, noise, n_samples) X = X.reshape((n_samples, 1)) return X, y X_train, y_train = generate(n_samples=n_train, noise=noise) X_test, y_test = generate(n_samples=n_test, noise=noise) from sklearn.tree import DecisionTreeRegressor reg_tree = DecisionTreeRegressor(max_depth=5, random_state=17) reg_tree.fit(X_train, y_train) reg_tree_pred = reg_tree.predict(X_test) plt.figure(figsize=(10, 6)) plt.plot(X_test, f(X_test), "b") plt.scatter(X_train, y_train, c="b", s=20) plt.plot(X_test, reg_tree_pred, "g", lw=2) plt.xlim([-5, 5]) plt.title("Decision tree regressor, MSE = %.2f" % np.sum((y_test - reg_tree_pred) ** 2)) plt.show()

æ±ºå®æšã¯ãåºå宿°é¢æ°ã«ãã£ãŠããŒã¿ã®äŸåæ§ãè¿äŒŒããããšãããããŸãã
æè¿åæ³
æè¿åæ³ïŒkæè¿åããŸãã¯kNNïŒãéåžžã«äžè¬çãªå顿¹æ³ã§ãããååž°åé¡ã§ã䜿çšãããå ŽåããããŸãã ããã¯ã決å®ããªãŒãšå
±ã«ãåé¡ã«å¯Ÿããæãçè§£ããããã¢ãããŒãã®1ã€ã§ãã çŽèгã®ã¬ãã«ã§ã¯ããã®æ¹æ³ã®æ¬è³ªã¯æ¬¡ã®ãšããã§ããåã€é£äººãèŠãŠãããªããããã§ãã æ£åŒã«ã¯ããã®æ¹æ³ã®åºç€ã¯ã³ã³ãã¯ãæ§ä»®èª¬ã§ããäŸéã®è·é¢ã¡ããªãã¯ãéåžžã«ããŸãå°å
¥ãããå Žåãé¡äŒŒããäŸã¯ç°ãªãäŸãããåãã¯ã©ã¹ã«ããå¯èœæ§ãã¯ããã«é«ããªããŸãã
æè¿åã®æ¹æ³ã«ããã°ããã¹ãã±ãŒã¹ïŒç·ã®ããŒã«ïŒã¯ãã¯ã©ã¹ãèµ€ãã§ã¯ãªããéãã«å²ãåœãŠãããŸãã
ããšãã°ãBluetoothãããã»ããã®çºè¡šã§ç€ºã補åã®ã¿ã€ããããããªãå Žåã¯ã5ã€ã®åæ§ã®ãããã»ãããèŠã€ããããšãã§ãããã®ãã¡4ã€ãã¢ã¯ã»ãµãªã«ããŽãªã«å²ãåœãŠããã1ã€ã ãããã¯ãã¯ã¹ã«ããŽãªã«å²ãåœãŠãããŠããå ŽåãåžžèãããããŸãåºåã«ã¯ã«ããŽãªãã¢ã¯ã»ãµãªãã瀺ãããŸãã
ãã¹ããµã³ãã«ã®åãªããžã§ã¯ããåé¡ããã«ã¯ãæ¬¡ã®æäœãé çªã«å®è¡ããå¿
èŠããããŸãã
- ãã¬ãŒãã³ã°ã»ããå
ã®åãªããžã§ã¯ããŸã§ã®è·é¢ãèšç®ããŸã
- 奪ã k ãã¬ãŒãã³ã°ã»ããã®ãªããžã§ã¯ããæå°è·é¢
- åé¡ããããªããžã§ã¯ãã®ã¯ã©ã¹ã¯ãæãäžè¬çã«èŠãããã¯ã©ã¹ã§ã k æè¿å
ååž°ã¿ã¹ã¯ã®å Žåãã¡ãœããã¯éåžžã«ç°¡åã«é©å¿ããŸã-ã¹ããã3ã§ã¯ãã©ãã«ã¯è¿ãããŸããããæ°å€ã¯è¿é£ã®ã¿ãŒã²ãã屿§ã®å¹³åå€ïŒãŸãã¯äžå€®å€ïŒã§ãã
ãã®ã¢ãããŒãã®é¡èãªç¹åŸŽã¯ããã®æ inessãã§ãã ã€ãŸããèšç®ã¯ãã¹ãã±ãŒã¹ã®å顿ã«ã®ã¿éå§ãããäºåã«ã¯ãã¬ãŒãã³ã°äŸã®ã¿ã§ã¢ãã«ã¯æ§ç¯ãããŸããã ããã¯ãããšãã°ã以åã«æ€èšãããæ±ºå®ããªãŒãšã®éãã§ããæåã«ãã¬ãŒãã³ã°ãµã³ãã«ã«åºã¥ããŠããªãŒãæ§ç¯ãããæ¬¡ã«ãã¹ãã±ãŒã¹ã®åé¡ãæ¯èŒçè¿
éã«è¡ãããŸãã
æè¿åã®æ¹æ³ãããç ç©¶ãããã¢ãããŒãã§ããããšã¯æ³šç®ã«å€ããŸãïŒæ©æ¢°åŠç¿ãèšéçµæžåŠãçµ±èšåŠã§ã¯ãããããç·åœ¢ååž°ã«ã€ããŠã®ã¿ç¥ãããŠããŸãïŒã æè¿åã®æ¹æ³ã«ã€ããŠã¯ããç¡éããµã³ãã«ã§ã¯ãããæé©ãªå顿¹æ³ã§ãããšäž»åŒµããå€ãã®éèŠãªå®çããããŸãã å€å
žçãªæ¬ãçµ±èšçåŠç¿ã®èŠçŽ ãã®èè
ã¯ãkNNãçè«çã«çæ³çãªã¢ã«ãŽãªãºã ãšèããŠããŸãããã®é©çšæ§ã¯ãèšç®èœåãšæ¬¡å
ã®åªãã«ãã£ãŠåçŽã«å¶éãããŸãã
å®ç掻åé¡ã«ãããæè¿åæ³
- çŽç²ãªåœ¢ã§ã¯ãkNNã¯åé¡ã解決ããããã®è¯ãåºçºç¹ïŒããŒã¹ã©ã€ã³ïŒãšããŠåœ¹ç«ã¡ãŸãã
- ç«¶æäŒã§ã¯ãKaggle kNNãã¡ã¿ãµã€ã³ãæ§ç¯ããããã«ãã䜿çšãããŸãïŒkNNã®äºæž¬ã¯ä»ã®ã¢ãã«ã«å
¥åãããŸãïŒãŸãã¯ã¹ã¿ããã³ã°/ãã¬ã³ã;
- æãè¿ãé£äººã®ã¢ã€ãã¢ã¯ä»ã®ã¿ã¹ã¯ã«ãæ¡åŒµãããŸããããšãã°ãã¬ã³ã¡ã³ããŒã·ã¹ãã ã§ã¯ãç°¡åãªæåã®è§£æ±ºçã¯ãæšå¥šãããäººã®æãè¿ãé£äººã®éã§äººæ°ã®ãã補åïŒãŸãã¯ãµãŒãã¹ïŒãæšå¥šããããšã§ãã
- å®éã«ã¯ãå€ãã®å Žåãæãè¿ãè¿åãèŠã€ããããã®è¿äŒŒæ¹æ³ã䜿çšãããŸãã ããã¯ã Artyom Babenkoã«ããã髿¬¡å
空éã®äœååãã®ãªããžã§ã¯ãããæãè¿ããã®ãèŠã€ããããã®å¹æçãªã¢ã«ãŽãªãºã ã«ã€ããŠã®è¬çŸ©ã§ãïŒç»åæ€çŽ¢ïŒã Annoyã©ã€ãã©ãªã®Spotifyã®ãããã§ããã®ãããªã¢ã«ãŽãªãºã ãå®è£
ãããªãŒãã³ã©ã€ãã©ãªãç¥ãããŠããŸãã
æè¿åæ³ã«ããåé¡/ååž°ã®å質ã¯ãããã€ãã®ãã©ã¡ãŒã¿ãŒã«äŸåããŸãã
- é£äººã®æ°
- ãªããžã§ã¯ãéã®è·é¢ã¡ããªãã¯ïŒããã³ã°ã¡ããªãã¯ããŠãŒã¯ãªããè·é¢ãã³ãµã€ã³è·é¢ããã³ã³ãã¹ããŒè·é¢ããã䜿çšãããŸãïŒã ã»ãšãã©ã®ã¡ããªãã¯ã䜿çšããå Žåãç¹æ§å€ãã¹ã±ãŒãªã³ã°ããå¿
èŠãããããšã«æ³šæããŠãã ããã çžå¯Ÿçã«èšãã°ãå€ã®ç¯å²ã10äžãŸã§ã®ã絊äžãã®èšå·ã¯ãå€ã100ãŸã§ã®ã幎霢ããããè·é¢ã«å€§ãã圱é¿ããŸããã
- é£äººã®éã¿ïŒãã¹ãã±ãŒã¹ã®é£äººã¯ç°ãªãéã¿ã§å
¥åããå ŽåããããŸããããšãã°ãäŸãé ããªãã»ã©ãã¹ã³ã¢ã¯äœããªããŸãïŒ
Scikit-learnã®KNeighborsClassifierã¯ã©ã¹
sklearn.neighbors.KNeighborsClassifierã¯ã©ã¹ã®äž»ãªãã©ã¡ãŒã¿ãŒïŒ
- éã¿ïŒãåäžãïŒãã¹ãŠã®éã¿ãçããïŒããè·é¢ãïŒéã¿ã¯ãã¹ãã±ãŒã¹ãŸã§ã®è·é¢ã«åæ¯äŸããïŒããŸãã¯å¥ã®ãŠãŒã¶ãŒå®çŸ©é¢æ°
- algorithm (): "brute", "ball_tree", "KD_tree", "auto". . â , . "auto" .
- leaf_size (): BallTree KDTree
- metric: "minkowski", "manhattan", "euclidean", "chebyshev"
-
â , . ( , ), , .
2 :
- ( held-out/hold-out set ). - ( 20% 40%), (60-80% ) (, â ) .
- - ( cross-validation , ). â K-fold -
K ( Kâ1 ) ( ), ( , ).
K , , / -.
- . - , .
- â ( ), , , .. , , Sebastian Raschka ()
å¿çšäŸ
-
DataFrame . Series, . , , .
df = pd.read_csv('../../data/telecom_churn.csv') df['International plan'] = pd.factorize(df['International plan'])[0] df['Voice mail plan'] = pd.factorize(df['Voice mail plan'])[0] df['Churn'] = df['Churn'].astype('int') states = df['State'] y = df['Churn'] df.drop(['State', 'Churn'], axis=1, inplace=True)
70% (X_train, y_train) 30% (X_holdout, y_holdout). , , , . 2 â kNN, , , : 5, â 10.
ã³ãŒã from sklearn.model_selection import train_test_split, StratifiedKFold from sklearn.neighbors import KNeighborsClassifier X_train, X_holdout, y_train, y_holdout = train_test_split(df.values, y, test_size=0.3, random_state=17) tree = DecisionTreeClassifier(max_depth=5, random_state=17) knn = KNeighborsClassifier(n_neighbors=10) tree.fit(X_train, y_train) knn.fit(X_train, y_train)
â . . : 94% 88% kNN. .
from sklearn.metrics import accuracy_score tree_pred = tree.predict(X_holdout) accuracy_score(y_holdout, tree_pred)
knn_pred = knn.predict(X_holdout) accuracy_score(y_holdout, knn_pred)
-. . , GridSearchCV: max_depth
max_features
5- - .
from sklearn.model_selection import GridSearchCV, cross_val_score
tree_params = {'max_depth': range(1,11), 'max_features': range(4,19)}
tree_grid = GridSearchCV(tree, tree_params, cv=5, n_jobs=-1, verbose=True)
tree_grid.fit(X_train, y_train)
ãã©ã¡ãŒã¿ãŒã®æé©ãªçµã¿åãããšãçžäºæ€èšŒã®æ£è§£ã®å¯Ÿå¿ããå¹³åå²åïŒ
tree_grid.best_params_
{'max_depth'ïŒ6ã 'max_features'ïŒ17}
tree_grid.best_score_
0.94256322331761677
accuracy_score(y_holdout, tree_grid.predict(X_holdout))
0.94599999999999995
kNN.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler
knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1))])
knn_params = {'knn__n_neighbors': range(1, 10)}
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=5, n_jobs=-1, verbose=True)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_
({'knn__n_neighbors': 7}, 0.88598371195885128)
accuracy_score(y_holdout, knn_grid.predict(X_holdout))
0.89000000000000001
, : 94.2% - 94.6% 88.6% / 89% kNN. , , ( , - , ) (95.1% - 95.3% â ), .
from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=17) print(np.mean(cross_val_score(forest, X_train, y_train, cv=5)))
forest_params = {'max_depth': range(1,11), 'max_features': range(4,19)}
forest_grid = GridSearchCV(forest, forest_params, cv=5, n_jobs=-1, verbose=True)
forest_grid.fit(X_train, y_train)
forest_grid.best_params_, forest_grid.best_score_
accuracy_score(y_holdout, forest_grid.predict(X_holdout))
. - , ( â 6), , "", .
export_graphviz(tree_grid.best_estimator_, feature_names=df.columns, out_file='../../img/churn_tree.dot', filled=True) !dot -Tpng '../../img/churn_tree.dot' -o '../../img/churn_tree.png'
, , - "", . (2 ), (+1, , -1 â ). , â .
def form_linearly_separable_data(n=500, x1_min=0, x1_max=30, x2_min=0, x2_max=30): data, target = [], [] for i in range(n): x1, x2 = np.random.randint(x1_min, x1_max), np.random.randint(x2_min, x2_max) if np.abs(x1 - x2) > 0.5: data.append([x1, x2]) target.append(np.sign(x1 - x2)) return np.array(data), np.array(target) X, y = form_linearly_separable_data() plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn', edgecolors='black');
. , , 30Ã30 , .
, tree = DecisionTreeClassifier(random_state=17).fit(X, y) xx, yy = get_grid(X, eps=.05) predicted = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.pcolormesh(xx, yy, predicted, cmap='autumn') plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap='autumn', edgecolors='black', linewidth=1.5) plt.title('Easy task. Decision tree compexifies everything');
, ( ) â x1=x2 ã
export_graphviz(tree, feature_names=['x1', 'x2'], out_file='../../img/deep_toy_tree.dot', filled=True) !dot -Tpng '../../img/deep_toy_tree.dot' -o '../../img/deep_toy_tree.png'
, , ( ).
, kNN knn = KNeighborsClassifier(n_neighbors=1).fit(X, y) xx, yy = get_grid(X, eps=.05) predicted = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.pcolormesh(xx, yy, predicted, cmap='autumn') plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap='autumn', edgecolors='black', linewidth=1.5); plt.title('Easy task, kNN. Not bad');
MNIST
2 . "" sklearn
. , .
8 x 8 ( ). "" 64, .
, , .
from sklearn.datasets import load_digits data = load_digits() X, y = data.data, data.target X[0,:].reshape([8,8])
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
f, axes = plt.subplots(1, 4, sharey=True, figsize=(16,6)) for i in range(4): axes[i].imshow(X[i,:].reshape([8,8]));
, , .
DT kNN MNIST70% (X_train, y_train) 30% (X_holdout, y_holdout). , , , .
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.3, random_state=17)
kNN, .
tree = DecisionTreeClassifier(max_depth=5, random_state=17) knn = KNeighborsClassifier(n_neighbors=10) tree.fit(X_train, y_train) knn.fit(X_train, y_train)
. , . .
tree_pred = tree.predict(X_holdout) knn_pred = knn.predict(X_holdout) accuracy_score(y_holdout, knn_pred), accuracy_score(y_holdout, tree_pred)
, -, , , â 64.
tree_params = {'max_depth': [1, 2, 3, 5, 10, 20, 25, 30, 40, 50, 64], 'max_features': [1, 2, 3, 5, 10, 20 ,30, 50, 64]} tree_grid = GridSearchCV(tree, tree_params, cv=5, n_jobs=-1, verbose=True) tree_grid.fit(X_train, y_train)
-:
tree_grid.best_params_, tree_grid.best_score_
66%, 97%. . - 99% .
np.mean(cross_val_score(KNeighborsClassifier(n_neighbors=1), X_train, y_train, cv=5))
, , . .
np.mean(cross_val_score(RandomForestClassifier(random_state=17), X_train, y_train, cv=5))
, , RandomForestClassifier, 98%, .
å®éšçµæ
(: CV Holdoutâ - -. DT â , kNN â , RF â )
( ): â ( ), , .
. , .
def form_noisy_data(n_obj=1000, n_feat=100, random_seed=17): np.seed = random_seed y = np.random.choice([-1, 1], size=n_obj)
, - . , n_neighbors
. .
, , . , "" .
kNN from sklearn.model_selection import cross_val_score cv_scores, holdout_scores = [], [] n_neighb = [1, 2, 3, 5] + list(range(50, 550, 50)) for k in n_neighb: knn = KNeighborsClassifier(n_neighbors=k) cv_scores.append(np.mean(cross_val_score(knn, X_train, y_train, cv=5))) knn.fit(X_train, y_train) holdout_scores.append(accuracy_score(y_holdout, knn.predict(X_holdout))) plt.plot(n_neighb, cv_scores, label='CV') plt.plot(n_neighb, holdout_scores, label='holdout') plt.title('Easy task. kNN fails') plt.legend();
tree = DecisionTreeClassifier(random_state=17, max_depth=1) tree_cv_score = np.mean(cross_val_score(tree, X_train, y_train, cv=5)) tree.fit(X_train, y_train) tree_holdout_score = accuracy_score(y_holdout, tree.predict(X_holdout)) print('Decision tree. CV: {}, holdout: {}'.format(tree_cv_score, tree_holdout_score))
Decision tree. CV: 1.0, holdout: 1.0
, , . , , : , .
:
- , , , " < 25 , ". ;
- , "" ( ) (), ( );
- ;
- ;
- , .
:
- : , , (, ), , ;
- , , ( , - ), ;
- (pruning) . , â ;
- . . ( );
- ( ) NP-, , ;
- . Friedman , 50% CART ( â Classification And Regression Trees,
sklearn
); - , ( ). , , . , > 19 < 0.
:
- ;
- ;
- , , , , , ;
- ( : , kNN ). , kNN, . ;
- , , . : , (: " , 350 , 70 â , 12% , ").
:
- , , , , , , , (100-150), , ;
- , , /;
- . . , ;
- â (, ). , ;
- , , , - " ". ML- Pedro Domingos â "A Few Useful Things to Know about Machine Learning", "the curse of dimensionality" Deep Learning "Machine Learning basics".
, , . , .
â 3
, .
â , , , Adult UCI. - ( ).