3æ16æ¥ã ML Boot Camp IIIæ©æ¢°åŠç¿ã³ã³ãã¹ãã¯çµäºããŸããã ç§ã¯æ¬åœã®æº¶æ¥å·¥ã§ã¯ãããŸããããããã§ããçµæã®æçµè¡šã§7äœãéæããããšãã§ããŸããã ãã®èšäºã§ã¯ããã®ãããªãã£ã³ããªã³ã·ãããžã®åå ãéå§ããæ¹æ³ãå
±æããããšæããŸããããã¯åé¡ã解決ãããšãã«åããŠæ³šæãæã䟡å€ããããç§ã®ã¢ãããŒãã«ã€ããŠã話ããŸãã
MLããŒããã£ã³ãIII
ããã¯ãMail.Ru Groupãäž»å¬ãããªãŒãã³ãªæ©æ¢°åŠç¿ãã£ã³ããªã³ã·ããã§ãã ã¿ã¹ã¯ãšããŠããã¬ã€ã€ãŒããªã³ã©ã€ã³ã²ãŒã ã«ãšã©ãŸããããããšãéå Žããããäºæž¬ããããšãææ¡ãããŸããã ããŒã¿ãšããŠãäž»å¬è
ã¯éå»2é±éã®åŠçæžã¿ã®ãŠãŒã¶ãŒçµ±èšãæäŸããŸããã
ããŒã¿ã®èª¬æ- maxPlayerLevel-ãã¬ãŒã€ãŒãæž¡ããã²ãŒã ã®æ倧ã¬ãã«ã
- numberOfAttemptedLevels-ãã¬ãŒã€ãŒãééããããšããã¬ãã«ã®æ°ã
- triesOnTheHighestLevel-æé«ã¬ãã«ã§è¡ãããè©Šè¡ã®åæ°ã
- totalNumOfAttempts-è©Šè¡ã®ç·æ°ã
- averageNumOfTurnsPerCompletedLevel-æ£åžžã«å®äºããã¬ãã«ã§å®äºãã移åã®å¹³åæ°ã
- doReturnOnLowerLevels-ãã¬ã€ã€ãŒããã§ã«å®äºããã¬ãã«ã§ã²ãŒã ã«æ»ããã©ããã
- numberOfBoostersUsed-䜿çšãããããŒã¹ã¿ãŒã®æ°ã
- fractionOfUsefullBoosters-æåããè©Šè¡äžã«äœ¿çšãããããŒã¹ã¿ãŒã®æ°ïŒãã¬ãŒã€ãŒã¯ã¬ãã«ããã¹ããŸããïŒ;
- totalScore-åŸç¹ã®åèšæ°ã
- totalBonusScore-ç²åŸããããŒãã¹ãã€ã³ãã®ç·æ°ã
- totalStarsCount-åŸç¹ã®åèšæ°ã
- numberOfDaysActuallyPlayed-ãŠãŒã¶ãŒãã²ãŒã ããã¬ã€ããæ¥æ°ã
éžææš©ã«é¢ãã詳现ã¯ããããžã§ã¯ãã®ãŠã§ããµã€ãã§èŠã€ããããšãã§ããŸãã
ã«ãŒã«ãèªã
家é»è£œåã®èª¬ææžãšã¯å¯Ÿç
§çã«ãæçšãªæ
å ±ããããŸãã æ¢ãã¹ããã®ïŒ
- å
¥åããã³åºåããŒã¿ã®åœ¢åŒã
- 1æ¥ãããã®åºç»ã®æ倧æ°ã
- å質åºæº/è©äŸ¡é¢æ°ã
æåŸã®ãããããã«ãŒã«ã®æãéèŠãªéšåããªããªã ãã®é¢æ°ã䜿çšããŠãæå°åïŒå Žåã«ãã£ãŠã¯æ倧åïŒããå¿
èŠããããŸãã ä»åã¯ã 察æ°æ倱é¢æ°ã䜿çšãããŸããã
ããã«
Nã¯äŸã®æ°ã§ã
Mã¯ã¯ã©ã¹ã®æ°ã§ãïŒ2ã€ãããããŸããïŒ
Pijã¯ãäŸiãã¯ã©ã¹jã«å±ããäºæž¬ç¢ºçã§ãã
Yij-äŸiãå®éã«ã¯ã©ã¹jã«å±ããå Žåã¯1ãããã§ãªãå Žåã¯0
ãã®åŒã¯ãåçã«å¯Ÿããèªä¿¡ã匷ãã眰ãããããšã«æ³šæããããšãéèŠã§ãã ãããã£ãŠã解決çãšããŠããã¬ã€ã€ãŒãæ確ãªã1ããšã0ãã®ä»£ããã«ãã¬ã€ãç¶ãã確çãéä¿¡ããæ¹ãããæçã§ãã
å Žåã«ãã£ãŠã¯ãè©äŸ¡é¢æ°ã調ã¹ããšãå°ãããããŠäœåãªãã€ã³ããç²åŸã§ããŸãïŒéå»ããã³çŸåšã®ã³ã³ãã¹ãã®åè
ãããããã« ïŒã
ããŸããŸãªã¡ããªãã¯ã®è©³çŽ°ã«ã€ããŠã¯ã ãã¡ããã芧ãã ãã ã
ããŒã«ããã
ãã£ã³ããªã³ã·ããäžã«äœ¿çšã§ããããŒã«ã¯å€æ°ãããŸãã æ©æ¢°åŠç¿ã«ã€ããŠã®äººã
ã®è©±ãããªãã«å®£èªã®ããã«èãããå ŽåãMLãé§ãæããŠã ããã§åºæ¬çãªã¢ã«ãŽãªãºã ã«ç²Ÿéããããšããå§ãããŸãã
ä»åã¯ãã»ãšãã©ã®åå è
ãPythonãšRã®ãããããéžæããŸãããäžè¬çãªæšå¥šäºé
ïŒ1ã€ã®èšèªã«åºå·ãã䜿çšå¯èœãªããŒã«ã®æ©èœãããæ·±ãç 究ããŸãã äž¡æ¹ã®èšèªã«é©ãããœãªã¥ãŒã·ã§ã³ããããæã人æ°ã®ããã©ã€ãã©ãªïŒXGBoostãªã©ïŒã¯ããã¡ãã¡ã§å©çšã§ããŸãã
ç·æ¥ã®å¿
èŠãããå Žåã¯ãå¥ã®ããã±ãŒãžã䜿çšããŠãããçš®ã®åå¥ã®èšç®ããã€ã§ãè¡ãããšãã§ããŸãã ããšãã°ãt-SNEå€æã¯ãPythonå®è£
ã§ã¯ç¡åã«ãªãããã¹ãŠã®ã¡ã¢ãªãæ¶è²»ããŸãã
pythonãéžæããæçµçãªãœãªã¥ãŒã·ã§ã³ã§ã¯æ¬¡ã®ã©ã€ãã©ãªã䜿çšããŸããã
- scikit learnã¯æ©æ¢°åŠç¿ã®ããã®çŽ æŽãããããŒã«ãããã§ãã æåã¯ã圌女ã ãã«å¶éã§ããŸãã
- XGBoost-åŸé
ããŒã¹ãã£ã³ã°ã æ©æ¢°åŠç¿éžææš©ã®ãæ°ã«å
¥ãã®ã©ã€ãã©ãªã®1ã€ã
- LightGBMã¯XGBoostã®ä»£æ¿ã§ããç§ã®å Žåãååããã1æ¡éãåäœããŸããããçµæã®ç²ŸåºŠã¯ãããã«äœäžããŸããã
- Lasagneã¯ãTheanoã䜿çšããŠãã¥ãŒã©ã«ãããã¯ãŒã¯ãäœæããã³ãã¬ãŒãã³ã°ããããã®ã©ã€ãã©ãªã§ãã å¥ã®æ¹æ³ãšããŠãKerasãè©Šãããšãã§ããŸã-ããã¯å°ãåçŽã«èŠããããã«é¢ããããå€ãã®ããã¥ã¡ã³ãã«åºããããŸããã ãããã亀差ç¹ã®éŠ¬ã¯å€ããããæåã®éžæã«åºå·ããããšã«ããŸããã
æåã®æåº
æåã«ããã¹ãŠã®å
¥åããŒã¿ãèªã¿åã£ãŠããŒãã®ã¿ã§æ§æããããã¹ãåçã衚瀺ããŠã¿ãŸãããã
ã³ãŒã>>> import numpy as np >>> import pandas as pd >>> X_train = pd.read_csv('x_train.csv', sep=';') >>> X_test = pd.read_csv('x_test.csv', sep=';') >>> y_train = pd.read_csv('y_train.csv', header=None).values.ravel() >>> print(X_train.shape, X_test.shape, y_train.shape) (25289, 12) (25289, 12) (25289,) >>> result = np.zeros((X_test.shape[0])) >>> pd.DataFrame(result).to_csv('submit.csv', index=False, header=False)
ããŒã¿ã®ããŒã/ä¿åã確èªããè©äŸ¡çšã®åºæºç¹ãåãåã£ãããåçŽãªã¢ãã«ããã¬ãŒãã³ã°ã§ããŸãã äŸãšããŠãRandomForestClassifierãåãäžããŸããã
ã³ãŒã >>> from sklearn.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(X_train, y_train) >>> result = clf.predict_proba(X_test)[:,1] >>> pd.DataFrame(result).to_csv('submit.csv', index=False, header=False)
åã®äŸãå床å®è¡ããŠæ€èšŒã®ããã«çµæãéä¿¡ãããšãé«ã確çã§å¥ã®æšå®å€ãåŸãããŸãã ããã¯ãå€ãã®ã¢ã«ãŽãªãºã ãä¹±æ°ãžã§ãã¬ãŒã¿ãŒã䜿çšããããã§ãã ãã®åäœã¯ãæçµçµæã«å¯Ÿããã¢ãã«ã®å°æ¥ã®å€æŽã®åœ±é¿ã®è©äŸ¡ãéåžžã«è€éã«ããŸãã ãã®åé¡ãåé¿ããããã«ã次ã®ããšãã§ããŸãã
ã·ãŒãå€ãã³ããã >>> np.random.seed(2707) >>> clf = RandomForestClassifier(random_state=2707) ...
ãŸãã¯
ç°ãªãã·ãŒãã§ã¢ã«ãŽãªãºã ãå®è¡ããå¹³åçµæãååŸããŸã >>> runs = 1000 >>> results = np.zeros((runs, X_test.shape[0])) >>> for i in range(runs): ⊠clf = RandomForestClassifier(random_state=2707+i) ⊠clf.fit(X_train, y_train) ⊠results[i, :]=clf.predict_proba(X_test)[:,1] >>> result = results.mean(axis=0)
2çªç®ã®ãªãã·ã§ã³ã§ã¯ãããå®å®ããçµæãåŸãããŸãããèšç®ã«ããå€ãã®æéãå¿
èŠãšããããšã¯æããã§ãããããæ¢ã«æçµãã§ãã¯ã«äœ¿çšããŸããã
ãã®ä»ã®äŸã¯ããªãŒã¬ãã€ã¶ãŒã®ãã¥ãŒããªã¢ã«ã«ãããŸãã ããã§ã¯ãã«ããŽãªå±æ§ã®æäœã«é¢ããæ
å ±ãèŠã€ããããšãã§ããŸããããã®èšäºã§ã¯è§ŠããŸããã
ããŒã¿æºå
ãšã³ããªãŒã®ãããå€ãäžããããã«ãäž»å¬è
ã¯ããŒã¿ãããªãé©åã«æºåããŸããããããã«ç²Ÿè£œããå¿
èŠã¯ãããŸããã§ããã ããã«ããã¬ãŒãã³ã°ã»ããå
ã®éè€ãŸãã¯å€ãå€ãåé€ããããšããŠããçµæãæªåããã ãã§ããã
éè€ã«ã€ããŠã¯ããããããã°ãã°ç°ãªãã¯ã©ã¹ã«å±ããŠããããšã«æ³šæãã䟡å€ããããŸãïŒåãããŒã¿ãæã€ãŠãŒã¶ãŒã¯ã²ãŒã ã«ãšã©ãŸããã²ãŒã ãé¢ããããšãã§ããŸãïŒãè¿œå æ
å ±ããªããã°æ£ç¢ºãªäºæž¬ãããããšã¯å°é£ã§ãã 幞ããªããšã«ãã»ãšãã©ã®ã¢ãã«ã¯ç¬èªã«ãããè¡ããè©äŸ¡é¢æ°ïŒãã®å Žåã¯ãã°æ倱ïŒãæå°åãã確çãå°ãåºããŸããã
UPDïŒ3äœã®åå è
ã¯ããã®äºå®ã圌ã®å©çã®ããã«äœ¿çšããããšãã§ããŸããã
ãªãŒã¬ãã€ã¶ãŒãäœæããããŒã¿ã¯ãã«ãŒã«ã®äŸå€ã§ããã€ãŸããã«ãŒã«ãèªåã§åŠçããããã®æºåãå¿
èŠã§ãã éè€ããè¡ãšå€ãå€ã«å ããŠãããŒã¿ã«ã¯æ¬ æå€ãå«ãŸããå ŽåããããŸãã æ¬ æå€ã®ããè¡ãåé€ããã®ã¯ç¡é§ã§ã æçšãªæ
å ±ããŸã å«ãŸããŠããŸãã ãããã£ãŠã2ã€ã®ãªãã·ã§ã³ãæ®ã£ãŠããŸãã
- ãã®ãŸãŸã«ããŠãããŸããäžéšã®ã¢ã«ãŽãªãºã ã¯æ¬ æå€ïŒNAïŒãåŠçã§ããŸãã
- ãããã埩å
ããŠã¿ãŠãã ããã
埩å
ããã«ã¯ãåçŽã«ãããäžè¬çãªïŒã«ããŽãªèšå·ïŒãå¹³åå€ããŸãã¯äžå€®å€ã«çœ®ãæããããšãã§ããŸãã Pythonã§ã¯ããã®ããã«sklearn.preprocessing.Imputerã¯ã©ã¹ã䜿çšã§ããŸãã ä»ã®å±æ§ïŒããšãã°ãåãã¬ãã«ã®ãŠãŒã¶ãŒéã®å¹³åå€ïŒã䜿çšããããè€éãªæ¹æ³ããããŸããä»ã®åã®æ¬ æå€ãäºæž¬ããå¥ã®ã¢ãã«ããã¬ãŒãã³ã°ããããšããŸããã ãããããç§ã¯äžèšã§ããŒã¿ãæºåãããŠãããæ¬ æå€ããªãããšãæžããŸãããå®éãããã¯å®å
šã«çå®ã§ã¯ãããŸããã
ã«ãŒã«ã泚ææ·±ãèªããšãã»ãšãã©ãã¹ãŠã®å
åã2é±éã®ãã°ã«åºã¥ãçµ±èšã§ããããšãæããã«ãªããŸãã ããŒã¿ã®ãã詳现ãªèª¿æ»ã«ãããšãå€ãã®ãŠãŒã¶ãŒã2é±éåããæ©ããã¬ã€ãéå§ããŸããã ããããé€å€ãããå Žåã亀差æ€èšŒã§ä¿¡ããããªãã»ã©è¯ãæ瞟ãåãåããæ®ãã®ãæ±ãããããŒã¿ã®äºæž¬ãæ¹åããããšãåå©ã®éµã«ãªããšèããããããŸããã 2é±éåã®æç¹ã§ãŠãŒã¶ãŒã®ããŒã¿ã埩å
ããããšããŠã倧å¹
ãªå¢å ã¯ãããŸããã§ãããããã®ãœãªã¥ãŒã·ã§ã³ãæ®ããåŸã§ä»ã®ãœãªã¥ãŒã·ã§ã³ãšäœµçšããŸããã
ç§ã«çºçããå¥ã®ããªãã¯ã¯ããã®ãããªãŠãŒã¶ãŒã®å±æ§ã®äžéšã«-1ãæããããšã§ããã ããã¯ããã¬ãŒãã³ã°äžã«ä»ã®ãã¹ããããããåé¢ããç¹ã«ã¡ãœããã®åçŽããèãããšãããèªäœããã瀺ããŠããŸãã
ããã€ãã®ãã£ãŒããã¹ãŠã®ããŒã¿ïŒ

2é±é以å
ã«ãã¬ã€ãéå§ãããŠãŒã¶ãŒã®ã¿ïŒ

ä»ã®åããããŒã¿ãå埩ããããšããŠããŸãïŒ

2é±éåã«ãã¬ã€ãéå§ãããŠãŒã¶ãŒåãã®ãå転ãïŒ

ç¹å®ã®ã±ãŒã¹ã§ã¯ãããã«ããã€ãã®å
åãåãé€ãããšãçã«ããªã£ãŠããŸãïŒ
- å®æ°èšå·;
- 2ã€ã®åŒ·ãçžé¢ããèšå·ïŒå¿
èŠãªã®ã¯1ã€ã ãã§ãïŒã
- ãŒãåæ£ã«è¿ããµã€ã³ã
ããã«ããèšç®é床ãåäžããã¢ãã«ã®å
šäœçãªå質ãåäžããå ŽåããããŸãããæ©èœã®åé€ã«ã¯éåžžã«æ³šæããå¿
èŠããããŸãã
æåã®æ®µéã§ããŒã¿ã䜿çšããŠæåŸã«ã§ããããšã¯ãã¹ã±ãŒãªã³ã°ã§ãã ããèªäœã§ã¯ãå±æ§éã®äŸåé¢ä¿ãå€æŽããŸããããäžéšã®ïŒç·åœ¢ãªã©ïŒã¢ãã«ã®äºæž¬ã倧å¹
ã«æ¹åã§ããŸãã Pythonã§ã¯ããã®ããã«sklearn.preprocessing.StandardScaler ã sklearn.preprocessing.MinMaxScalerããã³sklearn.preprocessing.MaxAbsScalerã®ã¯ã©ã¹ã䜿çšã§ããŸãã
åããŒã¿å€æã¯æ
éã«ç¢ºèªããå¿
èŠããããŸãã ããå Žåã«æ©èœãããã®ã¯ãå¥ã®å Žåã«ã¯ãã€ãã¹ã®å¹æããããããéããŸãåæ§ã§ãã
åžžã«ïŒïŒïŒãã¹ããµã³ãã«ããã¬ãŒãã³ã°ãšãŸã£ããåãå€æãè¡ãããšã確èªããŸãã
èªåèªèº«ã確èªãã
ããŒã¿ã»ããå
šäœã¯ããã¬ãŒãã³ã°ãµã³ãã«ãšãã¹ããµã³ãã«ã®2ã€ã®éšåã«åãããŠããŸãã ãã¹ããµã³ãã«ã¯ã40/60ã®æ¯çã§ãããªãã¯ãšé衚瀺ã«åå²ãããŸãã ã¢ãã«ããããªãã¯ããŒãã®çµæãã©ã®çšåºŠæ£ç¢ºã«äºæž¬ãããã«ãã£ãŠããã£ã³ããªã³ã·ããå
šäœã®ãªãŒããŒããŒãã®äœçœ®ã決ãŸããé衚瀺ã®ããŒãã®äºæž¬ã¹ã³ã¢ã¯æåŸã«ã®ã¿å©çšå¯èœã«ãªããåå è
ã®æçµçãªäœçœ®ã決ãŸããŸãã
ãã¹ããµã³ãã«ã®å
¬ééšåã®çµæã®ã¿ã«çŠç¹ãåœãŠããšãã¢ãã«ã®åãã¬ãŒãã³ã°ãšé ããçµæã®çºèŠåŸã®è©äŸ¡ã®å€§å¹
ãªäœäžã«ã€ãªããå¯èœæ§ãé«ããªããŸãã ãããåé¿ããã¢ãã«ãã©ã®ããã«æ¹å/æªåããããããŒã«ã«ã§ç¢ºèªã§ããããã«ããããã«ãçžäºæ€èšŒã䜿çšãããŸãã
ããŒã¿ãKåã®ãã©ãŒã«ãã«åå²ããŸããK-1åã®ãã©ãŒã«ãã§ãã¬ãŒãã³ã°ããæ®ãã«ã€ããŠã¯äºæž¬ã¹ã³ã¢ãäºæž¬ããã³èšç®ããŸãã ãããã£ãŠããã¹ãŠã®Kãã©ãŒã«ãã«ã€ããŠç¹°ãè¿ããŸãã æçµããŒã¯ã¯ãåãã©ãŒã«ãã®å¹³åããŒã¯ãšèŠãªãããŸãã
å¹³åå€ã«å ããŠãæšå®å€ã®æšæºåå·®ïŒstdïŒã«æ³šæãã䟡å€ããããŸãããã®ãã©ã¡ãŒã¿ãŒã¯ããã©ãŒã«ãã®å¹³åæšå®å€ãããããã«éèŠã«ãªãå¯èœæ§ããããŸãã ã¯ãããŸããŸãªãã©ãŒã«ãã®äºæž¬ã®åºãããã©ãã»ã©åŒ·ããã瀺ããŠããŸãã stdã®å€ã¯Kã®å¢å ãšãšãã«å€§ãããªãå¯èœæ§ããããŸãããããèŠããŠãã䟡å€ããããæãããªãã§ãã ããã
éèŠãªåœ¹å²ã¯ãæãç³ã¿ã®å質ã«ãããŸãã å解æã«ã¯ã©ã¹ååžãç¶æããããã«ã sklearn.model_selection.StratifiedKFoldã䜿çšããŸãã ã ããã¯ãã¯ã©ã¹ãæåã¯éåžžã«äžåè¡¡ãªå Žåã«ç¹ã«éèŠã§ãã ããã«ãæãç®ïŒææ¥ãæéããŠãŒã¶ãŒãªã©ïŒã«ããããŒã¿ã®é
åžã«ã¯ä»ã®åé¡ãããå¯èœæ§ããããåå¥ã«ç¢ºèªããã³ä¿®æ£ããå¿
èŠããããŸãã
åãšåæ§ã«ãä¹±æ°ãžã§ãã¬ãŒã¿ãŒã䜿çšããå Žåã¯åžžã«ãã·ãŒãå€ãä¿®æ£ããŠãçµæãåçŸã§ããããã«ããŸãã
ã³ãŒã >>> from sklearn.model_selection import StratifiedKFold, cross_val_score >>> clf = RandomForestClassifier(random_state=2707) >>> kf = StratifiedKFold(random_state=2707, n_splits=5, shuffle=True) >>> scores = cross_val_score(clf, X_train, y_train, cv=kf) >>> print("CV scores:", scores) CV scores: [ 0.8082625 0.81059707 0.8024911 0.81431679 0.81926043] >>> print("mean:", np.mean(scores)) mean: 0.810985579862 >>> print("std:", np.std(scores)) std: 0.00564433052781
çžäºæ€èšŒã«ããŸããŸãªã¹ããŒã ã䜿çšããŠãããŒã«ã«è©äŸ¡ãšãããªãã¯è©äŸ¡ã®éããæå°éã«æããããšãæãŸããã§ãã æšå®å€ãäžèŽãããããŒã«ã«çžäºæ€èšŒãæ£ãããšèŠãªãããå ŽåãããŒã«ã«è©äŸ¡ã«äŸåããã®ãæ
£äŸã§ãã
ã¢ãã«ãè€éã«ããŸãïŒæ©èœãããã®ã¯èŠèŠãããããŸããïŒ
ãã¥ãŒãã³ã°
MOã¢ã«ãŽãªãºã ã®ãã€ããŒãã©ã¡ãŒã¿ãŒã®éžæã¯ãã¯ãã¹æ€èšŒã§ãããã®ãã©ã¡ãŒã¿ãŒã䜿çšããŠã¢ãã«ã®æšå®å€ãè¿ãé¢æ°ãæå°åããã¿ã¹ã¯ãšèŠãªãããšãã§ããŸãã
ãã®åé¡ã解決ããããã®ããã€ãã®ãªãã·ã§ã³ãæ€èšããŠãã ããã
- BruteforceïŒ sklearn.model_selection.GridSearchCV ïŒã 培åºçãªæ€çŽ¢ã«ããããããããã®æ¹æ³ã¯éåžžã«å¹æçã§ãã XGBoostã¢ãã«ã調æŽããŸããã ãããŠããããè¡ãæ¹æ³ã«é¢ããè¯ãã¬ã€ãããã ãæ°æ¥åŸ
ã€ããšã¯ãããŸããã ãã®æ¹æ³ã¯ãæéãç¯çŽããããã«ãã€ããŒãã©ã¡ãŒã¿ãŒã®æå³ãããããç解ã§ãããããåªããŠããŸãã
- ã©ã³ãã åããããã«ãŒããã©ãŒã¹ïŒ sklearn.model_selection.RandomizedSearchCV ïŒã ããã«ããã©ã¡ãŒã¿ãŒã®æ°ã«é¢ä¿ãªããªããŠã³ãã®æ°ãèšå®ã§ããããšã«æ³šæããŠãã ããã
- ãã€ããŒãªãã ã ã¬ã€ã€ãŒã®æ°ãç°ãªããã¥ãŒã©ã«ãããã¯ãŒã¯ãå«ããå€æ°ã®ãã€ããŒãã©ã¡ãŒã¿ãŒãäžåºŠã«éžæã§ããŸããããã¯ãåçºããæ§æãèŠã€ããå¿
èŠãããå Žåã«ç¹ã«äŸ¿å©ã§ãã
- 埮åé²å ã
- æåãã£ãããªã©
ã¡ãªã¿ã«ãçžäºæ€èšŒã«Scikit Learnã©ã€ãã©ãªã®cros_val_score
ã¡ãœããã䜿çšããå Žåãäžéšã®ã¢ã«ãŽãªãºã ã¯ãã¬ãŒãã³ã°äžã«æå°åããã¡ããªãã¯ãfit
ã¡ãœããã«åãå
¥ããããšãã§ããŸãã ãŸãããã®ãã©ã¡ãŒã¿ãŒãçžäºæ€èšŒçšã«èšå®ããã«ã¯ã fit_params
ã䜿çšããå¿
èŠããããŸãã
UPDïŒxgboostããã³LightGBMã©ã€ãã©ãªã®eval_metric
ãã©ã¡ãŒã¿ãŒã¯ãeval_setãæ©æåæ¢ã®ããã«è©äŸ¡ãããã¡ããªãã¯ãèšå®ããŸãã èšãæãããšãããŒã¿ã»ãããfitã¡ãœããã«æž¡ãããåŸé
ããŒã¹ãã£ã³ã°ã®åã¹ãããã§eval_metric
ã䜿çšããŠã¢ãã«ãè©äŸ¡ãããŸãeval_metric
ããŠã¹ãããããå Žåã early_stopping_rounds
ã®è©äŸ¡ã¯æ¹åãããããã¬ãŒãã³ã°ã¯åæ¢ããŸãã
ã³ãŒã clf = xgb.XGBClassifier(seed=2707) kf = StratifiedKFold(random_state=2707, n_splits=5, shuffle=True) scores = cross_val_score(clf, X_train, y_train, cv=kf, scoring='neg_log_loss', fit_params={'eval_metric':'logloss'})
ãã£ãªãã¬ãŒã·ã§ã³ïŒhello GarusïŒïŒ
ãã£ãªãã¬ãŒã·ã§ã³ã®èãæ¹ã¯ãã¢ãã«ãã¯ã©ã¹0.6ã«å±ãããšããäºæž¬ãäžãããšãã¢ãã«ããã®äºæž¬ãäžãããã¹ãŠã®ãµã³ãã«ã®ãã¡ã60ïŒ
ãå®éã«ãã®ã¯ã©ã¹ã«å±ãããšããããšã§ãã Scikit Learnã®ã©ã€ãã©ãªã«ã¯ããã®ããã®sklearn.calibration.CalibratedClassifierCVã¯ã©ã¹ãå«ãŸããŠããŸãã ããã«ããè©äŸ¡ãæ¹åã§ããŸãããçžäºæ€èšŒã¡ã«ããºã ããã£ãªãã¬ãŒã·ã§ã³ã«äœ¿çšãããããšãèŠããŠããå¿
èŠããããŸããããã¯ããã¬ãŒãã³ã°æéã倧å¹
ã«å¢å ããããšãæå³ããŸãã
ã³ãŒã from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV kf = StratifiedKFold(random_state=2707, n_splits=5, shuffle=True) clf = RandomForestClassifier(random_state=2707) scores = cross_val_score(clf, X_train, y_train, cv=kf, scoring="neg_log_loss") print("CV scores:", -scores) print("mean:", -np.mean(scores)) clf = CalibratedClassifierCV(clf,method='sigmoid', cv=StratifiedKFold(random_state=42, n_splits=5, shuffle=True)) scores = cross_val_score(clf, X_train, y_train, cv=kf, scoring="neg_log_loss") print("CV scores:", -scores) print("mean:", -np.mean(scores)) CV scores: [ 1.12679227 1.01914874 1.24362513 0.97109882 1.07280166] mean: 1.08669332288 CV scores: [ 0.41028741 0.4055759 0.4134125 0.40244068 0.39892905] mean: 0.406129108769 <---
ãã®ã³ã°
èãæ¹ã¯ããã¬ãŒãã³ã°ãµã³ãã«ãšå±æ§ã®ç°ãªãïŒäžå®å
šãªïŒã»ããã§åãã¢ã«ãŽãªãºã ãå®è¡ãããã®ãããªã¢ãã«ã®å¹³åäºæž¬ã䜿çšããããšã§ãã ãã€ãã®ããã«ãScikit Learnã«ã¯ãã§ã«å¿
èŠãªãã®ããã¹ãŠå«ãŸããŠããããã sklearn.ensemble.BaggingClassifierã¯ã©ã¹ã䜿çšããã ãã§æéã倧å¹
ã«ç¯çŽã§ããŸãã
ã³ãŒã from sklearn.ensemble import RandomForestClassifier, BaggingClassifierâ kf = StratifiedKFold(random_state=2707, n_splits=5, shuffle=True) clf = RandomForestClassifier(random_state=2707)â scores = cross_val_score(clf, X_train, y_train, cv=kf, scoring="neg_log_loss") print("CV scores:", -scores) print("mean:", -np.mean(scores))â clf = BaggingClassifier(clf, random_state=42) scores = cross_val_score(clf, X_train, y_train, cv=kf, scoring="neg_log_loss") print("CV scores:", -scores) print("mean:", -np.mean(scores)) CV scores: [ 1.12679227 1.01914874 1.24362513 0.97109882 1.07280166] mean: 1.08669332288 CV scores: [ 0.51778172 0.46840953 0.52678512 0.5137191 0.52285478] mean: 0.509910050424
ãã¡ãããããããã£ãªãã¬ãŒã·ã§ã³ãšäœµçšããããšãçŠæ¢ãã人ã¯ããŸããã
è€åã¢ãã«
ããŒã¿ãã°ã«ãŒãã«åå²ããŠãç°ãªãã¢ãã«ã䜿çšããŠäºæž¬ããæ¹ãåçæ§ãé«ãããšã¯çãããããŸããã ããšãã°ãäžéšã®åå è
ã¯ããã¬ãŒã€ãŒã®ã¬ãã«ã«å¿ããŠç°ãªãã°ã«ãŒãã«åå²ããç°ãªãã¢ãã«ã§äºæž¬ããŸããã
ç§ã®æé«ã®ã¢ãã«ã¯ãã®åçã䜿çšããŸããã 2ã€ã®ã°ã«ãŒãã«åããŸããã2é±é以å
ã«ãã¬ã€ãéå§ããã°ã«ãŒããšããããããæ©ãéå§ããã°ã«ãŒãã§ãã ããã«ãæåã®ã°ã«ãŒãã§ã¯ãæåã®ã¬ãã«ãèšé²ããæç¹ã®ãŠãŒã¶ãŒãè¿œå ããŸããã ããã«ãããå
šäœçãªè©äŸ¡ãæ¹åãããŸããã ã¢ãã«ãšããŠãç§ã¯xgboostãç°ãªããã€ããŒãã©ã¡ãŒã¿ãŒã§äœ¿çšããç°ãªãèšå·ã»ããã䜿çšããŸããã ããã«ã2çªç®ã®ã¢ãã«ããã¬ãŒãã³ã°ãããšãã«ãã¹ãŠã®ããŒã¿ã䜿çšããŸãããã2é±éåããæ©ããã¬ã€ãéå§ãããŠãŒã¶ãŒã®å Žåã3ã«çããéã¿ãä»ããŸããã
æ±ãããªãã¯
競äºãšæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã®å®éã®äœ¿çšã¯å®å
šã«ç°ãªããã®ã§ããããšãç解ããå¿
èŠããããŸãã ããã§ã¯ã巚倧ã§é
ãã¢ãã«ãäœæã§ããŸããããã«ãããèšç®ã«äœåãªæ¥æ°ãããããããè©äŸ¡ã®ç²ŸåºŠã®ããŒã»ã³ããŒãžãåŸãããŸãããŸããåçãæåã§ä¿®æ£ããŠç²ŸåºŠãäžããããšãã§ããŸãã æãéèŠãªããšã¯ãå
¬éè©äŸ¡ã®åèšç·Žã«æ³šæããŠãã ããã
ããå€ãã®ããŒã¿ïŒ
ç§ãã¡ã«æäŸãããããŒã¿ããæ
å ±ã®æåŸã®äžæ»Žãçµãããã«ãããªãã¯ïŒå¿
èŠïŒïŒæ°ããå
åãçæããããšããããšãã§ããŸãã æäŸãããããŒã¿ããé©åãªå±æ§ã»ãããäœæããããšã¯ãå€ãã®å Žåãæ©æ¢°åŠç¿ã®ãã£ã³ããªã³ã·ãããç²åŸããéèŠãªèŠçŽ ã§ãã
- æ¢åã®ç¹æ§ãäºãã«ä¹ç®ãŸãã¯åå²ããããšã¯ãç°¡åã§ããå¹æçãªæ¹æ³ã§ãã
- æ°ããæ©èœã®æœåºã ããšãã°ãæ¥ä»ããã®ææ¥ãããã¹ãããã®æåæ°ãªã©ã
- æ¢åã®ãã£ãŒãã£ã®éç·åœ¢å€æã«ãããå€ã®ååžãéåžžã«è¿ã¥ããããšãã§ããŸããããã«ãããå Žåã«ãã£ãŠã¯ïŒåããã¥ãŒã©ã«ãããã¯ãŒã¯ãïŒæè¯ã®çµæãåŸãããŸãã äŸïŒlogïŒxïŒãlogïŒx + 1ïŒãsqrtïŒxïŒãsqrtïŒx + 1ïŒãªã©
- ãã®ä» ããªããååãªæ³ååãæã£ãŠãããã¹ãŠïŒçªå·ãåå²ãã2ã€ã®æ倧次æ°ã瀟é·ãšã®å¹Žéœ¢å·®ãªã©ã æçµã¢ãã«ã§äœ¿çšããããç§ãçæãããµã€ã³ã®1ã€ã¯ã次ã®åŒã§èšç®ãããŸããã
raw_data['totalScore'] / (1 + np.log(1+raw_data['maxPlayerLevel']) * raw_data['maxPlayerLevel'])
å€ãã®æ°ããå
åãããã®ã§ãæé«ã®ã¹ã³ã¢ãäžããæé©ãªã»ãããäœããã®æ¹æ³ã§éžæããå¿
èŠããããŸãã
PCAãŸãã¯TruncatedSVDã䜿çšãããšãç¹åŸŽç©ºéã®æ¬¡å
ãå°ããããŠãã¢ã«ãŽãªãºã ã®é床ãäžããããšãã§ããŸãã ãã ããããŒã¿éã®éç·åœ¢é¢ä¿ãç¡èŠããã ãã§ãªããéèŠãªæ©èœãå®å
šã«å€±ã倧ããªãªã¹ã¯ããããŸãã
åŸé
ããŒã¹ãã£ã³ã°ãªã©ã®å€ãã®ã¢ã«ãŽãªãºã ã¯ãããã€ã¹ãåå ã§ãèšç·Žãããã¢ãã«ã®ç¹å®ã®å±æ§ã®éèŠæ§ã«é¢ããæ
å ±ãéåžžã«ç°¡åã«ååŸã§ããŸãã ãã®æ
å ±ã¯ãéèŠã§ãªãåãé€å€ããããã«äœ¿çšã§ããŸãã
äŸ import matplotlib.pyplot as plt import xgboost as xgb from xgboost import plot_importance clf = xgb.XGBClassifier(seed=2707) clf.fit(X_train, y_train, eval_metric='logloss') for a, b in sorted(zip(clf.feature_importances_, X_train.columns)): print(a,b, sep='\t\t') plot_importance(clf) plt.show()
0.014771 numberOfAttemptedLevels 0.014771 totalStarsCount 0.0221566 totalBonusScore 0.0295421 doReturnOnLowerLevels 0.0354505 fractionOfUsefullBoosters 0.0531758 attemptsOnTheHighestLevel 0.0886263 numberOfBoostersUsed 0.118168 totalScore 0.128508 averageNumOfTurnsPerCompletedLevel 0.144756 maxPlayerLevel 0.172821 numberOfDaysActuallyPlayed 0.177253 totalNumOfAttempts

ãã€ãã®ããã«ãæšèã®é€å»ã«ã¯éåžžã«æ³šæããå¿
èŠããããŸãã éèŠã§ãªãç¹åŸŽãåé€ãããšäºæž¬ã®ç²ŸåºŠãæãªãããå¯èœæ§ããããŸãããéã«æãéèŠãªç¹åŸŽãåé€ãããšæ¹åãããŸãã ãã®æ¹æ³ã䜿çšããŠãå®å
šã«çµ¶æçãªå
åãæé€ããŸããã
圢質ãéžæããããã®ããå€å
žçãªã¢ãããŒãããããŸãã ãã®ã³ã³ãã¹ãã§ã¯ã貪欲ã¢ã«ãŽãªãºã ãéäžçã«äœ¿çšããŸããããã®ã¢ã€ãã¢ã¯ãã»ããã«æ°ããæ©èœã1ã€ãã€è¿œå ããæé©ãªçžäºæ€èšŒã¹ã³ã¢ãäžãããã®ãéžæããããšã§ãã ãŸãããµã€ã³ãäžåºŠã«1ã€ãã€æšãŠãããšãã§ããŸãã ãããã®ã¢ãããŒãã亀äºã«ç¹°ãè¿ããŠãæçµãµã³ãã«ãæ¡ç¹ããŸããã ããã¯æžããããã¢ã«ãŽãªãºã ã§ãããä»ã®ããã€ãã®ã»ãããšã»ããã®ç²ŸåºŠãåäžãããæ©èœã¯ç¡èŠãããŸãã ãã®èŠ³ç¹ãããç¹æ§ã®äœ¿çšããã€ããªãã¯ãã«ã§ãšã³ã³ãŒãããéºäŒçã¢ã«ãŽãªãºã ã䜿çšããæ¹ãçç£çã§ãã
ãšã©ãŒåŠç
æ§ããã ããçå®ãã¡ãããå声ãšè³åã¯çŽ æŽããããã®ã§ãããä»åã®ç§ã®äž»ãªåæ©ã¯çµéšãšç¥èãåŸãããšã§ããã ãããŠããã¡ãããåŠç¿ããã»ã¹ã«ã¯ééãããªãããã§ã¯ãããŸããã ãã®åæã«ãããç§ãäœãããŠããããæãç解ããããšãã§ããŸããã ãããŠãããªããç§ãšåããããæ°ãããªããç§ã®ã¢ããã€ã¹ã¯ïŒãã¹ãŠãè©ŠããŠã¿ãŠãã ããã ããã€ãã®ç°ãªãçµæãåŸãããããããããããä»ãšæ¯èŒããŠè©äŸ¡ããããããäºãã«æ¯èŒããã®ãç°¡åã§ãã ãããŠãäœãèµ·ããŠããã®ããã¢ã«ãŽãªãºã ã®åäœãããæ·±ãç解ããããšã«ã€ãªããçç±ãèªåèªèº«ã«èª¬æããããšããŸãã
äžèšã®èšäºã§èª¬æããããŒã¿ãšã¢ãã«ã䜿çšããããã»ã¹ã¯ç·åœ¢ã§ã¯ãªãããã£ã³ããªã³ã·ããäžã«å®æçã«æ°ããã¢ãã«ã«æ»ããæ°ããæ©èœã®çæãšãããã®ã¢ãã«ã®èª¿æŽã«æ»ããŸããã ãã®çµæãããã€ãã®åªããã¢ãã«ãèç©ããããã®çµæãæçµäºæž¬ã«äœ¿çšããŸããã
ããªããæ»ãã å Žæã§ç«ã¡åŸçããŠããå ŽåïŒ
- ããŒã«ã«ãããã ã«ã€ããŠèŠããŠãããŠãã ãããããããããã€ãã®ã¢ã€ãã¢ã¯æåã«çŸåšã®ã¢ã€ãã¢ãããæªãçµæããããããããããŸãããããã®ãããªãéçºãŸãã¯ä»ã®ã¢ã€ãã¢ãšã®çµã¿åãããããªãã®ããã©ãŒæ©èœãã«ãªããŸãã
- ã»ãšãã©åžžã«ã課é¡ã®ãããã¯ã«é¢ããç§åŠè«æãèŠã€ããããšãã§ããŸãã
- ä»ã®éžææš©ã®åå è
ã®æ±ºå®ãç 究ããïŒkaggleïŒ;
- ããŸããŸãªã¢ãã«ãŸãã¯ããã«å€ãã®æ©èœãçæããŠã¿ãŠãã ããã
ããå€ãã®ã¢ãã«ïŒ
äœåºŠãèŠãã¿ãç ãã¬å€ãéãããåŸãå°å
ã®å±¥æŽæžã§ããããŠçæ³çã«ã¯äžè¬ã«è¯ãããŒã¯ã§ãè¯ãã¢ãã«ãæã«å
¥ãããšããŸãããã ããã«ãå質ãããå£ãã¢ãã«ãããã«2ã€å
¥æããŸããã ããã«æåŸãæããªãã§ãã ããã å®éã«ã¯ãããã€ãã®ã¢ãã«ã®äºæž¬ãããŸããŸãªæ¹æ³ã§çµã¿åãããŠãããã«æ£ç¢ºã«ããããšãã§ããŸãã ããã¯éåžžã«å€§ããªãããã¯ã§ããã ãã®èšäºããå§ããããšããå§ãããŸã ã ããã§ãå¿ã«æµ®ããã è€éãã®ç°ãªã2ã€ã®æ¹æ³ãå
±æããŸãã
æãåçŽãªã¢ãããŒãããããŠç§ã®å Žåã¯ããå¹æçãªã¢ãããŒãã¯ãããã€ãã®ã¢ãã«ã®ãœãªã¥ãŒã·ã§ã³éã®å¹³å¡ãªç®è¡å¹³åã§ããããšãå€æããŸããã ãã®æ¹æ³ã®ããªãšãŒã·ã§ã³ãšããŠã幟äœå¹³åã䜿çšããããã¢ãã«ã«éã¿ãè¿œå ãããã§ããŸãã
2çªç®ã®ã¢ãããŒãã¯ã¹ã¿ããã³ã°ã§ãã ããã§ã¯ããªãŒã麊ãé£ã¹ãããšãã§ããŸã...ã¢ã€ãã¢ã¯ç°¡åã§ãã第1ã¬ãã«ã¢ãã«ã®äºæž¬ãå¥ã®ã¢ã«ãŽãªãºã ãžã®å
¥åãšããŠäœ¿çšããŸãã ãããã®äºæž¬ã«åæããŒã¿ãè¿œå ãããããæ°ããæ©èœãçæããããã«ç¬¬1ã¬ãã«ã®ã¢ãã«ã®çµæã䜿çšãããå ŽåããããŸãã , ( -), . : holdout set out-of-fold predictions.
Holdout set â (~10%) , , . , .
OOF predictions â K , K-1 . . : , (Variant ), , -1 , (Variant A).

äŸ def get_oof(clf): oof_train = np.zeros((X_train.shape[0],)) oof_test = np.zeros((X_test.shape[0],)) oof_test_skf = np.empty((NFOLDS, X_test.shape[0])) for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)): x_tr = X_train[train_index] y_tr = y_train[train_index] x_te = X_train[test_index] clf.train(x_tr, y_tr) oof_train[test_index] = clf.predict_proba(x_te)[:, 1] oof_test_skf[i, :] = clf.predict_proba(X_test)[:, 1] oof_test[:] = oof_test_skf.mean(axis=0) return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
? , (data leak), , . , , , .
1: OOF predictions -.
2: K~=10, 1 holdout set.
, , . -, , , .
èªåãç¹°ãè¿ããªãã§ãã ãã
, . . , / , -, OOF , - .. , , , . , , .
, , . .
, Scikit Learn , , ( ). , , .
ãŸãšã
. telegram . , 6 8 .
GitHub .
, , , .
, .