圢ç¶ãšé«ãã«ãã£ãŠãã¬ã€ããŒã€ã¢ãã«ã®ééãäºæž¬ããããšã«é¢ãã
BubaVVã®èšäºã«åæ©ä»ããããèè
ã¯ã
ããããªããç§ãæå³ããããšãç¥ã£ãŠããŠãåãããŒã¿ãã€ãŸããã®åœ¢ã§ä»ã®äººããéç«ã£ãŠãã
å·šä¹³ã¢ãã«ãèŠã€ãã
å Žå ãç 究ã®äž»é¡ãããæ·±ãæãäžããããšã«æ±ºããŸããã身é·ãŸãã¯äœéã ãã®ãŠã©ãŒã ã¢ããã®èæ¯ã«å¯ŸããŠãåæã«ãŠãŒã¢ã¢ã®ææ
ã¯ãScikit-learnã©ã€ãã©ãªã®One-class Support Vector Machineã®å®è£
ã«ãããç°åžžå€æ€åºãšç°åžžæ€åºããŒã¿ã«ã€ããŠãåå¿è
ã«å°ãã ãæããŸãã Pythonã§ã
ããŠã³ããŒãããã³åæããŒã¿åæ
ãããã£ãŠãããŒã¿ã®
ãœãŒã¹ãšãããã«åãçµãã 人ãæ£çŽã«åç
§ãã
girls.csvããŒã¿ãå«ãCSVãã¡ã€ã«ãéããŠãããã«äœããããã確èªããŸãã 1953幎12æãã2009幎1æãŸã§ã®604人ã®ãã¬ã€ããŒã€ã¬ãŒã«ã®ãã©ã¡ãŒã¿ãŒãèŠãïŒèžå²ïŒãã¹ããcmïŒãè
°å²ïŒãŠãšã¹ããcmïŒãè
°å²ïŒããããcmïŒã身é·ïŒé«ããcmïŒ ãïŒããã³ééïŒééãkgïŒã
ãæ°ã«å
¥ãã®Pythonããã°ã©ãã³ã°ç°å¢ïŒç§ã®å Žåã¯Eclipse + PyDevïŒãéããPandasã©ã€ãã©ãªã䜿çšããŠããŒã¿ãããŒãããŸãããã ãã®èšäºã§ã¯ãPandasãNumPyãSciPyãsklearnãmatplotlibã©ã€ãã©ãªãã€ã³ã¹ããŒã«ãããŠããããšãåæãšããŠããŸãã ããã§ãªãå ŽåãWindowsãŠãŒã¶ãŒã¯
ããããããªã³ã³ãã€ã«ãããã©ã€ãã©ãªãåãã§ã€ã³ã¹ããŒã«ã§ã
ãŸã ã
ãŸããnyxãšpoppiesïŒããã³èè
ïŒã®ãŠãŒã¶ãŒã¯å°ãèŠããå¿
èŠããããŸãããèšäºã¯ããã«ã€ããŠã§ã¯ãããŸããã
ãŸããå¿
èŠãªã¢ãžã¥ãŒã«ãã€ã³ããŒãããŸãã å©çšå¯èœã«ãªã£ããšãã®åœ¹å²ã«ã€ããŠã話ãããŸãã
import pandas import numpy as np import matplotlib.pyplot as plt import matplotlib.font_manager from scipy import stats from sklearn.preprocessing import scale from sklearn import svm from sklearn.decomposition import PCA
girls.csvãã¡ã€ã«ããããŒã¿ãèªã¿åãããšã«ãããPandas DataFrameããŒã¿æ§é ã®
girlsã€ã³ã¹ã¿ã³ã¹ãäœæããŸãïŒãã®pyãã¡ã€ã«ã®é£ã«ãããŸããããã§ãªãå Žåã¯ããã«ãã¹ãæå®ããå¿
èŠããããŸãïŒã
headerãã©ã¡ãŒã¿ãŒã¯ãå±æ§ã®ååãæåã®è¡ïŒã€ãŸããããã°ã©ããŒãšããŠæ°ããããå Žåã¯ãŒãïŒã«ããããšã瀺ããŠããŸãã
girls = pandas.read_csv('girls.csv', header=0)
ã¡ãªã¿ã«ãPandasã¯pythonã«æ
£ããŠãã人ã«ãšã£ãŠã¯çŽ æŽããããªãã·ã§ã³ã§ãããRã§ã®ããŒã¿è§£æã®é床ã倧奜ãã§ããPandasãRããç¶æ¿ããäž»ãªãã®ã¯ã䟿å©ãªDataFrameããŒã¿æ§é ã§ãã
èè
ã¯ãTitanicïŒMachine Learning from Disasterã®ãã©ã€ã¢ã«ã³ã³ããã£ã·ã§ã³ã§Kaggle
ãã¥ãŒããªã¢ã«ã§Pandasã«äŒããŸããã KaggleãåããŠäœ¿çšãã人ã«ãšã£ãŠãããã¯æçµçã«ãããè¡ãããã®å€§ããªèšãèš³ã§ãã
ç§ãã¡ã®å¥³ã®åã®äžè¬çãªçµ±èšãèŠãŠã¿ãŸãããïŒ
print girls.info()
604人ã®å¥³ã®åãèªç±ã«å©çšã§ããããšãéç¥ãããŸããå女ã®åã«ã¯ã7ã€ã®å
åïŒMonthïŒtype objectïŒãYearïŒtype int64ïŒãããã«5åã®int64ã®å
åïŒããããŸãã
次ã«ã女ã®åã«ã€ããŠããã«åŠã³ãŸãã
print girls.describe()
ããã人çã®ãã¹ãŠããšãŠãã·ã³ãã«ã ã£ããïŒ
éèš³è
ã¯ã女ã®åã®å
åã®äž»ãªçµ±èšçç¹åŸŽãå¹³åå€ãæå°å€ãæ倧å€ããªã¹ãããŸãã ããæªããªãã ãã®ããšããããã¬ã€ããŒã€ã¢ãã«ã®å¹³åçãªåœ¢ç¶ã¯89-60-88ïŒäºæ³ïŒãå¹³åã®é«ã-168 cmãéé-52 kgã§ãããšçµè«ä»ããããŸãã
ããã§ãæé·ã¯å°ããããã§ãã ã©ããããããã¯ã20äžçŽåã°ããã®æŽå²çããŒã¿ããçŸåšã180 cmã®é«ããã¢ãã«ã®æšæºã§ãããšèããããŠããããã«èŠãããšããäºå®ã«ãããã®ã§ãã
å°å¥³ã®èžã®ç¯å²ã¯81ã104 cmããŠãšã¹ã-46ã89ãããã-61ã99ã身é·-150 cmã188 cmãäœé-42 kgã68 kgã§ãã
ãããŒãããªãã¯ãã§ã«ãšã©ãŒãããŒã¿ã«å¿ã³èŸŒãã§ãããšçãããšãã§ããŸãã ãŠãšã¹ãã89cmã®ã¢ãã«
ã¯ã©ã®
ãããªããŒã«æšœã§ããïŒ ãããŠããããã¯ã©ã®ããã«61 cmã«ãªãããšãã§ããŸããïŒ
ãããã®ãŠããŒã¯ãªãã®ãäœã§ãããèŠãŠã¿ãŸãããïŒ
print girls[['Month','Year']][girls['Waist'] == 89]
ãããã¯ããããã1998幎12æãš2005幎1æã®ãã¬ã€ããŒã€ã®å¥³ã®åã§ãã
ããã§ç°¡åã«èŠã€ããããšãã§ã
ãŸã ã ãããã¯ããã³ãŒã«ããšãªã«ããžã£ã¯ãªã³ã®3人çµ
ã§ãããé話è
ã®å§ã¯ DamïŒ
Dahm ïŒã§ãã3ã€ãã¹ãŠãã1ã€ã®ã¢ã«ãŠã³ãã®äžããšDestiny DavisïŒ
Destiny Davis ïŒã§ãã ããªãã¬ãããŠãšã¹ãã¯25ã€ã³ãïŒ64 cmïŒã§ã89ã§ã¯ãªããDestiniã®ãããã¯86 cmã§ã61ã§ã¯ãããŸããã
çŸå®¹ã®ããã«ã女ã®åã®ãã©ã¡ãŒã¿ãŒã®ååžã®ãã¹ãã°ã©ã ãäœæããããšãã§ããŸãïŒå€æŽã®ããã«ãRã§äœæãããŸãïŒã
ãã®ãããããŒã¿ãç°¡åã«æäŒã£ãŠèŠãŠã¿ããšããã¡ããããŒã¿ãããŸããªããå
åãäœããã®åœ¢ã§äººãç解ã§ããæ¹æ³ã§è§£éã§ããå Žåã¯ãããŒã¿ã«ããã€ãã®å¥åŠãªç¹ãèŠã€ããããšãã§ããŸãã
ããŒã¿ã®ååŠç
ã¢ãã«ããã¬ãŒãã³ã°ããããã«ã幎ãé€ããæ°å€ãã©ã¡ãŒã¿ãŒã®ã¿ãæ®ããŸãã ããããNumPy
girl_paramsé
åã«æžã蟌ã¿ãåæã«float64åã«å€æããŸãã ãã¹ãŠã®å±æ§ã-1ã1ã®ç¯å²ã«ãªãããã«ããŒã¿ãã¹ã±ãŒãªã³ã°ããŸããããã¯ãå€ãã®äž»èŠãªåŠç¿ã¢ã«ãŽãªãºã ã«ãšã£ãŠéèŠã§ãã 詳现ã説æããããšãªããã¹ã±ãŒãªã³ã°ã«ãããå€åã®ç¯å²ãåºããšããçç±ã ãã§ãµã€ã³ãããå€ãã®éã¿ãåãããšããäºå®ãåé¿ããŸãã ããšãã°ãã幎霢ããšãåå
¥ããšããèšå·ã«åŸã£ãŠãŠãŒã¯ãªããè·é¢ãèæ
®ãããšãã¡ããªãã¯ãžã®å¯äžã¯ãããšãã°æ°åãæ°åã®å¹Žéœ¢ã§æž¬å®ããããããã¯ããã«é«ããªããŸãã
girl_params = np.array(girls.values[:,2:], dtype="float64") girl_params = scale(girl_params)
次ã«ãããŒã¿å
ã®2ã€ã®äž»èŠã³ã³ããŒãã³ããéžæããŠã衚瀺ã§ããããã«ããŸãã ããã§ã¯ãScikit-learn Principal Component AnalysisïŒ
PCA ïŒã©ã€ãã©ãªã圹ã«ç«ã¡ãŸããã ãŸããããã¯ç§ãã¡ã®å¥³ã®åã®æ°ãä¿ã€ããã«ç§ãã¡ãå·ã€ããããšã¯ãããŸããã ããã«ãããŒã¿å
ã®æåºéã®1ïŒ
ãæ¢ããŠãããšèšããŸããã€ãŸãã6ã7人ã®ãå¥åŠãªã女ã®åã«å¶éããŸãã ïŒå€§æåã§èšè¿°ãããPythonã®å€æ°ã¯å®æ°ãè¡šããéåžžã¯ã¢ãžã¥ãŒã«ãæ¥ç¶ããåŸããã¡ã€ã«ã®å
é ã«æžã蟌ãŸããŸãïŒã
X = PCA(n_components=2).fit_transform(girl_params) girls_num = X.shape[0] OUTLIER_FRACTION = 0.01
ã¢ãã«ãã¬ãŒãã³ã°
ããŒã¿å
ã®ãå€ãå€ããæ€åºããã«ã¯ããµããŒããã¯ã¿ãŒãã·ã³ã®åäžã¯ã©ã¹ã¢ãã«ã䜿çšããŸãã SVMã®ãã®ããªãšãŒã·ã§ã³ã«é¢ããçè«çç 究ã¯ãã¢ã¬ã¯ã»ã€ã»ã€ã³ãã¬ãŽã£ããã»ãã§ã«ãŽã©ã³ãã¹ã«ãã£ãŠå§ããããŸããã Yandexã«ãã
ãš ãçŸåšããã®åé¡ã解決ããæ¹æ³ã®éçºã¯ãæ©æ¢°åŠç¿ã®çè«ã®éçºã§æåã«è¡ãããŠããŸãã
ããã§ã¯SVMãšã³ã¢ã«ã€ããŠã¯èª¬æããŸãããããšãã°ã
Habré ïŒããã·ã³ãã«ïŒã
machinelearning.ru ïŒããè€éïŒã«ã¯ãSVMãšã³ã¢ã«ã€ããŠå€ãã®ããšãæžãããŠããŸãã ååã瀺ãããã«ã1ã¯ã©ã¹ã®SVMãåãã¯ã©ã¹ã®ãªããžã§ã¯ããåºå¥ããããã«åŒã³åºãããšã«æ³šæããŠãã ããã ããŒã¿ã®ç°åžžãæ€åºããããšã¯ããã®èãæ¹ã®ãããããªã¢ããªã±ãŒã·ã§ã³ã§ãã çŸåšã深局åŠç¿ã®æ代ã§ã¯ã1ã¯ã©ã¹åé¡ã®ã¢ã«ãŽãªãºã ã䜿çšããŠãããšãã°åäŸãä»ã®ãã¹ãŠã®ãªããžã§ã¯ããšç¬ãåºå¥ãããªã©ããªããžã§ã¯ãã®ãè¡šçŸãäœæãããããã«ã³ã³ãã¥ãŒã¿ãŒã«æããããšããŠããŸãã
ããããScikit-learnãµã€ãã§ååã«
ææžåãããŠããOne-class SVMã®Scikitå®è£
ã«æ»ããŸãã
ã¬ãŠã¹ã«ãŒãã«ã䜿çšããŠåé¡åã®ã€ã³ã¹ã¿ã³ã¹ãäœæããããã«ããŒã¿ãããã£ãŒããããŸãã
clf = svm.OneClassSVM(kernel="rbf") clf.fit(X)
ãšããã·ã§ã³æ€çŽ¢
ãã¬ãŒãã³ã°ã»ããXã®ãªããžã§ã¯ãããæ§ç¯ãããåå²é¢ãŸã§ã®è·é¢ãæ ŒçŽããé
å
dist_to_borderãäœæãããããå€ãéžæããåŸããªããžã§ã¯ããå€ãå€ã§ã¯ãªããã®ã¯ã©ã¹ã®ä»£è¡šã§ããã€ã³ãžã±ãŒã¿ãŒã®é
åïŒTrueãŸãã¯FalseïŒãäœæããŸãã ããã«ããªããžã§ã¯ããæ§ç¯ãããåå²é¢ã§å²ãŸããé åã®ãå
åŽãã«ããå ŽåïŒã€ãŸããã¯ã©ã¹ã®ä»£è¡šã§ããå ŽåïŒãè·é¢ã¯æ£ã«ãªããããã§ãªãå Žåã¯è² ã«ãªããŸãã ãããå€ã¯ãåå²é¢ãŸã§ã®è·é¢ãªã©ãçµ±èšçã«æ±ºå®ãããOUTLIER_FRACTIONïŒãã®å Žåã¯1ã€ïŒã®ãµã³ãã«ã®å²åã倧ãããªããŸãïŒã€ãŸãããã®å Žåã
ãããå€ã¯åå²é¢ãŸã§ã®è·é¢ã®é
åã®1ïŒ
ããŒã»ã³ã¿ã€ã«ã§ãïŒã
dist_to_border = clf.decision_function(X).ravel() threshold = stats.scoreatpercentile(dist_to_border, 100 * OUTLIER_FRACTION) is_inlier = dist_to_border > threshold
çµæã®è¡šç€ºãšè§£é
æåŸã«ãäœãèµ·ãã£ãã®ããèŠèŠåããŸãã ãã®æç¹ã§ç§ã¯ãããŸãããåžæãã人ã¯èªåã§matplotlibãæ±ãããšãã§ããŸãã ããã¯ãScikit-learnã®ãããã€ãã®æ¹æ³ã«ããç°åžžå€æ€åºãã®
äŸããåèšèšãããã³ãŒãã§ãã
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500)) n_inliers = int((1. - OUTLIER_FRACTION) * girls_num) n_outliers = int(OUTLIER_FRACTION * girls_num) Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.title("Outlier detection") plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r) a = plt.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') plt.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange') b = plt.scatter(X[is_inlier == 0, 0], X[is_inlier == 0, 1], c='white') c = plt.scatter(X[is_inlier == 1, 0], X[is_inlier == 1, 1], c='black') plt.axis('tight') plt.legend([a.collections[0], b, c], ['learned decision function', 'outliers', 'inliers'], prop=matplotlib.font_manager.FontProperties(size=11)) plt.xlim((-7, 7)) plt.ylim((-7, 7)) plt.show()
次ã®å³ã衚瀺ãããŸãã
7ã€ã®ãæŸåºãã衚瀺ãããŸãã ãã®äžå¿«ãªãççºãã®äžã«ã©ããªçš®é¡ã®å¥³ã®åãé ããŠããããç解ããããã«ããœãŒã¹ããŒã¿ã§ããããèŠãŠã¿ãŸãããã
print girls[is_inlier == 0]
Month Year Bust Waist Hips Height Weight 54 September 1962 91 46 86 152 45 67 October 1963 94 66 94 183 68 79 October 1964 104 64 97 168 66 173 September 1972 98 64 99 185 64 483 December 1998 86 89 86 173 52 507 December 2000 86 66 91 188 61 535 April 2003 86 61 69 173 54
ãããŠä»ãæãé¢çœãéšåã¯ãçµæãšããŠçããæŸåºã®è§£éã§ãã
Kunstkameraã«ã¯7ã€ã®å±ç€ºãããªãããšã«æ°ã¥ããŸããïŒOUTLIER_FRACTIONãããå€ãèšå®ã§ããã®ã§ïŒãããããã®å±ç€ºã確èªã§ããŸãã
- ããããŒãŠã£ã³ã¿ãŒãº ã 1962幎9æã91-46-86ãé«ã152ãéé45ã
ãã¡ããããŠãšã¹ã46ã¯ã¯ãŒã«ã§ãïŒ åœŒãã¯ãã®èž91ã§ã©ãããŠããŸããïŒ
- ã¯ãªã¹ãã£ã³ã»ãŠã£ãªã¢ã 㺠1963幎10æã94-66-94ãé«ã183ãéé68ã
ãã®å¹Žã®å°å¥³ã§ã¯ãããŸããã ããã¯ããããŒãŠã£ã³ã¿ãŒãºã§ã¯ãããŸããã
- ããŒãºããªãŒãã«ã¯ã¬ã¹ã ã 1964幎10æã104-64-97ã身é·168ãäœé66ã
ããããããïŒ çŽ æŽããã女æ§ã
- ã¹ãŒã¶ã³ã»ãã©ãŒ 1972幎9æã98-64-99ãé«ã185ãäœé64ã
- ãã¥ãŒãã£ãŒãã ããªãã¬ãããã ã 86-89ïŒå®éã®64ïŒ-86ãé«ã173ãéé52ã
ããŒã¿ãšã©ãŒã®äŸã 圌ãã3ã€ãã¹ãŠãã©ã®ããã«æž¬å®ãããã¯ããŸãæ確ã§ã¯ãªãã
- ã«ãŒã©ã»ãã·ã§ã« ã 2000幎12æã86-66-91ãé«ã188ãäœé61ã
æé·188-ãã®èšäºã®èè
ã®äžã ãã®ãããªãå±¥æŽãããŒã¿ã®æ瀺çãªãå€ãå€ãã
- ã«ã«ã¡ã©ã»ãã»ãã§ã¶ãŒã¬ ã 2003幎4æã86-61-69ãé«ã173ãéé54ã
ããããè
°ã®ããã«ã
ãããã61 cmã®å¥³æ§ã¯ãä»ã®å¥³ã®åãšã¯éåžžã«ç°ãªããä»ã®ç¹ã§ã¯ããªãæ£åžžã§ãããSVMããå€ãå€ããšããŠå®çŸ©ãããŠããªãã£ãããšã¯æ³šç®ã«å€ããŸãã
ãããã«
æåŸã«ãæåã®ããŒã¿åæã®éèŠæ§ã«æ³šç®ãããç®ã ãã§ãããããŠãã¡ãããããŒã¿ã®ç°åžžã®æ€åºã¯ãããæ·±å»ãªã¿ã¹ã¯ã«ã䜿çšãããããšã«æ³šæããŠãã ãã-ä¿¡é Œã§ããªã顧客ãèªèããããã®ä¿¡çšã¹ã³ã¢ãªã³ã°ãæœåšçãªãããã«ããã¯ããæ€åºããã»ãã¥ãªãã£ã·ã¹ãã ããµã€ããŒç¯çœªè
ãªã©ãæ€çŽ¢ããããã®éè¡ååŒã®åæã ãŸããèå³ã®ããèªè
ã¯ãããŒã¿ãšãã®ã¢ããªã±ãŒã·ã§ã³ã®ç°åžžãšç°åžžå€ãæ€åºããããã®ä»ã®å€ãã®ã¢ã«ãŽãªãºã ãèŠã€ããã§ãããã