1.ã¯ããã«
éåžžã«è² è·ã®é«ãããŒã¿ã«ãŸãã¯APIã§ã¯ãããšãã°ãŠãŒã¶ãŒãåé¡ããããã«æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã®äœ¿çšãå¿
èŠã«ãªãå ŽåããããŸãã ãã®ã¡ã¢ã®äžéšãšããŠãããã€ãã®é«æ§èœç·åœ¢ã¢ãã«ã®å®è£
ããã»ã¹ãšãåºæ¬çãªçè«åçã®èª¬æã瀺ãããŸãã
2.ãã¢ã¯ã€ãºç·åœ¢é¢ä¿
æãäžè¬çã«äœ¿çšãããåçŽãªã¢ãã«ãã話ãå§ããŸãã æ確ãªç·åœ¢é¢ä¿ãæã€ã¡ããªãã¯ã®ãã¢ããããšä»®å®ããŸãã ããŒã¿ãèŠèŠçã«è¡šç€ºããŸããæåã®ã¡ããªãã¯ã®å€ã¯æšªåº§æšã«æ²¿ã£ããã€ã³ãã®äœçœ®ã§ããã2çªç®ã®ã¡ããªãã¯ã®å€ã¯çžŠåº§æšã«æ²¿ã£ããã€ã³ãã®äœçœ®ã§ãã ãã®å³ã¯ã説æå€æ°ïŒäºæž¬å€æ°ãååž°å€æ°ããŸãã¯ç¬ç«å€æ°ãšãåŒã°ããïŒãå¢å ãããšãåŸå±å€æ°ãå¢å ããããšã瀺ããŠããŸãã æ確ã«ããããã«ãçè«çãªäŸãRã«ç€ºããŸãã
a <- c(1, 5, 5, 6, 4, 8, 9, 11, 15, 18, 22, 28, 29, 31, 31, 32) b <- c(1, 5, 6, 4, 5, 8, 9, 10, 17, 19, 22, 28, 28, 30, 30, 32) plot(a, b) abline(lm(b ~ a), col = "blue")

ãã®ãããªã¢ã«ãŽãªãºã ãèšè¿°ããå¿
èŠããããŸãããã®ã¢ã«ãŽãªãºã ã¯ãç·åœ¢äŸåæ§ã®ååšã®äºå®ãæããã«ãããã®é倧床ã枬å®ããå¿
èŠããããŸãã æ£åŒã«èšãã°ãã«ãŒã«ãã¢ãœã³ã®ç·åœ¢çžé¢ä¿æ°ãèšç®ããå¿
èŠããããŸãã çè«çãªåºç€ãæãåºããèšç®åŒããã詳现ã«æ±ãããšãææ¡ããŸãã

ãŸãããã®ãããªã»ããã®ç¹æ§ã«èå³ããããŸããããã¯ãã©ã³ãã å€æ°ã®åæ£ãšåŒã°ããŸãã ã»ããã®åèŠçŽ ããå¹³åå€ãåŒããçµæã2ä¹ãããšãæ°ããã»ãããåŸãããŸãããã®å¹³åå€ã¯ãäžè¬æ¯éå£ã®ç¢ºçå€æ°ã®åæ£ãšåŒã°ããŸãã ç解ã§ããããã«ãå
ã®ã»ããã®ãã¹ãŠã®èŠçŽ ãåãã§ããå Žåãåæ£ã¯ãŒãã«ãªããèŠçŽ ãå¹³åãããã倧ããéžè±ããã»ã©ãåæ£ã¯å€§ãããªããŸãã ããã§ããè² ã®æ°ã«ããããšã¯ã§ããŸããã
var_a <- sum((a - mean(a)) ^ 2) / (length(a) - 1) c(var(a), var_a) # 127.2625 127.2625 c(sd(a), sqrt(var_a)) # 11.28107 11.28107
äžèšã®äŸã§ã¯ãã»ããã®ãã¯ãŒã§å²ã£ãã®ã§ã¯ãªãããããã1ã€å°ãªãå€ã§å²ã£ãããšã«æ³šæããŠãã ããã ã€ãŸããæ¯éå£ã§ã¯ãªããµã³ãã«ã®åæ£ãèšç®ããŸããã ãŸããå¹³åå€ãšèŠçŽ ã®å·®ãäºä¹ããããããåæ£ããå¹³æ¹æ ¹ãæœåºããŠæšæºåå·®ãååŸããã®ãçã«ããªã£ãŠããŸãã é¢ä¿ãèŠã€ããã«ã¯ã2çªç®ã®ã»ããã®æšæºåå·®ãç¥ãå¿
èŠããããŸãã
var_b <- sum((b - mean(b)) ^ 2) / (length(b) - 1) c(var_b, var(b)) # 122.7833 122.7833 c(sd(b), sqrt(var_b)) # 11.08076 11.08076
ããã§ã2ã€ã®éã®ç·åœ¢äŸåæ§ã®æž¬åºŠãå
±åæ£ãšããŠèšç®ããå¿
èŠããããŸãã åŒã¯åæ£ã«éåžžã«äŒŒãŠãããããã«ãã»ãããåäžã§ããå Žåãå®éã«ç¢ºçå€æ°ã®åæ£ãååŸããŸãã åŒã¯å¯Ÿç§°æ§ã瀺ããŠãããããåŒæ°ã®é åºã¯ä»»æã§ãïŒã»ããã亀æããããšãã§ããŸãïŒ-AãšBã®å
±åæ£ã¯BãšAã®å
±åæ£ã«çãã
cov_ab <- sum((a - mean(a)) * (b - mean(b))) / (length(a) - 1) c(cov(a, b), cov_ab) # 124.525 124.525
å®éãã«ãŒã«ãã¢ãœã³ã®ç·åœ¢çžé¢ä¿æ°ã¯ãåã«å
±åæ£ãšã»ããã®æšæºåå·®ã®ç©ã®æ¯ã§ãã å
±åæ£ãšã¯ç°ãªãã解éããã®ã¯éåžžã«äŸ¿å©ã§ããåžžã«-1ã1ã®ç¯å²ã«ãããŸãããŠããã£ã«è¿ãã»ã©ãç·åœ¢çžé¢ãé«ããªããŸãã ãŸãã-1ã«è¿ãããšã¯è² ã®çžé¢ã瀺ããŸãïŒèšãæãããšã1ã€ã®å€æ°ã倧ããã»ã©ãä»ã®å€æ°ã¯å°ãããªããŸãïŒã ãŒããã倧å¹
ã«éžè±ããŠããªãå Žåãããã¯åŒ±ãäŸåé¢ä¿ã瀺ããŠããŸãã ã¯ã£ãããšè¡šãããæåºç©ã®ãªãç·åœ¢é¢ä¿ã«ã€ããŠã®ã¿è©±ããŠããããšã匷調ããããšã¯éåžžã«éèŠã§ãããããªããã°ããã®ä¿æ°ã®äœ¿çšã¯æå³ããªããªãã§ãããã
cov_ab / (sqrt(var_a) * sqrt(var_b))
ç·åœ¢çžé¢ä¿æ°ã¯ãäŸåé¢ä¿ãä¿æããããããããŒã¿ãæ£èŠåãŸãã¯æšæºåããåŸã«èšç®ã§ããŸãã åæããŒã¿ã®æšæºåãšæ£èŠåã®äž¡æ¹ã®åè¿°ã®å€æŽã«ã€ããŠèããŠã¿ãŸãããã æåã®ã±ãŒã¹ã§ã¯ãã»ããã®åèŠçŽ ããå¹³åå€ãæžç®ãïŒãã®å€ã®å¹³åããã®åå·®ã®åŒ·ããååŸïŒããããæšæºåå·®ã§é€ç®ããŸããã ãã®çµæãå¹³åå€ã0ã§åæ£ã1ã®æ°ããã»ãããåŸãããŸããã2çªç®ã®ã±ãŒã¹ã§ã¯ãåèŠçŽ ããæå°å€ãæžç®ããå€åç¯å²ã§é€ç®ããŸããïŒããŒã¿ã¯0ã1ã®ç¯å²ã«ãªããŸãïŒã
# nm <- function(a) # snt <- function(a) cor(a, b) # 0.9961772 cor(nm(a), nm(b)) # 0.9961772 cor(snt(a), snt(b)) # 0.9961772
ãã¢ã®ç·åœ¢äŸåæ§ã芳å¯ããŠããã®çŽç·ãè¿äŒŒããŸãã 2çªç®ã®ã¡ããªãã¯ã®ã¿ãããã£ãŠããå Žåãããã«ããã1ã€ã®ã¡ããªãã¯ã®å€ãäºæž¬ãããŸãã ãã¢ã¯ã€ãºç·åœ¢äŸåæ§ãæ£ç¢ºã«èª¿æ»ããŠããããã2ã€ã®ãã©ã¡ãŒã¿ãŒã®ã¿ãèšç®ããå¿
èŠããããŸãïŒå®æ°ïŒäº€å·®ããªãã»ãããåçïŒãšåäžã®äºæž¬åã®ä¿æ°ãã€ãŸã ç·ã®åŸé
ïŒåŸé
ïŒã äºæž¬åä¿æ°ãèšç®ããã«ã¯ãäºæž¬åãšåŸå±å€æ°ã®æšæºåå·®ãé€ç®ããçµæãçžé¢ã«ä¹ç®ããã ãã§ååã§ãã 亀ç¹ã¯ããã«ç°¡åã«èŠã€ããããšãã§ããŸããäºæž¬å€æ°ã®å¹³åå€ãããä¿æ°ãšåŸå±å€æ°ã®å¹³åå€ã®ç©ã®çµæãåŒããŸãã
slope <- cor(a, b) * (sd(b) / sd(a)) intercept <- mean(b) - (slope * mean(a)) c(intercept, slope) # 0.2803261 0.9784893
äŸåé¢ä¿ãæ©èœçã§ã¯ãªãã確ççã§ããå Žåãäœããã®ãšã©ãŒã衚瀺ãããŸãã äŸãèããŠã¿ãŸãããã äºæž¬å€æ°ã®ã¿ãããã£ãŠããå Žåã¯ãç·åœ¢ååž°ã䜿çšããŠåŸå±å€æ°ã®å€ãäºæž¬ããŠã¿ãŸãããã èµ€ã§ãå³å¯ã«ç·äžã«ããäºæž¬å€ã衚瀺ããé»ã§å®éã®å€ã衚瀺ããŸãã
y <- (0.2803261 + (0.9784893 * a)) plot(a, b) points(a, y, col = "red") abline(lm(b ~ a), col = "blue")

誀差ã¯ãå®éã®å€ãšäºæž¬å€ã®å·®ã§ãã ããšãã°ãããã°ã©ãã³ã°èšèªRã§ã¯ã説æçãªãšã©ãŒçµ±èšããæ®å·®ãã»ã¯ã·ã§ã³ã«è¡šç€ºãããŸãã å
ç¢ãªïŒèå¹²æžæ§ïŒæž¬å®çµæã衚瀺ãããŸãã 䞊ã¹æ¿ããããã»ããã®äžå€®ïŒäžå€®å€ïŒã¯ãäžäœãŸãã¯äžäœã®ååäœæ°ã ãã§ãªããå€ãå€ïŒå¹²æžïŒã«å¯ŸããŠèæ§ããããŸãã å¹³åå€ã¯ãæåºã«èæ§ããªããããããã§ã¯äœ¿çšããŸããã æ倧å€ãšæå°å€ãåºæã®ã¬ã³ãŒãã§ãããšæšæž¬ããããšã¯é£ãããããŸããïŒæãé倧ãªééãïŒã
summary(lm(b ~ a)) # Residuals: # Min 1Q Median 3Q Max # -2.15126 -0.61350 -0.09749 0.50744 2.04233 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.28033 0.44308 0.633 0.537 # a 0.97849 0.02293 42.669 3.17e-16 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #
ããã«ãåã«èšç®ããåçããã³äºæž¬ä¿æ°ã衚瀺ãããŸãã é£æ¥ããåã¯æšæºãšã©ãŒã§ãã 次ã«ãä¿æ°ããŒãã§ãããšããåž°ç¡ä»®èª¬ããã¹ãããtçµ±èšã瀺ããŸãïŒä¿æ°ãããŒããæžç®ããŠãæå³ããªããããåçŽã«ä¿æ°ãæšæºèª€å·®ã§é€ç®ããŸãïŒã æææ°Žæºã¯ãåž°ç¡ä»®èª¬ãæ£åŽããã®ã«ååãªå€§ããã§ãã æ確ã«ããããã«ãæåã§ååŸããã€ã³ãžã±ãŒã¿ãŒãèšç®ããŸãã
e <- (b - y) # Residuals: c(min(e), quantile(e, .25), median(e), quantile(e, .75), max(e)) # -2.15126190 -0.61349440 -0.09748515 0.50744140 2.04233440 # Std. Error (a) sqrt(sum(e ^ 2) / ((length(e) - 2) * sum((a - mean(a)) ^ 2))) # 0.02293208 # t value (a) 0.9784893 / 0.02293208 # 42.66902 # Pr(>|t|) (a) round((pt(42.66902, df = 14, lower.tail = FALSE) * 2), digits = 18) # 3.17e-16
次ã®ã¡ããªãã¯ã䜿çšããŠãã¢ãã«ã®ç²ŸåºŠãè©äŸ¡ããŸãïŒMSEãMAEãããã³RMSEã MSEãšããååã¯ãè±èªã®å¹³åå¹³æ¹èª€å·®ã«ç±æ¥ããŠããŸãã ããã¯å¹³åäºä¹èª€å·®ã§ãã ã¡ããªãã¯MAEïŒå¹³å絶察誀差ïŒã¯ã誀差ã®å¹³å絶察å€ã§ãã èšãæããã°ãæåã®ã±ãŒã¹ã§ã¯ãšã©ãŒã®å¹³åäºä¹ãååŸãã2çªç®ã§ã¯ãšã©ãŒä¿æ°ã®å¹³åãååŸããŸãã RMSEïŒäºä¹å¹³åå¹³æ¹æ ¹èª€å·®ïŒã¡ããªãã¯ã¯ãåã«MSEã®å¹³æ¹æ ¹ã§ãã
mae <- mean(abs(e)) mse <- mean(e ^ 2) rmse <- sqrt(mse) c(mae, mse, rmse) # 0.7131298 0.8783887 0.9372239 hist(e, breaks = 10, col = "blue")

3.æ¢åã®ã¢ãã«ã転éãã
ããã«å®çšçãªåŽé¢ã«ç§»ããŸãããã åäžã®äºæž¬å€æ°ã®å€ã«ãã£ãŠåŸå±å€æ°ã®å€ãæ€åºããæ©èœãè¿œå ããå¿
èŠãããéåžžã«è² è·ã®é«ãAPIããããšããŸãã å®éãè¿äŒŒïŒååž°ïŒã®åé¡ã«ã€ããŠè©±ããŠããã ç·åœ¢é¢æ°ã®å®è£
ã¯ãéåžžã«ã³ã³ãã¯ãã§é«æ§èœãªã³ãŒãã«ãã£ãŠåºå¥ãããŸãã
APIã¯PHP7ã§äœæãããŠãããšæ³å®ããŠããŸãã å€éšã·ã¹ãã ã¯ã¢ãã«ã®ååã«ã€ããŠäœãç¥ããªãããšãæãåºãããŠãã ããã ãã¹ãŠã®äœæ¥ããžãã¯ã¯1ã€ã®ã¯ã©ã¹ã«ã«ãã»ã«åãããŸãã å
¥åããŒã¿ãšæ»ãå€ã®èŠä»¶ã®ã¿ãããã£ãŠããŸãã æŠç¥èšèšãã³ãã¬ãŒãã瀺åããããã«ãã¯ã©ã€ã¢ã³ãã¯ã€ã³ã¿ãŒãã§ãŒã¹ãå®è£
ããä»»æã®ã¯ã©ã¹ã䜿çšããŸãã ãã®ã€ã³ã¿ãŒãã§ã€ã¹ã§ã¯ã1ã€ã®åŒæ°ïŒäºæž¬åïŒãåããå¥ã®ã¹ã«ã©ãŒå€ïŒåŸå±å€æ°ïŒãè¿ã1ã€ã®ã¡ãœããã®å®è£
ãå¿
èŠã§ãã
<?php declare(strict_types = 1); interface IModel { public function predict(float $x): float; } class Example implements IModel { const SLOPE = 0.9784893; const INTERCEPT = 0.2803261; public function predict(float $x): float { return (self::INTERCEPT + (self::SLOPE * $x)); } } class Client { private $_model; public function setModel(IModel $model) { $this->_model = $model; } public function run(float $x): float { return $this->_model->predict($x); } } $client = new Client(); $client->setModel(new Example()); echo $client->run(17);
æ¯èŒã®ããã«ããã¢ç·åœ¢é¢ä¿ã®å®å
šãªã³ãŒããæžãããšãã§ããŸãã
<?php declare(strict_types = 1); class Model { public $slope = 0.0; public $intercept = 0.0; public function fit(array $x, array $y) { $this->slope = Stat::cor($x, $y) * (Stat::sd($y) / Stat::sd($x)); $this->intercept = Stat::mean($y) - ($this->slope * Stat::mean($x)); } public function predict(float $x): float { return ($this->intercept + ($this->slope * $x)); } }
å¥ã®ã¯ã©ã¹ã«èšè¿°çµ±èšã¡ãœãããå®è£
ããŸããã
<?php declare(strict_types = 1); class Stat { public static function max(array $values): float { return max($values); } public static function min(array $values): float { return min($values); } public static function sum(array $values): float { return array_sum($values); } public static function mean(array $values): float { return self::sum($values) / count($values); } public static function variance(array $values): float { $mean = self::mean($values); $pow = array_map(function($v) use ($mean) { return pow($v - $mean, 2); }, $values); return self::sum($pow) / (count($pow) - 1); } public static function sd(array $values): float { return sqrt(self::variance($values)); } public static function cov(array $a, array $b): float { $meanA = self::mean($a); $meanB = self::mean($b); $diff = []; for($i = 0; $i < count($a); $i++) { $diff[] = ($a[$i] - $meanA) * ($b[$i] - $meanB); } return self::sum($diff) / (count($diff) - 1); } public static function cor(array $a, array $b): float { return self::cov($a, $b) / (self::sd($a) * self::sd($b)); } }
ã¿ã¹ã¯ãè€éã«ããŸãããã äºæž¬åã®æ°ã¯ä»»æã§ãã ããã¯çŽç·ã§ã¯ãªããå€æ¬¡å
空éã®è¶
å¹³é¢ã§ãã ãããŠãåé¡ã®çš®é¡ãè¿äŒŒããåé¡ã«å€æŽããŸãã ããšãã°ãããžã¹ãã£ãã¯ååž°ã®å©ããåããŠããã€ããªåé¡ã®åé¡ã解決ãããŸããã ããšãã°ããŠãŒã¶ãŒãåé¡ããããã«ãéåžžã«è² è·ã®é«ããµãŒãã¹ã§ãã®çµ±èšã¢ãã«ã䜿çšããŸãã ãã®åé¡ã解決ããã«ã¯ãã¢ãã«åŠç¿ã¢ã«ãŽãªãºã èªäœã¯å¿
èŠãããŸããããåé¢è¶
å¹³é¢ã®ãã©ã¡ãŒã¿ãŒã®ã¿ãå¿
èŠã§ãã
ãã®ç¶æ³ã§ã¯ãã¢ãã«ã®åçã¯å€ãããŸããã åæ§ã«ãäºæž¬å€æ°ã«å¯Ÿå¿ããä¿æ°ãä¹ç®ããçµæãèŠçŽããå¿
èŠããããŸãã 次ã«ãåä¿¡éã«ã€ã³ã¿ãŒã»ãããè¿œå ãããŸãã ãã€ã¢ã¹ã®ä»£ããã«ã人工å®æ°äºæž¬åãè¿œå ãããå Žåãããããã®å Žåãæ°åŒå
šäœã¯ãäºæž¬åãšãã®ä¿æ°ã®ç©ïŒç¹åŸŽãã¯ãã«ã«ããéã¿ãã¯ãã«ã®ã¹ã«ã©ãŒç©ïŒã®åèšã«ã®ã¿åæžãããŸãã
åé¡åé¡ã«ã€ããŠã¯ãèšç®çµæãæå®ã®ãããå€ãšåçŽã«æ¯èŒããŸãã å€ã倧ããå Žåãæåã®ã¯ã©ã¹ãå²ãåœãŠãããŸãããã以å€ã®å Žå-ãŒãã ãã®ãããªã¢ãã«ã®è»¢éã¯åäžã«ãªããŸãã ãã®ãããªç·åœ¢ã¢ãã«ã®2ã€ã®æãéèŠãªå©ç¹ã¯ãã³ã³ãã¯ããªã³ãŒããšéåžžã«é«ãããã©ãŒãã³ã¹ã§ãã ããã«ãããæåéããªã³ã¶ãã©ã€ã§ãç¹åŸŽã®ãã¯ãã«ã«åŸã£ãŠèŠ³æž¬ã¯ã©ã¹ãèå¥ã§ããŸãã
# dataset <- read.csv('dataset.csv') # pairs(dataset, col = factor(dataset$class)) # model <- glm(formula = class ~ ., data = dataset, family = binomial) # b <- model$coefficients # , nc <- (b[1] + (dataset$alpha * b[2]) + (dataset$beta * b[3])) > 0 plot(dataset$alpha, dataset$beta, col = factor(nc))

ç»åã¯ãåæ§ã®ç·åœ¢é¢æ°ããã€ã³ãã2ã€ã®ã¯ã©ã¹ã«åå²ããããšã瀺ããŠããŸãã å®éããã®ãã©ã¡ãŒã¿ãŒã¯å¥ã®ããã°ã©ãã³ã°èšèªã®ã³ãŒãã«ãšã¯ã¹ããŒãããå¿
èŠããããŸãã ãããŸã§èŠãŠããããã«ãããã¯ç°¡åãªã¿ã¹ã¯ã§ãã ãã®ããã»ã¹ã¯ãèªååã«éåžžã«åœ¹ç«ã¡ãŸãã äž»ãªããšã¯ããã®ãããªè¶
å¹³é¢ãæ£ããåé¡ããããã¹ãŠã®äºæž¬å€æ°ãæ¬åœã«å¿
èŠã§ããããšã確èªããããšã§ãã
ã¢ãã«ããã¬ãŒãã³ã°ãããã®ç²ŸåºŠã確èªããã«ã¯ãç°ãªãããŒã¿ã»ããã䜿çšããå¿
èŠããããŸãã å€æ°ã®åé¡ç²ŸåºŠã¡ããªãã¯ããããŸããããã®äž»ãªãã®ã¯ééããªãæ€èšããŸãã ãŸã第äžã«ãæ£ç¢ºãªçãã®ç¢ºçã¡ããªãã¯ãæãåºãããšãææ¡ããŸãã æ£è§£æ°ããã¹ãŠã®åçæ°ã§å²ã£ãŠèšç®ãããŸãã
test <- factor(as.logical(dataset$class)) length(nc) # 345 table(test == nc) # FALSE TRUE # 17 328 328 / 345 # 0.9507246
ããããã¯ã©ã¹ã®1ã€ã®èŠ³æž¬å€ã®å²åã1000åã®1ããŒã»ã³ãã«éããªãå Žåã¯ã©ãã§ããããïŒ ããã§ã¯ãå®æ°ãçºè¡ããåé¡åšã§ãããé©ãã¹ãçµæã瀺ããŸãã æ··åãããªãã¯ã¹ã衚瀺ããããšã¯çã«ããªã£ãŠããŸãã ãã€ããªåé¡ã§ã¯ãã¯ã©ã¹ã©ãã«ã®äºæž¬çµæã¯4ã€ãããããŸããã çã®ããžãã£ããªçµæã¯ãããžãã£ããªã¯ã©ã¹ã®æ£ããæšæž¬ãšåŒã°ããŸãïŒTRUEã¯TRUEãšäºæž¬ãããŸãïŒã çã®è² -è² ã®ã¯ã©ã¹ã®çã®æšæž¬ïŒFALSEã¯FALSEãšäºæž¬ïŒã åœéœæ§ã¯FALSEãTRUEãšäºæž¬ãããåœé°æ§ã¯TRUEãFALSEãšäºæž¬ããããšä»®å®ããã®ã¯è«ççã§ãã ç¹å®ã®äŸãèŠãŠã¿ãŸãããïŒ
table(nc, test) # test # nc FALSE TRUE # FALSE 120 8 # TRUE 9 208
ããã§ããncãã¯åé¡åã®å¿çã§ããããtestãã¯çã®çãã§ãã åé¡åšã®è¯å®çãªå¿çã®ãã¡ãå®éã«è¯å®çã§ãã£ãå²åïŒç²ŸåºŠã粟床ïŒãå®çŸ©ããŸãã å®å
šæ§ïŒãªã³ãŒã«ïŒãšåæ§ã«ãã€ãŸã ãã®ã¢ãã«ã«ãã£ãŠç¹å®ãããçéœæ§ã®å²åã ãããã®ææšã®æå³ã¯æ¬¡ã®ãšããã§ãã圌ãè¯å®çãªçããããå Žåã粟床ã¯åé¡åšã®ä¿¡é ŒåºŠã瀺ããŸãã èšãæããã°ãããã確ãã«ããžãã£ããªã¯ã©ã¹ã§ãããšç¢ºä¿¡ã§ããéãã§ãã ããããå®å
šæ§ã¯ãæããã«ããèœåã®ç¯å²ã瀺ããŠããŸãã ç¹å®ãããéœæ§ã®å²åã 誀ã£ãŠã¯ã©ã¹ãããžãã£ããšåŒã¶ããšãæããŠããå Žåã粟床ãããéèŠã§ãã ã§ããã ãå€ãã®ããžãã£ããªãã®ãèŠã€ããå¿
èŠãããå Žåãå®å
šæ§ãããéèŠã§ãã
library(caret) precision <- posPredValue(factor(nc), test, positive = T) # 0.9585253 recall <- sensitivity(factor(nc), test, positive = T) # 0.962963 # F1 f1 <- (2 * precision * recall) / (precision + recall) # 0.960739
æ£ç¢ºããšå®å
šæ§ãæåã§èšç®ããŸãã æ··åãããªãã¯ã¹ã®å€ã眮ãæããã ãã§ååã§ãã
# precision ( / + ) 208 / (208 + 9) # 0.9585253 # recall ( / + ) 208 / (208 + 8) # 0.962963
4.äºæž¬å åã®éèŠæ§ã®è©äŸ¡
人ã
ã®è€éãªå¿çåŠçç 究ãè¡ããããšä»®å®ããŸãã 被éšè
ã®ååã¯ç¥çµçã«èŠããã§ããŸãããããçæ¹ã¯é 調ã§ãã éãã¯äœã§ããïŒ ãŸãã¯å¥ã®äŸïŒæ©åšã®ããã©ãŒãã³ã¹ã枬å®ããŸãã-äžéšã®ããã€ã¹ã¯ããŸãæ©èœããŸãããä»ã®åé¡ããããŸãã ããã¯äœã«åœ±é¿ããŸããïŒ åŸå±å€æ°ãšåäºæž¬å€æ°ã®éã®çžé¢é¢ä¿ãæ¢ãããšããå¿
èŠããããšçŽæçã«æšæž¬ã§ããŸãã çªç¶ãçæå質ã®æž¬å®åºæºã¯èä¹
æ§ãšåŒ·ãçžé¢ããŠããããšãå€æããŸããïŒçæãåªããŠããã»ã©ãããã€ã¹ã®èä¹
æ§ã¯é«ããªããŸãïŒã ãŸãã¯ãç°ãªãã¯ã©ã¹ã®èŠ³æž¬å€ã®äºæž¬åã®å¹³åå€ã®éãã確èªããŸãã
ãã ããããŒã¿ã»ãããèŠèŠçã«è¡šç€ºããããšããŸãã æ¡ä»¶ä»ãããŒããŒã衚瀺ããããã®åŸããããã®è²ãå€ãããŸãã ã¢ã«ãã¡äºæž¬åã«ãããšã0.5ã®é åã§å®è¡ãããŸãã ããã¯ãååžã®ãã¹ãã°ã©ã ã«ãèŠãããŸãïŒç°ãªãè²ã§è¡šç€ºïŒã ãããŠãããŒã¿äºæž¬ã«ãããšããã®ãããªæãããªéãã¯èŠ³å¯ãããŠããŸããã

ãšã©ãŒïŒããã€ãã®ãã€ã³ãã¯èª€ã£ãŠåé¡ãããããšãå€æããŠããŸãïŒã«ããããããããã®ãããªåºæºã¯ãããŒã¿äºæž¬åã«ããå¯èœãªåé¢ãããå¹æçã«åé¡ã解決ããããšã¯çŽæçã«æããã§ãã ãããã£ãŠãã¢ã«ãã¡ã®éèŠæ§ã¯éåžžã«é«ããªããŸãã åé¢åŸãç®çã®ã¯ã©ã¹ã®ãã€ã³ããæºããå¯èœæ§ã倧å¹
ã«å¢å ããŸãã éããæããã«ããããã«ããŸãåé¢ããã«ç¢ºçãèšç®ããŸãã
table(dataset$class) # 0 1 # 129 216 length(dataset$class) # 345 c(129 / 345, 216 / 345) # 0.373913 0.626087
確çãããã£ãŠããã®ã§ãããŒã¿ã®åäžæ§ã®ã¡ããªãã¯ãèšç®ããŸãïŒåãã¯ã©ã¹ã®ä»£è¡šã®ã¿ã§ããå ŽåãGiniäžçŽç©ã€ã³ãžã±ãŒã¿ãŒã¯0ã«ãªããŸãïŒã ãžãäžçŽç©ã¯ãåäœããæžç®ããã確çã®äºä¹ã®åèšãšããŠèšç®ãããŸãã
# Gini impurity gini <- function(p) { (1 - sum(p ^ 2)) } gini(c(0.373913, 0.626087)) # 0.4682041
åè¿°ã®æ¡ä»¶ã«åŸã£ãŠãã¹ãŠã®ãã€ã³ããåå²ããŸãã æ¬è³ªã¯ãçŽæçãªããžãã¯ã«èŠçŽãããŸããæãæçãªäºæž¬åã«ããæãå¹æçãªåé¢ãéžæããå¿
èŠããããŸãã
node_1 <- subset(dataset, alpha > .5) table(node_1$class) # 0 1 # 25 197 length(node_1$class) # 222 # gini(c(25/222, 197/222)) # 0.199862
ååŸããæ°ãããµãã»ããããšã«æé ãååž°çã«ç¹°ãè¿ããŸãã ããã¯ãããåæ¢æ¡ä»¶ãçºçãããŸã§ãããšãã°ã1ã€ã®ã¯ã©ã¹ã®èŠ³æž¬ã®ã¿ãæ®ããŸã§çºçããŸãã ããã¯ãããŒãã察å¿ããæ€èšŒæ¡ä»¶ïŒäºæž¬åãšå®æ°ãšã®æ¯èŒïŒãæã¡ãçµç«¯ããŒãïŒèïŒãåãã¯ã©ã¹ã®èŠ³æž¬å€ãæã€æ±ºå®ããªãŒã«ãã£ãŠé©åã«èšè¿°ãããŸãã ããªãŒã¯ãŸãæãå¹æçãªäºæž¬åãååŸããããšããããããããã«æ²¿ã£ãåé¢ã®æ¡ä»¶ã¯ã«ãŒãã«è¿ãããŒãã«èç©ãããŸãã äºæž¬åã«ããåé¢ã®ã¬ãã«ïŒæ·±ãïŒããã®éèŠæ§ãåæ ããŠããããšãããããŸãã
node_2 <- subset(node_1, beta < .29) table(node_2$class) # 1 # 34 length(node_2$class) # 34 # gini(c(0/34, 34/34)) # 0
æäœæ¥ã§è¡ãããšã¯æãå¿«é©ãªäœæ¥ã§ã¯ãªãããšã«åæããŸãã ããããæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãå©ãã«ãªãå Žæã§ãã æ®å¿µãªããã決å®æšã®éåã¯ã¯ã©ã¹éã®éãã説æããããšã¯ã§ããŸããïŒããããèšå€§ãªæ°ã®æ±ºå®æšã解éããããšã¯å°é£ã§ãïŒããäºæž¬åã®éèŠæ§ãèšç®ã§ããŸãã ããã¯ãåäœã®åçã䌌ãŠããè¿äŒŒåé¡ã«ãé©çšãããåé¢åºæºã®ã¿ãç°ãªããŸãã
ãã®ããŒã¿ã»ããã§ã¯ãäºæž¬åã®1ã€ãéåžžã«éèŠã§ããããšã¯å¯èœãªéãæçœã§ããã ã¯ã©ã¹ã©ãã«ãšã©ã®ããã«çžé¢ããããèŠãŠã¿ãŸãããã

ç°ãªãã¯ã©ã¹ã®å¹³åå€ã®éãã衚瀺ãããŸãã èšè¿°çµ±èšïŒPythonããã³Rã®äŸïŒãåç
§ããããšããå§ãããŸãã

ã©ã³ãã ãã©ã¬ã¹ãã«ããäºæž¬åã®éèŠæ§ã®è©äŸ¡ïŒ

5.çµè«
ãã®ããŒãã§èª¬æãããŠããç·åœ¢ã¢ãã«ã«ã¯ãããŸããŸãªãããžã§ã¯ãã«ç°¡åã«è»¢éã§ãã2ã€ã®äž»ãªå©ç¹ããããŸãã 1ã€ç®ã¯ãã³ãŒãã®ã³ã³ãã¯ããã§ãïŒæ°åŠçæäœã®ã¿ïŒã 第äºã«ãéåžžã«é«ãããã©ãŒãã³ã¹ã ãã ãããã¹ãŠã«æ¬ é¥ããããŸãã æ®å¿µãªããããã€ã³ããè¶
å¹³é¢ã§åé¢ããã®ãå°é£ãªå ŽåããŸãã¯äŸåé¢ä¿ãããã«ãã£ãŠè¿äŒŒãããŠããªãå Žåãå¹çã¯èš±å®¹ã§ããªãã»ã©äœãã¬ãã«ã«ãªããŸãã
6.ã¢ããªã±ãŒã·ã§ã³
ãœãŒã¹ã³ãŒãã¹ããããïŒPythonïŒã䜿çšããŠãäºæž¬å€æ°ã®éèŠæ§ãç¹å®ããŸããã
import pandas as pd dataset = pd.read_csv('dataset.csv') dataset.info() dataset.sample(5) dataset.describe() dataset.corr() dataset.groupby('class').mean()
import seaborn as sns import matplotlib.pyplot as plt dataset.hist(bins = 20, figsize = (6, 6)) plt.show() sns.heatmap(dataset.corr(), square = True, annot = True) plt.show() sns.pairplot( data = dataset, hue = 'class', size = 2, palette = 'seismic' ) plt.show()
import pandas as pd from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier dataset = pd.read_csv('dataset.csv') X_train, X_test, y_train, y_test = train_test_split( dataset, dataset.pop('class'), test_size = .5 ) model = RandomForestClassifier( n_estimators = 1000, max_depth = 100, max_features = 2 ).fit(X_train, y_train) print(classification_report(y_test, model.predict(X_test))) pd.Series(model.feature_importances_, index = X_train.columns)
from catboost import CatBoostClassifier from sklearn.model_selection import KFold from sklearn.metrics import confusion_matrix, f1_score kf = KFold(n_splits = 3, shuffle = True) dataset = pd.read_csv('dataset.csv') y = dataset.pop('class').values X = dataset.values columns = dataset.columns for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = CatBoostClassifier( calc_feature_importance = True ).fit(X_train, y_train) y_pred = model.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(f1_score(y_test, y_pred)) print(list(zip(columns, model.feature_importances_)))