ãªãŒãã³ããŒã¿ãµã€ãšã³ã¹ã³ãã¥ããã£ãã³ãŒã¹åå è
ãæè¿ããŸãïŒ
ã³ãŒã¹ã®äžç°ãšããŠããã§ã«ããã€ãã®éèŠãªæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã«ç²ŸéããŠããŸãã ãã ããããæŽç·Žãããã¢ã«ãŽãªãºã ãšã¢ãããŒãã«é²ãåã«ãäžæ©èžã¿èŸŒãã§ãã¢ãã«ãã¬ãŒãã³ã°çšã®ããŒã¿ã®æºåã«ã€ããŠè©±ããããšæããŸãã ã¬ããŒãžã€ã³-ã¬ããŒãžã¢ãŠãã®ããç¥ãããŠããååã¯ããã¹ãŠã®æ©æ¢°åŠç¿ã¿ã¹ã¯ã«100ïŒ
é©çšå¯èœã§ãã çµéšè±å¯ãªã¢ããªã¹ãã§ããã°ãå®æ§çã«æºåãããããŒã¿ã§ãã¬ãŒãã³ã°ãããåçŽãªã¢ãã«ããäžååãªã¯ãªãŒã³ããŒã¿ã§æ§ç¯ãããunningãªã¢ã³ãµã³ãã«ãããåªããŠããããšãå€æããå Žåã®å®è·µäŸãæãåºãããšãã§ããŸãã
UPDïŒçŸåšãã³ãŒã¹ã¯è±èªã§ã mlcourse.aiãšãããã©ã³ãåã§ãMedium ã«é¢ããèšäº ãKaggleïŒ Dataset ïŒããã³GitHubã«é¢ããè³æããããŸã ã
ã·ãªãŒãºã®èšäºã®ãªã¹ã ä»æ¥ã®èšäºã®ãã¬ãŒã ã¯ãŒã¯ã§ã¯ã3ã€ã®é¡äŒŒããŠãããç°ãªãã¿ã¹ã¯ã確èªããŸãã
- ç¹åŸŽæœåºãšç¹åŸŽãšã³ãžãã¢ãªã³ã°-ãã¡ã€ã³åºæã®ããŒã¿ãã¢ãã«ã«é©ãããã¯ãã«ã«å€æããŸãã
- ç¹åŸŽå€æ-ã¢ã«ãŽãªãºã ã®ç²ŸåºŠãé«ããããã®ããŒã¿å€æã
- æ©èœéžæ-äžèŠãªæ©èœãåãåããŸãã
ãããšã¯å¥ã«ããã®èšäºã§ã¯æ°åŒã¯ã»ãšãã©ãªãããšã«æ³šæããŠãã ããããã ããæ¯èŒçå€ãã®ã³ãŒãããããŸãã
ããã€ãã®äŸã§ã¯ã Two Sigma Connectã§äœ¿çšãããRenthop ããŒã¿ã»ããã䜿çšããŸãïŒKaggleã§ã®ç«¶åç©ä»¶ã®ç
§äŒã ãã®åé¡ã§ã¯ãäžåç£è³è²žåºåã®äººæ°ãäºæž¬ããå¿
èŠããããŸãã åé¡åé¡ã3ã€ã®ã¯ã©ã¹['low', 'medium', 'high']
ãŸãã ãœãªã¥ãŒã·ã§ã³ãè©äŸ¡ããã«ã¯ããã°æ倱ã¡ããªãã¯ã䜿çšãããŸãïŒå°ããã»ã©è¯ãïŒã Kaggleã®ã¢ã«ãŠã³ãããŸã ãæã¡ã§ãªãæ¹ã¯ç»é²ããå¿
èŠããããŸãã ãŸããããŒã¿ãããŠã³ããŒãããã«ã¯ã競äºã«ãŒã«ã«åæããå¿
èŠããããŸãã
# train.json.zip Kaggle import json import pandas as pd # Renthop with open('train.json', 'r') as raw_data: data = json.load(raw_data) df = pd.DataFrame(data)
人çã§ã¯ãããŒã¿ãæ¢è£œã®ãããªãã¯ã¹ã®åœ¢ã§æäŸãããããšã¯ãã£ãã«ãªãããããã¹ãŠã®ã¿ã¹ã¯ã¯å±æ§ã®æœåºããå§ãŸããŸãã ãã¡ãããcsvãã¡ã€ã«ãèªã¿åã£ãŠnumpy.array
ã«å€æããã ãã§ååãªå ŽåããããŸããããããã¯å¹žããªäŸå€ã§ãã ç¹æ§ãæœåºããäžè¬çãªããŒã¿åãèŠãŠã¿ãŸãããã
ããã¹ã
ããã¹ãã¯ãèªç±åœ¢åŒã®ããŒã¿ã®æãæçœãªäŸã§ãã ããã¹ããæ±ãæ¹æ³ã¯ååã§ããããã1ã€ã®èšäºã«ã¯åãŸããŸããã ããã«ãããããããæã人æ°ã®ãããã®ãèŠãŠãããŸãã
ããã¹ããæäœããåã«ãããŒã¯ã³åããå¿
èŠããããŸãã ããŒã¯ã³åã§ã¯ãããã¹ããããŒã¯ã³ã«åå²ããŸããæãåçŽãªå Žåããããã¯åãªãåèªã§ãã ããããå®æçãªã¹ã±ãžã¥ãŒã«ïŒãé¡ãïŒãåçŽã«ãããããšããããžãããŽãŽãããã¯2ã€ã®ããŒã¯ã³ã§ã¯ãªãã1ã€ã®æå³ã倱ãå¯èœæ§ããããŸãã ããããåŒã³åºãã¯ãã¹ããŒã«ãã«ïŒãã§ãã ç¡é§ã«2ã€ã®ããŒã¯ã³ã«åå²ã§ããŸãã èšèªã®ç¹æ§ãèæ
®ã«å
¥ããæ¢è£œã®ããŒã¯ãã€ã¶ãŒããããŸãããç¹ã«ç¹å®ã®ããã¹ãïŒå°éçšèªãå°éçšèªãã¿ã€ããã¹ïŒã§äœæ¥ããŠããå Žåãééã£ãŠããå¯èœæ§ããããŸãã
ã»ãšãã©ã®å ŽåãããŒã¯ã³åã®åŸãéåžžã®åœ¢åŒã«æ»ãããšãæ€èšããå¿
èŠããããŸãã ç§ãã¡ã¯èªå¹¹åŠçããã³/ãŸãã¯è£é¡ã«ã€ããŠè©±ããŠãã-ãããã¯åèªåœ¢åŒãåŠçããããã«äœ¿çšãããåæ§ã®ããã»ã¹ã§ãã ãããã®éãã«ã€ããŠã¯ãã¡ããã芧ãã ãã ã
ãããã£ãŠãããã¥ã¡ã³ããäžé£ã®åèªã«å€æãããããããã¯ãã«ã«å€æãå§ããããšãã§ããŸãã æãåçŽãªã¢ãããŒãã¯Bag of WordsãšåŒã°ããŸããèŸæžé·ã®ãã¯ãã«ãäœæããååèªã«å¯ŸããŠããã¹ãå
ã®ãšã³ããªã®æ°ãã«ãŠã³ããããã®æ°ããã¯ãã«å
ã®å¯Ÿå¿ããäœçœ®ã«çœ®ãæããŸãã ã³ãŒãã§ã¯ãããã¯èšèãããç°¡åã«èŠããŸãã
äœåãªã©ã€ãã©ãªã®ãªãWord of Words from functools import reduce import numpy as np texts = [['i', 'have', 'a', 'cat'], ['he', 'have', 'a', 'dog'], ['he', 'and', 'i', 'have', 'a', 'cat', 'and', 'a', 'dog']] dictionary = list(enumerate(set(reduce(lambda x, y: x + y, texts)))) def vectorize(text): vector = np.zeros(len(dictionary)) for i, word in dictionary: num = 0 for w in text: if w == word: num += 1 if num: vector[i] = num return vector for t in texts: print(vectorize(t))
ãŸãããã®ã¢ã€ãã¢ã¯åçã§ãã瀺ãããŠããŸãã
ããã¯éåžžã«åçŽãªå®è£
ã§ãã å®éã«ã¯ãã¹ãããã¯ãŒããèŸæžã®æ倧ãµã€ãºãæå¹ãªããŒã¿æ§é ïŒéåžžãããã¹ãããŒã¿ã¯ã¹ããŒã¹ãã¯ãã«ã«å€æãããŸãïŒã«æ³šæããå¿
èŠããããŸã...
Bag of Wordsã®ãããªã¢ã«ãŽãªãºã ã䜿çšãããšãããã¹ãå
ã®èªé ã倱ãããŸããã€ãŸãããiãcow are noããšãnoãi have cowsããšããããã¹ãã¯ããã¯ãã«ååŸã¯æå³çã«å察ã§ãããåãã«ãªããŸãã ãã®åé¡ãåé¿ããã«ã¯ãäžæ©äžãã£ãŠããŒã¯ã³åã®ã¢ãããŒããå€æŽããŸããããšãã°ãN-gramïŒNåã®é£ç¶ããçšèªã®çµã¿åããïŒã䜿çšããŸãã
å®éã«ç¢ºèªããŸããã In : from sklearn.feature_extraction.text import CountVectorizer In : vect = CountVectorizer(ngram_range=(1,1)) In : vect.fit_transform(['no i have cows', 'i have no cows']).toarray() Out: array([[1, 1, 1], [1, 1, 1]], dtype=int64) In : vect.vocabulary_ Out: {'cows': 0, 'have': 1, 'no': 2} In : vect = CountVectorizer(ngram_range=(1,2)) In : vect.fit_transform(['no i have cows', 'i have no cows']).toarray() Out: array([[1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 1, 1, 0]], dtype=int64) In : vect.vocabulary_ Out: {'cows': 0, 'have': 1, 'have cows': 2, 'have no': 3, 'no': 4, 'no cows': 5, 'no have': 6}
ãŸããåèªãæäœããå¿
èŠããªãããšã«ã泚æããŠãã ãããå Žåã«ãã£ãŠã¯ãæåããN-gramãçæã§ããŸãïŒããšãã°ããã®ãããªã¢ã«ãŽãªãºã ã§ã¯ãé¢é£ããåèªãã¿ã€ããã¹ã®é¡äŒŒæ§ãèæ
®ãããŸãïŒã
In : from scipy.spatial.distance import euclidean In : vect = CountVectorizer(ngram_range=(3,3), analyzer='char_wb') In : n1, n2, n3, n4 = vect.fit_transform(['', '', '', '']).toarray() In : euclidean(n1, n2) Out: 3.1622776601683795 In : euclidean(n2, n3) Out: 2.8284271247461903 In : euclidean(n3, n4) Out: 3.4641016151377544
Bag of Wordsã®æŠå¿µã®éçºïŒã³ãŒãã¹ïŒæ€èšäžã®ãã®ããŒã¿ã»ããå
ã®ãã¹ãŠã®ããã¥ã¡ã³ãïŒã«ã¯ãã£ãã«èŠãããªããããã®ç¹å®ã®ããã¥ã¡ã³ãã«ååšããåèªã¯ãããéèŠãããããŸããã 次ã«ãäžè¬çãªäž»é¡ã®åèªãšåºå¥ããããã«ãããç矩ã®äž»é¡ã®åèªã®éã¿ãå¢ããããšã¯çã«ããªã£ãŠããŸãã ãã®ã¢ãããŒãã¯TF-IDFãšåŒã°ãã10è¡ã§èšè¿°ããããšã¯ã§ããŸããããããã£ãŠãåžæãã人ã¯wikiãªã©ã®å€éšãœãŒã¹ã®è©³çŽ°ã«æ
£ããããšãã§ããŸã ã ããã©ã«ãã®ãªãã·ã§ã³ã¯æ¬¡ã®ããã«ãªããŸãã
Bag of Wordsã®é¡äŒŒç©ã¯ãããã¹ãã¿ã¹ã¯ä»¥å€ã§ãèŠã€ããããšãã§ããŸããããšãã°ãç§ãã¡ãéå¬ããŠããã³ã³ãã¹ãã® Bag of Sites-Catch Me If You Canãªã©ã§ãã ä»ã®äŸïŒã¢ããªã® ããã°ãã€ãã³ãã®ããã°ïŒãæ€çŽ¢ã§ããŸã ã
ãã®ãããªã¢ã«ãŽãªãºã ã䜿çšãããšãç°¡åãªåé¡ã«å¯Ÿããå®å
šã«æ©èœãããœãªã¥ãŒã·ã§ã³ãã€ãŸãããŒã¹ã©ã€ã³ãååŸã§ããŸãã ãã ããã¯ã©ã·ãã¯ã®æ奜家ã«ã¯ãæ°ããã¢ãããŒãããããŸãã æ°ãããŠã§ãŒãã®æãäžè¬çãªæ¹æ³ã¯Word2Vecã§ããã代æ¿æ段ïŒã°ããŒãããã¡ã¹ãããã¹ããªã©ïŒããããŸãã
Word2Vecã¯ãWordåã蟌ã¿ã¢ã«ãŽãªãºã ã®ç¹æ®ãªã±ãŒã¹ã§ãã Word2Vecããã³åæ§ã®ã¢ãã«ã䜿çšããŠãåèªã倧èŠæš¡ãªç©ºéïŒéåžžã¯æ°çŸïŒã«ãã¯ãã«åããã ãã§ãªãããããã®ã»ãã³ãã£ãã¯ãªè¿æ¥æ§ãæ¯èŒããããšãã§ããŸãã ãã¯ãã«åããããã¥ãŒã®æäœã®å€å
žçãªäŸïŒãã³ã°-ç·æ§+女æ§=ã¯ã€ãŒã³ã
ãã¡ããããã®ã¢ãã«ã¯åèªãç解ããŠããªãããšãç解ãã䟡å€ããããŸãããäžè¬çãªã³ã³ããã¹ãã§äœ¿çšãããåèªãäºãã«è¿ãã«é
眮ãããããã«ãã¯ãã«ãé
眮ããããšããŸãã ãããèæ
®ãããŠããªãå Žåãå€ãã®é¢çœãããšãçºæãããŸããäŸãã°ã察å¿ãããã¯ãã«ã«-1ãæããããšã§ããã©ãŒã®å察ãèŠã€ããŸãã
ãã®ãããªã¢ãã«ã¯ããã¯ãã«ã®åº§æšãå®éã«åèªã®ã»ãã³ãã£ã¯ã¹ãåæ ããããã«ãéåžžã«å€§ããªããŒã¿ã»ããã§ãã¬ãŒãã³ã°ããå¿
èŠããããŸãã ããªãã®åé¡ã解決ããããã«ãäŸãã°ããããäºåã«èšç·Žãããã¢ãã«ãããŠã³ããŒãã§ããŸãã
ã¡ãªã¿ã«ãä»ã®åéïŒãã€ãªã€ã³ãã©ããã£ã¯ã¹ãªã©ïŒã§ãåæ§ã®æ¹æ³ã䜿çšãããŠããŸãã æãäºæãã¬äœ¿çšæ³-food2vec ã
ç»å
ç»åã®æäœã§ã¯ããã¹ãŠãåæã«ããã·ã³ãã«ã§è€éã«ãªããŸãã ããç°¡åã§ããå€ãã®å Žåãããèããããäºåãã¬ãŒãã³ã°æžã¿ãããã¯ãŒã¯ã®1ã€ããŸã£ããèããŠäœ¿çšã§ããªãããã§ãã ããã«è€éãªã®ã¯ããŸã 詳现ã«ç解ããå¿
èŠãããå Žåã¯ããã®ãŠãµã®ã®ç©Žãéåžžã«æ·±ããªãããã§ãã ãã ããæåã«ãŸãæåã«ã
GPUã匱ããããã¥ãŒã©ã«ãããã¯ãŒã¯ã®ã«ãããµã³ã¹ãããŸã çºçããŠããªãã£ãåœæãç»åããæ©èœãçæããããšã¯å¥ã®è€éãªåéã§ããã åçã䜿çšããã«ã¯ãè§åºŠãé åã®å¢çãªã©ãå®çŸ©ããäœã¬ãã«ã§äœæ¥ããå¿
èŠããããŸããã ã³ã³ãã¥ãŒã¿ãŒããžã§ã³ã®çµéšè±å¯ãªå°é家ã¯ãå€ãã¢ãããŒããšãã¥ãŒã©ã«ãããã¯ãŒã¯ãããã¹ã¿ãŒã®éã«å€ãã®é¡äŒŒç¹ãæãããšãã§ããŸããç¹ã«ãçŸä»£ã®ãããã¯ãŒã¯ã®ç³ã¿èŸŒã¿å±€ã¯ã Haarã«ã¹ã±ãŒãã«éåžžã«äŒŒãŠããŸãã ãã®åé¡ãçµéšããããšãªããç§ã¯å
¬çãªæ
å ±æºããç¥èã移ãããšã¯ããŸãããskimageãšSimpleCVã©ã€ãã©ãªãžã®ãªã³ã¯ãããã€ãæ®ããŠãç§ãã¡ã®æ代ã«çŽè¡ããŸãã
å€ãã®å Žåãåçã«é¢é£ããã¿ã¹ã¯ã«ã¯ãããçš®ã®ç³ã¿èŸŒã¿ãããã¯ãŒã¯ã䜿çšãããŸãã ã¢ãŒããã¯ãã£ãæãã€ãããããããã¯ãŒã¯ããŒããããã¬ãŒãã³ã°ãããããããšã¯ã§ããŸããããäºåã«ãã¬ãŒãã³ã°ãããæå
端ã®ãããã¯ãŒã¯ãå©çšã§ããŸãããã®éã¿ã¯ãªãŒãã³ãœãŒã¹ããããŠã³ããŒãã§ããŸãã ããã圌ãã®ä»äºã«é©å¿ãããããã«ãç§åŠè
ãã¡ã¯ãããã 埮調æŽïŒãããã¯ãŒã¯ã®æåŸã®å®å
šã«æ¥ç¶ãããå±€ãããªããã«ãªãããããã®ä»£ããã«æ°ããå±€ãè¿œå ãããç¹å®ã®ã¿ã¹ã¯çšã«éžæããããããã¯ãŒã¯ã¯æ°ããããŒã¿ã§ãã¬ãŒãã³ã°ãããŸãã ãã ããäœããã®ç®çã§ç»åãåçŽã«ãã¯ãã«åããå ŽåïŒããšãã°ãããçš®ã®éãããã¯ãŒã¯åé¡åã䜿çšããå ŽåïŒ-æåŸã®ã¬ã€ã€ãŒãåãåããåã®ã¬ã€ã€ãŒã®åºåã䜿çšããã ãã§ãïŒ
from keras.applications.resnet50 import ResNet50 from keras.preprocessing import image from scipy.misc import face import numpy as np resnet_settings = {'include_top': False, 'weights': 'imagenet'} resnet = ResNet50(**resnet_settings) img = image.array_to_img(face())
ããããŒã¿ã»ããã§èšç·ŽãããæåŸã®ã¬ã€ã€ãŒãããã£ã¢ãªã³ã°ãããŠä»£ããã«æ°ããã¬ã€ã€ãŒãè¿œå ããããšã«ãããå¥ã®ããŒã¿ã»ããã«é©å¿ããåé¡åãã ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã®æ¹æ³ã«ãã ãããªãã§ãã ããã æã§çæãããå
åã®ããã€ãã¯æè¿äŸ¿å©ã«ãªããŸããããšãã°ãã¢ããŒãè³è²žåºåã®äººæ°ãäºæž¬ããå Žåãæããã¢ããŒãããã泚ç®ãéãããå¹³åãã¯ã»ã«å€ãã®å
åãäœããšæ³å®ã§ããŸãã ããããã®ã©ã€ãã©ãªã®ããã¥ã¡ã³ãã®äŸã«è§ŠçºãããŸã ã
ç»åã«ããã¹ããå¿
èŠãªå Žåã¯ãããšãã°pytesseractã䜿çšããŠãè€éãªãã¥ãŒã©ã«ãããã¯ãŒã¯ãèªåã®æã§å±éããããšãªãèªãããšãã§ããŸãã
In : import pytesseract In : from PIL import Image In : import requests In : from io import BytesIO In : img = 'http://ohscurrent.org/wp-content/uploads/2015/09/domus-01-google.jpg' # In : img = requests.get(img) ...: img = Image.open(BytesIO(img.content)) ...: text = pytesseract.image_to_string(img) ...: In : text Out: 'Google'
pytesseractã¯äžèœè¬ãšã¯ã»ã©é ãããšãç解ããå¿
èŠããããŸãã
# Renthop In : img = requests.get('https://photos.renthop.com/2/8393298_6acaf11f030217d05f3a5604b9a2f70f.jpg') ...: img = Image.open(BytesIO(img.content)) ...: pytesseract.image_to_string(img) ...: Out: 'Cunveztible to 4}»'
ãã¥ãŒã©ã«ãããã¯ãŒã¯ã圹ã«ç«ããªãå¥ã®ã±ãŒã¹ã¯ãã¡ã¿æ
å ±ããã®å
åã®æœåºã§ãã ãã ããEXIFã«ã¯ãã«ã¡ã©ã®ã¡ãŒã«ãŒãšã¢ãã«ã解å床ããã©ãã·ã¥ã®äœ¿çšãæ®åœ±ã®å°ç座æšããœãããŠã§ã¢ã®åŠçã«äœ¿çšããããã®ãªã©ãå€ãã®æçšãªãã®ãä¿åã§ããŸãã
ãžãªããŒã¿
å°çããŒã¿ã¯ã¿ã¹ã¯ã«ã¯ããŸãèŠãããŸããããç¹ã«ãã®é åã«ã¯ååãªæ¢è£œã®ãœãªã¥ãŒã·ã§ã³ããããããããããæäœããããã®åºæ¬çãªãã¯ããã¯ãåŠã¶ããšã圹ç«ã¡ãŸãã
ãžãªããŒã¿ã¯ãã»ãšãã©ã®å ŽåãäœæãŸãã¯ã緯床+çµåºŠãã®ãã¢ã®åœ¢åŒã§è¡šç€ºãããŸãã ãã€ã³ãã ã¿ã¹ã¯ã«ãã£ãŠã¯ããžãªã³ãŒãã£ã³ã°ïŒäœæãããã€ã³ãã埩å
ïŒãšãªããŒã¹ãžãªã³ãŒãã£ã³ã°ïŒéïŒã®2ã€ã®æäœãå¿
èŠã«ãªãå ŽåããããŸãã ã©ã¡ãããGoogleããããOpenStreetMapãªã©ã®å€éšAPIã䜿çšããŠå®è¡ã§ããŸãã ããŸããŸãªãžãªã³ãŒããŒã«ã¯ç¬èªã®ç¹æ§ããããå質ã¯å°åã«ãã£ãŠç°ãªããŸãã 幞ããªããšã«ãå€ãã®å€éšãµãŒãã¹ã®ã©ãããŒãšããŠæ©èœããgeopyã®ãããªãŠãããŒãµã«ã©ã€ãã©ãªããããŸãã
倧éã®ããŒã¿ãããå Žåãå€éšAPIã®å¶éã«ç°¡åã«ééããŸãã ãŸããHTTPçµç±ã§æ
å ±ãåä¿¡ããããšã¯ãåžžã«æé©ãªé床ã®ãœãªã¥ãŒã·ã§ã³ã§ã¯ãããŸããã ãããã£ãŠãOpenStreetMapã®ããŒã«ã«ããŒãžã§ã³ã䜿çšããå¯èœæ§ã念é ã«çœ®ã䟡å€ããããŸãã
ããŒã¿ãå€ããªããååãªæéããããæŽç·ŽãããæšèãååŸããå¿
èŠããªãå ŽåãOpenStreetMapã䜿çšããŠreverse_geocoderã䜿çšããããšã¯ã§ããŸããã
In : import reverse_geocoder as revgc In : revgc.search((df.latitude, df.longitude)) Loading formatted geocoded file... Out: [OrderedDict([('lat', '40.74482'), ('lon', '-73.94875'), ('name', 'Long Island City'), ('admin1', 'New York'), ('admin2', 'Queens County'), ('cc', 'US')])]
ãžãªã³ãŒãã£ã³ã°ã䜿çšããå Žåãäœæã«å
¥åãã¹ãå«ãŸããŠããå¯èœæ§ãããããšãå¿ããªãã§ãã ããããã®ãããæéããããŠæŽçãã䟡å€ããããŸãã éåžžã座æšã®ã¿ã€ããã¹ã¯å°ãªããªããŸããããã¹ãŠãããŸãããããã§ã¯ãããŸãããGPSã¯ãããŒã¿ã®æ§è³ªã«ãã£ãŠããŸãäžéšã®å ŽæïŒãã³ãã«ãé«å±€ãã«ã®4åã®1 ...ïŒã§ããã€ãºããçºçãããããšããããŸãã ããŒã¿ãœãŒã¹ãã¢ãã€ã«ããã€ã¹ã§ããå Žåããžãªãã±ãŒã·ã§ã³ã¯GPSã§ã¯ãªããå°åºå
ã®WiFiãããã¯ãŒã¯ã«ãã£ãŠæ±ºå®ãããå Žåããããã¹ããŒã¹ããã¬ããŒããŒã·ã§ã³ã«ã€ãªããããšãèæ
®ãã䟡å€ããããŸãã ã
ãã¬ããŒããŒã·ã§ã³ä»®èª¬WiFiãã±ãŒã·ã§ã³ãã©ããã³ã°ã¯ãSSIDãšMACã¢ãã¬ã¹ã®çµã¿åããã«åºã¥ããŠãããå®å
šã«ç°ãªããã€ã³ãã§äžèŽããå ŽåããããŸãïŒããšãã°ãé£éŠãããã€ããŒã¯MACã¢ãã¬ã¹ã®ç²ŸåºŠã§ã«ãŒã¿ãŒã®ãã¡ãŒã ãŠã§ã¢ãæšæºåããç°ãªãéœåžã«é
眮ããŸãïŒã ã«ãŒã¿ãŒã䜿çšããŠäŒç€Ÿãå¥ã®ãªãã£ã¹ã«ç§»åãããªã©ãããäžè¬çãªçç±ããããŸãã
ãã€ã³ãã¯éåžžãããããªåéã§ã¯ãªããã€ã³ãã©ã¹ãã©ã¯ãã£ã®äžã«ãããŸã-ããã§ã¯ãæ³ååãèªç±ã«æããå
åãæãã€ãã人ççµéšãšãã¡ã€ã³é åã®ç¥èãé©çšããããšãã§ããŸãã ãã€ã³ãã®å°äžéãžã®è¿ãã建ç©ã®éæ°ãæå¯ãã®åºèãŸã§ã®è·é¢ãååŸå
ã®ATMã®æ°-1ã€ã®ã¿ã¹ã¯ã®æ çµã¿ã§ãæ°åã®æšèãèŠã€ããŠããŸããŸãªå€éšãœãŒã¹ããååŸã§ããŸãã éœåžã€ã³ãã©å€ã®ã¿ã¹ã¯ã§ã¯ãé«åºŠãªã©ã®ããå
·äœçãªæ
å ±æºããã®æšèã圹ç«ã€å ŽåããããŸãã
2ã€ä»¥äžã®ãã€ã³ããçžäºæ¥ç¶ãããŠããå Žåããããã®éã®ã«ãŒããããã£ãŒãã£ãæœåºãã䟡å€ããããŸãã ããã§ã¯ãè·é¢ã䟿å©ã§ãïŒå€§åè·é¢ãšãéè·¯ã°ã©ãã«åŸã£ãŠèšç®ããããæ£çŽãªãè·é¢ãèŠã䟡å€ããããŸãïŒãå·Šæãšå³æã®æ¯çã«æ²¿ã£ãã¿ãŒã³æ°ãä¿¡å·æ©ãã€ã³ã¿ãŒãã§ã³ãžãæ©ã®æ°ã ããšãã°ãç§ã®ã¿ã¹ã¯ã®1ã€ã§ããéè·¯ã®è€éãããšåŒã°ããå
åãããããããŸãã-ã°ã©ãã«åŸã£ãŠèšç®ãããGCDã§é€ç®ãããè·é¢ã
æ¥æ
é¢é£ããæ©èœãæ®åããŠãããããæ¥ä»ãšæå»ã®åŠçã¯æšæºåããå¿
èŠããããŸãããèœãšãç©Žã¯æ®ã£ãŠããŸãã
ææ¥ããå§ããŸããã-ã¯ã³ãããã³ãŒãã£ã³ã°ã䜿çšããŠãç°¡åã«7ã€ã®ãããŒå€æ°ã«å€æã§ããŸãã ããã«ãé±æ«ã®å¥ã®ç¹æ§ã匷調衚瀺ãããšäŸ¿å©ã§ãã
df['dow'] = df['created'].apply(lambda x: x.date().weekday()) df['is_weekend'] = df['created'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
äžéšã®ã¿ã¹ã¯ã§ã¯ãè¿œå ã®ã«ã¬ã³ããŒæ©èœãå¿
èŠã«ãªãå ŽåããããŸããããšãã°ãçŸéã®åŒãåºãã¯çµŠææ¥ã«çµã³ä»ããããæ
è¡ã«ãŒãã®è³Œå
¥ã¯æã®åãã«çµã³ä»ããããŸãã ãŸããäžæçãªããŒã¿ãåŠçããå Žåãç¥æ¥ãç°åžžæ°è±¡ããã®ä»ã®éèŠãªã€ãã³ããå«ãã«ã¬ã³ããŒãæå
ã«çšæããå¿
èŠããããŸãã
ããã®é¢çœãç¬ã- äžåœã®æ§æ£æããã¥ãŒãšãŒã¯ãã©ãœã³ãã²ã€ãã©ã€ããã¬ãŒãããã©ã³ãã®å°±ä»»ã®å
±éç¹ã¯äœã§ããïŒ
- ãããã¯ãã¹ãŠãæœåšçãªç°åžžã®ã«ã¬ã³ããŒã«è¿œå ããå¿
èŠããããŸãã
ããããæéïŒåãæã®æ¥...ïŒã§ãã¹ãŠãããã»ã©ãã©è²ã§ã¯ãããŸããã æéãå®éã®å€æ°ãšããŠäœ¿çšããå ŽåãããŒã¿ã®æ§è³ªãšè¥å¹²ççŸããŸãïŒ0 <23ããã ã02.01 0:00:00> 01.01 23:00:00ã äžéšã®ã¿ã¹ã¯ã§ã¯ããããéèŠã«ãªãå ŽåããããŸãã ããããã«ããŽãªå€æ°ãšããŠãšã³ã³ãŒããããšãå€æ°ã®èšå·ãçæããè¿æ¥æ§ã«é¢ããæ
å ±ã倱ãå¯èœæ§ããããŸãã22ãš23ã®å·®ã¯22ãš7ã®å·®ãšåãã«ãªããŸãã
ãã®ãããªããŒã¿ã«ã¯ãããé£è§£ãªã¢ãããŒãããããŸãã ããšãã°ãåãžã®æ圱ãšããã«ç¶ã2ã€ã®åº§æšã®äœ¿çšã
def make_harmonic_features(value, period=24): value *= 2 * np.pi / period return np.cos(value), np.sin(value)
ãã®ãããªå€æã¯ããã€ã³ãéã®è·é¢ãä¿æããŸããããã¯ãããã€ãã®è·é¢ããŒã¹ã®ã¢ã«ãŽãªãºã ïŒkNNãSVMãk-means ...ïŒã«ãšã£ãŠéèŠã§ãã
In : from scipy.spatial import distance In : euclidean(make_harmonic_features(23), make_harmonic_features(1)) Out: 0.5176380902050424 In : euclidean(make_harmonic_features(9), make_harmonic_features(11)) Out: 0.5176380902050414 In : euclidean(make_harmonic_features(9), make_harmonic_features(21)) Out: 2.0
ãã ãããã®ãããªãšã³ã³ãŒãæ¹åŒã®éãã¯ãéåžžãã¡ããªãã¯ã®å°æ°ç¹ä»¥äž3æ¡ã§ã®ã¿æ€åºã§ãããã以åã§ã¯æ€åºã§ããŸããã
æç³»åããŠã§ããªã©
ç§ã¯æç³»åãæ±ãã®ãããŸã楜ãããªãã£ãã®ã§ãæç³»åãããµã€ã³ãèªåçæããããã®ã©ã€ãã©ãªãžã®ãªã³ã¯ãæ®ããããã«å
ã«é²ã¿ãŸãã
Webã䜿çšããå ŽåãéåžžããŠãŒã¶ãŒã®ãŠãŒã¶ãŒãšãŒãžã§ã³ãã«é¢ããæ
å ±ããããŸãã ããã¯æ
å ±ã®è²¯èµåº«ã§ãã
ãŸããããããããŸãããªãã¬ãŒãã£ã³ã°ã·ã¹ãã ãæœåºããå¿
èŠããããŸãã 次ã«ãèšå·is_mobile
ãŸãã 第äžã«ããã©ãŠã¶ãèŠãŠãã ããã
ãŠãŒã¶ãŒãšãŒãžã§ã³ãããæ©èœãååŸããäŸ In : ua = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/ ...: 56.0.2924.76 Safari/537.36' In : import user_agents In : ua = user_agents.parse(ua) In : ua.is_bot Out: False In : ua.is_mobile Out: False In : ua.is_pc Out: True In : ua.os.family Out: 'Ubuntu' In : ua.os.version Out: () In : ua.browser.family Out: 'Chromium' In : ua.os.version Out: () In : ua.browser.version Out: (56, 0, 2924)
ä»ã®ãã¡ã€ã³ãã¡ã€ã³ãšåæ§ã«ãããŒã¿ã®æ§è³ªã«é¢ããæšæž¬ã«åºã¥ããŠç¬èªã®å±æ§ãäœæã§ããŸãã ãã®èšäºã®å·çæç¹ã§ã¯ãChromium 56ã¯æ°ããããã°ããããŠããããã®ããŒãžã§ã³ã®ãã©ãŠã¶ã¯ãéåžžã«é·ãéãã®ãã©ãŠã¶ãåèµ·åããŠããªã人ã®ã¿ãä¿åã§ããŸãã ã§ã¯ããªããææ°ããŒãžã§ã³ã®ãã©ãŠã¶ã«é
ãããšã£ãŠããããšãããµã€ã³ãå°å
¥ããªãã®ã§ããïŒ
OSãšãã©ãŠã¶ã«å ããŠããªãã¡ã©ãŒïŒåžžã«å©çšå¯èœãšã¯éããŸããïŒã http_accept_language ãããã³ãã®ä»ã®ã¡ã¿æ
å ±ã確èªã§ããŸãã
次ã«æçšãªæ
å ±ã¯ãå°ãªããšãåœãã§ããã°éœåžããããã€ããŒãæ¥ç¶ã¿ã€ãïŒã¢ãã€ã«/åºå®é»è©±ïŒãæœåºã§ããIPã¢ãã¬ã¹ã§ãã ããŸããŸãªãããã·ãšå€ãããŒã¿ããŒã¹ããããããèšå·ã«ãã€ãºãå«ãŸããŠããå¯èœæ§ãããããšãç解ããå¿
èŠããããŸãã ãããã¯ãŒã¯ç®¡çã®é人ã¯ãã¯ããã«æŽç·Žãããæ©èœãæœåºããããšããããšãã§ããŸã ãããšãã°ã VPNã®äœ¿çšã«ã€ããŠæ³å®ããããšã§ãã ã¡ãªã¿ã«ãIPã¢ãã¬ã¹ã®ããŒã¿ãšhttp_accept_languageãçµã¿åãããããšããå§ãããŸãããŠãŒã¶ãŒãããªã®ãããã·ã«åº§ã£ãŠããŠããã©ãŠã¶ãŒã®ãã±ãŒã«ãru_RUã§ããå ŽåãããŒãã«ã®å¯Ÿå¿ããåïŒ is_traveler_or_proxy_user
ïŒã«ãããŠãããã«äžé©åœã§äŸ¡å€ããããã®ããããŸãã
äžè¬çã«ãç¹å®ã®é åã«ã¯éåžžã«å€ãã®ãã¡ã€ã³åºæã®ãã®ãããããã1ã€ã®é ã«åãŸããªãã ãããã£ãŠãç§ã¯èŠªæãªãèªè
ã«åœŒãã®çµéšãå
±æããŠã圌ãã®ä»äºã§ãµã€ã³ã®æœåºãšçæã«ã€ããŠã³ã¡ã³ãã§äŒããããšãå§ããŸãã
æ£èŠåãšååžã®å€å
ç¹åŸŽã®å調ãªå€æã¯ãäžéšã®ã¢ã«ãŽãªãºã ã«ãšã£ãŠéèŠã§ãããä»ã®ã¢ã«ãŽãªãºã ã«ã¯åœ±é¿ããŸããã ã¡ãªã¿ã«ãããã決å®æšãšãã¹ãŠã®æŽŸçã¢ã«ãŽãªãºã ïŒã©ã³ãã ãã©ã¬ã¹ããåŸé
ããŒã¹ãã£ã³ã°ïŒã®äººæ°ã®çç±ã®1ã€ã§ã-誰ããå€æãå°ç¡ãã«ããæ¹æ³ãç¥ã£ãŠãã/ãããããã§ã¯ãããŸãããããããã®ã¢ã«ãŽãªãºã ã¯ç°åžžãªååžã«èæ§ããããŸãã
: np.log
, np.float64
. , ; - . , . ( ).
, : , : (-1, 1), â .
: , â . 5, .
â Standart Scaling ( Z-score normalization).
StandartScaling ...
In : from sklearn.preprocessing import StandardScaler In : from scipy.stats import beta In : from scipy.stats import shapiro In : data = beta(1, 10).rvs(1000).reshape(-1, 1) In : shapiro(data) Out: (0.8783774375915527, 3.0409122263582326e-27) # , p-value In : shapiro(StandardScaler().fit_transform(data)) Out: (0.8783774375915527, 3.0409122263582326e-27) # p-value
⊠-
In : data = np.array([1, 1, 0, -1, 2, 1, 2, 3, -2, 4, 100]).reshape(-1, 1).astype(np.float64) In : StandardScaler().fit_transform(data) Out: array([[-0.31922662], [-0.31922662], [-0.35434155], [-0.38945648], [-0.28411169], [-0.31922662], [-0.28411169], [-0.24899676], [-0.42457141], [-0.21388184], [ 3.15715128]]) In : (data â data.mean()) / data.std() Out: array([[-0.31922662], [-0.31922662], [-0.35434155], [-0.38945648], [-0.28411169], [-0.31922662], [-0.28411169], [-0.24899676], [-0.42457141], [-0.21388184], [ 3.15715128]])
â MinMax Scaling, ( (0, 1)).
In : from sklearn.preprocessing import MinMaxScaler In : MinMaxScaler().fit_transform(data) Out: array([[ 0.02941176], [ 0.02941176], [ 0.01960784], [ 0.00980392], [ 0.03921569], [ 0.02941176], [ 0.03921569], [ 0.04901961], [ 0. ], [ 0.05882353], [ 1. ]]) In : (data â data.min()) / (data.max() â data.min()) Out: array([[ 0.02941176], [ 0.02941176], [ 0.01960784], [ 0.00980392], [ 0.03921569], [ 0.02941176], [ 0.03921569], [ 0.04901961], [ 0. ], [ 0.05882353], [ 1. ]])
StandartScaling MinMax Scaling - . , , â StandartScaling. MinMax Scaling , (0, 255).
, , , :
In : from scipy.stats import lognorm In : data = lognorm(s=1).rvs(1000) In : shapiro(data) Out: (0.05714237689971924, 0.0) In : shapiro(np.log(data)) Out: (0.9980740547180176, 0.3150389492511749)
, , , .. , â . , , , . - ( â -) - , ; , â np.log(x + const)
.
-. , â QQ . , .
QQQQ! In : import statsmodels.api as sm # price Renthop In : price = df.price[(df.price <= 20000) & (df.price > 500)] In : price_log = np.log(price) In : price_mm = MinMaxScaler().fit_transform(price.values.reshape(-1, 1).astype(np.float64)).flatten() # , sklearn warning- In : price_z = StandardScaler().fit_transform(price.values.reshape(-1, 1).astype(np.float64)).flatten() In : sm.qqplot(price_log, loc=price_log.mean(), scale=price_log.std()).savefig('qq_price_log.png') In : sm.qqplot(price_mm, loc=price_mm.mean(), scale=price_mm.std()).savefig('qq_price_mm.png') In : sm.qqplot(price_z, loc=price_z.mean(), scale=price_z.std()).savefig('qq_price_z.png')
QQ
QQ StandartScaler.
QQ MinMaxScaler.
QQ . !
, - . , Renthop, ( - ), .
In : from demo import get_data In : x_data, y_data = get_data() In : x_data.head(5) Out: bathrooms bedrooms price dishwasher doorman pets \ 10 1.5 3 8.006368 0 0 0 10000 1.0 2 8.606119 0 1 1 100004 1.0 1 7.955074 1 0 1 100007 1.0 1 8.094073 0 0 0 100013 1.0 4 8.116716 0 0 0 air_conditioning parking balcony bike ... stainless \ 10 0 0 0 0 ... 0 10000 0 0 0 0 ... 0 100004 0 0 0 0 ... 0 100007 0 0 0 0 ... 0 100013 0 0 0 0 ... 0 simplex public num_photos num_features listing_age room_dif \ 10 0 0 5 0 278 1.5 10000 0 0 11 57 290 1.0 100004 0 0 8 72 346 0.0 100007 0 0 3 22 345 0.0 100013 0 0 3 7 335 3.0 room_sum price_per_room bedrooms_share 10 4.5 666.666667 0.666667 10000 3.0 1821.666667 0.666667 100004 2.0 1425.000000 0.500000 100007 2.0 1637.500000 0.500000 100013 5.0 670.000000 0.800000 [5 rows x 46 columns] In : x_data = x_data.values In : from sklearn.linear_model import LogisticRegression In : from sklearn.ensemble import RandomForestClassifier In : from sklearn.model_selection import cross_val_score In : from sklearn.feature_selection import SelectFromModel In : cross_val_score(LogisticRegression(), x_data, y_data, scoring='neg_log_loss').mean() /home/arseny/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/linear_model/base.py:352: RuntimeWarning: overflow encountered in exp np.exp(prob, prob) # , - ! - , Out: -0.68715971821885724 In : from sklearn.preprocessing import StandardScaler In : cross_val_score(LogisticRegression(), StandardScaler().fit_transform(x_data), y_data, scoring='neg_log_loss').mean() /home/arseny/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/linear_model/base.py:352: RuntimeWarning: overflow encountered in exp np.exp(prob, prob) Out: -0.66985167834479187 # ! ! In : from sklearn.preprocessing import MinMaxScaler In : cross_val_score(LogisticRegression(), MinMaxScaler().fit_transform(x_data), y_data, scoring='neg_log_loss').mean() ...: Out: -0.68522489913898188 # a â :(
(Interactions)
, ; , .
Two Sigma Connect: Rental Listing Inquires. . , , â , .
rooms = df["bedrooms"].apply(lambda x: max(x, .5)) # ; .5 df["price_per_bedroom"] = df["price"] / rooms
. , , , . , - : , ( )[ https://habrahabr.ru/company/ods/blog/322076/ ] (. sklearn.preprocessing.PolynomialFeatures
) .
" ", . , , . python : pandas.DataFrame.fillna
sklearn.preprocessing.Imputer
.
. :
"n/a"
( );- ( , );
- , - ( , , .. );
- (, ) â .
- df = df.fillna(0)
. : , ; .
(Feature selection)
? - , . : , . , â , . â ( ) , .
â , , .. . , , , . , .
In : from sklearn.feature_selection import VarianceThreshold In : from sklearn.datasets import make_classification In : x_data_generated, y_data_generated = make_classification() In : x_data_generated.shape Out: (100, 20) In : VarianceThreshold(.7).fit_transform(x_data_generated).shape Out: (100, 19) In : VarianceThreshold(.8).fit_transform(x_data_generated).shape Out: (100, 18) In : VarianceThreshold(.9).fit_transform(x_data_generated).shape Out: (100, 15)
, .
In : from sklearn.feature_selection import SelectKBest, f_classif In : x_data_kbest = SelectKBest(f_classif, k=5).fit_transform(x_data_generated, y_data_generated) In : x_data_varth = VarianceThreshold(.9).fit_transform(x_data_generated) In : from sklearn.linear_model import LogisticRegression In : from sklearn.model_selection import cross_val_score In : cross_val_score(LogisticRegression(), x_data_generated, y_data_generated, scoring='neg_log_loss').mean() Out: -0.45367136377981693 In : cross_val_score(LogisticRegression(), x_data_kbest, y_data_generated, scoring='neg_log_loss').mean() Out: -0.35775228616521798 In : cross_val_score(LogisticRegression(), x_data_varth, y_data_generated, scoring='neg_log_loss').mean() Out: -0.44033042718359772
, . , , , .
: - baseline , . : - "" (, Random Forest) Lasso , . : , .
from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline x_data_generated, y_data_generated = make_classification() pipe = make_pipeline(SelectFromModel(estimator=RandomForestClassifier()), LogisticRegression()) lr = LogisticRegression() rf = RandomForestClassifier() print(cross_val_score(lr, x_data_generated, y_data_generated, scoring='neg_log_loss').mean()) print(cross_val_score(rf, x_data_generated, y_data_generated, scoring='neg_log_loss').mean()) print(cross_val_score(pipe, x_data_generated, y_data_generated, scoring='neg_log_loss').mean()) -0.184853179322 -0.235652626736 -0.158372952933
, â .
Renthop. x_data, y_data = get_data() x_data = x_data.values pipe1 = make_pipeline(StandardScaler(), SelectFromModel(estimator=RandomForestClassifier()), LogisticRegression()) pipe2 = make_pipeline(StandardScaler(), LogisticRegression()) rf = RandomForestClassifier() print('LR + selection: ', cross_val_score(pipe1, x_data, y_data, scoring='neg_log_loss').mean()) print('LR: ', cross_val_score(pipe2, x_data, y_data, scoring='neg_log_loss').mean()) print('RF: ', cross_val_score(rf, x_data, y_data, scoring='neg_log_loss').mean()) LR + selection: -0.714208124619 LR: -0.669572736183
, , : "", , , . Exhaustive Feature Selection .
â , . N, N , , N+1 , , . , . Sequential Feature Selection .
: , .
! In : selector = SequentialFeatureSelector(LogisticRegression(), scoring='neg_log_loss', verbose=2, k_features=3, forward=False, n_jobs=-1) In : selector.fit(x_data_scaled, y_data) In : selector.fit(x_data_scaled, y_data) [2017-03-30 01:42:24] Features: 45/3
â6
, .
c UCI . Jupyter notebook - , .