
ã€ã³ã¿ã©ã¯ãã£ããªlastfmãããã«åºäŒã ãæ ç»çšã®åæ§ã®ãããžã§ã¯ããééããªãäœæããããšã«ããŸããã ã«ããã®äžã§ãã¹ããŒãªãŒã¯ãããŒã¿ãåéããã°ã©ããäœæããäŸãšããŠæ ç»æ€çŽ¢ãšimdbããã®ããŒã¿ã䜿çšããŠç¬èªã®ã€ã³ã¿ã©ã¯ãã£ããªãã¢ãäœæããæ¹æ³ã«ã€ããŠã§ãã Scrapyã¹ã¯ã©ãããããã³ã°ãã¬ãŒã ã¯ãŒã¯ãèŠãŠã倧ããªã°ã©ããèŠèŠåããæ¹æ³ãæ€èšãããã©ãŠã¶ã§å€§ããªã°ã©ããã€ã³ã¿ã©ã¯ãã£ãã«è¡šç€ºããããã®ããŒã«ãæ±ããŸãã
1.ããŒã¿åéïŒã¹ã¯ã¬ã€ããŒ
ããŒã¿ãœãŒã¹ãšããŠãæ ç»ã®æ€çŽ¢ãéžæããŸããã ãããããã®åŸãããã¯éåžžã«å°ãããIMDbãåã£ãããšãå€æããŸããã ã°ã©ããäœæããã«ã¯ãæ ç»ããšã«ãæšå¥šãããæ ç»ã®ãªã¹ããç¥ãå¿
èŠããããŸãã æ€çŽ¢ãããšãååãªã ãŒããŒæ€çŽ¢ããŒãµãŒãšããããçš®é¡ã®éå
¬åŒã®APIãèŠã€ããããšãã§ããŸãããæšå¥šäºé
ãååŸããæ¹æ³ã¯ã©ãã«ããããŸããã IMDbã¯ããŒã¿ã»ããããªãŒãã³ã«å
±æããŠããŸãããæšå¥šäºé
ã¯ãããŸããã ãããã£ãŠãéžæè¢ã¯1ã€ã ãã§ããã¹ãã€ããŒãäœæããŸãã
ããã®ã¹ã¯ã¬ã€ãã³ã°ã«é¢ããããã€ãã®èšäºãæ¢ã«ããã®ã§ãå¯èœãªã¢ãããŒãã®æŠèŠã¯ã¹ãããããŸãã ç°¡åã«èšããšãPythonã§èšè¿°ããŠããŠããã¬ãŒã ã¯ãŒã¯ãèšè¿°ããããªãå Žåã¯ã Scrapyã䜿çšããŸã ã ããªããå¿
èŠãšãããããããªãã»ãšãã©ãã¹ãŠã®ãã®ã¯ãã§ã«æäŸãããŠããŸãã
Scrapyã¯æ¬åœã«éåžžã«åŒ·åã§ãããšåæã«éåžžã«ã·ã³ãã«ãªããŒã«ã§ãã ãšã³ããªãŒã®ãããå€ã¯éåžžã«äœãã§ãããåæã«ãScrapyã¯ãããããµã€ãºãšè€éãã®ãããžã§ã¯ãã«ç°¡åã«æ¡åŒµã§ããŸãã æ¬åœã«å¿
èŠãªãã®ããã¹ãŠå«ãŸããŠããŸãã ãããã¯ã®ãã€ãã¹ãã¹ã¯ã¬ã€ãã³ã°ã®äžæåæ¢ããã³åéãªã©ã®æ¹æ³ãå«ããåä¿¡ããèŠçŽ ã®çŽæ¥è§£æããã³HTTPãªã¯ãšã¹ããåŠçããã³ä¿åã®ããã®ããŒã«ãããããžã§ã¯ãã®ç®¡çãŸã§ã
ãããžã§ã¯ãã®äœæã¯ã scrapy startproject mycoolproject
ã§å§ãŸããŸãããã®åŸãæå°éã®äœæ¥æ§æã§å¿
èŠãªèŠçŽ ãšãã¡ã€ã«ã®ãã³ãã¬ãŒããåããæ¢è£œã®æ§é ãååŸããŸãã ããããäœæ¥ãããžã§ã¯ããäœæããã«ã¯ãããŒãžã解æããæ¹æ³ã説æããã ãã§ååã§ããã€ãŸããã¹ãã€ããŒãäœæããŠãããžã§ã¯ãå
ã®spiders
ãã©ã«ããŒã«é
眮ããæœåºããæ
å ±ã説æããŸãã scrapy.item
ã scrapy.item
ã¹ã¯ãªããã®scrapy.item
ã¯ã©ã¹ããã¯ã©ã¹ãç¶æ¿ããŸãã ãããã£ãŠã1æéæªæºã§å®å
šã«æ©èœãããããžã§ã¯ããäœæã§ããŸãã çµæãä¿åããããã®çµã¿èŸŒã¿ããŒã«ããããŸããããšãã°ãcsvãŸãã¯jsonã«æžã蟌ã¿ãŸããããããžã§ã¯ãã5åéãªãå Žåã¯å€éšããŒã¿ããŒã¹ã䜿çšããããšããå§ãããŸãã ä¿åãå«ãåŠççµæã«é¢é£ããåäœã¯ã pipelines.py
ã§æå®ãããŠããŸãã æåŸã®éèŠãªãã¡ã€ã«-settings.pyãæ®ã£ãŠãsettings.py
ããã®ç®çã¯ååããæããã§ãã ããã§ã¯ãããšãã°ããããã·ã®äœ¿çšããªã¯ãšã¹ãéã®ã¿ã€ãã³ã°ãªã©ã«é¢é£ãããããžã§ã¯ãã®æ§æãèšå®ã§ããŸãã
ãããŠã次ã®æé ã§ïŒ
- Scrapyã®ãããžã§ã¯ããäœæããæ¹æ³ã«ã€ããŠã¯ãèšäºã1å ã 2å ã ããã¥ã¡ã³ãã§ç¢ºèªããŸãã é¡æšã«ãããã¢ã€ãã çšã«ç¬èªã®ã¯ã©ã¹ãäœæããŸãã
æ ç»æ€çŽ¢çšã®items.py import scrapy class MovieItem(scrapy.Item): '''Movie scraped info''' movie_id = scrapy.Field() name = scrapy.Field() like = scrapy.Field() genre = scrapy.Field() date = scrapy.Field() country = scrapy.Field() director = scrapy.Field()
- ããŒãžã§å¿
èŠãªèŠçŽ ãæ¢ããŠãxpathãååŸããŸãã ããã¯ãããšãã°ãã¯ãã ãä»ããŠè¡ãããšãã§ããŸããèŠçŽ ãå³ã¯ãªãã¯ããã³ãŒãã§èŠçŽ ãæ€æ»ãéžæããå床å³ã¯ãªãã¯ããŠãã³ããŒ-> xpathãæ¢ããŸãã ãããã°ã®ããã«ãscrapy-shellãå®è¡ããŠããŒãžURLãæž¡ãããšãã§ããŸãïŒ
scrapy-shell https://www.kinopoisk.ru/film/518214/
ã å¿
èŠãªèŠçŽ ãååŸã§ããå¿çãªããžã§ã¯ããã€ã³ã¹ã¿ã³ã¹åããŸãã
ãã®ããã«ïŒ $scrapy-shell https://www.kinopoisk.ru/film/sakhar-i-korica-1915-201125/ $response.xpath('//span[@itemprop="director"]/a/span/text()').extract_first() ' '
- åä¿¡ãããªããžã§ã¯ãã®åŠçãã«ã¹ã¿ãã€ãºããŸãã ãšã³ããªã¯éåžžã«åçŽãªã®ã§ãsqliteããŒã¿ããŒã¹ã«ä¿åããããšã«ããŸããã
- ã¯ãšãªã®æ°ãå¶éããããã«ãèµ·åæã«
CLOSESPIDER_PAGECOUNT=5
ãã©ã¡ãŒã¿ãŒãèšå®ããŠããã¹ãŠãæ©èœããããšã確èªããŸãã - æŠãã«ïŒ äžéçµæãä¿åãããã£ã¬ã¯ããªãäœæããŸãïŒäŸïŒ
crawls1
ã scrapy crawl myspider -s JOBDIR=crawls1
ãã©ã¡ãŒã¿ãŒscrapy crawl myspider -s JOBDIR=crawls1
ã¹ãã€ããŒãéå§ããŸããäœãåé¡ãçºçããå Žåã¯ãçµäºããå Žæããã¹ãã€ããŒãåèµ·åã§ããŸãã ããã¥ã¡ã³ãã®é¢é£ã»ã¯ã·ã§ã³ã
1.1ãªã¯ãšã¹ãæ°ã®å¶éããã€ãã¹ããŸãã
Kinopoiskã¯ãã¯ã¢ããããã°ãã段éã§ããã5ç§ããšã«5ãªã¯ãšã¹ãã®ããã¯ã1ç§ã®ã¿ã€ã ã¢ãŠãã§éä¿¡ãããšãã«çŠæ¢ããŸããã å¶éãåé¿ããããã®å€ãã®ãªãã·ã§ã³ããããŸãã ã¹ã¯ã¬ã€ããŒã®å ŽåãããŒã©ã¹ã®äœ¿çšããªã¹ãããã®ãããã·ã®ã©ã³ãã ãªãœãŒãããŸãã¯ææã®å転ãããã·ãµãŒãã¹ãžã®æ¥ç¶ã®æ¢è£œã®äŸã¯ç°¡åã«ã°ãŒã°ã«ã§æ€çŽ¢ã§ããŸãã ãé±æ«ãããžã§ã¯ããããããããææã®ãµãŒãã¹ã«æ¥ç¶ããå¿
èŠããããŸãããå®è£
ããã®ã«æãéããªãã·ã§ã³ã§ããå転ãããã·ãéžæããŸããã ä»çµã¿ïŒãããã·ãããã€ããŒã®ç¹å®ã®IPïŒããŒãã«æ¥ç¶ããåºåã§åãªã¯ãšã¹ãã®æ°ããIPãååŸããŸãã ã¹ã¯ã¬ã€ããŒåŽã§ã¯ããããžã§ã¯ãã®settings.pyãã¡ã€ã«ã«1è¡ãè¿œå ããåãªã¯ãšã¹ãã§ipïŒportãã¢ã®ãã©ã¡ãŒã¿ãŒãæž¡ãå¿
èŠããããŸãã
ã³ãŒãã§ã¯ã次ã®ããã«ãªããŸããsettings.pyã§é©åãªã»ã¯ã·ã§ã³ãæ¢ããããã«è¡ãè¿œå ããŸãïŒ
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':543 }
次ã«ãã¯ã¢ã®ãªã¯ãšã¹ãããšã«ïŒ
scrapy.Request(url=url, callback=self.parse, meta={'proxy':'http:
ç§ã«ã€ã³ã¹ãã¬ãŒã·ã§ã³ãäžãããããžã§ã¯ãã®äœè
ã¯ãNightwishããå§ããŠãããlastfmã®æšå¥šäºé
ãããªãŒãšåããããåºãåã£ãŠã¿ãã®ã§ãæ¥ç¶ã°ã©ããåŸãŸããã ç§ã®ã¢ãããŒããåæ§ã§ããã æ ç»ã®æ€çŽ¢ããIDã§æ ç»ãååŸã§ããŸããããã¯ããµã€ãäžã®æ ç»ã®ã·ãªã¢ã«çªå·ã«åçŽã«å¯Ÿå¿ããŠããããã§ãã ãã¹ãŠã®IDãååŸããã ãã§ã¯ãã»ãšãã©ã®æ ç»ã«ã¯æšå¥šäºé
ããªãããããäžã®ãã€ãºã«å€ããåäžãã€ã³ãã«ãªããããããŸãæ©èœããŸããã æ°é®®ãªæ ç»ã®IDã¯500,000ã®ãªãŒããŒã§ã-ããã¯èŠèŠçãªã¬ã³ããªã³ã°ã«ã¯éåžžã«å€ãã®ã§ãäžäœ250ã®æ ç»ã®ãªã¹ãããå§ããŠãåæ ç»ã®æšå¥šäºé
ã®ãªã¹ããç¹°ãè¿ãèŠãŠãããŸãããã
ç§ã¯çŽ100,000æã®æ ç»ãåãåããšäºæ³ããŠããŸãããããããèœãšãå€ã®çµãããŸã§ã«ãã¯ã¢ã¯çŽ12,600ã§æ¢ãŸã£ãããšãããããŸããã æ ç»æ€çŽ¢ã®æšå¥šäºé
ã¯ããã§çµããã§ãã æåã«è¿°ã¹ãããã«ãæ°ããããŒã¿ãåŸãããã«IMDbã«ã¢ã¯ã»ã¹ããŸããã IMDbã®ã¹ã¯ã¬ã€ãã³ã°ã¯ããã«ç°¡åã§ãã å®æãããããžã§ã¯ããæžãæããã®ã«æ°æéããããæ°ããã¹ãã€ããŒãéå§ããæºåãã§ããŸããã 2ã3æ¥éã®ã¯ããŒã«ïŒåããŸãããªããªãããã«1ç§éã«8åã®ãªã¯ãšã¹ãïŒã®åŸãã¯ã¢ã¯æ¢ãŸãã173 +åã®æ ç»ãåéããŸããã ã¯ã¢ã®å®å
šãªã³ãŒãã¯githubã§èŠãããšãã§ããŸãïŒãã£ã«ã æ€çŽ¢ãšIMDb ã
2.å¯èŠå
äžæ¹ã§ã¯ãã°ã©ãèŠèŠåããŒã«ã¯åç©åå
šäœã§ãã äžæ¹ãéåžžã«å€§ããªã°ã©ãã«ãªããšããã®åç©åã¯çªç¶ã©ããã§é§ãæããŸãã ãã®ãããªå Žåã®ããã«ãç§ã¯èªåçšã«2ã€ã®ããŒã«ãéžæããŸããããããã¯graphvizãšgephiã®sfdpã§ãã SFDPã¯ãå¹
åºããã©ã¡ãŒã¿ãŒãåããCLIãŠãŒãã£ãªãã£ã§ããã100äžããŒããããã®ã°ã©ããæç»ã§ããŸãããã¹ã¿ã€ãªã³ã°ããã»ã¹ãå¶åŸ¡ããå¿
èŠãããããããã®ã±ãŒã¹ã§ã¯æã䟿å©ãªããŒã«ã§ã¯ãããŸããã ç§ãã¡ã®ãããªå ŽåãGephiã¯çŽ æŽãããã§ããããã¯ãã»ãŒãã¹ãŠã®å¥œã¿ã«å¯Ÿå¿ãããã°ã©ãã£ã«ã«ã€ã³ã¿ãŒãã§ã€ã¹ãšæ®éçãªã¹ã¿ã€ãªã³ã°ãåããã¢ããªã±ãŒã·ã§ã³ã§ãã
ã°ã©ãã®ããŒã¿ã®ãšã¯ã¹ããŒãã¯ãåçŽãªpythonã¹ã¯ãªããã䜿çšããŠè¡ãããŸãã ããã圢åŒã¯ãã人éãèªããããšåŒã°ããéåžžã«åçŽãªãããé垞䜿çšããŸãã åœåããã®åœ¢åŒã¯graphvizã§ã®äœ¿çšãç®çãšããŠããŸãããçŸåšã§ã¯ä»ã®å€ãã®ã°ã©ãã¢ããªã±ãŒã·ã§ã³ã§ãµããŒããããŠããŸãã
ãã©ãŒãããã®èª¬æ
æåã«ã digraph kinopoisk {\n
ããããŒdigraph kinopoisk {\n
ãèšè¿°ãããã¡ã€ã«ã®çµããã«éãæ¬åŒ§ãæžãããšãå¿ããªãã§ãã ãã}
ã åè¡ã§ãã°ã©ãnode1 -> node2;
ãšããžã説æãnode1 -> node2;
ãªã¹ããéããŸãã ããã§ã®åœ¢åŒã®èª¬æïŒ å
¬åŒããã¯ãšWikipediaã®ç°¡åãªäŸ ã
説æä»ãã®ãµã³ãã«ãã¡ã€ã« digraph sample { 1 -> 2; 1 -> 3; 5 -> 4 [weight="5"]; 4 [shape="circle"]; }
digraph
ã宣èšããããšãæå³ããŸãã æ瀺ããå¿
èŠããªãå ŽåãåçŽã«graph
ãäœæãgraph
ã sample
ã¯ã°ã©ãã®ååã§ãïŒãªãã·ã§ã³ïŒã åè¡ã§ããšããžãŸãã¯é ç¹ã宣èšãããŸãã ãšããžãæ¹åä»ããããŠããªãå Žåã ->
代ããã«->
ãæžããŸãã è§ãã£ãã§ã¯ããšããžãŸãã¯é ç¹ã®ãã©ã¡ãŒã¿ãŒã宣èšã§ããŸãã ãã®äŸã§ã¯ãããŒã5ãš4ã®éã®ãšããžã5ã«èšå®ããé ç¹4ãåã«èšå®ããŸãã é ç¹ã®ååã¯æ°åã§ç€ºãå¿
èŠã¯ãªããæååã«ããããšãã§ããŸãã ãã®ä»ã®äŸãšãã©ã¡ãŒã¿ãŒã«ã€ããŠã¯ãããã¥ã¡ã³ããåç
§ããŠãã ããã ç§ãã¡ã®å Žåãäžèšã®æ©èœã§ååã§ãã
2.1ã¹ã¿ã€ãªã³ã°ã®æ¯èŒ
倧ããªã°ã©ãã®å Žåãgephiã«ã¯2ã€ã®åŠ¥åœãªãªãã·ã§ã³ããããŸããOpenOrdãšForceAtlas 2ã§ããOpenOrdã¯éåžžã«é«éãªè¿äŒŒã¢ã«ãŽãªãºã ã§ãããæ§æå¯èœãªãã©ã¡ãŒã¿ãŒã¯ã»ãšãã©ãããŸããã ForceAtlasã¯ãä»ã®å€å
žçãªåæåã¢ã«ãŽãªãºã ã«äŒŒãŠãããããæ£ç¢ºãªçµæãæäŸãããã¥ãŒãã³ã°ã«éåžžã«æè»æ§ããããŸãããæéããããŠæ¯æãå¿
èŠããããŸãã 以äžã¯ãã°ãªãããè¡šãã°ã©ãäžã®äž¡æ¹ã®ã¢ã«ãŽãªãºã ã®æäœã®äŸã§ãã


ã¢ãã«ã°ã©ãã°ãªããã å·ŠOpenOrdãå³ForceAtlasã
ããæ£ç¢ºãªçµæãåŸ
ã€æéãããå Žåã¯ãOpenOrdããŸã£ãã䜿çšãã¹ãã§ã¯ãªããšèãããããããŸããã å®éãForceAtlasããã¹ãŠã®ããŒãã1ã€ã®ã¿ã€ããªå¡ã«åéããOpenOrdãå°ãªããšãäœããã®æ§é ã瀺ãå Žåãã°ã©ãã¯çãããããŸããã
ããã»ã¹ãé«éåããããã«ãOpenOrdãåæè¿äŒŒãšããŠäœ¿çšããForceAtlasã§ã°ã©ããå¡ãã€ããŸããã ç»åãå°ãªããšãäœãã¯ã£ããããŠããããã«ã¯ããäºãã®ããŒãã®ãªãŒããŒã©ãããæé€ããå¿
èŠããããŸãã ãã®ããã«ã¯ãYifan Huã¹ã¿ããã³ã°ã䜿çšãããšäŸ¿å©ã§ã-ã¯ã©ã¹ã¿ãŒãå°ãæ±ããããããã³noverlapã䜿çšããŠãªãŒããŒã©ãããå®å
šã«é€å»ããŸãã æ ç»ã®æ€çŽ¢ã°ã©ãã®éè€ã解æ¶ããã®ã«äžæ©ããã£ããããé±æ«å
šäœã§imdbã«å¯ŸåŠã§ããŸããã§ããã


3.çµæã®ãšã¯ã¹ããŒãïŒã€ã³ã¿ã©ã¯ãã£ãããã
Gephiã¯ãç»åãsvgãpngãªã©ã®å€ãã®åœ¢åŒã«ãšã¯ã¹ããŒãã§ããŸãã ããããããã°ããŒã¿ã«ã¯å€§ããªå°é£ã䌎ããŸãã çŸããåçã ãã§ã¯ååã§ã¯ãããŸããã ç§ãã¡ã¯æ ç»ã®ååãšããããã©ã®ããã«é¢é£ããŠããããèŠããã§ãã ããŒãã®ã©ãã«ãæç»ãããšãéåžžã«å€ãã®ããŒãã§ãæåã®å®å
šã«å€èªã§ããªãã¯ã©ãŠããåŸãããŸãã SVGã䜿çšããŠãäœããèŠããããã«ãªããŸã§ã¹ã±ãŒãªã³ã°ãããªãã·ã§ã³ããŸãã¯æãéèŠãªã©ãã«ã®ã¿ãæç»ãããªãã·ã§ã³ããããŸãã ããããããè¯ããªãã·ã§ã³ããããç§ã¯ããã«éäžããããšã«ããŸããã ã€ã³ã¿ã©ã¯ãã£ããªããããäœæããŸãã

ããŒã«ã®æŠèŠïŒ
sigma.js
æåã®ãªãã·ã§ã³ã¯ãæãã·ã³ãã«ã§ãããšåæã«æãçŽæçãªãã®ã®1ã€ã§ã sigma.jsãã³ãã¬ãŒãã«ãšã¯ã¹ããŒãããgephiã®ãã©ã°ã€ã³ã§ãã äžèšã®gifã§ã¯ãããã ãã§ãã gephiã¡ãã¥ãŒãããã©ã°ã€ã³ãã€ã³ã¹ããŒã«ãããšããã¡ã€ã«ã¿ãã«æ°ãããšã¯ã¹ããŒãã¡ãã¥ãŒé
ç®ã衚瀺ãããŸãã ãã©ãŒã ã«èšå
¥ãããšã¯ã¹ããŒãããæ¢è£œã®äœæ¥ã®èŠèŠåãååŸããŸãã ã·ã³ãã«ã§åŒ·åã çµæã¯ããã§èŠãããšãã§ããŸã ã æ¬ ç¹ïŒå€§ããªã°ã©ãã§ã¯ããã©ãŠã¶ã¯ã»ãšãã©å¯Ÿå¿ããŸããã
gefx-js
次ã®ãªãã·ã§ã³ã¯åã®ãªãã·ã§ã³ãããããã«åçŽã§ãäžè¬çã«éåžžã«äŒŒãŠããŸãã gefx-js-ãããžã§ã¯ããgephiããgexf圢åŒã«ãšã¯ã¹ããŒããããã³ãã¬ãŒããã©ã«ããŒã«é
眮ããã ãã§ãã ã§ãã æ¬ ç¹ã¯ãåã®ã±ãŒã¹ãšãŸã£ããåãã§ãã ããã«ãsigmajsã䜿çšãããšãå°ãªããšãããŒã«ã«ã§imdbã°ã©ããèŠãããšãã§ããgefx-jsã§ã¯èµ·åããŸããã§ããã
openseadragon
éåžžã«å€§ããªç»åã衚瀺ããå¿
èŠãããå Žåã¯ã ã·ãŒãã©ãŽã³ããããŸãã ååã¯ãå°ççããããã¬ã³ããªã³ã°ãããšããšãŸã£ããåãã§ããã¹ã±ãŒãªã³ã°æã«ã¯ãçŸåšã®å¢å ãšè¡šç€ºé åã«å¯Ÿå¿ããæ°ããã¿ã€ã«ãããŒããããŸãã ããããŸãã«ç§ã«ã€ã³ã¹ãã¬ãŒã·ã§ã³ãäžãããããžã§ã¯ãã®èè
ã§ãã 1ã€ã®æ¬ ç¹ïŒæå°ã®å¯Ÿè©±æ§ã ããŒããéžæããããšã¯äžå¯èœã§ãããrib骚ãã©ãã«è¡ãããèŠãã®ã¯å°é£ã§ãã éãªãåãããŒããšãšããžããèŠã蟌ããããšã¯ã§ããŸããã
ã·ã³ã°ã«
ããããåã®ãªãã·ã§ã³ã®æ··åç©ã®ãããªãã®ãäœæããŠãã¹ã±ãŒãªã³ã°æã«ã°ã©ãã§ã¯ãªããç»åã§ã¯ãªããæåã®å Žåã®ããã«ããŒããšãšããžã®çžäºäœçšã§ããŒãããããšãããã©ãã§ããããïŒ ã¿ãŒã³ããŒãœãªã¥ãŒã·ã§ã³ã¯ãæåéãå¥è·¡ã«ãã£ãŠçºèŠãããŸããããããã¯shinglejsã§ãã
é·æïŒå¯Ÿè©±æ§ãç¶æããªãããéåžžã«ïŒéåžžã«ïŒå€§èŠæš¡ãªã°ã©ãããã©ãŠã¶ãŒã§ã¬ã³ããªã³ã°ã§ããŸãã
çæïŒsigmagesã»ã©çŸããã¯ãããŸãã;ããŒã¿ã®æºåã¯ç°¡åã§ã¯ãããŸããã
Shinglejsã䜿çšããCountã®ç»é¢
ããã»ã©çŸããã¯ãªããããšãŠãè³¢ã
imdbã°ã©ããèŠèŠåããããã«ãæåŸã®ãªãã·ã§ã³ãéžæããŸããã äžè¬çã«ãéžæã®äœå°ã¯ãããŸããã§ããã ããã§çµæã確èªãããã®ãããªèŠèŠåã®ããã«ããŒã¿ãæºåããæ¹æ³ã«ã€ããŠå°ã説æããŸãã
ããŒã¿ãshinglejsã«ãšã¯ã¹ããŒãããïŒ
åã«è¿°ã¹ãããã«ãåŸè
ã®å Žåã®ããŒã¿ã®ãšã¯ã¹ããŒãã¯ããã»ã©åçŽã§ã¯ãªããããsinglejsã®gephiããã°ã©ããã¢ã³ããŒãããæ¹æ³ã®äŸã瀺ããŸãã
- gephiããgdf圢åŒã§ã°ã©ãããšã¯ã¹ããŒããã-ããã¯ãããããããŒãã«ã®åœ¢ã§ããŒãã®åº§æšãååŸããããã®å¯äžã®ç°¡åãªæ¹æ³ã§ãã ãã¡ã€ã«ã®æ§é ã¯æ¬¡ã®ãšããã§ããæåã«ããŒãã®èª¬æãèšèŒãããããŒãã«ãããã次ã«ãšããžã®èª¬æãèšèŒãããããŒãã«ããããŸãã
- ãã¡ã€ã«ãèªã¿åããããããé ç¹ãšãšããžã®èª¬æãååŸããŸãã ãã³ãã§ãããè¡ã£ãåŸãããŒã¿ãã¬ãŒã ã2ã€ã®ãšããžãšããŒãã«åãåããŸããã ãã³ãã§ã®äœæ¥ã«ã€ããŠã
- ããã¯ã®Shinglejsã«åŸã£ãŠåã®ååãå€æŽããjsonã«ãšã¯ã¹ããŒãããŸãã Shinglejsã¯è²ã®çŽæ¥ãšã¯ã¹ããŒãããµããŒãããŠããŸããããããŒãããšã«ãã³ãã¥ããã£ããæå®ããŠããã§ã«è²ãä»ããããšãã§ããŸãã ãããã£ãŠãæ ç»ã®è©äŸ¡ã¯ã³ãã¥ããã£ã¿ã°ãšããŠã¢ã³ããŒããããŸãã
- ã¡ã€ã³ããŒãžã®ãœãŒã¹ã§ãã³ãã¥ããã£ã®è²ã®ãªã¹ããæå®ããããšãå¿ããªãã§ãã ããã è²ã®ãªã¹ãããããŒãã«è²ãä»ããã«ã¯ã次ã®ããã«èšç®ãããæ°å€ãæã€èŠçŽ ã䜿çšããŸãïŒ
id %
ã - ãã¡ã€ã«ã1ã€ã«æ¥çããŸãã ããã¯bashã䜿çšããŠè¡ããŸãã
cat start imdbnodes.json middle imdbedges.json end > imdbdata.json
ã以åã«ããããã {"nodes":
ããã , "relations":
ããã }
ãã®å
容ã®start, middle, stop
ãã¡ã€ã«ãäœæããŸããã - ããã«ãªãã£ã¹ããã®æ瀺ã«åŸã£ãŠã ãµã€ã
- ãããããããäœæããŠããããã°ã©ãããŒã¿ãã©ã«ããŒã«å
¥ããããšãå¿ããªãã§ãã ããã ãããžã§ã¯ãã®äœæè
ã¯ãã®è©³çŽ°ã瀺ããŠããªãããã§ãããããã©ã«ãã§ã¯ãããã©ã«ãã®ãããžã§ã¯ãããã«ãããåŸã®ããã«npmã§ã¯ãªã
image_2400.jpg
ããã³image_1200.jpg
ãããŒãããããšããŸãã
4.èå³æ·±ã芳å¯
lastfmã³ã©ã ã«ã¯ãæ¥æ¬ã®ãããã¹ãããã¯ãã®ãªã·ã£ã®ã¡ã¿ã«ãªã©ãé³æ¥œã°ã«ãŒãã®åºèº«åœã«é¢é£ããæ確ãªã¯ã©ã¹ã¿ãªã³ã°ããããŸããæ ç»ã§ããŸã£ããåãããšãèµ·ãããŸãã éåœæ ç»ããã«ã³èªãæ¥æ¬èªããã©ãžã«èªã¯éåžžã«æ確ã«åé¢ãããŠããŸãã imdbã§ã¯ã倧èŠæš¡ãªãã®ããã¯ã»ã©é ãã挫ç»ã®å€§ããªã¯ã©ã¹ã¿ãŒãéç«ã£ãŠããŸãã äž¡æ¹ã®ã°ã©ãã§ãã³ããã¯ããã¯ã®ã¹ãŒããŒããŒããŒæ ç»ã®ã¯ã©ã¹ã¿ãŒã¯éåžžã«å¯éããŠããŸãã æªãæ ç»ã1ã€ã®å€§ããªé²ã«éãŸã£ãŠããããšã¯æããã§ãããããã§ãäºæ³å€ã®ããšã§ãã ããªãŒã»ããã¿ãŒã®äžçã«ã€ããŠã®ãã¥ãŒãžãã¯ãããªãåäŸåãã®YouTubeããã°ããã¡ã³æ ç»ã®å¥ã
ã®ã¯ã©ã¹ã¿ãŒããããŸãã
5.ãã®ããŒã¿ã§ä»ã«ã§ããããš
èªè
ã¯ãåãåã£ãããŒã¿ã«ã€ããŠããã«å€ãã®èå³æ·±ããããžã§ã¯ããèãåºããå®è¡ã§ãããšç¢ºä¿¡ããŠããŸãã ããã«æ¬¡ã®ã¢ã€ãã¢ãæãã€ããŸãã
- ã³ã¬ã¯ã·ã§ã³ãã¯ã©ã¹ã¿ãŒåããã³ã³ãã€ã«ããŸãã DBSCANã¯ãæåã®å®è¡ããã»ãšãã©ããŸããããŸããã ïŒäŸãç¶ããŸãïŒ
- ç¬èªã®æšå¥šã·ã¹ãã ãäœæãã
- æ ç»å
šè¬ã«é¢ããèå³æ·±ãçµ±èšãåéãã
- ãã¡ããããã£ã«ã ã©ã€ãã©ãªãæ¡åŒµããŸãã
5.1 DBSCAN
ã°ã©ããã¯ã©ã¹ã¿ãªã³ã°ããã«ã¯å€ãã®ç¹å¥ãªæ¹æ³ãããããããã¯ãã¹ãŠå¥ã®èšäºã«å€ããŸãã å®éšãšããŠãã°ã©ãçšã§ã¯ãªãæ¹æ³ã䜿çšããŸããã çç±ã¯æ¬¡ã®ãšããã§ããã°ã©ããèŠèŠçã«é¡äŒŒã®ãã£ã«ã ã®é²ã«å解ããããšã DBSCANã䜿çšããŠãã£ã«ã ãç¹ã«è¿æ¥ããŠããé åãèŠã€ããããšãã§ããŸãã æ·±ãæãäžããããšãªãããã®ã¡ãœãããäœãããã®ãèŠãŠã¿ãŸãããã DBSCANãšããååã¯å¯åºŠããŒã¹ã®ã¹ãã£ã³ãè¡šããŸããã€ãŸãããã®æ¹æ³ã䜿çšããŠãããªãè¿ãã«ãããã€ã³ããããŒãžããŸãã ããã¯ã2ã€ã®äž»èŠãªãã€ããŒãã©ã¡ãŒã¿ãŒãä»ããŠåœ¢åŒåãããŸããããã¯ãåãã€ã³ãã®è¿åãæ¢ãååŸãšãè¿åã®æå°æ°ã§ãã
1.座æšãååŸããŸãã
ãããè¡ãããã«ãgephiããã°ã©ããgdf圢åŒã§ãšã¯ã¹ããŒãããŸãã pandasã䜿çšããŠãã¡ã€ã«ãcsvãšããŠèªã¿åããŸãã
data = pd.read_csv('./kinopoisk.gdf')
ããã§ã¯ãã©ã®ããã«èŠããããæããŸãããã
plt.figure(figsize=(7, 7)) plt.scatter(data['x DOUBLE'].values, data['y DOUBLE'].values, marker='.', alpha=0.3);

ãŸããDBSCANã¯ãããåŠçããå¿
èŠããããŸãã
2.ã¯ã©ã¹ã¿ãŒã
ãã©ã¡ãŒã¿ãŒãéžæããã¯ã©ã¹ã¿ãŒãµã€ãºã®ååžã確èªããŸãã é倧ãªäœæ¥ã¯äºå®ãããŠããªãã£ãã®ã§ãå質ããç®ã§ãè©äŸ¡ããŸããã
from sklearn.cluster import DBSCAN coords = data_nodes[['x DOUBLE', 'y DOUBLE']].values dbscan = DBSCAN(eps=70, min_samples=5, leaf_size=30, n_jobs=-1) labels = dbscan.fit_predict(coords) plt.hist(labels, bins=50);

ã¯ã©ã¹ã¿ãŒãµã€ãºã®ååž
ã¯ã©ã¹ã¿ãŒã®è²ã§ãã€ã³ããè²ä»ãããŠãçµæãã©ã®ããã«çå®ã«èŠããããèŠãŠã¿ãŸãããã
plt.figure(figsize=(8, 8)) for l in set(labels): coordsm = coords[labels == l] plt.scatter(coordsm[:,0], coordsm[:,1], marker='.', alpha=0.3);

欲ãããã®ã®ããã«èŠããŸãã
ãªã¹ã圢åŒã§æ ç»ã®ã¯ã©ã¹ã¿ãŒãååŸããŠã¿ãŸãããã ãªã¹ããååŸããã®ã«äŸ¿å©ãªæ¹æ³ãäœãããšã«æ°ãã€ããªãã£ãã®ã§ãä»åã¯ã³ãŒããªãã§ã 以äžã¯ããæ¬ç©ã®ã°ãŒã«ãã§1ã€ã®ã¯ã©ã¹ã¿ãŒã«ãªã£ãæ ç»ã®ãªã¹ãã§ãã ç§ã®æèŠã§ã¯ãæªããªãã
ãã£ã«ã ä»ãããŒãã«movie_id | ãåå | æ¥ä» | ãžã£ã³ã« | åœ | ç£ç£ |
---|
271695 | 倪éœãã3çªç®ã®ææ | 1996-01-09 | å°èª¬ | ã¢ã¡ãªã« | ããªãŒã»ãã¥ãŒãº |
663135 | é£äºº | 2012-09-26 | åå | ã¢ã¡ãªã« | ã¯ãªã¹ã»ã³ãã |
277375 | ãšã€ãªã¢ã³ | 1997-11-07 | æŒ«ç» | ãã©ã³ã¹ | ãžã ã»ãŽã¡ã¹ |
81845 | ã¹ãŠã£ãŒããŒã»ããããããªãŒãã»ã¹ããªãŒãã®æªéã®ç髪垫 | 2007-12-03 | ãã¥ãŒãžã«ã« | ã¢ã¡ãªã« | ãã£ã ã»ããŒãã³ |
445196 | æã®ããã®æãšè¶³ | 2010-10-29 | ã¹ãªã©ãŒ | è±åœ | ãžã§ã³ã»ã©ã³ãã£ã¹ |
271878 | ã¬ããããã« | 2007-12-05 | åå | ãã©ã³ã¹ | ãžã§ã©ã«ãã»ã¯ã©ãã㯠|
3609 | ãã©ã³ã±ãããšãã¯ã¬ãŒã³ | 1999-01-22 | ã¢ã¯ã·ã§ã³æ ç» | è±åœ | ãžã§ã€ã¯ã»ã¹ã³ãã |
183497 | ããŒã¯ãšããŠãµã® | 1972-02-03 | ææ | è±åœ | ãŽã¡ãŒãã³ã»ã¹ãŠã§ã« |
3482 | å»è
ãšæªé | 1985-10-04 | ææ | è±åœ | ãã¬ãã£ã»ãã©ã³ã·ã¹ |
2528 | ã¢ãã ã¹ãã¡ããªãŒããªã¥ãŒ | 1993-11-19 | å¹»æ³ | ã¢ã¡ãªã« | ããªãŒã»ãŸã³ãã³ãã§ã«ã |
503578 | åå® | 2010-02-12 | å¹»æ³ | ããŒã©ã³ã | ãžã¥ãªãŠã¹ã»ãã¯ã«ã¹ã㌠|
87404 | èµ€ãå±
é
å± | 1951-10-19 | åå | ãã©ã³ã¹ | ã¯ããŒãã»ãªã¿ã³ã»ã©ã© |
5293 | ã¢ãã ã¹ãã¡ããªãŒ | 1991-11-22 | å¹»æ³ | ã¢ã¡ãªã« | ããªãŒã»ãŸã³ãã³ãã§ã«ã |
18089 | äœæ³¥æ£ | 1945-02-16 | ææ | ã¢ã¡ãªã« | ãããŒãã»ã¯ã€ãº |
271846 | æ»è
ã®å£²ãæ | 2008-10-10 | ææ | ã¢ã¡ãªã« | ã°ã¬ã³ããããŒã |
272111 | çŒãã㊠| 2007-09-09 | ãã©ã | ã«ãã | ãã£ãºã»ãœãŒã³ |
34186 | ãšã«ãã©ïŒéã®å¥³ç | 1988-09-30 | åå | ã¢ã¡ãªã« | ãžã§ãŒã ãºã»ã·ã°ãã¬ã㪠|
818981 | æ¬ç©ã®ã°ãŒã« | 2014-01-19 | åå | ãã¥ãŒãžãŒã©ã³ã | ãžã§ãã€ã³ã»ã¯ã¬ã¡ã³ã |
8421 | ãšãã¯ãŒãã»ã·ã¶ãŒãã³ãº | 1990-12-06 | å¹»æ³ | ã¢ã¡ãªã« | ãã£ã ã»ããŒãã³ |
5622 | ã¹ãªãŒããŒãã㌠| 1999-11-17 | ææ | ã¢ã¡ãªã« | ãã£ã ã»ããŒãã³ |
2389 | ããŒãã¬ãžã¥ã¹ | 1988-03-29 | å¹»æ³ | ã¢ã¡ãªã« | ãã£ã ã»ããŒãã³ |
ãã®ã¢ãããŒãã¯ç§ã«ãšã£ãŠèå³æ·±ãããã«æããŸãããªããªããå¿
ãããçŽæ¥ã®æšå¥šäºé
ãšçµã³ä»ããããŠããªãé¡äŒŒã®æ ç»ã®ãªã¹ããååŸããã°ã©ãã暪æãããšãã«å°æ°ã®ã¹ãããã§å°éããããšããã§ããªãããã§ãã ã€ãŸãããµã€ãäžã§åæ§ã®æ ç»ãçŽæ¥ã¯ãªãã¯ããã ãã§ãèœã¡ãŠããæ ç»ãèŠã€ããããšãã§ããŸãã
PS
ç§ã®è³ªåã«çããæºåãã§ããŠãããã¹ãŠã®å人ããã¹ãŠã®æ ç»ãã¡ã³-æ°ããçºèŠããããŠé«å質ã®ããŒã¿ãµã€ã¹ãã«æè¬ããŸãïŒ