幎æ«å¹Žå§ã¯ãå±
å¿å°ã®è¯ãå®¶åºç°å¢ã§çŸããè£
食ãããå¿ã«å€§åãª2k17ããŒã ãæãåºãããšã¬ã¯ããããã¯ã¢ãŒãã®è¯å¿ã®ããã«æ°žé ã«æ®ãçŽ æŽãããæ©äŒã§ãã

ãããããµã©ãã§ååã«å³ä»ããããè¯å¿ã§ãããç®ãèŠããèªåèªèº«ããŸãšããŠæçãªæŽ»åã«åŸäºããããšãå°ãªããšãå°ãèŠæ±ããŸããã ãããã£ãŠãããžãã¹ãšåã³ãçµã¿åãããŠããæ°ã«å
¥ãã®ããŒã ãäŸãšããŠäœ¿çšããŠãå°ããªããŒã¹ã«èªåãè§£æããæ¹æ³ãæ€èšããŸããã
ããŒã¿ããã¹ãŠã®çš®é¡ã®ããã¯ããã©ãããããã³éäžã§ãµãŒããŒã«ãã£ãŠèšå®ãããå¶éããã€ãã¹ããŸãã èå³ã®ããæ¹ã¯ãã¹ãŠç«ã«æåŸ
ãããŸãã
æ©æ¢°åŠç¿ãèšéçµæžåŠãçµ±èšãããã³ä»ã®å€ãã®ããŒã¿ãµã€ãšã³ã¹ããã¿ãŒã³ãæ¢ããŠããŸãã 忢ãªã€ã³ã¯ã€ãžã¿ãŒã¢ããªã¹ãã¯æ¯æ¥ãããŸããŸãªæ¹æ³ã§èªç¶ãæ·åããå®å®ãäœæããããŒã¿ãçæãããã°ãããããã»ã¹ãã©ã®ããã«æ©èœãããã«ã€ããŠã®æ
å ±ãåŒãåºããŸãã ã¹ãã€ã³ã®å¯©åå®ã¯ãæ¥ã
ã®æŽ»åã§è¢«å®³è
ã®èäœãçŽæ¥äœ¿çšããŠããŸããã ããããèªç¶ã¯éåšããæç¢ºãªç©ççå€èгãæã¡ãŸããã ãã®ãããçŸä»£ã®ç°ç«¯å¯©åå®ã®è·æ¥ã«ã¯å¥åŠãªç¹åŸŽããããŸããèªç¶ã®æ·åã¯ãã©ããããåããªããã°ãªããªãããŒã¿ã®åæãéããŠèµ·ãããŸãã éåžžãå¹³åã³ã¬ã¯ã¿ãŒã¯ããŒã¿ã審åå®ã«æã¡èŸŒã¿ãŸãã ãã®èšäºã¯ãããŒã¿ãã©ãããæ¥ãŠãã©ã®ããã«ãããå°ãåéããããšãã§ãããã«ã€ããŠã®ç§å¯ã®ããŒã«ããããã«éãããã«èšèšãããŠããŸãã
ç§ãã¡ã®ã¢ãããŒã¯ããã£ããã³ã®ãžã£ãã¯ã»ã¹ãããŠã®æåãªãã¬ãŒãºã§ããïŒããã¹ãŠãæã«å
¥ããäœãäžããªããã æã
ãããŒã ãåéããããã«ããªãã®ã£ã³ã°ã®æ¹æ³ã䜿çšããå¿
èŠããããŸãã ããããç§ãã¡ã¯å¹³åãªããŒã¿åéè
ã®ãŸãŸã§ãããæ±ºããŠã®ã£ã³ã°ã«ãªãããšã¯ãããŸããã ã¡ã€ã³ã¹ãã¬ãŒãžããããŒã ãååŸããŸãã
1.远BreakåŒ
1.1ã äœãæã«å
¥ãããã§ãã
ãããã£ãŠã knowyourmeme.comãè§£æã㊠ãããŸããŸãªå€æ°ãååŸããŸãã
- åå -ããŒã ã®ååã
- Origin_year-äœæå¹Žã
- ãã¥ãŒ - ãã¥ãŒã®æ°
- About-ããŒã ã®èª¬ææã
- ä»ã®å€ãã®
ããã«ãããããã¹ãŠè¡ããã«ãããå®è¡ãããïŒ
ããŒã¿ãããŠã³ããŒãããŠãŽãããæ¶å»ããåŸãã¢ãã«ã®æ§ç¯ãéå§ã§ããŸãã ããšãã°ããã©ã¡ãŒã¿ã«ãã£ãŠããŒã ã®äººæ°ãäºæž¬ããŠã¿ãŠãã ããã ããããããã¯ãã¹ãŠåŸã®ãã®ã§ãããããã§ããã€ãã®å®çŸ©ã«ç²ŸéããŸãã
- ããŒãµãŒã¯ããµã€ãããæ
å ±ã奪ãã¹ã¯ãªããã§ã
- ã¯ããŒã©ãŒã¯ããªã³ã¯ãããŒãã³ã°ããããŒãµãŒã®äžéšã§ã
- ã¯ããŒãªã³ã°ã¯ããŒãžãšãªã³ã¯ãä»ããç§»è¡ã§ã
- ã¹ã¯ã¬ã€ãã³ã°ã¯ãããŒãžããã®ããŒã¿ã®åéã§ãã
- è§£æã¯ããã«ã¯ããŒã«ããã³ã¹ã¯ã¬ã€ãã³ã°ã§ãïŒ
1.2ã HTMLãšã¯äœã§ããïŒ
HTMLïŒHyperText Markup LanguageïŒã¯ãMarkdownãŸãã¯LaTeXãšåãããŒã¯ã¢ããèšèªã§ãã ããŸããŸãªãµã€ããèšè¿°ããããã®æšæºã§ãã ãã®èšèªã®ã³ãã³ãã¯ã¿ã°ãšåŒã°ããŸã ã 絶察ã«ä»»æã®ãµã€ããéããå ŽåãããŠã¹ã®å³ãã¿ã³ãã¯ãªãã¯ãã View page source
]ãã¯ãªãã¯ãããšããã®ãµã€ãã®HTMLã¹ã±ã«ãã³ã衚瀺ãããŸãã
HTMLããŒãžã¯ããã¹ããããã¿ã°ã®ã»ããã«ãããªãããšãããããŸãã ããšãã°ã次ã®ã¿ã°ã«æ°ä»ããããããŸããã
<title>
-ããŒãžã®ã¿ã€ãã«<h1>âŠ<h6>
-ããŸããŸãªã¬ãã«ã®ããããŒ<p>
-段èœïŒæ®µèœïŒ<div>
-ã³ã³ãã³ãã®å€èгã倿Žããããã®ããã¥ã¡ã³ããã©ã°ã¡ã³ãã®éžæ<table>
- <table>
æç»<tr>
-ããŒãã«å
ã®è¡ã®åºåãæå<td>
-ããŒãã«å
ã®åã®åºåãæå<b>
-倪åãèšå®ããŸã
éåžžã <...>
ã³ãã³ãã¯ã¿ã°ãéãã </...>
ã¯ã¿ã°ãéããŸãã ããã2ã€ã®ããŒã ã®éã«ãããã¹ãŠã®ãã®ã¯ãã¿ã°ãæç€ºããèŠåã«åŸããŸãã ããšãã°ã <p>
ãš</p>
éã«ãããã®ã¯ãã¹ãŠç¬ç«ããæ®µèœã§ãã
ã¿ã°ã¯ã <html>
ã«ãŒããæã€äžçš®ã®ããªãŒã圢æããããŒãžãããŸããŸãªè«çéšåã«åå²ããŸãã åã¿ã°ã«ã¯ãç¬èªã®åå«ïŒåïŒãå«ããããšãã§ããŸã-åã蟌ãŸããŠããã¿ã°ãšãã®èŠªã
ããšãã°ãHTMLããŒãžããªãŒã¯æ¬¡ã®ããã«ãªããŸãã
<html> <head> </head> <body> <div> </div> <div> <b> , </b> </div> </body> </html>
ããã¹ããšåæ§ã«ãã®htmlã䜿çšããããšããããªãŒã䜿çšããããšãã§ããŸãã ãã®ããªãŒããã€ãã¹ãããšãWebããŒãžãè§£æãããŸãã ãã®å€æ§æ§ã®äžããå¿
èŠãªããŒãã ããèŠã€ããŠãããããæ
å ±ãååŸããŸãïŒ
ãããã®ããªãŒãæåã§ãã©ããŒã¹ããããšã¯ããŸãè¯ããªãã®ã§ãããªãŒããã©ããŒã¹ããããã®ç¹å¥ãªèšèªããããŸãã
- CSSã»ã¬ã¯ã¿ãŒ ïŒããã¯ãããŒãšå€ã®ãã¢ã§ããŒãžèŠçŽ ãæ¢ããšãã§ãïŒ
- XPath ïŒããã¯ã次ã®ããã«ããªãŒã«æ²¿ã£ãŠãã¹ãèšè¿°ãããšãã§ãïŒ/ html / body / div [1] / div [3] / div / div [2] / divïŒ
- ããšãã°ãBeautifulSoup for pythonãªã©ãããããçš®é¡ã®ç°ãªãèšèªã«å¯Ÿå¿ããããããçš®é¡ã®ã©ã€ãã©ãªã 䜿çšããã®ã¯ãã®ã©ã€ãã©ãªã§ãã
1.3ã æåã®ãªã¯ãšã¹ã
WebããŒãžã«ã¢ã¯ã»ã¹ãããšã requests
ã¢ãžã¥ãŒã«ãåä¿¡ã§ããŸãã ã¢ããããŒãããŠãã ããã äŒç€Ÿåãã«ãããå¹ççãªããã±ãŒãžãããã€ãã¢ããããŒãããŸãã
import requests
é«è²Žãªç ç©¶ç®çã®ããã«ã察å¿ããããŒãžããåããŒã ã®ããŒã¿ãåéããå¿
èŠããããŸãã ãã ããæåã«ãããã®ããŒãžã®ã¢ãã¬ã¹ãååŸããå¿
èŠããããŸãã ãããã£ãŠããã¹ãŠã®ããŒã ãã¬ã€ã¢ââãŠããããã¡ã€ã³ããŒãžãéããŸãã æ¬¡ã®ããã«ãªããŸãã
ããããããªã¹ããããåããŒã ã«ãªã³ã¯ããã©ãã°ããŸãã ã¡ã€ã³ããŒãžã®ã¢ãã¬ã¹ã倿°page_link
ãã requests
ã©ã€ãã©ãªã䜿çšããŠéããŸãã
page_link = 'http://knowyourmeme.com/memes/all/page/1' response = requests.get(page_link) response
Out: <Response [403]>
ãããŠããããæåã®åé¡ã§ãïŒ ã«åãã
ãŸãããµãŒããŒãå©çšå¯èœã§ãªã¯ãšã¹ããåŠçã§ããå Žåã403rdãšã©ãŒããµãŒããŒã«ãã£ãŠçºè¡ãããŸãããå人çãªçç±ã§ãããæåŠããŠããŸãã
çç±ã調ã¹ãŠã¿ãŸãããã ãããè¡ãããã«ããµãŒããŒã«éä¿¡ãããæçµãªã¯ãšã¹ããã©ã®ããã«èŠããããããå
·äœçã«ã¯ããµãŒããŒã®ç®ã«ã¯ãŠãŒã¶ãŒãšãŒãžã§ã³ããã©ã®ããã«èŠãããã確èªããŸãã
for key, value in response.request.headers.items(): print(key+": "+value)
Out: User-Agent: python-requests/2.14.2 Accept-Encoding: gzip, deflate Accept: */* Connection: keep-alive
pythonäžã«åº§ã£ãŠãããããŒãžã§ã³2.14.2ã§èŠæ±ã©ã€ãã©ãªã䜿çšããŠããããšããµãŒããŒã«æããã«ããããã§ãã ãããããããããµãŒããŒã«ç§ãã¡ã®åæã«é¢ããããã€ãã®ç念ãåŒãèµ·ããã圌ã¯å®¹èµŠãªãç§ãã¡ãæåŠããããšã«ããŸããã æ¯èŒã®ããã«ãå¥åº·ãªäººã®ãªã¯ãšã¹ãããããŒãã©ã®ããã«èŠãããã確èªã§ããŸãã

åœç¶ã®ããšãªãããæ§ãããªãªã¯ãšã¹ãã¯ãéåžžã®ãã©ãŠã¶ããã®ãªã¯ãšã¹ãäžã«éä¿¡ããããã®ãããªè±å¯ãªã¡ã¿æ
å ±ãšç«¶åããŸããã 幞ããªããšã«ãã ããç§ãã¡ã人éã®ãµããããŠãåœã®ãŠãŒã¶ãŒãšãŒãžã§ã³ãã®çæã䜿çšããŠãµãŒããŒã®ç®ã«ã¡ããæããããšãæ°ã«ããŸããã ãã®ã¿ã¹ã¯ã«å¯ŸåŠããã©ã€ãã©ãªã¯ãããããããŸãããå人çã«ã¯fake-useragent
äžçªå¥œãã§ãã ããŸããŸãªéšåããã¡ãœãããåŒã³åºããšããªãã¬ãŒãã£ã³ã°ã·ã¹ãã ã仿§ããã©ãŠã¶ããŒãžã§ã³ã®ã©ã³ãã ãªçµã¿åãããçæããããªã¯ãšã¹ãã«æž¡ãããšãã§ããŸãã
Out: 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'
çæããããšãŒãžã§ã³ãã䜿çšããŠããªã¯ãšã¹ããå床å®è¡ããŠã¿ãŸããã
response = requests.get(page_link, headers={'User-Agent': UserAgent().chrome}) response
Out: <Response [200]>
çŽ æŽããããç§ãã¡ã®å°ããªå€è£
ããã¯æ©èœããceivããããµãŒããŒã¯ç¥çŠããã200å¿çãå¿ å®ã«çºè¡ããŸãã-æ¥ç¶ã確ç«ãããããŒã¿ãåä¿¡ããããã¹ãŠãçŽ æŽãããã§ãïŒ çµå±ã®ãšãããäœãåŸãã®ãèŠãŠã¿ãŸãããã
html = response.content html[:1000]
Out: b'<!DOCTYPE html>\n<html xmlns:fb=\'http://www.facebook.com/2008/fbml\' xmlns=\'http://www.w3.org/1999/xhtml\'>\n<head>\n<meta content=\'text/html; charset=utf-8\' http-equiv=\'Content-Type\'>\n<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"c1a6d52f38","applicationID":"31165848","transactionName":"dFdfRUpeWglTQB8GDUNKWFRLHlcJWg==","queueTime":0,"applicationTime":59,"agent":""}</script>\n<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(e,t,n){function r(){}function o(e,t,n){return function(){return i(e,[f.now()].concat(u(arguments)),t?null:this,n),t?void 0:this}}var i=e("handle"),a=e(2),u=e(3),c=e("ee").get("tracer")'
æ¶åã§ããªãããã«èŠããŸããããã®ã±ãŒã¹ãããã£ãšããããªãã®ã溶æ¥ããã®ã¯ã©ãã§ããïŒ ããšãã°ãçŽ æŽãããã¹ãŒãã
1.4ã çŽ æŽãããã¹ãŒã

bs4ããã±ãŒãžãå¥åBeautifulSoup ïŒãã®äººã®èŠªå-ããã¥ã¡ã³ããžã®ãã€ããŒãªã³ã¯ããããŸãïŒã¯ãäžæè°ã®åœã®ã¢ãªã¹ã®çŸããã¹ãŒãã«é¢ããè©©ã«ã¡ãªãã§åœåãããŸããã
Wonderful Soupã¯ãããŒãžã®çã®æªåŠçã®HTMLã³ãŒããããWebããŒãžã®å¿
èŠãªã¿ã°ãã¯ã©ã¹ã屿§ãããã¹ããããã³ãã®ä»ã®èŠçŽ ãæ€çŽ¢ããã®ã«éåžžã«äŸ¿å©ãªæ§é åãããããŒã¿é
åãæäŸããå®å
šã«éæ³ã®ã©ã€ãã©ãªã§ãã
BeautifulSoup
ãšåŒã°ããããã±ãŒãžã¯ãããããç§ãã¡ãå¿
èŠãšãããã®ã§ã¯ãããŸããã ããã¯3çªç®ã®ããŒãžã§ã³ïŒ Beautiful Soup 3 ïŒã§ããã4çªç®ã䜿çšããŸãã beautifulsoup4
ããã±ãŒãžãã€ã³ã¹ããŒã«ããå¿
èŠããããŸãã å®å
šã«æ¥œãããã®ã«ããããã«ã¯ãã€ã³ããŒãããéã«å¥ã®ããã±ãŒãžåbs4
ãæå®ãã BeautifulSoup
ãšãã颿°ãã€ã³ããŒãããå¿
èŠããããŸãã äžè¬ã«ãæåã¯æ··ä¹±ããããã§ããããããã®å°é£ãå
æããå¿
èŠããããŸãã
ãã®ããã±ãŒãžã¯ãããŒãžã®çã®XMLã³ãŒãã§ãæ©èœããŸãïŒXMLã¯ã¯ãŒããããã³ãã³ãã§ããHTMLã䜿çšããŠæ¹èšã«å€æãããŸãïŒã ããã±ãŒãžãXMLããŒã¯ã¢ããã§æ£ããæ©èœããããã«ã¯ãæŠåšåº«å
šäœã«å ããŠxml
ããã±ãŒãžãã€ã³ã¹ããŒã«ããå¿
èŠããããŸãã
from bs4 import BeautifulSoup
BeautifulSoup
颿°ã«ãæè¿åãåã£ãWebããŒãžã®ããã¹ããæž¡ããŸãã
soup = BeautifulSoup(html,'html.parser')
次ã®ãããªãã®ãåŸãããŸãã
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml"> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"c1a6d52f38","applicationID":"31165848","transactionName":"dFdfRUpeWglTQB8GDUNKWFRLHkUNWUU=","queueTime":0,"applicationTime":24,"agent":""}</script> <script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({1:[function(e,n,t){function r(){}function o(e,n,t){return function(){return i(e,[c.now()].concat(u(arguments)),n?null:this,t),n?void 0:this}}var i=e("handle"),a=e(2),u=e(3),f=e("ee").get("tracer"),c=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=
ãã£ãšè¯ããªããŸããããïŒ soup
倿°ã«ã¯äœããããŸããïŒ äžæ³šæãªãŠãŒã¶ãŒã¯ãã»ãšãã©äœãå€ãã£ãŠããªããšèšãã§ãããã ããããããã¯ããã§ã¯ãããŸããã ããã§ãããŒãžã®HTMLããªãŒãèªç±ã«ããŒãã³ã°ããåäŸãèŠªãæ¢ããŠããããåãåºãããšãã§ããŸãïŒ
ããšãã°ãã¿ã°ããã®ãã¹ãæå®ããŠãããŒã¯ãç§»åã§ããŸãã
soup.html.head.title
Out: <title>All Entries | Know Your Meme</title>
text
ã¡ãœããã䜿çšããŠãããŸããå Žæããããã¹ããååŸã§ãtext
ã
soup.html.head.title.text
Out: 'All Entries | Know Your Meme'
ããã«ãèŠçŽ ã®ã¢ãã¬ã¹ããããã°ãããã«èŠã€ããããšãã§ããŸãã ããšãã°ãã¯ã©ã¹ã§ãããè¡ãããšãã§ããŸãã æ¬¡ã®ã³ãã³ãã¯ãã¿ã°å
ã«ãããã¯ã©ã¹ã®photo
ãæã€èŠçŽ ãèŠã€ããããšã§ãã
obj = soup.find('a', attrs = {'class':'photo'}) obj
Out: <a class="photo left" href="/memes/nu-male-smile" target="_self"><img alt='The "Nu-Male Smile" Is Duck Face for Men' data-src="http://i0.kym-cdn.com/featured_items/icons/wide/000/007/585/7a2.jpg" height="112" src="http://a.kym-cdn.com/assets/blank-b3f96f160b75b1b49b426754ba188fe8.gif" title='The "Nu-Male Smile" Is Duck Face for Men' width="198"/> <div class="info abs"> <div class="c"> The "Nu-Male Smile" Is Duck Face for Men </div> </div> </a>
ããããç§ãã¡ã®äºæ³ã«åããŠãåŒãåºããããªããžã§ã¯ãã«ã¯"photo left"
ã¯ã©ã¹ããããŸãã BeautifulSoup4
ã¯class
屿§ãåå¥ã®å€ã®ã»ãããšèŠãªããããã©ã€ãã©ãªã®"photo left"
ã¯["photo", "left"]
ã«çžåœã["photo", "left"]
æå®ãããã®ã¯ã©ã¹"photo"
å€ã¯ãã®ãªã¹ãã«å«ãŸããŸãã ãã®ãããªäžå¿«ãªç¶æ³ãé¿ããäžèŠãªãªã³ã¯ãã¯ãªãã¯ããã«ã¯ãç¬èªã®æ©èœã䜿çšããŠå®å
šäžèŽãèšå®ããå¿
èŠããããŸãã
obj = soup.find(lambda tag: tag.name == 'a' and tag.get('class') == ['photo']) obj
Out: <a class="photo" href="/memes/people/mf-doom"><img alt="MF DOOM" data-src="http://i0.kym-cdn.com/entries/icons/medium/000/025/149/1489698959244.jpg" src="http://a.kym-cdn.com/assets/blank-b3f96f160b75b1b49b426754ba188fe8.gif" title="MF DOOM"/> <div class="entry-labels"> <span class="label label-submission"> Submission </span> <span class="label" style="background:
æ€çŽ¢åŸã«ååŸããããªããžã§ã¯ããbs4æ§é ãæã£ãŠããŸãã ãããã£ãŠããã§ã«å¿
èŠãªãªããžã§ã¯ããåŒãç¶ãæ€çŽ¢ã§ããŸãïŒ ãã®ããŒã ãžã®ãªã³ã¯ãååŸããŸãã ããã¯ããªã³ã¯ãå«ãŸããŠããhref
屿§ã«ãã£ãŠå®è¡ã§ããŸãã
obj.attrs['href']
Out: '/memes/people/mf-doom'
ããããã¹ãŠã®ã¯ã¬ã€ãžãŒãªå€æã®åŸãããŒã¿åã倿ŽãããŠããããšã«æ³šæããŠãã ããã ä»ã圌ãã¯str
ã§ãã ããã¯ãããã¹ãã®ããã«ããããæäœããæ£èŠè¡šçŸã䜿çšããŠäžèŠãªæ
å ±ãé€å€ã§ããããšãæå³ããŸãã
print(" :", type(obj)) print(" :", type(obj.attrs['href']))
Out: : <class 'bs4.element.Tag'> : <class 'str'>
ããŒãžäžã®è€æ°ã®èŠçŽ ã«æå®ãããã¢ãã¬ã¹ãããå Žåã find
ã¡ãœããã¯æåã®èŠçŽ ã®ã¿ãè¿ããŸãã ãã®ã¢ãã¬ã¹ãæã€ãã¹ãŠã®èŠçŽ ãæ€çŽ¢ããã«ã¯ã findAll
ã¡ãœããã䜿çšããå¿
èŠããããŸãããªã¹ããfindAll
ãŸãã ãããã£ãŠãããŒã ã®ããããŒãžãžã®ãªã³ã¯ãå«ããã¹ãŠã®ãªããžã§ã¯ããäžåºŠã®æ€çŽ¢ã§ååŸã§ããŸãã
meme_links = soup.findAll(lambda tag: tag.name == 'a' and tag.get('class') == ['photo']) meme_links[:3]
Out: [<a class="photo" href="/memes/people/mf-doom"><img alt="MF DOOM" data-src="http://i0.kym-cdn.com/entries/icons/medium/000/025/149/1489698959244.jpg" src="http://a.kym-cdn.com/assets/blank-b3f96f160b75b1b49b426754ba188fe8.gif" title="MF DOOM"/> <div class="entry-labels"> <span class="label label-submission"> Submission </span> <span class="label" style="background:
ãŽããããªã¹ããã¯ãªã¢ããããšã¯æ®ã£ãŠããŸãïŒ
meme_links = [link.attrs['href'] for link in meme_links] meme_links[:10]
Out: ['/memes/people/mf-doom', '/memes/here-lies-beavis-he-never-scored', '/memes/people/vanossgaming', '/memes/stream-sniping', '/memes/kids-describe-god-to-an-illustrator', '/memes/bad-teacher', '/memes/people/adam-the-creator', '/memes/but-can-you-do-this', '/memes/people/ken-ashcorp', '/memes/heartbroken-cowboy']
å®äºããŸããã1ã€ã®æ€çŽ¢ããŒãžã§ãããŒã ã®æ°ã§æ£ç¢ºã«16åã®ãªã³ã¯ãååŸããŸããã
ãã¡ããããã®äœæã§ã¢ã€ãã ãæ€çŽ¢ã§ãããšããã®ã¯çŽ æŽãããããšã§ããããã®äœæã¯ã©ãã§å
¥æã§ããŸããïŒ ã»ã¬ã¯ã¿ãŒã¬ãžã§ãããªã©ãããŒãžããå¿
èŠãªã¿ã°ããã«ã§ãããã©ãŠã¶ãŒã«äœããã®çš®é¡ã®ãŠãŒãã£ãªãã£ãã€ã³ã¹ããŒã«ã§ããŸãã
ãã ãããã®ãã¹ã¯çã®ãµã ã©ã€ã«ã¯é©ããŠããŸããã Bushidoã®ãã©ãã¯ãŒã«ã¯ãå¥ã®æ¹æ³ããããŸããå¿
èŠãªåèŠçŽ ã®ã¿ã°ãæåã§æ€çŽ¢ããŸãã ãããè¡ãã«ã¯ããã©ãŠã¶ãŠã£ã³ããŠãå³ã¯ãªãã¯ããŠã[ æ€æ» ]ãã¿ã³ãæŒãå¿
èŠããããŸãã ããããã¹ãŠã®æäœã®åŸããã©ãŠã¶ã¯æ¬¡ã®ããã«ãªããŸãã
éžæãããªããžã§ã¯ãã®ã¢ãã¬ã¹ãå«ããããã¢ããHTMLããŒã¹ã¯ãã³ãŒãã«ç°¡åã«ã³ããŒããŠãæ®èè¡çºã楜ããããšãã§ããŸãã
æåŸã®ç¬éãæ®ã£ãŠããŸãã çŸåšã®ããŒãžãããã¹ãŠã®ããŒã ãããŠã³ããŒãããããã©ããããããæ¬¡ã®ããŒãžã«ç§»åããå¿
èŠããããŸãã ãµã€ãã§ã¯ãããŒã ã䜿çšããŠããŒãžãäžã«ã¹ã¯ããŒã«ããã ãã§ãããè¡ãããšãã§ããŸããjavascript颿°ã¯æ°ããããŒã ãçŸåšã®ãŠã£ã³ããŠã«è¡šç€ºããŸããããããã®æ©èœã«ã¯è§Šããªãããã«ããŸãã
éåžžãæ€çŽ¢çšã«ãµã€ãã«èšå®ãããã¹ãŠã®ãã©ã¡ãŒã¿ãŒã¯ãhrefã®æ§é ã«è¡šç€ºãããŸãã ããŒã ãäŸå€ã§ã¯ãããŸããã ããŒã ã®æåã®éšåãååŸããå Žåã¯ããªã³ã¯ã§ãµã€ããåç
§ããå¿
èŠããããŸã
http://knowyourmeme.com/memes/all/page/1
16åã®ããŒã ã§2çªç®ã®äœçœ®ãååŸããå Žåããªã³ã¯ããããã«å€æŽããå¿
èŠããããŸããã€ãŸããããŒãžçªå·ã2ã«çœ®ãæããŸãã
http://knowyourmeme.com/memes/all/page/2
ãã®ãããªç°¡åãªæ¹æ³ã§ããã¹ãŠã®ããŒãžãé²èЧããŠã¡ã¢ãªã¢ã«ã奪ãããšãã§ããŸãã æåŸã«ãäžèšã®ãã¹ãŠã®æäœãå«ãçŸãã颿°ãã©ããããŸãã
ããŒã ãžã®ãªã³ã¯ãããŠã³ããŒãããæ©èœ def getPageLinks(page_number): """ , page_number: int/string """
æ©èœããã¹ããããã¹ãŠãæ£åžžã§ããããšã確èªããŸã
meme_links = getPageLinks(1) meme_links[:2]
Out: ['http://knowyourmeme.com/memes/people/mf-doom', 'http://knowyourmeme.com/memes/here-lies-beavis-he-never-scored']
ããŠããã®é¢æ°ã¯æ©èœããçè«çã«ã¯17171ã®ãã¹ãŠã®ããŒã ãžã®ãªã³ã¯ãååŸã§ããŸãããã®ããã«ã¯17171/16ã1074ããŒãžãééããå¿
èŠããããŸãã éåžžã«å€ãã®ãªã¯ãšã¹ãã§ãµãŒããŒãæ··ä¹±ãããåã«ãç¹å®ã®ããŒã ã«é¢ãããã¹ãŠã®å¿
èŠãªæ
å ±ãååŸããæ¹æ³ãèŠãŠã¿ãŸãããã
1.5匷çã®æçµæºå
ãªã³ã¯ãšã®é¡æšã«ãããäœã§ãåŒãåºãââããšãã§ããŸãã ãããè¡ãã«ã¯ãããã€ãã®æé ãå®è¡ããå¿
èŠããããŸãã
- ããŒã ã§ããŒãžãéã
- å¿
èŠãªæ
å ±ã®ã¿ã°ãæ¢ããŸã
- ãã¹ãŠãçŸããã¹ãŒãã«å
¥ããŸã
- ......
- å©ç
奜å¥å¿readerçãªèªè
ã®é ã®äžã§æ
å ±ãçµ±åããã«ã¯ãããŒã ã®ãã¥ãŒã®æ°ãåŒãåºããŸãã
ãããŠãäŸãšããŠããã®ãµã€ãã§æã人æ°ã®ããããŒã ãDogeãåãäžããŸããããDogeã¯ã2018幎1æ1æ¥æç¹ã§1,200äžåãè¶
ããŠããŸãã
ç§ãã¡ã®ç ç©¶ã®å¿ã«ãšã£ãŠå€§åãªæ
å ±ãåŸãããŒãžèªäœã¯ã次ã®ããã«ãªããŸãã

åãšåãããã«ãæåã«ãããŒãžãžã®ãªã³ã¯ã倿°ã«ä¿åãããã®ã³ã³ãã³ããåŒãåºããŸãã
meme_page = 'http://knowyourmeme.com/memes/doge' response = requests.get(meme_page, headers={'User-Agent': UserAgent().chrome}) html = response.content soup = BeautifulSoup(html,'html.parser')
ããŒã ã«é¢é£ããããŠã³ããŒãããããããªãåçã®æ°ã ãã§ãªãããã¥ãŒãã³ã¡ã³ãã®çµ±èšãååŸããæ¹æ³ãèŠãŠã¿ãŸãããã ãããã¯ãã¹ãŠãå³äžã®ã¿ã°"dd"
äžã«ãã¯ã©ã¹"views"
ã "videos"
ã "photos"
ãããã³"comments"
ãšãšãã«ä¿åãã"comments"
views = soup.find('dd', attrs={'class':'views'}) print(views)
Out: <dd class="views" title="12,318,185 Views"> <a href="/memes/doge" rel="nofollow">12,318,185</a> </dd>
ã¿ã°ãšå¥èªç¹ãã¯ãªã¢ãã
views = views.find('a').text views = int(views.replace(',', '')) print(views)
Out: 12318185
ç¹°ãè¿ããŸããããã¹ãŠãå°ããªé¢æ°ã«è©°ã蟌ã¿ãŸãã
ããŒã ããšã«çµ±èšãè¿ã颿° def getStats(soup, stats): """ //... soup: bs4.BeautifulSoup stats: string views/videos/photos/comments """ obj = soup.find('dd', attrs={'class':stats}) obj = obj.find('a').text obj = int(obj.replace(',', '')) return obj
ãã¹ãŠæºåå®äºã§ãïŒ
views = getStats(soup, stats='views') videos = getStats(soup, stats='videos') photos = getStats(soup, stats='photos') comments = getStats(soup, stats='comments') print(": {}\n: {}\n: {}\n: {}".format(views, videos, photos, comments))
Out: : 12318185 : 59 : 1645 : 918
å¥ã®è峿·±ãç ç©¶-ããŒã ã远å ããæ¥ä»ãšæå»ãååŸããŸãã ãã©ãŠã¶ã§ããŒãžãèŠããšãåŒãåºãããšãã§ããæå€§ã®æ
å ±ã¯ãçºè¡ããAdded 4 years ago by NovaXP
ããŠããçµéãã幎æ°ã§ãããšèããã§ãããã ãã ãããããããããšã¯ããŸããããŸãããhtmlã®æ ¹åºã«ç»ãããã®ç¢æã®åå ãšãªã£ãŠããéšåãæãäžããŸãã
ããïŒ è¿œå æ¥ã«é¢ãã詳现ã¯ãååäœã§æ£ç¢ºã§ãã å°åŠç
date = soup.find('abbr', attrs={'class':'timeago'}).attrs['title'] date
Out: '2017-12-31T01:59:14-05:00'
å®éãããŒãµãŒã¯äºæž¬äžèœã§ãã å€ãã®å Žåãè§£æããããŒãžã¯éåžžã«ç°çš®ã®æ§é ãæã£ãŠããŸãã ããšãã°ãããŒã ãè§£æããŠããå Žåã説æã¯ããŒãžã®äžéšã«è¡šç€ºãããå ŽåããããŸãããäžéšã«ã¯è¡šç€ºãããªãå ŽåããããŸãã ã³ãŒããæåã«èª¬æã®æ¬ åŠã«ééãããšããã«ããšã©ãŒãã¹ããŒããŠåæ¢ããŸãã ãã¹ãŠã®ããŒã¿ãé©åã«åéããã«ã¯ãäŸå€ãç»é²ããå¿
èŠããããŸãã ããŒã ã¹ãã¢ã«ã¯èšåãæŽã£ãŠãããç·æ¥äºæ
ã¯çºçããªãã¯ãã§ãã
ããã«ãããããããç§ã¯æ¬åœã«æèµ·ããŠãã³ãŒãã20åç¹°ãè¿ããããšã©ãŒã«ã¶ã€ãããåãèœãšãããã®ãèŠãããããŸããã ãããé²ãã«ã¯ãããšãã°ã try - except
ã³ã³ã¹ãã©ã¯ãã䜿çšãã try - except
ãšã©ãŒãåçŽã«åŠçããŸãã ã€ã³ã¿ãŒãããã§äŸå€ã«ã€ããŠèªãããšãã§ããŸãã ãã®å Žåãééããç¯ãããšã¯ã§ããŸããããããŒãžã«å¿
èŠãªèŠçŽ ããããã©ãããäºåã«ãã§ãã¯ããéåžžã®if - else
ã䜿çšããã«ãè§£æã詊ã¿ãŸãã
ããšãã°ãããŒã ã®ã¹ããŒã¿ã¹ãåŒãåºãããã®ã§ããããåãå·»ãã¿ã°ãèŠã€ããŸãã
properties = soup.find('aside', attrs={'class':'left'}) meme_status = properties.find("dd") meme_status
Out: <dd> Confirmed </dd>
次ã«ãã¿ã°ããããã¹ããæœåºããäœåãªã¹ããŒã¹ããã¹ãŠåãåãå¿
èŠããããŸãã
meme_status.text.strip()
Out: 'Confirmed'
ãã ããããŒã ã«ã¹ããŒã¿ã¹ããªãããšãçªç¶å€æããå Žåã find
ã¡ãœããã¯voidãè¿ããŸãã äžæ¹ã text
ã¡ãœããã¯ã¿ã°å
ã®ããã¹ããèŠã€ããããšãã§ããããšã©ãŒãã¹ããŒããŸãã ãã®ãããªãã€ããã身ãå®ãããã«ãäŸå€ãç»é²ãããif - else
ç»é²ã§ãif - else
ã çŸåšã®ããŒã ã¯ãŸã ã¹ããŒã¿ã¹ãæã£ãŠãããããæå³çã«ç©ºã®ãªããžã§ã¯ããšããŠèšå®ããäž¡æ¹ã®ã±ãŒã¹ã§ãšã©ãŒããã£ãããããããšã確èªããŸã
Out: Exception Empty
ãã®ã³ãŒãã«ããããšã©ãŒãã身ãå®ãããšãã§ããŸãã ãã®å Žåã if - else
ã1ã€ã®äŸ¿å©ãªæååãšããŠäœ¿çšããŠãæ§é å
šäœãæžãæããããšãã§ããŸãã ãã®è¡ã¯ã meme_status
ãã©ããã確èªãã meme_status
ãªãå Žåã¯ç¡å¹ã«ãªããŸãã
Out: Confirmed
顿šã«ãããããŒãžããæ®ãã®æ
å ±ãåŒãåºãããšãã§ããŸãããã®ããã«ãåã³é¢æ°ãæžããŸãã
ããŒã ããããã£ãè§£æããããã®é¢æ° def getProperties(soup): """ (tuple) , , , soup: bs4.BeautifulSoup """
getProperties(soup)
Out: ('Doge', 'Confirmed', 'Animal', '2013', 'Tumblr', 'animal, dog, shiba inu, shibe, such doge, super shibe, japanese, super, tumblr, much, very, many, comic sans, photoshop meme, such, shiba, shibe doge, doges, dogges, reddit, comic sans ms, tumblr meme, hacked, bitcoin, dogecoin, shitposting, stare, canine')
ããŒã ããããã£ãã³ã³ãã€ã«ãããŸãã ä»ããã®ããã¹ãã®èª¬æã顿šããŠåéããŸãã
ããŒã ã®ããã¹ãèšè¿°ãè§£æããããã®é¢æ° def getText(soup): """ soup: bs4.BeautifulSoup """
meme_about, meme_origin, other_text = getText(soup) print(" :\n{}\n\n:\n{}\n\n :\n{}...\n"\ .format(meme_about, meme_origin, other_text[:200]))
Out: : Doge is a slang term for âdogâ that is primarily associated with pictures of Shiba Inus (nicknamed âShibeâ) and internal monologue captions on Tumblr. These photos may be photoshopped to change the dog's face or captioned with interior monologues in Comic Sans font. : The use of the misspelled word âdogeâ to refer to a dog dates back to June 24th, 2005, when it was mentioned in an episode of Homestar Runner's puppet show. In the episode titled âBiz Cas Fri 1â[2], Homestar calls Strong Bad his âdogeâ while trying to distract him from his work. : Identity On February 23rd, 2010, Japanese kindergarten teacher Atsuko Sato posted several photos of her rescue-adopted Shiba Inu dog Kabosu to her personal blog.[38] Among the photos included a peculi...
, ,
, def getMemeData(meme_page): """ , meme_page: string """
, ,
final_df = pd.DataFrame(columns=['name', 'status', 'type', 'origin_year', 'origin_place', 'date_added', 'views', 'videos', 'photos', 'comments', 'tags', 'about', 'origin', 'other_text']) data_row = getMemeData('http://knowyourmeme.com/memes/doge') final_df = final_df.append(data_row, ignore_index=True) final_df
Out:
ãåå | ç¶æ
| type | origin_year | ... |
---|
Doge | Confirmed | Animal | 2013 | ... |
. â , meme_links
.
for meme_link in meme_links: data_row = getMemeData(meme_link) final_df = final_df.append(data_row, ignore_index=True)
Out:
ãåå | ç¶æ
| type | origin_year | ... |
---|
Doge | Confirmed | Animal | 2013 | ... |
Charles C. Johnson | Submission | Activist | 2013 | ... |
Bat- (Prefix) | Submission | Snowclone | 2018 | ... |
The Eric Andre Show | Deadpool | TV Show | 2012 | ... |
Hopsin | Submission | Musician | 2003 | ... |
ãããïŒ , , , â , , .
2.
2.1
圌ãããïŒ , , , , â . , . . try-except
. .
! - - , , .

, , , . .

2.2 â
, . , .
. , , , request-header
. , IP, , IP . , -, IP , "". : â - , â ?
Tor . , Tor , . , , , , , . , Tor, , . :
, , , Tor . ip-. get- , IP
def checkIP(): ip = requests.get('http://checkip.dyndns.org').content soup = BeautifulSoup(ip, 'html.parser') print(soup.find('body').text) checkIP()
Out: Current IP Address: 82.198.191.130
ip Tor . â , â .
tor , . tor .
- Linux â
apt-get install tor
, - Mac â brew ,
brew install tor
. - Windows â .
2.3
. ip PySocks
. , , pip3 install PySocks
.
9150. socks socket . , â - .
import socks import socket socks.set_default_proxy(socks.SOCKS5, "localhost", 9150) socket.socket = socks.socksocket
ip-a.
checkIP()
Out: Current IP Address: 51.15.92.24
, .
⊠!

ip-.
data_row = getMemeData('http://knowyourmeme.com/memes/doge') for key, value in data_row.items(): print(key.capitalize()+":", str(value)[:200], end='\n\n')
Out: Name: Doge Status: Confirmed Type: Animal Origin_year: 2013 Origin_place: Tumblr Date_added: 2017-12-31T01:59:14-05:00 Views: 12318185 Videos: 59 Photos: 1645 Comments: 918 Tags: animal, dog, shiba inu, shibe, such doge, super shibe, japanese, super, tumblr, much, very, many, comic sans, photoshop meme, such, shiba, shibe doge, doges, dogges, reddit, comic sans ms, tumblr meme About: Doge is a slang term for âdogâ that is primarily associated with pictures of Shiba Inus (nicknamed âShibeâ) and internal monologue captions on Tumblr...
. . . .
, : , - ip 10 . , ? , , Tor torrc ( ~/Library/Application Support/TorBrowser-Data/torrc
, â ) . :
CircuitBuildTimeout 10 LearnCircuitBuildTimeout 0 MaxCircuitDirtiness 10
ip 10 . .
for i in range(10): checkIP() time.sleep(5)
Out: Current IP Address: 89.31.57.5 Current IP Address: 93.174.93.71 Current IP Address: 62.210.207.52 Current IP Address: 209.141.43.42 Current IP Address: 209.141.43.42 Current IP Address: 162.247.72.216 Current IP Address: 185.220.101.17 Current IP Address: 193.171.202.150 Current IP Address: 128.31.0.13 Current IP Address: 185.163.1.11
, ip 10 . . 20 .
- ;
- ;
- .....
- å©ç
final_df = pd.DataFrame(columns=['name', 'status', 'type', 'origin_year', 'origin_place', 'date_added', 'views', 'videos', 'photos', 'comments', 'tags', 'about', 'origin', 'other_text']) for page_number in tqdm_notebook(range(1075), desc='Pages'):
. . . : , ip ?
2.4
ip . Github - TorCrawler.py
. , , . . . , .
, torrc . /usr/local/etc/tor/
, /etc/tor/
. .
tor --hash-password mypassword
- torrc vim , nano atom
- torrc- , HashedControlPassword
- , HashedControlPassword
- ( ) ControlPort 9051
- .
tor . : service tor start
, : tor
.
.
from TorCrawler import TorCrawler
get- , bs4.
meme_page = 'http://knowyourmeme.com/memes/doge' response = crawler.get(meme_page, headers={'User-Agent': UserAgent().chrome}) type(response)
Out: bs4.BeautifulSoup
- .
views = response.find('dd', attrs={'class':'views'}) views
Out: <dd class="views" title="12,318,185 Views"> <a href="/memes/doge" rel="nofollow">12,318,185</a> </dd>
IP
crawler.ip
Out: '51.15.40.23'
ip 25 . n_requests
. , .
crawler.n_requests
Out: 25
, ip .
crawler.rotate()
IP successfully rotated. New IP: 62.176.4.1
, , IP . .
. .
ãããã«
, . â , , , , - , , , . , , , , , DDoS-. , , â , â .
â - , . time.sleep()
request-header
â .
!
: filfonul , Skolopendriy