
ããã«ã¡ã¯Habrahabrã
Game of Thronesã®
Graph Theoryèšäºã®æç¥šã«åºã¥ããŠãErik Germaniã«ãããã¬ãŒãã³ã°è³æã翻蚳ããŸããErikGermaniã¯ãäžèšã®èšäºã®åºç€ãšãªã£ãSong of Ice and Fireã·ãªãŒãºã®æåã®5åãããœãŒã·ã£ã«ãªã³ã¯ã°ã©ããåãåããŸããã ãã®èšäºã«ã¯ãæ©æ¢°åŠç¿æ¹æ³ã®è©³çްãªèª¬æã¯å«ãŸããŠããŸããããå®éã«ã¯ãæ¢åã®ããŒã«ã䜿çšããŠããã¹ãå
ã®å¯Ÿè©±ã®èè
ãæ€çŽ¢ããæ¹æ³ã説æãããŠããŸãã æ³šæãããããã®æçŽïŒ è¡ãã
ãã®ãã¥ãŒããªã¢ã«ã¯ãç§ã1幎åã«ãã®ãããžã§ã¯ããå§ãããšãã®ããã«ãæ©æ¢°åŠç¿ã®åå¿è
ã察象ãšããŠããŸãã ïŒãããŠãç§ã¯ä»ã§ãç§ã¯èª°ãªã®ããä»ã¯ç·ã§ããããã®ã¹ã¬ããã§ã¯æããç·ã§ã¯ãããŸãããïŒãžã§ãŒãžR.R. ããŒãã£ã³ã®ãæ°·ãšç«ã®æãã ãããè¡ãã«ã¯ãCRFæ¡ä»¶ä»ãã©ã³ãã ãã£ãŒã«ãæ³ïŒ
è¿äŒŒ æ¡ä»¶ä»ãã©ã³ãã ãã£ãŒã«ããã ïŒãšã岡åŽçŽæã®ãã°ããã
CRFsuiteãŠãŒãã£ãªãã£ã䜿çšããŸãã ããã¹ãåŠçã«ã¯ãPython 2.7ããã³NLTKïŒNatural Language ToolkitïŒã䜿çšããŸãã
ã§ããéã詳现ã«ãã¹ãŠã説æããããšããŸãã ç§ã®è¡åã®åã¹ãããã説æãããšãã«ãèªåã®ãããžã§ã¯ãã§åœ¹ç«ã€æ°ããããŒã«ãšã¡ãœãããèªåã§æœåºã§ããããšãé¡ã£ãŠããŸãã ã³ãŒãã¯åå¿è
ããåå¿è
ãŸã§èª¬æãããŸããåå¿è
ã¯Pythonæ§æãçè§£ãããªã¹ãã®æœè±¡åã«ã€ããŠã¯ç¥ã£ãŠããŸããããã以äžã®ããšã¯ãããŸããã ããç§ã®ã³ãŒãã®èª¬æãããªãã®éãæ¶èãããŠãããšæããããããããé£ã°ããŠãã ããã
éèŠïŒæ¡ä»¶ä»ãã©ã³ãã ãã£ãŒã«ãã®æ¹æ³ã«é¢ããçè«ãããã§èŠã€ãããå Žåã¯ããã®è³æã¯é©ããŠããŸããã ç§ã«ãšã£ãŠãCRFsuiteã¯ç¿ã®è¶³ã§è§ŠããçŸããé»ãç®±ã§ãã ã¢ãã«ã®å¹çãäžããããã«ãã°ããæéãè²»ãããŸãããããã¯èª€ã£ã詊ã¿ã§ããããšãããããŸãã ãããããªããæ··ä¹±ããããªããå¿ã«çããŠãããŠãã ããïŒ
- ç®±ããåºããŠããã«CRFsuiteã§è¯å¥œãªçµæïŒã75ïŒ
ã®ç²ŸåºŠïŒãéæããããšãã§ããŸãã
- LaTeXã¯ãããŸãã
ã²ãŒã ãã©ã³ã¯ã·ã³ãã«ã§ãã ä»ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãšåæ§ã«ããã¬ãŒãã³ã°ãšæ€èšŒã®ããã«ããŒã¿ãæºåããå¿
èŠããããŸãã æ¬¡ã«ãã¢ã«ãŽãªãºã ãåé¡ã«äœ¿çšããããããã£ãéžæããŸãã ãããã®ããããã£ã䜿çšããŠããã¹ããåŠçããåŸãçµæãCRFsuiteã«ãã£ãŒãããããã§ããä»äºãç¥çŠããŸãã ïŒãŸãã¯ããã·ã³ã®æšæž¬ã確èªããéªšã®æããäœæ¥ãè² æ
ããŸãïŒã
å§ããŸãããã
ããã¹ããããŠã³ããŒã
ãŸã第äžã«ãããã¹ãã®ãœãŒã¹ã®ã³ããŒãèŠã€ããå¿
èŠããããŸãããã®ããã«éã®ä»£äŸ¡ãæ¯æããã©ããã¯ããªãã«ãä»»ãããŸãã
èªç¶èšèªåŠçãåããŠäœ¿çšããå ŽåããœãŒã¹ã³ãŒãã®é£ãããéå°è©äŸ¡ããå¯èœæ§ããããŸãã å.txtãã¡ã€ã«ã«ã¯ãåæåã®èšè¿°æ¹æ³ã決å®ãããšã³ã³ãŒãããããŸãã Ocarina of Timeã®ãã¥ãŒããªã¢ã«ãèªãã ASCII圢åŒã¯ããã¹ãŠã®ç¹æ®æåãåŠçã§ããUTF-8ã«çœ®ãæããããŸããã ïŒASCIIã¯128æåã衚ãããšãã§ããŸããïŒç§ã®PLIPïŒããããã®æ°·ãšç«ã®æïŒã®ã³ããŒã¯UTF-8ã§ãããããå€å°äžäŸ¿ã«ãªããŸãããå®éã«ã¯ããŒãã¹ã§ãã
ãã®ããã¹ããNLTKã«ã¢ããããŒãããŠãæäœããããããŸãã NLTKã¯å€ãã®ã¿ã¹ã¯ãå®è¡ã§ããŸãããããPythonãåŠãã æ¹æ³ã§ãããããããããªãã«ãšã£ãŠè峿·±ããã®ã§ããããšãããã£ããããã°ããã
ãªã³ã©ã€ã³ããã¯ãã芧ãã ãã ã ãã®ç®çã®ããã«ããã®ããŒã«ã䜿çšããŠããã¹ããããŒã¯ã³ã«åå²ããŸãã ããã¯ãèªç¶èšèªåŠçãããžã§ã¯ãã§ããè¡ãããããã«ãæãåèªãšå¥èªç¹ã«åå²ããããšãæå³ããŸãã
import nltk nltk.word_tokenize("NLTK is ready to go.")
['NLTK'ã 'is'ã 'ready'ã 'to'ã 'go'ã 'ã']
NLTKã«ã¯ã·ã§ã«ãããªããŒããããŠããŸãããèªåã§ã¢ããããŒãããå¿
èŠããããŸãã
ãã©ã«ããŒãäœæããããã«PLIPããã¹ããã¡ã€ã«ã貌ãä»ããŸãã æ¬ã¯éåžžã«å€§ãããããå
¬éãœãŒã¹ããã¹ãã¯ã»ãŒ10 MBã«ãªããŸãã ããã¹ãã®æ€çŽ¢ããã³çœ®æã«ã¯çæ³çã§ã¯ãããŸããã ããã¹ããæ¬ã«åå²ããŸãããããã£ãšåæããæ¬ç©ã®å°éå®¶ã¯ã忬ãç« ã«åããŠãé çªã«çªå·ãä»ããŸãã
ããããä»ãã¹ãŠãè€éã«ããªãã§ãã ããïŒ ããã¹ãããã©ã«ããŒã«å
¥ã£ãããæ¬¡ãå®è¡ã§ããŸãã
corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', 'George.*\.txt', encoding = 'utf-8')
ããã§ã
rã¯æååãåŠçããªãããšã瀺ããŸãã ããã§ã¯é¢ä¿ãããŸãã ç§ã¯ãã³ãŒãã¹ããã©ã«ãã«çŽæ¥ã¢ã¯ã»ã¹ããŸãããããªãã®å Žåããã©ã«ãã®å Žæãé£ããå Žåã¯ãå¿ããªãã»ããããã§ãããã
2çªç®ã®åŒæ°ã¯ãååã«ãGeorgeããå«ãŸããæ¡åŒµåãã.txtãã§ãããã©ã«ããŒå
ã®ãã¹ãŠã®ãã¡ã€ã«ãååŸããããã«NLTKã«æç€ºããæ£èŠè¡šçŸã§ãã
ãšã³ã³ãŒããã©ã¡ãŒã¿ã¯éåžžã«éèŠã§ã-ããã¹ãã®ãšã³ã³ãŒããæå®ããããã®ãšäžèŽããªãå Žåããšã©ãŒãçºçããŸãã
NLTKã®æ¬æã¯éåžžã«äŸ¿å©ã§ãããŸããŸãªã¬ãã«ã®ããã¹ãããæ
å ±ãååŸã§ããŸãã
corpus.words("George RR Martin - 01 - A Game Of Thrones.txt")[-5:]
[u'the 'ãu'music'ãu'of 'ãu'dragons'ãu 'ã']
corpus.words()[0]
u'PROLOGUE '
corpus.sents()[1][:6]
[u '\ u201c'ãu'We 'ãu'should'ãu'start 'ãu'back'ãu 'ã\ u201d']
Game of Thronesã®ããããŒã°ããéåœã®GaredãèããPythonã§è¡šçŸãããUnicodeæåã確èªããŸãã ãã¹ãŠã®Unicodeæååã¯
uã§å§ãŸããç¹æ®æåãå«ãŸããŠããããšãããããŸãã \ u201cã¯å·ŠåŒçšç¬Šã\ u201dã¯å³åŒçšç¬Šã§ãã UTF-8ã®æ¹ãããŸãã ãšèšããŸãããããããçç±ã§ãã ãšã³ã³ãŒãã£ã³ã°ãæå®ããã«åããã¡ã€ã«ãéããšã©ããªããèŠãŠã¿ãŸãããã
bad_corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', '.*\.txt') bad_corpus.sents()[1][:9]
['\ xe2'ã '\ x80 \ x9c'ã 'We'ã 'should'ã 'start'ã 'back'ã 'ã'ã '\ xe2'ã '\ x80 \ x9d']
\ uãUnicodeæååãæãããã«ã\ xã¯16鲿°æååãæãã®ã§ãNLTKã¯3ã€ã®16鲿°ãã€ãïŒ\ xe2ã\ x80ã\ x9cïŒãäžããããããåå²ããããšããŸãã 圌ã¯ãããè¡ãæ¹æ³ãç¥ããªãããšãããããŸãã
段èœãæ±ãã®ã§ããã®ãã¡ã®1ã€ãèŠãŠã¿ãŸãããã
print corpus.paras()[1]
[[u '\ u201c'ãu'We 'ãu'should'ãu'start 'ãu'back'ãu 'ã\ u201d'ãu'Gared 'ãu'urged'ãu'as 'ã u'the 'ãu'woods'ãu'began 'ãu'to'ãu'grow 'ãu'dark'ãu'around 'ãu'them'ãu 'ã']ã[u '\ u201c 'ãu'The'ãu'wildlings 'ãu'are'ãu'dead 'ãu'ã\ u201d ']]
NLTKãããŒã¿ãæ§é åããæ¹æ³ã«æ°ä»ããããããŸããã ãªãã¡ãŒã¯ããŒã¯ã³ã®ãªã¹ãã§ãããæ®µèœã¯ãªãã¡ãŒã®ãªã¹ãã§ãã ç°¡åïŒ
ã¿ã°
次ã«ããã¬ãŒãã³ã°çšã®ããŒã¿ãæºåããå¿
èŠããããŸããããããè¡ãã«ã¯ã䜿çšããã©ãã«ã決å®ããå¿
èŠããããŸãã ããã¹ããè§£æãããšããã¢ã«ãŽãªãºã ã¯ãããŒã¯ã³ãåå¥ã«ããŽãªã«å±ããŠããããšãèªèããŸããåã«ããŽãªã«ã¯ç¬èªã®ã©ãã«ããããŸãã JJã¯åœ¢å®¹è©ãNNã¯åè©ãINã¯å眮è©ã§ãã ãããã®ã©ãã«ã¯ãã¢ãã«ã®ä¿¡é Œæ§ã«ãããŠéèŠãªåœ¹å²ãæãããŸãã
Penn Treebank ïŒ
ãããããã¹ãã©ãã«ã®ãããžã§ã¯ã ïŒã¯ã36ã®ãã®ãããªã©ãã«ã匷調
ããŠããŸã ã
ã¿ã°ã¯äœã«ãªããŸããïŒ æãåçŽãªãªãã·ã§ã³ã¯ãã£ã©ã¯ã¿ãŒåã§ãã ããã¯ããã€ãã®çç±ã§æ©èœããŸããïŒ
- PLIPã«ã¯1000æå以äžãå«ãŸããŠããŸãã ããã¯ç§ãã¡ã®è²§ããã¢ãã«ã«ãšã£ãŠã¯ããŸãã«ãå€ãã®éžæã§ãã å¹³å¡ãªéã«äŸåããŠæ£ããåé¡ããã«ã¯ãã§ããã ãå€ãã®ã¿ã°ãåãé€ãå¿
èŠããããŸãã
- æåã®æ±ãã¯ç°ãªããŸãã Joffreyã¯ããJoffreyãããJoffãããPrinceãããŸãã¯åã«ã圌ãã®ããããã§ãã
- ãã£ã©ã¯ã¿ãŒåãã©ãã«ãšããŠäœ¿çšããå Žåããã¬ãŒãã³ã°ããŒã¿ã§å®çŸ©ããå¿
èŠããããŸãã ããããªããšãã¢ãã«ã¯ãããã®ååšãèªèããªããããããããæ±ºå®ã§ããŸããã
- ãã¹ãŠã®ãã£ã©ã¯ã¿ãŒã®é³ã¯åãã§ãã ïŒæ©æ¢°åŠç¿ã®å¥ã®çµéšã®ãããã§ããããå®çŸããŸãããããã§ã¯ãèªåœã«åŸã£ãŠãã£ã©ã¯ã¿ãŒãåé¢ããããšããŸããïŒã ããã€ãã¯ãVarisã®ãåŸæãïŒ çŽGrievous ïŒãHodorã®ãHodorããªã©ã®ãã£ãããã¬ãŒãºãæã£ãŠããŸãããããã¯ãŸãã§ãã ããã«ãå€ãã®äººã«ãšã£ãŠãä»ã®äººãšè©±ãã®ã«ååãªæéããããŸããã
ãã£ã©ã¯ã¿ãŒã®ååã«ããå®çŸ©ã¯éåžžã«é
åçã§ããããã®èããæšãŠãŠãåæ§ã®åé¡ã解決ãããšãã«èªè
ã®é ã®äžã§èµ·ããããã»ã¹ã«ã€ããŠèããŠã¿ãŸãããã
ããªãã«äžçªè¿ãæ¬ãåããã©ã³ãã ãªããŒãžãéããŠãããã§èª°ã話ããŠããã®ãã倿ããŠãã ããã ãããã©ããã£ãŠããã®ïŒ ãã€ã¢ãã°ã®æšªã«ããæãè¿ãåºæåã確èªããŸãã
ã圌ããèŠãã ããããšã¬ãŒãã¯çããã
[...]
ã¯ã€ããŒã«ã»ãã€ã¹irã¯ç¡é¢å¿ã«ç©ºãèŠäžããã ãå€ã¯æ¯æ¥åãæéã§ãã æéã¯ããªãã®åæ°ã奪ããŸããïŒã
ãã ãããã€ã¢ãã°ã®ãã¹ãŠã®è¡ãããŒã¯ãããŠããããã§ã¯ãããŸããã ããã«èŠãŠãèŠãŠãã ããïŒ
ãäœã®äœçœ®ã«æ°ä»ããïŒã
äžã®æ®µèœãšäžã®æ®µèœãã芧ãã ããã äžã«2ã€ãããŸãã
ããããŠæŠåšïŒã
ãããã€ãã®å£ãšåŒã 1ã€ã¯ã2æ¬ã®åãæã€æ®é
·ãªéã®hadã§ãã...æ®é
·ãªéã 圌ã¯ãã®ç·ã®é£ã®å°é¢ã«åœŒã®æå
ã«æšªããããŸãããã
ãã³ãã®äžæ»Žã§ã¯ãããŸããã 以äžã®2ã€ã®æ®µèœïŒ
è©ãããããŸãã ãäžäººã¯åŽã®è¿ãã«åº§ã£ãŠããã æ®ãã¯å°é¢ãäœãã«ãããŸãããã
ããŸãã¯ã圌ãã¯ç ã£ãããšãã€ã¹ã¯ç€ºåããã
ãŠã£ã«ã¯èªåããªãããšãç¥ã£ãŠããã®ã§ã圌ã¯ãã®æŒèª¬ã®èè
ã§ã¯ãªããšèšãããšãã§ããå€ãã®ãã€ã¢ãã°ã¯ããã€ãã®æ®µèœã«åºãã£ãŠãããããæåã®è¡ã®èè
ã¯ãã€ã¹ã§ãããšä»®å®ããŸãã
ãã®ã¹ããŒã ã¯ãã¢ãã«ãããŒã¯ããã®ã«åœ¹ç«ã¡ãŸãã ããã¹ãã®æšªã«èªåã®ååãèå¥ããããã«åœŒå¥³ã«æããŸããååããªãå Žåã¯ãè¿ãã®æ®µèœã調ã¹ãŸãã æ¬¡ã«ãã¿ã°ã¯æ¬¡ã®ããã«ãªããŸãã
PS±2ãFN±2ãNN±2ããã®ä»ã
PS-ã¹ããŒã«ãŒã®åŸã 段èœã®ã©ãã«ãPS -2ã®å Žåããã€ã¢ãã°ã®ååã話ãéšåã2段èœäžã«ããããšãæå³ããŸãã FN 1ã®å Žåãæ¬¡ã®æ®µèœã®åã NN 0ã¯ãå°ãªããšã2ã€ã®ååã察話ã®åã«ãããå¯Ÿè©±ã«æãè¿ãååãå¿
èŠã§ããããšãæå³ããŸãã
ãŸãããã€ã¢ãã°ã®ããã¹ãã§åç
§ãããæåã«ã€ããŠãADR±2ãæ±ºå®ããŸãã
ããŒã¯
次ã«ããã¬ãŒãã³ã°ããŒã¿ãæºåããŸãã ãã®SublimeTextã§åœ¹ç«ã¡ãŸãã ãã²ãŒã ãªãã¹ããŒã³ãºããšããããã¹ããéããå·ŠåŽã®åŒçšç¬Šã匷調衚瀺ãã[æ€çŽ¢]-> [ãã¹ãŠããã°ããæ€çŽ¢]ãéžæããããŒã ããŒã2åæŒããŸããã ããã§ãã«ãŒãœã«ã¯ãã€ã¢ãã°ã®ããåæ®µèœã®å
é è¿ãã«ãããŸãã æ¬¡ã«ãã{}ããšå
¥åããŸããã ãªããªã ããã¹ãã«ã¯äžæ¬åŒ§ã¯ãããŸããããã®åŸããããã䜿çšããŠãä»åŸäœ¿çšããã¡ã¢ãæ®ãããšãã§ããŸãã
æ£èŠè¡šçŸïŒïŒ<= \ {ïŒïŒïŒ= \}ïŒã䜿çšããŠãäžæ¬åŒ§ãé£ã³è¶ããŸãã ãã®èšèšã«äŒã£ãŠããªãå Žåããããã¯ååããªåé¡§çãã€äž»èŠãªãã§ãã¯ãšåŒã°ããŸãã æ¬åŒ§ã§å²ãŸããæåã®åŒã«ãããSublimeTextã¯å
é ã«éå§äžæ¬åŒ§ïŒããã¯ã¹ã©ãã·ã¥ã§ãšã¹ã±ãŒããããïŒãããè¡ã®åŒ·èª¿è¡šç€ºãéå§ããŸãã æ¬¡ã®åŒã¯ãå³äžæ¬åŒ§ãããå Žåã«åæ¢ã瀺ããŸãã ã芧ã®ãšãããäž¡æ¹ã®åŒã¯aïŒ= Constructã§æ§æãããæåã®åŒã®ã¿ã«<ãå«ãŸããŠããŸãã
F3ãæŒããšããã©ã±ããéãç§»åã§ããŸããããã¯ãWindowsã®SublimeTextã§æ¬¡ã®ãã©ã±ãããèŠã€ããããã®ãããããŒã§ãã ãã®çš®ã®æé©åã¯éèŠã§ã çŽ1,000åã®ãã€ã¢ãã°ã«ã¿ã°ãä»ããŸãã å°ãªããšãç§ã¯ãããªã«ãããŸããã æã£ãã»ã©é£ãããæéãããããŸããã§ããã ïŒãã¶ãç§ã¯åãã€ããŠããŸããããã£ã1幎åŸã«çµãã£ãããã§ãïŒã
å§ããåã«ã1ã€ã®çºèšãããããšæããŸããäœçœ®ã©ãã«ïŒPSãFNãNNïŒã䜿çšãããããã¹ãŠåãæååã䜿çšããããèããŠãã ããã ååã䜿çšããªãããšã¯æ¢ã«è¿°ã¹ãŸããããäœçœ®ã©ãã«ã䜿çšããå Žåã¯ããã®ãã¬ãŒãã³ã°ããŒã¿ã察å¿ããã¢ãã«ã«é¢é£ä»ããŸãã Johnã®ãã€ã¢ãã°ã«ãJonããšããã©ãã«ãä»ãããšãå°æ¥çã«ã©ãã«ãå®äœçœ®ã®ãã®ã«å€æŽããããããé©åã«ä»ã®ã©ãã«ã䜿çšãããããããšãã§ããŸãã
åäžã®çãã¯ãªããšæããŸãã æšå¹Žãç§ã¯ãã£ã©ã¯ã¿ãŒåã§ã¿ã°ä»ãããŸããã æ¬¡ã«ããããŸããã远å ããäºåçãªæäœãè¡ãå¿
èŠããããŸãã Eddardã®ååãäžã®2ã€ã®æ®µèœãšäžã®1ã€ã®æ®µèœã«è¡šç€ºãããå Žåãã©ã¡ããéžæããŸããïŒ ããã¯ã¢ãã«ã®åäœã«çŽæ¥åœ±é¿ãããããè¡ããšããã»ã¹ãããã«äžæ£ç¢ºã«ãªããŸãã ãããã£ãŠãäœãã¢ããã€ã¹ããã°ãããããããŸããã æåã¿ã°ã®èгç¹ããã¯ããã£ã©ã¯ã¿ãŒã®ååãæžãæ¹ãç°¡åãªããã§ãããèªååã®èгç¹ããã¯ãäœçœ®ã¿ã°ãæã€æ¹ãã¯ããã«äŸ¿å©ã§ãã
ããããã£ã®ååŸ
ããŠãããã¹ãã®äžéšã«ã¿ã°ãä»ããŸããã èªç¶èšèªåŠçãžã®ã³ãããã¡ã³ããç§°è³ããŸãã ããã§å¿
èŠãªã®ã¯ã段èœãåŒæ°ãšããŠåãåããèå³ã®ããããããã£ã§ããŒã¯ããããã€ãã®é¢æ°ãæžãããšã§ãã
ã©ã®ããããã£ãéç¥ããŸããïŒ ã¢ãã«ã®æ£ç¢ºæ§ãæ
åœããäž»åã¯ãæ¬¡ã®æ©èœã§ããPSãFNããŸãã¯NNãçŸåšã®æ®µèœãŸãã¯é£æ¥ããæ®µèœã«ååšãããã©ããã
ååæ€çŽ¢
æåã®æ©èœã¯ãé©åãªååãèŠã€ããããšã§ãã ããã¯ãåè©ãå®çŸ©ããããšã§å®è¡ã§ããŸãã
sentence = corpus.paras()[33][0] print " ".join(sentence) print nltk.pos_tag(sentence)
ããã®ãããªéåŒããGaredãSer Waymarã¯èгå¯ããã
[ïŒu '\ u201c'ã 'NN'ïŒãïŒu'Such 'ã' JJ 'ïŒãïŒu'eloquence'ã 'NN'ïŒãïŒu 'ã'ã 'ã'ïŒãïŒu'Gared 'ã' NNP 'ïŒãïŒu'ã\ u201d 'ã' NNP 'ïŒãïŒu'Ser'ã 'NNP'ïŒãïŒu'Waymar 'ã' NNP 'ïŒãïŒu'observed'ã 'VBD 'ïŒãïŒu'ã 'ã'ã 'ïŒ]
SerãšWaymarã«è¿ãNPPã¯ãããããåºæåã§ããããšãæå³ããŸãã ããããæ¬ ç¹ããããŸãïŒ
- ãšã©ãŒãçºçããŸãã çµããã®åŒçšãé©åãªååã«ãªã£ãããšã«æ³šæããŠãã ããã
- åè©ã®èå¥ã«ã¯æéãããããŸãã
%timeit nltk.pos_tag(sentence)
100ã«ãŒãããã¹ã3ïŒã«ãŒãããã8.93ããªç§
asoiaf_sentence_count = 143669 ( asoiaf_sentence_count * 19.2 ) / 1000 / 60
45.974079999999994
PLIPã«ã¯ãåŠçããããã®å€ãã®æ®µèœããããåè©ããã¹ãããã³ãªãã¡ã¯ã¿ãªã³ã°ããã»ã¹ãé
å»¶ããããšå€æããã®ã«45å以äžããããŸãã ãã¡ããããã¹ãŠãäžåºŠåæããŠãäœãèµ·ãã£ãã®ããåŒãç¶ã確èªã§ããŸãã ãã ãããã®ããã«ã¯ãããã«å¥ã®ããŒã¿æ§é ã«å¯ŸåŠããå¿
èŠãããããã®ãããªå®çŸ©ã¯ããœãŒã¹ããã¹ãã倿Žããããã³ã«ããçŽãå¿
èŠããããŸãã ïŒãããŠããã¯é¿ããããŸãããïŒ
幞ããªããšã«ããã£ã©ã¯ã¿ãŒåãæ±ºå®ããããã«åè©ã«é£çµ¡ããå¿
èŠã¯ãããŸããã ããã¯ãåæã®ããã«PLIPãéžæããå©ç¹ã®1ã€ã§ãããã§ã«åä¿¡ããããŒã¿ã倧éã«ãããŸãã ãããã®ããã€ããåããŸãããã
æ¢åã®æ
å ±
Wiki Songs of Ice and Fireã§ããããšã
ããããŸãã ã
ããŒããŒåã®ãªã¹ããå«ãããŒãžãæåéãã³ããŒããããšã§ããã£ã©ã¯ã¿ãŒåã®ã»ãŒç¶²çŸ
çãªãªã¹ããåŸãŸããã çµæã¯
ããã§èŠã€ããããšãã§ã
ãŸã ã ããã§ååãªå Žå
ã¯ãèšäºã®æ¬¡ã®ç« ã§èª¬æããŸãã ããŒãžããããŒã¿ãèªåçã«æœåºããæ¹æ³ã«èå³ããã人ã®ããã«ãä»ã®ãããžã§ã¯ãã§äœ¿çšããããã€ãã®æ¹æ³ã玹ä»ããŸãã
Wget
æ¢ç¥ã®ãªã³ã¯ããã©ãå¿
èŠãããå Žåã«éåžžã«ã·ã³ãã«ãª
åªãããŠãŒãã£ãªã㣠ã ãªã³ã¯ããã€ãã¹ããæ¹æ³ã«ã€ããŠèããå¿
èŠã¯ãããŸããããªã¹ããå«ããã¡ã€ã«ãäœæããæ¬¡ã®ããã«
-iãã©ã°ã䜿çšããŠè»¢éããã ãã§ãã
wget -i list_of_links.txt
å¿
èŠæ¡ä»¶
Pythonã«ã¯ãåã
ã®ããŒãžã§ã®äœæ¥ã«é©ãã
èŠæ±ã©ã€ãã©ãªããããŸãã
import requests r = requests.get("http://awoiaf.westeros.org/index.php/List_of_characters") html = r.text print html[:100]
<ïŒDOCTYPE html>
<html lang = "en" dir = "ltr" class = "client-nojs">
<head>
<ã¡ã¿æåã»ãã= "UTF-8" />
<ã¿ã€ãã«
è§£æ
htmlãããŠã³ããŒããããããªã³ã¯ã«ã¢ã¯ã»ã¹ããããã«ãäžèŠãªã¿ã°ããããŒãžãå¥é¢ããå¿
èŠããããŸãã
BeautifulSoupã¯ãé¢åãªããšãªããªã³ã¯ãååŸã§ããHTMLããŒãµãŒã§ãã ã€ã³ã¹ããŒã«ãšè§£æåŸã次ãå®è¡ããã ãã§ãã¹ãŠã®ãªã³ã¯ãèŠã€ããããšãã§ããŸãã
parsed_html.find_all("a")
ããã§ããã«ã€ããŠãã£ãšèªãããšãã§ããŸã ã
lxmlã©ã€ãã©ãªã
䜿çšããå¥ã®æ¹æ³ã«ã€ããŠèª¬æããããšæããŸãã ãã®ã©ã€ãã©ãªã䜿çšãããšãXpathãæäœã§ããŸãã Xpathã¯åããŠã§ãããããã¯ããªãŒæ§é ãããã²ãŒããã匷åãªæ¹æ³ã§ãã
import lxml.html tree = lxml.html.fromstring(html) character_names = tree.xpath("//ul/li/a[1]/@title") print character_names[:5]
['Abelar Hightower'ã 'Addam'ã 'Addam Frey'ã 'Addam Marbrand'ã 'Addam Osgrey']
äžããXpathåŒãå°ãããšã次ã®ããã«ãªããŸãã
tree.xpath("//ul # /li # /a[1] # . /@title # title ")
次ã«ãçµæã®äžããååã匷調衚瀺ããååãšã¯é¢ä¿ã®ãªãååãåé€ããå¿
èŠããããŸãã PLIPããŒãžãèŠãã ãã§ããMyrã®ããšããã®ãããªèŠçŽ ã«æ°ä»ããŸããã ã¢ãã«ããofãç²åããã€ã¢ãã°ã«äžèŽãããªãããã«ããŸãã
NLTKã¯ããã«åœ¹ç«ã¡ãŸãã ãæªããåèª-ã¹ãããã¯ãŒããå«ãããã¹ãæ¬æããããŸãã ããã¹ããç¹åŸŽä»ããæå³ããªãã»ã©äžè¬çãªãã®ã
particles = ' '.join(character_names).split(" ") print len(set(particles)) stopwords = nltk.corpus.stopwords.words('english') print stopwords[:5] particles = set(particles) - set(stopwords) print len(particles)
2167
['i'ã 'me'ã 'my'ã 'myself'ã 'we']
2146
æ¬åœ
æåŸã«ãããã¹ããã©ãã¯ãã£ãã·ã¥ããžã§ããªã©ãèŠéããããã¯ããŒã ã远å ããå¿
èŠããããŸãã ååã®ãªã¹ãã«æºè¶³ããããå°æ¥ã®äœ¿çšã«åããŠãã¡ã€ã«ã«ä¿åããŸãã
ååãæ€çŽ¢ããŸãã ããŒã2
åè©ã䜿çšããŠååãèŠã€ãããšããèããæšãŠãååã®ãªã¹ããååŸããŸããã ããŒã¯ã³ã®ã·ãŒã±ã³ã¹ãæœåºããååã®ãªã¹ãã§ããããèŠã€ããããšãã§ãããã©ããã確èªããŸãã æåŸã«ãã³ãŒããäœæããŸãã
import itertools from operator import itemgetter particles = [particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt')] tokens = [u'\u201c', u'Such', u'eloquence', u',', u'Gared', u',\u201d', u'Ser', u'Waymar', u'observed', u'.'] def roll_call(tokens, particles): speakers = {} particle_indices = [i for (i, w) in enumerate(tokens) if w in particles] for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix): index_run = map(itemgetter(1), g) speaker_name = ' '.join(tokens[i] for i in index_run) speakers[min(index_run)] = speaker_name return speakers
ãã®é¢æ°ã¯ãæšå¹Žãã®ãããžã§ã¯ããè¡ã£ããšãã«äœ¿çšã§ããªãã£ãã©ã ãåŒã䜿çšããŸãã ãã®ãšãç§ã䜿çšããã¹ã¯ãªããã¯ã²ã©ãèªã¿ã«ããããããããŠå
¬éããããšã¯ããŸããã§ããã ããã«ããã®ã¹ã¯ãªããã§ã¯ãåå¿è
ãæ°ããããšãåŠã¶ããšãã§ãããšæãã®ã§ãããã«ã€ããŠããå°ã説æããŸãã
Itertoolsã¯æ³šç®ã«å€ããããŒã«ã§ãã ç§ã¯ããå
¥ãåã眮æãåãé€ãããã«ããã䜿çšããŸãã ãã®äžã§ã
groupby颿°ãå¿
èŠã§ãã å·çæç¹ã§ãã®é¢æ°ã®æ°ããããŒãžã§ã³ããªãªãŒã¹ããããããç§ã¯dropwhileãštakewhileãããå®å
šã«groupbyã奜ã¿ãŸããããããååž°çã«äœ¿çšããŸããã
ããã°ã©ãã³ã°ãããšãã
roll_call颿°ã¯èŠã€ããååã®äœçœ®ãç¥ã£ãŠããã¹ãã ãšæããŸããã ããã§ãååã®ã·ãªã¢ã«çªå·ããã¹ãŠä¿æããããšã«ããŸããã ããã¯ãæ©èœã³ãŒãã®3è¡ç®ã§ç¢ºèªã§ããŸãã
particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]
Enumerateã¯ãPythonã玹ä»ããããšãã«éåžžã«åœ¹ç«ã¡ãŸããã ãªã¹ããååŸããåèŠçŽ ã«å¯ŸããŠäžé£ã®ã·ãªã¢ã«çªå·ãšèŠçŽ èªäœãè¿ããŸãã
4è¡ç®ã¯ããã¹ãŠã®è³æã®äžã§ã³ãŒãã®æãæ±ãã«ããéšåã§ãããç§ã¯ãããæžããŸããã§ããã
ã©ã€ãã©ãªã®ããã¥ã¡ã³ãããçŽæ¥ååŸãããŸã ã
for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix):
Groupbyã¯ãªã¹ãã調ã¹ãã©ã ã颿°ã®çµæã«å¿ããŠèŠçŽ ãã°ã«ãŒãåããŸãã ã©ã ãã¯å¿å颿°ã§ãã roll_callãšã¯ç°ãªããäºåã«å®çŸ©ããå¿
èŠã¯ãããŸããã ããã¯ãåŒæ°ãåãå€ãè¿ãã³ãŒãã®äžéšã«ãããŸããã ãã®å Žåãã·ãªã¢ã«çªå·ããæ°åãåŒãã ãã§ãã
ãããã©ã®ããã«æ©èœããããèŠãŠã¿ãŸãããã
print tokens particle_indices = [i for (i, w) in enumerate(tokens) if w in particles] print particle_indices for index, location in enumerate(particle_indices): lambda_function = index-location print "{} - {} = {}".format(index, location, lambda_function)
[u '\ u201c'ãu'Such 'ãu'eloquence'ãu 'ã'ãu'Gared 'ãu'ã\ u201d 'ãu'Ser'ãu'Waymar 'ãu'observed'ãu 'ã']
[4ã6ã7]
0-4 = -4
1-6 = -5
2-7 = -5
ããã¯
groupbyã®ããªãã¯
ã§ã ãã€ã³ããã¯ã¹ã«ã¯é çªã«çªå·ãä»ããããããããªã¹ãå
ã®é
ç®ã次ã
ãšç§»åããå Žåãã©ã ãã®çµæã¯åãã«ãªããŸãã
groupbyã¯-4ãèŠãŠãã°ã«ãŒãã«å€4ãå²ãåœãŠãŸãã 6çªç®ãš7çªç®ã®èŠçŽ ã¯äž¡æ¹ãšã-5ãæã¡ãããããã°ã«ãŒãåãããŸãã
ããã§ãè€ååã®å Žæããããããããã䜿çšããå¿
èŠããããŸãã
groupbyã¯
äœãè¿ããŸããïŒ ããŒãã©ã ãã®çµæãããã³ã°ã«ãŒãèªäœã
ã°ã«ãŒããŒãªããžã§ã¯ãã æ¬¡ã«ã
map颿°ã䜿çšããŠ
itemgetterïŒ1ïŒãé©çšãããã³ãã«ããèŠçŽ ãæœåºããã°ã«ãŒãã®ãã¹ãŠã®èŠçŽ ã«é©çšããŸãããããã£ãŠãå
ã®ããŒã¯ã³ãªã¹ãã«ååã®ãªã¹ããäœæããŸãã
groupbyã®åŸ
ãèŠã€ãã£ãååãæœåºãã
ã¹ããŒã«ãŒã®é£æ³é
åã«ä¿åããã ãã§ãã
roll_call(tokens, particles)
{4ïŒu'Gared 'ã6ïŒu'Ser Waymar'}
æé©å
ãã®é¢æ°ã®é床ããåè©ã䜿çšããæ¹æ³ãšæ¯èŒããŠã¿ãŸãããã
100ã«ãŒãããã¹ã3ïŒã«ãŒãããã3.85 ms
æªããªãã5-6åé«éã ãããã
setã䜿çšããŠçµæãæ¹åã§ããŸãã
ã»ãã㯠ãã¢ã€ãã ããªã¹ãã«ãããã©ãããã»ãŒç¬æã«ãã§ãã¯ããŸãã
set_of_particles = set(particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt')) %timeit roll_call(tokens, set_of_particles)
10000ã«ãŒããæé«3ïŒã«ãŒãããã22.6 µs
ã®ãªã·ã¢æåãé«éã§èŠããšãèªåãè¯ãããšãçè§£ããŸãã
äŒè©±åã®æ€çŽ¢
ããã§ããã€ã¢ãã°ã®ããã¹ãã®åãäžãåŸã®æåã®ååãèŠã€ããããã«ãé©åãªå Žæã§äžèšã®é¢æ°ãåŒã³åºãããã°ã©ã ãäœæããå¿
èŠããããŸãã ããããã¹ãŠããç§ãã¡ã®ããã«ãã£ã©ã¯ã¿ãŒåã®å®å
šãªãªã¹ããåéã§ããã¯ã©ã¹ã«å
¥ãããããããããã£ãæœåºããããã®å¥ã®ã¢ã«ãŽãªãºã ã«æž¡ããŠãããCRFsuiteã«æž¡ããŸãã
ããããæåã«ãããŒã¿ãæŽçããããšæããŸãã
XMLããŒãµãŒ
Xpathã䜿çšãã1è¡ã®ã³ãã³ããæåããåŸãããã¹ããã¡ã€ã«çšã®XMLããŒãµãŒãäœæããããšã«ããŸããã ãã®åœ¢åŒãéžæããããšã«ã¯å€ãã®æå³ããããŸãã PLIPã¯ããã©ã°ã©ãã§æ§æãããç« ãããããã®äžã«ã¯ãã€ã¢ãã°ãå«ãŸãã倿°ã®æ¬ããããŸããããããæ
éã«ããŒã¯ããå¿
èŠããããŸãã ããã¹ããXMLã«ç¿»èš³ããŠããªãã£ãå ŽåïŒæåã¯ç¿»èš³ããŠããªãã£ãå ŽåïŒãã©ãã«ã¯ããã¹ãèªäœãæ£ããããŠããã§ãããã
以äžã®ã¹ã¯ãªããã«ã€ããŠã¯é»ã£ãŠããæ¹ã奜ãã§ããPythonã§ã®æåã®ã¹ãããã巚倧ãªé¢æ°ãæŸèæãé·ãååã®å€æ°ãæãåºãããŠãããŸãã
from lxml import etree import codecs import re def ASOIAFtoXML(input):
äžèšã®ã³ãŒãã®æ¬è³ªïŒlxmlã䜿çšããŠããªãŒãäœæããããã¹ãã1è¡ãã€ç¢ºèªããŸãããã®è¡ãç« ã®ååïŒå€§æåãå¥èªç¹ãã¹ããŒã¹ïŒãšããŠèªèãããŠããå ŽåãçŸåšã®æ¬ã®äžéšã«æ°ããç« ã远å ããŸããç« ã®æ¬æãèªã¿çµãããããã«ãå¥ã®æ£èŠè¡šçŸã䜿çšããŠæ®µèœãèªã¿ãã ããäŒè©±ã話ãããã倿ãããããäŒè©±ã®å¯Ÿå¿ããé ç¹ã«è¿œå ããŸãã以åã¯ããã¡ããæ¢ã«ã©ãã«ä»ããããŠããªããã°ãªããŸãããXMLã«é¢ããè峿·±ãã¡ã¢ãããã¯éå±€æ§é ã§ããããããã®æ§è³ªäžãå³å¯ãªåå²ãå¿
èŠã§ãããæäžéšãæäžéšã«ãããŸããããããããã¯æ£æã§ã¯ããã§ã¯ãããŸãããæ£æã§ã¯ã察話ã¯ããã¹ãå
ã«ãããŸãã lxmlã¯ãœãªã¥ãŒã·ã§ã³ãæäŸããŸãïŒããã¹ããšããŒã«ããããã£ãŠãXMLé ç¹ã¯ããã¹ããæ ŒçŽããŸããããã®ããã¹ãã¯æ¬¡ã®é ç¹ã远å ãããåŸã«äžæãããŸãã markup = '''<paragraph>Worse and worse, Catelyn thought in despair. My brother is a fool. Unbidden, unwanted, tears filled her eyes. <dialogue speaker="Catelyn Stark"> âIf this was an escape,â</dialogue> she said softly, <dialogue speaker="Catelyn Stark">âand not an exchange of hostages, why should the Lannisters give my daughters to Brienne?â</dialogue></paragraph>''' graf = lxml.etree.fromstring(markup) print graf.text
ããã«æªãããšã«ãã«ããªã³ã¯çµ¶æçã«èããŸãããç§ã®å
åŒã¯ã°ãã§ãã
ç®ã«èŠããªããäžå¿
èŠãªæ¶ã圌女ã®ç®ãæºãããã
print graf[0].text
ããããè±åºã ã£ããªãã
æ®ãã®ã圌女ã¯ç©ããã«èšã£ããã¯ã©ããªããŸããïŒå€æ°ã®é ç¹ã®æ«å°Ÿã«ä¿åããŸãã print graf[0].tail
圌女ã¯ãã£ãšèšã£ãã
ãªã©ããã€ã¢ãã°ã®åé ç¹ã«æ®ãã®ããã¹ãã远å ããŸãããã®çµæããã€ã¢ãã°äœæè
ã®æ€çŽ¢ãå¿
èŠãªãšãã«å€§å¹
ã«ç°¡çŽ åãããŸãããããŠä»ããå¿
èŠã§ãïŒ class feature_extractor_simple: """Analyze dialogue features of a paragraph. Paragraph should be an lxml node.""" def __init__(self, paragraph_node, particles, tag_distance=0): self.paragraph = paragraph_node self.particles = set(particles) self.tag_distance = tag_distance self.raw = ''.join(t for t in self.paragraph.itertext()) self.tokens = self.tokenize(self.raw) def tokenize(self, string): return nltk.wordpunct_tokenize(string) def find_speakers(self, tokens): speakers = {} particle_indices = [i for (i, w) in enumerate(tokens) if w in self.particles] for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): ix): index_run = map(itemgetter(1), g) speaker_name = ' '.join(tokens[i] for i in index_run) speakers[min(index_run)] = speaker_name return speakers def pre_speak(self, prior_tag="FN", near_tag="NN"):
ãããã®æ©èœã«é¢ããããã€ãã®èšèãPythonãåããŠäœ¿çšããå Žåã¯ãã¯ã©ã¹ãæããªãã§ãã ãããéåžžã®é¢æ°ãäœæããselfãåŒæ°ãšããŠæž¡ãã ãã§ããããã«ããã颿°ãçŸåšåŠçããŠãããªããžã§ã¯ããPythonã«éç¥ãããŸããã¯ã©ã¹ã¯ã¯ããŒã³ãã¡ã¯ããªã®ãããªãã®ã§ããããªããžã§ã¯ãã¯ã¯ããŒã³ã§ãããã¹ãŠã®ã¯ããŒã³ã¯åãDNAãæã¡ããããã¯æ¹æ³ãšå€æ°ã§ããã人çã®çµéšã®ããã«ãæ§æ Œã¯ç°ãªããŸãããã®ã³ã³ããã¹ãã§ã¯ãã¯ããŒã³ã¯éä¿¡ãããããŒã¿ã§ããã¯ã©ã¹ã«ã¯ããªããžã§ã¯ã倿°ãåæåã§ããç¹å¥ãªé¢æ°__init__ããããŸãããªã©ãã¯ã¹ã§ããããã«ãªããŸãã ããŒã¿ã¯ç¹å¥ãªã¯ã©ã¹ã®æã«ãããŸãããããŠãããªãã¯åœŒã®è¡åãæœè±¡åããã®ã§ãæãã¯ãªãã¯ããã ãã§ã圌ã«ãã£ãŠåŠçãããæ
å ±ãåŸãããšãã§ããŸãã paragraph = tree.xpath(".//paragraph")[32] example_extractor = feature_extractor_simple(paragraph, particles) print example_extractor.raw print example_extractor.pre_speak() print example_extractor.dur_speak() print example_extractor.post_speak()
ããã®ãããªéåŒããGaredãSer Waymarã¯èгå¯ããããããªããããªãã®äžã«ãããæã£ãŠãããšã¯æããªãã£ããã
{}
{'ADR 0'ïŒu'Gared '}
{'PS 0'ïŒ 'Ser Waymar'}
äžéšã®æ©èœã®åäœã«æ··ä¹±ããŠããå Žåã¯ããããã®æ©èœã«ã€ããŠç°¡åã«èª¬æããŸããäžèšã®ãã¹ãŠãããªãã«åãå
¥ããããããã«èŠãããªããããªãã¯äœããã¹ãããç¥ã£ãŠããŸããæ¬¡ã®ç« ã§äŒããŸãããã飿³é
åã®æ±ãã«ããæäœããããŸããããã¯ãPythonã§é åºä»ããããŠããªãããã§ããå®¶ãåºããšãã«ãã±ããã«éµããªãããã¢ãããã¯ããŠãããšæããæ°æã¡ãæãåºãããŸããå Žåã«ãã£ãŠã¯ãæåã®æåãååŸãããæåŸã®æåãååŸããããåžžã«ç¢ºèªããå¿
èŠããããŸãããããŒã®å€ãèŠãŠãæå°/æå€§ãéžæããŸããpre_speak
äžã§èšã£ãããã«ãããã¹ã屿§ã«ã¯ããã€ã¢ãã°ã®æåã®è¡ãŸã§ã®ãã¹ãŠã®ããã¹ããå«ãŸããŠããŸãããã®äžã®ãã£ã©ã¯ã¿ãŒã®ååãèŠã€ããã ãã§ããdur_speak
ååãå€ãã®è¡ã§æ§æããããã€ã¢ãã°ã®æ¬æã«ããå Žåãããããã¹ãŠã確èªããå¿
èŠããããŸãã for dialogue in self.paragraph.itertext("dialogue", with_tail=False)
æ©èœitertextã§lxmlã®ã¯ãããªãããã¹ãŠã®ããã¹ãããããååŸããããšãã§ããŸãããŸãããã©ã°with_tail = Falseãèšå®ããŠããããŒã«ãã®ãªãé ç¹ã®ã¿ãæ€çŽ¢ããŸããããã¯ããã€ã¢ãã°ã®ããã¹ãã®ã¿ãæå³ããŸãããã£ã©ã¯ã¿ãŒã®ååãèŠã€ãããããã«ãã³ã³ãã§åºåããããã£ã©ã¯ã¿ãŒã®ã¿ãéžæããå¿
èŠããããŸããããã«ãããã¢ããŒã«ãèŠã€ããããšãã§ããŸããïŒããšãã°ããããããçŽæããŠãã ãããã/ãçŽæããŠããããããïŒç§ã¯ããã€ã¢ãã°ã§èŠã€ãã£ãå§ãæ¬¡ã®æ®µèœã§çããå¯èœæ§ãéåžžã«é«ããšå
éšããæããŠããã®ã§ãèšèŒãããŠããå§ã§å®å
ãæžãæããŸããpost_speak
ãã®æ©èœã§ã¯ããã€ã¢ãã°ã®åŸã®æåã®æåã®ã¿ãå¿
èŠã§ãããããã£ãŠããµã€ã¯ã«ãèŠã€ãããšããã«äžæããŸãã颿°ã¯ãéãåŒçšç¬Šã®åŸã®æåã®2ã€ã®ããŒã¯ã³ã調ã¹ãŸãããã®ãããæ¬¡ã®ãããªãã€ã¢ãã°ã衚瀺ãããŸãããããããªãããšãžã§ã³ã¯èšã£ãã
åå¿è
ããã°ã©ããŒåãã®ãã³ãïŒãªã¹ããäœæãããšãã«ãã§ãã颿°ãåŒã³åºãããšãã§ããŸãã tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]
ããã«ããããã€ã¢ãã°ã1è¡ã§ååŸã§ããŸãããïŒæ¡ä»¶ãæå®ããã ãã§ãããŒã«ãªãã§ãã¹ãŠã®çµæãåé€ã§ããŸãïŒCRFsuite
ãããããããã¯ããªãã«ãšã£ãŠæãèå³ã®ããéšåã§ããããæ¡ä»¶ã«å¿ããŠã©ã³ãã ãªãã£ãŒã«ããå«ãŸããŠãããã³ãã³ãã©ã€ã³ããèµ·åãããŸãããå
éšããã©ã®ããã«æ©èœãããã確èªããæ¹æ³ã¯ãããŸããããããå®éãCRFsuiteã¯éåžžã«ã·ã³ãã«ã§è峿·±ãéšåã§ããè³æãæžããŠãããšãã«ã圌ã¯Pythonã®ã©ã€ãã©ãªãæã£ãŠããããšãããããŸããããä»ã¯ç©äºãè€éã«ãããã³ãã³ãã©ã€ã³ã䜿çšããŠå®è¡å¯èœãã¡ã€ã«ã䜿çšããŸããïŒæ¬¡ã®æ¬ãWinds of Winterããæ¥ã®ç®ãèŠããšãã¢ãã«ãæŽæ°ããäºå®ã§ãããããããããèµ·ãããŸã§ããšæ°å¹ŽãããŸãïŒCRFsuiteãå¿
èŠãšããã®ã¯ã次ã®ãããªã¿ãåºåãã®ããããã£ãæã€ããã¹ãã§ãã FN 0 Graf Sent Len = 4 FN 1 = True FN -2 = True FN 0 = True NN 1 = True
ããã¯ããã¬ãŒãã³ã°ããŒã¿ã®åœ¢åŒã§ããæåã®å±æ§ã¯æ£è§£ã§ããåŸç¶ã®ãã¹ãŠã®ããããã£ãèŠãç®ã¯äŒŒãŠãããããããŸããããã³ãã³ã¯äœ¿çšããªãã§ãã ãã-ããã¯éã¿ä»ããããããããã£ã®ããã§ããããã誀ã£ãè§£éã«ã€ãªããå¯èœæ§ããããŸããcrfsuite.exeãããå Žæã§ã³ãã³ãã©ã€ã³ãéããããã«æ¬¡ãå
¥åããå¿
èŠããããŸãã crfsuite learn -m asoiaf.model train.txt
ããã«ããããã¹ãŠã®é è³ã§ããã¢ãã«ãäœæãããŸããããªãã¯åœŒå¥³ã«å¥œããªãã®ãäœã§ãåŒã¶ããšãã§ããŸããç§ã¯ç§ã®asoiafãšåŒã³ãŸãããã¢ãã«ã®ç²ŸåºŠã確èªããã«ã¯ã次ãå
¥åããŸãã crfsuite tag -qt -m asoiaf.model test.txt
ã¿ã°ä»ãã®ããã«ã¢ãã«ãå®éã«å®è¡ããã«ã¯ã次ãå
¥åããŸã crfsuite tag -m asoiaf.model untagged.txt
untagged.txtã¯ã®ããã«èŠãããšãã¹ãã§ããtrain.txtããæåã«æ£ããçãã屿§ã§ã¯ãªããããªãã¡ããã®ãããªãã®ïŒ NN -1 = True FN 0 = True FN 2 = True FN -1 = True NN 0 = True
ããã§è©³çްã確èªã§ããŸããã¢ãã«ã®ç²ŸåºŠãåäžãããããšãã§ããå€ãã®ããããã£ãããã£ãŠã¿ãŸããããæãåçŽãªãã®ããå§ããŸããããæ®µèœå
ããã³æ®µèœã®è¿ãã®äœçœ®ã©ãã«ã®äœçœ®ã決å®ããããŒã«å€ã䜿çšããŸããç¹°ãè¿ãã«ãªããŸãããããããã£ãæœåºããããã®ã¯ã©ã¹ã¯ãæåã«ããã€ãã®æ°ããæ©èœã远å ãããŸããã class feature_extractor: """Analyze dialogue features of a paragraph. Paragraph should be an lxml node.""" def __init__(self, paragraph_node, particles, tag_distance=0): self.paragraph = paragraph_node self.particles = set(particles) self.tag_distance = tag_distance self.raw = ''.join(t for t in self.paragraph.itertext()) self.tokens = self.tokenize(self.raw) self.speaker = self.xpath_find_speaker() def features(self): features = {} features.update(self.pre_speak()) features.update(self.dur_speak()) features.update(self.post_speak()) return features def local_features(self):
ãããŠåœŒãã®æã«ãçå£ã
{}
[ãNoQuotes = TrueãããShort Graf = TrueãããLittle Talk = Trueã]
[ãPS 0 = FalseãããFN 0 = FalseãããNN 0 = FalseãããADR 0 = Falseã]
æšå€ãææžåãããŠããªãæ©æ¢°åŠç¿ã®çæ°ã®äžã§ãç§ã¯å€ãã®ç¹æ§ãæ¹åããããšããŸããã以äžã¯ãå
¬éå¯èœãªãã©ããã®äžéšã§ãããªãã·ã§ã³1ïŒçã®äœçœ®ããŒã«å€ã®ã¿ã©ãã«ã«ãŠã³ããªã³ãŒã«
PS 0 207 0.9949
FN 0 185 0.95
NULL 118 0.3492
OTHER 56 0.3939
PS - 2 44 0.5238
Item accuracy: 430 / 678 (0.6342)
ããã«ããã®ãããªçµ±èšã®å€ãã«äŒãã®ã§ããããã®æå³ãããã«å€æããŸããããç§ãã¡ã人ã
ãèŠãŠå€é£ã«ãããšæ³åããŠãã ãããã©ã³ãã ãªéè¡äººãã€ã«ãããã£ãã©ããã倿ããããã«é Œã¿ãŸãããããªãã¯ãé°è¬çè«ãå®å
šã«ä¿¡ããŠãã人ãšããŠãdumpåãé£ã¹çµããããéè¡äººã«ã¿ã°ãä»ãå§ããŸããããã§ã¯èæ
®ãããªãå€ã§ãã粟床ïŒãããPrecisionïŒã¯ã第1çš®ã®ãšã©ãŒã®é »åºŠã瀺ããŸããèšãæããã°ãããªããééã£ãŠã€ã«ãããã£ã®äžã§äººãã©ã³ã¯ä»ãããé »åºŠãå®å
šæ§ïŒãããRecallïŒã¯ãã¢ãã«ãæ£ããæ±ºå®ããæ€èšŒããŒã¿ã®ã©ãã«ã®æ°ã枬å®ããŸããF1ã¯äž¡æ¹ã®ã©ãã«ã®çµã¿åããã§ãããã¹ãŠã®äººã
ãã€ã«ãããã£ãšããŠåé¡ãããšãæå€§éã®å®å
šæ§ãšããããªç²ŸåºŠãä¿èšŒãããããšãããããŸãããªããªã
ãã¹ãŠãããŒã¯ãããŠããã®ã§ãã¢ãã«ã®ç²ŸåºŠã«ã¯ããŸãèå³ããããŸãããå®å
šæ§ãšæ£ç¢ºããå¿
èŠã§ããããããã£ã®æåã®ããŒãžã§ã³ã§ã¯ãçã®ããŒã«å€ã®ã¿ãèæ
®ããŸãããã€ãŸã
äžèšã®æ®µèœã§ã¯ããã¹ãŠã®ã»ããã¯ãADR 0 = Trueãããã³ãPS 0 = Trueãã®åœ¢åŒã§ããã粟床ïŒçŽã¢ã€ãã 粟床ïŒã¯63.4ïŒ
ã§ããã63.4ïŒ
ã¯ããã§ããã§ããïŒNULLãPS 0ãããã³FN 0ããã¹ãããŒã¿ã®4åã®3ãæ§æããããããèªç¶ã«èŠã€ãããããšããäºå®ã«åºã¥ããŠãç§ãã¡ã¯ééããªãããè¯ãçµæãåºãããšãã§ããŸããæ¬¡ã«ãæ®ãã®äœçœ®ããŒã«å€falseã远å ããŸãããªãã·ã§ã³2ïŒãã¹ãŠã®äœçœ®ããŒã«å€ã©ãã«ã«ãŠã³ããªã³ãŒã«
NULL 254 0.9048
PS 0204 0.9899
FN 0 149 0.975
ãã®ä»24 0.2273
PS-2 19 0.2857
ã¢ã€ãã ã®ç²ŸåºŠïŒ515/678ïŒ0.7596ïŒ
ããã§ãåçŽãªã±ãŒã¹ãå®å
šã«å®çŸ©ããé©åãªç²ŸåºŠãååŸããŸãã75ïŒ
ã¯ãæåã®æ¬ãGame of ThronesããšãBattle of the Kingsãã®3åã®1ãããã³æ®ãã®4åã®3ãæ±ºå®ããããã®ã¢ãã«èªäœãããŒã¯ããã ãã§ããããšãæå³ããŸããäœæéãããããŸãããåœç¶ã®ããšã§ããããã§ããNULLã¿ã°ã98ïŒ
+ã®å®å
šæ§ã§å®çŸ©ããªãçç±ã¯ãªãã®ã§ããããç®çãšããããããã£ã远å ããŸãããããªãã·ã§ã³3ïŒåŒçšç¬ŠïŒã©ãã«ã«ãŠã³ããªã³ãŒã«
PS 0 218 0.9907
NULL 180 0.9119
FN 0 167 0.9118
ãã®ä»63 0.3784
PS 2 25 0.5
ã¢ã€ãã ã®ç²ŸåºŠïŒ550/710ïŒ0.7746ïŒ
段èœã®éå§åŒçšç¬Šã®æ°ãã«ãŠã³ãããŸããNULLãããæ£ç¢ºã«ãªã£ãŠããªãããšã«é©ããŠãããšèšãããã§ããããã«åãçµãå¿
èŠããããŸããããã«FN 0ãæ¹åããããšæããŸãããªãã·ã§ã³4ïŒåã®ã€ã³ããã¯ã¹ïŒã©ãã«ã«ãŠã³ããªã³ãŒã«
PS 0 218 0.9907
NULL 183 0.9057
FN 0 157 0.8971
ãã®ä»68 0.4189
PS-2 23 0.5484
ã¢ã€ãã ã®ç²ŸåºŠïŒ551/710ïŒ0.7761ïŒ
ãã®ããããã£ã«ã¯ãåã®ã€ã³ããã¯ã¹ãå«ãŸããŸããããŒã...å€åè€éãããã®ã§ãããäžåºŠããŒã«å€ã«æ»ããŸãããããªãã·ã§ã³5ïŒã€ã³ããã¯ã¹0ã®ååïŒ+åé·æ§ã©ãã«ã«ãŠã³ããªã³ãŒã«
PS 0 216 0.986
FN 0 166 0.9265
NULL 160 1
ãã®ä»85 0.5811
PS 2 32 0.7143
ã¢ã€ãã ã®ç²ŸåºŠïŒ578/710ïŒ0.8141ïŒ
ããã«ããïŒæåã®åŒçšç¬Šã®æ°ãæ£ããæ°ããªãã£ããããçµæãå°ç¡ãã«ãªããŸãããä¿®æ£ãããšããã«NULLãå®å
šã«æ±ºå®ãããŸãããã¢ãã«ãæ¹åããç°¡åãªæ¹æ³ããªããªããŸãããä»ãç§ã¯æ¬åœã«çµæãããã«æ¹åããããã«å·¥å€«ããå¿
èŠããããŸãïŒãããåäœãããã©ããã®ãèŠãŠã¿ãŸããã...2 - ïŒPSïŒ+ãšè©±ãããåŸïŒãªãã·ã§ã³6ã¹ããŒã«ãŒã2ã€ã®æ®µèœã®äžãŸãã¯é»æµä»¥äžã§ããã°ããã§ã¯ãããŒã«å€ã䜿çšããŸããçè«çã«ã¯ãããã«ããPS -2ã®çµæãå¢å ããã¯ãã§ããã©ãã«ã«ãŠã³ããªã³ãŒã«
PS 0 216 0.986
FN 0 166 0.9265
NULL 160 1
ãã®ä»84 0.5676
PS 2 32 0.7143
ã¢ã€ãã ã®ç²ŸåºŠïŒ578/710ïŒ0.8141ïŒ
广ãªãïŒãªãã·ã§ã³7ïŒã·ãŒã±ã³ã¹??ã©ãã«ã«ãŠã³ããªã³ãŒã«
PS 0 217 0.986
FN 0 168 0.9265
NULL 160 1
ãã®ä»82 0.5541
PS 2 30 0.6429
ã¢ã€ãã ã®ç²ŸåºŠïŒ576/710ïŒ0.8113ïŒã€ã³ã¹ã¿ã³ã¹ã®ç²ŸåºŠïŒ56/142ïŒ0.3944ïŒ
åŸ
ã£ãŠïŒCRFã¯ã·ãŒã±ã³ã¹ãåŠçã§ããããšãããããŸãããå®éãããããã®æå³ã§ããã€ã³ã¹ã¿ã³ã¹ã®ç²ŸåºŠå€ïŒçŽã€ã³ã¹ã¿ã³ã¹ã®ç²ŸåºŠïŒãç¡èŠããŸããããªããªã ããã¯åžžã«0/1ã§ãããã€ãŸããã¢ãã«ã¯ããã¹ãå
šäœã1ã€ã®é·ã察話ãšèŠãªããŠããŸãããç³ãèš³ãããŸããããç§ã¯èªåãå¹³ææã¡ããå¿
èŠããããŸãã粟床ãåäžããããšä»®å®ããŸã-ããã¯æªè§£æ±ºã®è³ªåã§ãã-ãã®æ©èœãã©ã®ããã«äœ¿çšããŸããïŒ5ã€ã®æ®µèœã§åã·ãŒã±ã³ã¹ã®é·ãã瀺ãããšã詊ã¿ãŸããããããã¯æ£ããããã«æããŸãããããããã2ã€ã®é£ç¶ããNULLãäžèŽããå ŽåãäŒè©±ãå®äºãããšä»®å®ããŠãããã¯ã·ãŒã±ã³ã¹ã«ãªããŸããããã§éãã åŸãç§ã¯äŒè©±ã§åäœããã¢ãã«ãæ§ç¯ããããšãã§ããŸããã§ãããç§ãçè§£ããŠããããã«ãã·ãŒã±ã³ã¹å
ã®äœçœ®ã«å¿ããŠãå€ãã®ç¹å¥ãªãã©ã³ãžã·ã§ã³ãŠã§ã€ãïŒãããã®ãã©ã³ãžã·ã§ã³ãŠã§ã€ãïŒãå¿
èŠã§ãããããã£ãŠãã¢ãã«ã¯ãäŒè©±ã®éå§æãäžéããŸãã¯çµäºæã«ãç§ãã¡ã®äœçœ®ã«å¿ããŠç°ãªã決å®ãè¡ããŸããããããã¢ãã«ã®åäœã«ã¯ããããèµ·ãã£ãŠããããšã瀺ããã®ã¯äœããããŸãããè¿ãå°æ¥ãä»ã®ããããã£ã§å°ãéãã§ã¿ãŸããã¯ãããã¬ãŒãã³ã°ããŒã¿ãšãã¹ãããŒã¿ãçæããã¹ã¯ãªãããèŠãŠã¿ãŸããããæé©åãããŠããŸãã åæ®µèœã®ããããã£ã5åèšç®ããŸãããã®è³æã«ã€ããŠã¯ãã®ãŸãŸã«ããŠãããŸããã1ã€ã®ãµã€ã¯ã«ã䜿çšããŠæ®µèœã®ããŒã«ããããã£ãä¿æãã1ã€ã䜿çšããŠæ¢åã®æ®µèœã«è¿œå ãããšãé«éåã§ããããšã«æ³šæããŠãã ããã tree = ASOIAFtoXML([{"title": "ASOIAF", "contents": "corpus/train_asoiaf_pos_tagged.txt"}]) paragraphs = tree.xpath(".//paragraph") In [29]: def prep_test_data(paragraphs): max_index = len(paragraphs) results = [] for index, paragraph in enumerate(paragraphs): extractor = feature_extractor(paragraph, set_of_particles) all_features = extractor.local_features() + extractor.feature_booleans() for n in [-2, -1, 1, 2]: if 0 <= n+index < max_index: neighbor_features = feature_extractor(paragraphs[index + n], set_of_particles, tag_distance = n).feature_booleans() if neighbor_features: all_features += neighbor_features all_features.insert(0, extractor.speaker) results.append("\t".join(all_features)) return results results = prep_test_data(paragraphs) In [31]: max_index = len(results) with codecs.open(r"new_test.txt", "w", "utf-8") as output: for line in results[:int(max_index/2)]: output.write(line + '\n') with codecs.open(r"new_train.txt", "w", "utf-8") as output: for line in results[int(max_index/2):]: output.write(line + '\n')
ãã®ä»ã®ããããã£
ä»ã®ããã€ãã®ããããã£ã詊ããŸããïŒ- ãã€ã¢ãã°ã®æåã®è¡ã®åã®ååã®æ°ãæ°ããŸããçè«çã«ã¯ãããã¯NNãæãå€ãå Žæã§ããçµæã¯ãããŸããã
- 段èœã®å
šäœãŸãã¯äžéšãããŒã¯ããããããã£ã¯ããã€ã¢ãã°ã§ããããã¯PS -2ãšFN -2ã§ç¶æ³ãæ¹åããã®ã«åœ¹ç«ã¡ãŸããããéãã¯éèŠã§ã¯ãããŸããã§ããã
- çã/é·ã段èœãã¡ãã£ãšããã
- 察話åã®ããã¹ãã®ãããã³ããŸãã¯ããããããïŒç¡èŠãããNN 0ã«çŠç¹ãåœãŠã詊ã¿ã§ïŒ
åŸè
ã¯ããªãå·§åŠãªåãã ãšæããŸããããæ©èœããã81ïŒ
ãè¶
ãã粟床ã¯åŸãããŸããã§ãããæ€èšŒã§ãã¬ãŒãã³ã°ããŒã¿ã倿Žããããšãããšããã84ïŒ
ã«ãªããŸãããç¹å®ã®ããŒã¿ã®å€ãã®ããããã£ãæ¹åããã®ã«å€ãã®æéãè²»ããã¹ãã§ã¯ãããŸãããããã¯åèšç·Žã«ã€ãªãããŸããå®éããã¬ãŒãã³ã°ããŒã¿ãšãã¹ãããŒã¿ãæ··åããããšã¯è¯ãèãã§ããç§ã¯ããããæ··ããŸããã§ããããªããªã ããã¯ã·ãŒã±ã³ã¹ã®æå·ã«ã€ãªãããšæããŸãããããã䜿çšããªãã®ã§ããªãã§ããïŒããããæ··ããŸããå°ãæ··ãã£ãããŒã¿82ïŒ
ãåãåããŸãããããã£ãïŒããã§ã¹ãã«ã®éçã«éãããšæããŸããç¶ç¶ã¯ãããŸãããïŒ
次ã«äœãã§ãããããŸãšããŠè©±ããŸãããã- . 700 . 40000. , 1.7% 80%. ( 80%, 75%.) 10000 ? , , ADR, 700 .
- CRFsuite. , .
- .
- .
- Python. , . , âŠ
- . , OTHER. OTHER, , , , . OTHER â .
- . . . , , , , «». ; «» , .
ãããã«
ãããïŒããã誰ãã«åœ¹ç«ã€ããšãé¡ã£ãŠããŸããèªãã§ãããŠããããšãããããŠããããªããç§ã«é£çµ¡ããããªããç§ã¯ãã€ãã¿ãŒã«ããŸãããŸããã²ãŒã ã»ãªãã»ã¹ããŒã³ãºã®å€§èŠæš¡ãªæ¹å€çç ç©¶ã®ããã«äžèšã®ãã¹ãŠãè¡ãããããšã«æ³šæããããšæããŸããããªãããããã®æ¬ã®ãã¡ã³ã§ããã察話ã©ãã«ã®ãããã§å¯èœã§ãã£ãåæãèªã¿ãããªããç§ã¯ããã«ãã¹ãŠãå
¬éããŸãã