èªç¶èšèªåŠçã¯çŸåšãéåžžã«ä¿å®çãªã»ã¯ã¿ãŒãé€ããŠäœ¿çšãããŠããŸããã ã»ãšãã©ã®æè¡çãœãªã¥ãŒã·ã§ã³ã§ã¯ãã人éã®ãèšèªã®èªèãšåŠçãé·ãéå°å
¥ãããŠããŸãããã®ãããããŒãã³ãŒããããå¿çãªãã·ã§ã³ãåããéåžžã®IVRã¯åŸã
ã«éå»ã®ãã®ã«ãªãããã£ãããããã¯ã©ã€ããªãã¬ãŒã¿ãŒã®åå ãªãã§ããé©åã«éä¿¡ãå§ããŠããŸããã¡ãŒã«ãã£ã«ã¿ãŒã¯åŒ·æã§åäœããŸã é²é³ãããé³å£°ãã€ãŸãããã¹ãã®èªèã¯ã©ãã§ããïŒ ããããçŸä»£ã®èªèããã³åŠçæè¡ã®åºç€ã¯äœã§ããããïŒ ç§ãã¡ã®ä»æ¥ã®é©å¿ç¿»èš³ã¯ããã«ããåå¿ããŸã-ã«ããã®äžã§ãNLPã®åºæ¬ã®ã®ã£ãããåããé·ãéã®ããèŠã€ããã§ãããã çŽ æµãªèªæžãïŒ
èªç¶èšèªåŠçãšã¯äœã§ããïŒ
èªç¶èšèªåŠçïŒä»¥äžNLPãšåŒã³ãŸãïŒ-èªç¶èšèªåŠçã¯ãã³ã³ãã¥ãŒã¿ãŒãèªç¶ïŒäººéïŒèšèªãåæããæ¹æ³ã«å°å¿µããã³ã³ãã¥ãŒã¿ãŒãµã€ãšã³ã¹ãšAIã®ãµãã»ã¯ã·ã§ã³ã§ãã NLPã§ã¯ãããã¹ããšé³å£°ã«æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã䜿çšã§ããŸãã
ããšãã°ãNLPã䜿çšããŠãé³å£°èªèãããã¥ã¡ã³ãã®äžè¬åãæ©æ¢°ç¿»èš³ãã¹ãã æ€åºãååä»ããšã³ãã£ãã£ã®èªèã質åãžã®åçããªãŒãã³ã³ããªãŒããäºæž¬ããã¹ãå
¥åãªã©ã®ã·ã¹ãã ãäœæã§ããŸãã
仿¥ãç§ãã¡ã®å€ãã¯é³å£°èªèã¹ããŒããã©ã³ãæã£ãŠããŸã-圌ãã¯ç§ãã¡ã®é³å£°ãçè§£ããããã«NLPã䜿çšããŠããŸãã ãŸããå€ãã®äººã
ã¯OSã«çµã¿èŸŒãŸããé³å£°èªèãåããã©ãããããã䜿çšããŠããŸãã
äŸ
ã³ã«ã¿ã
Windowsã«ã¯ãé³å£°ãèªèããCortanaä»®æ³ã¢ã·ã¹ã¿ã³ãããããŸãã Cortanaã䜿çšãããšããªãã€ã³ããŒãäœæããããã¢ããªã±ãŒã·ã§ã³ãéããããæçŽãéã£ãããã²ãŒã ããããã倩æ°ã調ã¹ããããããšãã§ããŸãã
ã·ãª
Siriã¯Appleã®OSïŒiOSãwatchOSãmacOSãHomePodãtvOSïŒã®ã¢ã·ã¹ã¿ã³ãã§ãã å€ãã®æ©èœã¯é³å£°å¶åŸ¡ã§ãæ©èœããŸãïŒèª°ãã«é»è©±ããããããæžããããã¡ãŒã«ãéä¿¡ããããã¿ã€ããŒãèšå®ããããåçãæ®ã£ãããªã©ã
Gmail
ããç¥ãããŠããé»åã¡ãŒã«ãµãŒãã¹ã¯ãã¹ãã ãæ€åºããŠåä¿¡ãã¬ã€ã®åä¿¡ãã¬ã€ã«å±ããªãããã«ããæ¹æ³ãç¥ã£ãŠããŸãã
Dialogflow
NLPããããäœæã§ããGoogleã®ãã©ãããã©ãŒã ã ããšãã°ã泚æ
ãåãå
¥ããããã«æãªããã®IVRãå¿
èŠãšããªããã¶æ³šæããã
ãäœæã§ã
ãŸã ã
NLTK Pythonã©ã€ãã©ãª
NLTKïŒNatural Language ToolkitïŒã¯ãPythonã§NLPããã°ã©ã ãäœæããããã®äž»èŠãªãã©ãããã©ãŒã ã§ãã å€ãã®
èšèªã³ãŒãã¹ã®äœ¿ããããã€ã³ã¿ãŒãã§ã€ã¹ãšãåé¡ãããŒã¯ã³åã
ã¹ããã³ã° ã
ããŒã¯ã¢ãã ããã£ã«ã¿ãªã³ã°ã
ã»ãã³ãã£ãã¯æšè«ã®ããã®ã¯ãŒãããã»ãã·ã³ã°çšã®ã©ã€ãã©ãªããããŸãã ãŸãããããã³ãã¥ããã£ã®å©ããåããŠéçºãããŠããç¡æã®ãªãŒãã³ãœãŒã¹ãããžã§ã¯ãã§ãã
ãã®ããŒã«ã䜿çšããŠãNLPã®åºæ¬ã瀺ããŸãã 以éã®ãã¹ãŠã®äŸã§ã¯ãNLTKãæ¢ã«ã€ã³ããŒããããŠãããšæ³å®ããŠããŸãã ããã¯
import nltk
å®è¡ã§ããŸã
ããã¹ãã®NLPã®åºæ¬
ãã®èšäºã§ã¯ã次ã®ãããã¯ã«ã€ããŠèª¬æããŸãã
- ãªãã¡ãŒã«ããããŒã¯ã³åã
- åèªã«ããããŒã¯ã³åã
- ããã¹ãã®è£é¡ãšã¹ã¿ã³ã ã
- ã¹ãããã¯ãŒãã
- æ£èŠè¡šçŸã
- èšèã®è¢ ã
- TF-IDF
1.ãªãã¡ãŒã«ããããŒã¯ã³å
æã®ããŒã¯ã³åïŒå Žåã«ãã£ãŠã¯ã»ã°ã¡ã³ããŒã·ã§ã³ïŒã¯ãèšè¿°èšèªãã³ã³ããŒãã³ãæã«åå²ããããã»ã¹ã§ãã ã¢ã€ãã¢ã¯éåžžã«ã·ã³ãã«ã«èŠããŸãã è±èªãä»ã®ããã€ãã®èšèªã§ã¯ãç¹å®ã®å¥èªç¹ãã€ãŸãããªãªããèŠã€ãããã³ã«æãåé¢ã§ããŸãã
ããããè±èªã§ãããã®ã¿ã¹ã¯ã¯ç¥èªã§ã䜿çšãããããããã®ã¿ã¹ã¯ã¯ç°¡åã§ã¯ãããŸããã ç¥èªè¡šã¯ãã¯ãŒãããã»ãã·ã³ã°äžã«æã®å¢çã誀ã£ãŠé
眮ããªãããã«ããã®ã«éåžžã«åœ¹ç«ã¡ãŸãã ã»ãšãã©ã®å Žåãããã«ã¯ã©ã€ãã©ãªã䜿çšããããããå®è£
ã®è©³çްãå¿é
ããå¿
èŠã¯ãããŸããã
äŸïŒããã¯ã®ã£ã¢ã³ããŒãã²ãŒã ã«ã€ããŠã®çãããã¹ããåãåããŸãã
Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.
NLTKã䜿çšããŠããŒã¯ã³åãæäŸããã«ã¯ã
nltk.sent_tokenize
ã¡ãœããã䜿çšã§ããŸã
| text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice." |
| sentences = nltk.sent_tokenize(text) |
| for sentence in sentences: |
| print(sentence) |
| print() |
åºå£ã§ã3ã€ã®åå¥ã®æãååŸããŸãã
Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.
2.èšèã«ããããŒã¯ã³å
åèªã«ããããŒã¯ã³åïŒå Žåã«ãã£ãŠã¯ã»ã°ã¡ã³ããŒã·ã§ã³ïŒã¯ãæãæ§æèŠçŽ ã®åèªã«åå²ããããã»ã¹ã§ãã ç¹å®ã®ããŒãžã§ã³ã®ã©ãã³ã¢ã«ãã¡ãããã䜿çšããè±èªããã³ä»ã®å€ãã®èšèªã§ã¯ãã¹ããŒã¹ãé©åãªåèªåºåãæåã§ãã
ãã ããã¹ããŒã¹ã®ã¿ã䜿çšãããšåé¡ãçºçããå¯èœæ§ããããŸããè±èªã§ã¯ãè€ååè©ã¯ç°ãªãæ¹æ³ã§èšè¿°ãããå Žåã«ãã£ãŠã¯ã¹ããŒã¹ã§åºåãããŸãã ããã§ããã©ã€ãã©ãªã圹ç«ã¡ãŸãã
äŸïŒåã®äŸã®æãåãã
nltk.word_tokenize
ã¡ãœããããããã«é©çšããŠã¿ãŸããã
| for sentence in sentences: |
| words = nltk.word_tokenize(sentence) |
| print(words) |
| print() |
çµè«ïŒ
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.'] ['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.'] ['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']
3.ããã¹ãã®è£é¡ãšã¹ã¿ã³ã
éåžžãããã¹ãã«ã¯åãåèªã®ç°ãªãææ³åœ¢åŒãå«ãŸããŠãããã«ãŒãèªã1ã€ããå ŽåããããŸãã èªåœåãšã¹ããã³ã°ã¯ãåºäŒããã¹ãŠã®åèªåœ¢åŒãåäžã®éåžžã®èªåœåœ¢åŒã«ããããšãç®çãšããŠããŸãã
äŸïŒããŸããŸãªåèªåœ¢åŒã1ã€ã«ãŸãšããïŒ
dog, dogs, dog's, dogs' => dog
åãã§ãããæå
šäœãåç
§ããŸãïŒ
the boy's dogs are different sizes => the boy dog be differ size
è£é¡ãšã¹ããã³ã°ã¯ãæ£èŠåã®ç¹æ®ãªã±ãŒã¹ã§ãããç°ãªããŸãã
ã¹ããã³ã°ã¯ãåèªã®èªæ ¹ãããéå°ããã«ããããç²ããã¥ãŒãªã¹ãã£ãã¯ãªããã»ã¹ã§ãããå€ãã®å Žåãããã¯åèªæ§ç¯æ¥å°ŸèŸã®æå€±ã«ã€ãªãããŸãã
èªåœåã¯ãèªåœãšåœ¢æ
çŽ è§£æã䜿çšããŠãæçµçã«åèªããã®æšæºçãªåœ¢åŒãã€ãŸãè£é¡ã«å°ãããã埮åŠãªããã»ã¹ã§ãã
éãã¯ãã¹ãããŒïŒã¹ããã³ã°ã¢ã«ãŽãªãºã ã®ç¹å®ã®å®è£
-翻蚳è
ã³ã¡ã³ãïŒãã³ã³ããã¹ããç¥ããªããŠãåäœãããããåè©ã«ãã£ãŠæå³ãç°ãªãåèªã®éããçè§£ããªãããšã§ãã ãã ããã¹ãããŒã«ã¯ç¬èªã®å©ç¹ããããŸããå®è£
ãç°¡åã§ãåäœãé«éã§ãã ããã«ãã粟床ãã®äœäžã¯å Žåã«ãã£ãŠã¯åé¡ã«ãªããŸããã
äŸïŒ- goodãšããèšèã¯ãbetterãšããèšèã®è£é¡ã§ãã ããã§ã¯èŸæžã調ã¹ãå¿
èŠããããããStemmerã¯ãã®æ¥ç¶ãèªèããŸããã
- èšèéã³ã¯èšèéã³ã®åºæ¬çãªåœ¢ã§ãã ããã§ã¯ãã¹ããã³ã°ãšè£é¡ã®äž¡æ¹ã察åŠããŸãã
- äŒåãšããèšèã¯ãæèã«å¿ããŠãåè©ã®éåžžã®åœ¢ãŸãã¯äŒãåè©ã®åœ¢ã®ããããã§ãã ã¹ããã³ã°ãšã¯ç°ãªããè£é¡åã¯ã³ã³ããã¹ãã«åºã¥ããŠæ£ããè£é¡ãéžæããããšããŸãã
éããããã£ãã®ã§ãäŸãèŠãŠã¿ãŸãããã
| from nltk.stem import PorterStemmer, WordNetLemmatizer |
| from nltk.corpus import wordnet |
| |
| def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos): |
| """ |
| Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech) |
| """ |
| print("Stemmer:", stemmer.stem(word)) |
| print("Lemmatizer:", lemmatizer.lemmatize(word, pos)) |
| print() |
| |
| lemmatizer = WordNetLemmatizer() |
| stemmer = PorterStemmer() |
| compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB) |
| compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB) |
çµè«ïŒ
Stemmer: seen Lemmatizer: see Stemmer: drove Lemmatizer: drive
4.ã¹ãããã¯ãŒã
ã¹ãããã¯ãŒãã¯ãããã¹ãåŠçã®ååŸã«ããã¹ãããã¹ããŒãããã¯ãŒãã§ãã æ©æ¢°åŠç¿ãããã¹ãã«é©çšãããšããã®ãããªåèªã¯å€ãã®ãã€ãºã远å ããå¯èœæ§ããããããç¡é¢ä¿ãªåèªãåãé€ãå¿
èŠããããŸãã
ã¹ãããã¯ãŒãã¯éåžžãã»ãã³ãã£ãã¯ããŒããæããªãèšäºãéæè©ãå
±çšäœãªã©ã«ãã£ãŠçè§£ãããŸãã ã¹ãããã¯ãŒãã®äžè¬çãªãªã¹ãã¯ãªãããã¹ãŠç¹å®ã®ã±ãŒã¹ã«äŸåããããšãçè§£ããŠãã ããã
NLTKã«ã¯ãã¹ãããã¯ãŒãã®å®çŸ©æžã¿ãªã¹ãããããŸãã åããŠäœ¿çšããåã«ã
nltk.download(âstopwordsâ)
ãããŠã³ããŒãããå¿
èŠããããŸãã ããŠã³ããŒãåŸã
stopwords
ããã±ãŒãžãã€ã³ããŒãããŠãåèªèªäœã確èªã§ããŸãã
| from nltk.corpus import stopwords |
| print(stopwords.words("english")) |
çµè«ïŒ
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
æããã¹ãããã¯ãŒããåé€ããæ¹æ³ãæ€èšããŠãã ããã
| stop_words = set(stopwords.words("english")) |
| sentence = "Backgammon is one of the oldest known board games." |
| |
| words = nltk.word_tokenize(sentence) |
| without_stop_words = [word for word in words if not word in stop_words] |
| print(without_stop_words) |
çµè«ïŒ
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']
ãªã¹ãã®çè§£ã«æ
£ããŠããªãå Žåã¯ã
ãã¡ããã芧
ãã ãã ã åãçµæãåŸãå¥ã®æ¹æ³ã次ã«ç€ºããŸãã
| stop_words = set(stopwords.words("english")) |
| sentence = "Backgammon is one of the oldest known board games." |
| |
| words = nltk.word_tokenize(sentence) |
| without_stop_words = [] |
| for word in words: |
| if word not in stop_words: |
| without_stop_words.append(word) |
| |
| print(without_stop_words) |
ãã ãããªã¹ãã®å
å
衚èšã¯æé©åãããŠããããé«éã§ããã€ã³ã¿ãŒããªã¿ãŒã¯ã«ãŒãäžã«äºæž¬ãã¿ãŒã³ãæããã«ããŸãã
ãªã¹ãã
倿°ã«å€æããçç±ãå°ãããããããŸããã ã»ããã¯ãäžæã®å€ãæªå®çŸ©ã®é åºã§æ ŒçŽã§ããæœè±¡ããŒã¿åã§ãã è€æ°ã®æ€çŽ¢ã¯ããªã¹ãæ€çŽ¢ãããã¯ããã«é«éã§ãã å°æ°ã®åèªã®å Žåãããã¯éèŠã§ã¯ãããŸãããã倿°ã®åèªã«ã€ããŠè©±ããŠããå Žåã¯ãã»ããã䜿çšããããšã匷ããå§ãããŸãã ããŸããŸãªæäœãå®è¡ããã®ã«ãããæéã«ã€ããŠããå°ãç¥ãããå Žåã¯ã
ãã®çŽ æŽãããããŒãã·ãŒããã芧ãã ããã
5.æ£èŠè¡šçŸã
æ£èŠè¡šçŸïŒregularãregexpãregexïŒã¯ãæ€çŽ¢ãã¿ãŒã³ãå®çŸ©ããäžé£ã®æåã§ãã äŸïŒ
- ã -æ¹è¡ä»¥å€ã®ä»»æã®æåã
- \ wã¯1ã¯ãŒãã§ãã
- \ d-1æ¡ã
- \ sã¯1ã€ã®ã¹ããŒã¹ã§ãã
- \ Wã¯1ã€ã®éåèªã§ãã
- \ D-1æ¡ã®éæ°å;
- \ S-1ã€ã®éã¹ããŒã¹ã
- [abc]-aãbããŸãã¯cã®ããããã«äžèŽããæå®ãããæåãæ€çŽ¢ããŸãã
- [^ abc]-æå®ãããæå以å€ã®æåãæ€çŽ¢ããŸãã
- [ag]-aãgã®ç¯å²ã®æåãæ€çŽ¢ããŸãã
Pythonããã¥ã¡ã³ãããã®æç²ïŒ
æ£èŠè¡šçŸã§ã¯ãããã¯ã¹ã©ãã·ã¥(\)
ã䜿çšããŠãç¹æ®ãªåœ¢åŒã瀺ããããç¹æ®æåã®äœ¿çšãèš±å¯ãããããŸãã ããã¯ãPythonã§ããã¯ã¹ã©ãã·ã¥ã䜿çšããããšã«åããŸããããšãã°ãæåéãããã¯ã¹ã©ãã·ã¥ã瀺ãã«ã¯ãæ€çŽ¢ãã¿ãŒã³ãšããŠ'\\\\'
ãèšè¿°ããå¿
èŠããã'\\\\'
ã
解決çã¯ãæ€çŽ¢ãã¿ãŒã³ã«çã®æåå衚èšã䜿çšããããšã§ãã ãã¬ãã£ãã¯ã¹'r'
䜿çšããå Žåãããã¯ã¹ã©ãã·ã¥ã¯ç¹å¥ã«åŠçãããŸããã ãããã£ãŠã râ\nâ
ã¯2æå('\' 'n')
ã®æååã§ããã â\nâ
ã¯1æåïŒæ¹è¡ïŒã®æååã§ãã
æ£èŠè¡šçŸã䜿çšããŠãããã¹ããããã«ãã£ã«ã¿ãªã³ã°ã§ããŸãã ããšãã°ãåèªä»¥å€ã®ãã¹ãŠã®æåãåé€ã§ããŸãã å€ãã®å Žåãå¥èªç¹ã¯äžèŠã§ãããåžžé£ã®å©ããåããŠç°¡åã«åé€ã§ããŸãã
Pythonã®
reã¢ãžã¥ãŒã«ã¯ãæ£èŠè¡šçŸæäœã衚ããŸãã
re.sub颿°ã䜿çšããŠãæ€çŽ¢ãã¿ãŒã³ã«é©åãããã¹ãŠã®ãã®ãæå®ãããæååã«çœ®ãæããããšãã§ããŸãã ãããã£ãŠããã¹ãŠã®éåèªãã¹ããŒã¹ã«çœ®ãæããããšãã§ããŸãã
| import re |
| sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing." |
| pattern = r"[^\w]" |
| print(re.sub(pattern, " ", sentence)) |
çµè«ïŒ
'The development of snowboarding was inspired by skateboarding sledding surfing and skiing '
ã¬ã®ã¥ã©ãŒã¯ãã¯ããã«è€éãªãã¿ãŒã³ãäœæããããã«äœ¿çšã§ãã匷åãªããŒã«ã§ãã æ£èŠè¡šçŸã«ã€ããŠè©³ããç¥ãããå Žåã¯ãããã2ã€ã®Webã¢ããªã±ãŒã·ã§ã³ã
regex ã
regex101ããå§ãããŸãã
6.èšèã®è¢
æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã¯çã®ããã¹ããçŽæ¥åŠçã§ããªããããããã¹ããäžé£ã®æ°å€ïŒãã¯ãã«ïŒã«å€æããå¿
èŠããããŸãã ããã¯
ç¹åŸŽæœåºãšåŒã°ã
ãŸã ã
ã¯ãŒãããã°ã¯ãããã¹ããæäœãããšãã«äœ¿çšãããäžè¬çãªã·ã³ãã«ãªç¹åŸŽæœåºææ³ã§ãã ããã¹ãå
ã®ååèªã®åºçŸã説æããŸãã
ã¢ãã«ã䜿çšããã«ã¯ã次ã®ãã®ãå¿
èŠã§ãã
- æ¢ç¥ã®åèªïŒããŒã¯ã³ïŒã®èŸæžãå®çŸ©ããŸãã
- æåãªåèªã®ååšåºŠãéžæããŸãã
åèªã®é åºãŸãã¯æ§é ã«é¢ããæ
å ±ã¯ç¡èŠãããŸãã ãããèšèã®ããã°ãšåŒã°ããçç±ã§ãã ãã®ã¢ãã«ã¯ãããªãã¿ã®åèªãããã¥ã¡ã³ãã«çŸãããã©ãããçè§£ããããšããŸãããæ£ç¢ºã«ã©ãã§çºçããããç¥ããŸããã
çŽæã¯ã
åæ§ã®ããã¥ã¡ã³ãã
åæ§ã®å
容ãæã£ãŠ
ããããšã瀺åã
ãŸã ã ãŸããã³ã³ãã³ãã®ãããã§ãããã¥ã¡ã³ãã®æå³ã«ã€ããŠäœããåŠã¶ããšãã§ããŸãã
äŸïŒãã®ã¢ãã«ãäœæããæé ãæ€èšããŠãã ããã ã¢ãã«ãã©ã®ããã«æ©èœããããçè§£ããããã«ã4ã€ã®æã®ã¿ã䜿çšããŸãã å®éã«ã¯ãããå€ãã®ããŒã¿ã«ééããŸãã
1.ããŒã¿ãããŠã³ããŒããã
ãããããŒã¿ã§ãããé
åãšããŠããŒãããããšãæ³åããŠãã ããã
I like this movie, it's funny. I hate this movie. This was awesome! I like it. Nice one. I love it.
ãããè¡ãã«ã¯ããã¡ã€ã«ãèªã¿åããè¡ã§åå²ããŸãã
| with open("simple movie reviews.txt", "r") as file: |
| documents = file.read().splitlines() |
| |
| print(documents) |
çµè«ïŒ
["I like this movie, it's funny.", 'I hate this movie.', 'This was awesome! I like it.', 'Nice one. I love it.']
2.èŸæžãå®çŸ©ãã
倧æåãšå°æåãå¥èªç¹ã1æåã®ããŒã¯ã³ãç¡èŠããŠãèªã¿èŸŒãŸãã4ã€ã®æãããã¹ãŠã®äžæã®åèªãåéããŸãã ãããç§ãã¡ã®èŸæžïŒæåãªèšèïŒã«ãªããŸãã
èŸæžãäœæããã«ã¯ãsklearnã©ã€ãã©ãªãŒã®
CountVectorizerã¯ã©ã¹ã䜿çšã§ããŸãã æ¬¡ã®ã¹ãããã«é²ã¿ãŸãã
3.ããã¥ã¡ã³ããã¯ãã«ãäœæãã
次ã«ãããã¥ã¡ã³ãå
ã®åèªãè©äŸ¡ããå¿
èŠããããŸãã ãã®ã¹ãããã§ã®ç®æšã¯ãçã®ããã¹ããäžé£ã®æ°åã«å€ããããšã§ãã ãã®åŸããããã®ã»ãããæ©æ¢°åŠç¿ã¢ãã«ãžã®å
¥åãšããŠäœ¿çšããŸãã æãåçŽãªã¹ã³ã¢ãªã³ã°æ¹æ³ã¯ãåèªã®ååšã«æ³šæããããšã§ããã€ãŸããåèªãããå Žåã¯1ããååšããªãå Žåã¯0ãå
¥åããŸãã
ããã§ãåè¿°ã®CountVectorizerã¯ã©ã¹ã䜿çšããŠåèªã®è¢ãäœæã§ããŸãã
| # Import the libraries we need |
| from sklearn.feature_extraction.text import CountVectorizer |
| import pandas as pd |
| |
| # Step 2. Design the Vocabulary |
| # The default token pattern removes tokens of a single character. That's why we don't have the "I" and "s" tokens in the output |
| count_vectorizer = CountVectorizer() |
| |
| # Step 3. Create the Bag-of-Words Model |
| bag_of_words = count_vectorizer.fit_transform(documents) |
| |
| # Show the Bag-of-Words Model as a pandas DataFrame |
| feature_names = count_vectorizer.get_feature_names() |
| pd.DataFrame(bag_of_words.toarray(), columns = feature_names) |
çµè«ïŒ
ãããã¯ç§ãã¡ã®ææ¡ã§ãã ãåèªã®è¢ãã¢ãã«ãã©ã®ããã«æ©èœããããããããŸãã
èšèã®è¢ã«ã€ããŠã®ããã€ãã®èšè
ãã®ã¢ãã«ã®è€éãã¯ãèŸæžã®æ±ºå®æ¹æ³ãšåèªã®åºçŸåæ°ã®ã«ãŠã³ãæ¹æ³ã§ãã
èŸæžã®ãµã€ãºã倧ãããªããšãææžãã¯ãã«ã倧ãããªããŸãã äžèšã®äŸã§ã¯ããã¯ãã«ã®é·ãã¯æ¢ç¥ã®åèªã®æ°ã«çãããªããŸãã
å Žåã«ãã£ãŠã¯ãä¿¡ããããªãã»ã©å€§éã®ããŒã¿ãæã€ããšãã§ãããã¯ãã«ã¯æ°åãŸãã¯æ°çŸäžã®èŠçŽ ã§æ§æãããŸãã ããã«ãåããã¥ã¡ã³ãã«ã¯ãèŸæžã®åèªã®ããäžéšããå«ããããšãã§ããŸããã
çµæãšããŠããã¯ãã«è¡šçŸã«ã¯å€ãã®ãŒãããããŸãã å€ãã®ãŒããæã€ãã¯ãã«ã¯ã¹ããŒã¹ãã¯ãã«ãšåŒã°ããããå€ãã®ã¡ã¢ãªãšèšç®ãªãœãŒã¹ãå¿
èŠã§ãã
ãã ãããã®ã¢ãã«ã䜿çšããŠã³ã³ãã¥ãŒãã£ã³ã°ãªãœãŒã¹ã®èŠä»¶ãæžãããšãæ¢ç¥ã®åèªã®æ°ãæžããããšãã§ããŸãã ãããè¡ãã«ã¯ãäžé£ã®åèªãäœæããåã«æ¢ã«æ€èšãããã®ãšåãææ³ã䜿çšã§ããŸãã
- åèªã®å€§æåå°æåãç¡èŠããŸãã
- å¥èªç¹ãç¡èŠããŸãã
- ã¹ãããã¯ãŒãã®æåºã
- åèªãåºæ¬çãªåœ¢ïŒèªåœåãšã¹ããã³ã°ïŒã«éå
ã
- ã¹ãã«ãã¹ã®åèªã®ä¿®æ£ã
èŸæžãäœæããå¥ã®ããè€éãªæ¹æ³ã¯ãã°ã«ãŒãåãããåèªã䜿çšããããšã§ãã ããã«ãããèŸæžã®ãµã€ãºã倿Žãããåèªã®è¢ã«ããã¥ã¡ã³ãã«é¢ãã詳现ã远å ãããŸãã ãã®ã¢ãããŒãã¯ã
N-gram ããšåŒã°ããŸãã
N-gramã¯ããšã³ãã£ãã£ïŒåèªãæåãæ°åãæ°åãªã©ïŒã®ã·ãŒã±ã³ã¹ã§ãã èšèªæ¬äœã®ã³ã³ããã¹ãã§ã¯ãN-gramã¯éåžžãäžé£ã®åèªãšããŠçè§£ãããŸãã ãŠãã°ã©ã ã¯1ã¯ãŒãããã€ã°ã©ã ã¯2ã¯ãŒãã®ã·ãŒã±ã³ã¹ããã©ã€ã°ã©ã ã¯3ã¯ãŒããªã©ã§ãã æ°å€Nã¯ãN-gramã«å«ãŸããã°ã«ãŒãåãããåèªã®æ°ã瀺ããŸãã èãããããã¹ãŠã®N-gramãã¢ãã«ã«è©²åœããããã§ã¯ãªããã±ãŒã¹ã«è¡šç€ºãããN-gramã®ã¿ã該åœããŸãã
äŸïŒæ¬¡ã®æãèæ
®ããŠãã ããã
The office building is open today
圌ã®ãã€ã°ã©ã ã¯æ¬¡ã®ãšããã§ãã
- ãªãã£ã¹
- ãªãã£ã¹ãã«
- 建ç©ã¯
- éããŠãã
- æ¬æ¥å¶æ¥
ã芧ã®ãšããããã€ã°ã©ã ã®ããã°ã¯èšèã®ããã°ããã广çãªã¢ãããŒãã§ãã
åèªã®è©äŸ¡ïŒã¹ã³ã¢ãªã³ã°ïŒèŸæžãäœæãããšããåèªã®ååšãè©äŸ¡ããå¿
èŠããããŸãã åçŽãªãã€ããªã¢ãããŒããæ¢ã«æ€èšããŠããŸãïŒ1-åèªãããã0-åèªããããŸããïŒã
ä»ã®æ¹æ³ããããŸãïŒ
- æ°éã ææžã«ååèªãåºçŸããåæ°ãèšç®ãããŸãã
- é »åºŠ ããã¹ãå
ã®ååèªã®åºçŸé »åºŠïŒåèªã®åèšæ°ã«å¯ŸããŠïŒãèšç®ãããŸãã
7. TF-IDF
é »åºŠã¹ã³ã¢ãªã³ã°ã«ã¯åé¡ããããŸããé »åºŠãæãé«ãåèªã«ã¯ãããããæé«ã®è©äŸ¡ããããŸãã ãããã®èšèã§ã¯ãé »åºŠã®äœãèšèã»ã©ãã¢ãã«ã®
æ
å ±ã²ã€ã³ãå°ãªãå ŽåããããŸãã ç¶æ³ãä¿®æ£ãã1ã€ã®æ¹æ³ã¯ãåèªã¹ã³ã¢ãäžããããšã§ããããã¯
ããã¹ãŠã®åæ§ã®ããã¥ã¡ã³ãã§ããèŠ
ãããŸã ã ããã¯
TF-IDFãšåŒã°ããŸãã
TF-IDFïŒçšèªé »åºŠã®ç¥-éææžé »åºŠïŒã¯ãã³ã¬ã¯ã·ã§ã³ãŸãã¯ã³ãŒãã¹ã®äžéšã§ããææžå
ã®åèªã®éèŠæ§ãè©äŸ¡ããããã®çµ±èšç尺床ã§ãã
TF-IDFã«ããã¹ã³ã¢ãªã³ã°ã¯ãããã¥ã¡ã³ãå
ã®åèªã®åºçŸé »åºŠã«æ¯äŸããŠå¢å ããŸããããã®åèªãå«ãããã¥ã¡ã³ãã®æ°ã«ãã£ãŠçžæ®ºãããŸãã
ããã¥ã¡ã³ãYã®åèªXã®ã¹ã³ã¢ãªã³ã°åŒïŒ
ãã©ãŒãã¥ã©TF-IDFã ãœãŒã¹ïŒ filotechnologia.blogspot.com/2014/01/a-simple-java-class-for-tfidf-scoring.htmlTFïŒçšèªé »åºŠïŒã¯ãããã¥ã¡ã³ãå
ã®åèªã®ç·æ°ã«å¯Ÿããåèªã®åºçŸåæ°ã®æ¯çã§ãã
IDFïŒéææžé »åºŠïŒã¯ãã³ã¬ã¯ã·ã§ã³ææžã§ç¹å®ã®åèªãåºçŸããé »åºŠã®éã§ãã
ãã®çµæã
次ã®ããã«åèª
termã® TF-IDFãèšç®ã§ããŸãã
äŸïŒsklearnã©ã€ãã©ãªãŒã®
TfidfVectorizerã¯ã©ã¹ã䜿çšããŠãTF-IDFãèšç®ã§ããŸãã ããã°ãªãã¯ãŒãã®äŸã§äœ¿çšããã®ãšåãã¡ãã»ãŒãžã䜿çšããŠããããå®è¡ããŠã¿ãŸãããã
I like this movie, it's funny. I hate this movie. This was awesome! I like it. Nice one. I love it.
ã³ãŒãïŒ
| from sklearn.feature_extraction.text import TfidfVectorizer |
| import pandas as pd |
| |
| tfidf_vectorizer = TfidfVectorizer() |
| values = tfidf_vectorizer.fit_transform(documents) |
| |
| # Show the Model as a pandas DataFrame |
| feature_names = tfidf_vectorizer.get_feature_names() |
| pd.DataFrame(values.toarray(), columns = feature_names) |
çµè«ïŒ
ãããã«
ãã®èšäºã§ã¯ãããã¹ãã®NLPã®åºæ¬ã«ã€ããŠèª¬æããŸããã
- NLPã§ã¯ãããã¹ããšé³å£°ã«æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã䜿çšã§ããŸãã
- NLTKïŒNatural Language ToolkitïŒ-Pythonã§NLPããã°ã©ã ãäœæããããã®äž»èŠãªãã©ãããã©ãŒã ã
- ææ¡ããŒã¯ã³åã¯ãèšè¿°èšèªãã³ã³ããŒãã³ãæã«åå²ããããã»ã¹ã§ãã
- åèªã®ããŒã¯ã³åã¯ãæãæ§æèŠçŽ ã®åèªã«åå²ããããã»ã¹ã§ãã
- èªåœåãšã¹ããã³ã°ã¯ãåºäŒããã¹ãŠã®åèªåœ¢åŒãåäžã®éåžžã®èªåœåœ¢åŒã«ããããšãç®çãšããŠããŸãã
- ã¹ãããã¯ãŒãã¯ãããã¹ãåŠçã®ååŸã«ããã¹ãããã¹ããŒãããã¯ãŒãã§ãã
- regexïŒregexãregexpãregexïŒã¯ãæ€çŽ¢ãã¿ãŒã³ãå®çŸ©ããäžé£ã®æåã§ãã
- åèªã®è¢ã¯ãããã¹ããæäœãããšãã«äœ¿çšãããäžè¬çã§ã·ã³ãã«ãªç¹åŸŽæœåºææ³ã§ãã ããã¹ãå
ã®ååèªã®åºçŸã説æããŸãã
ãããïŒ ç¹åŸŽæœåºã®åºæ¬ãããã£ãã®ã§ãæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãžã®å
¥åãšããŠç¹åŸŽã䜿çšã§ããŸãã
説æãããŠãããã¹ãŠã®æŠå¿µã1ã€ã®å€§ããªäŸã§èŠããå Žåã¯ã
ããã«ããŸã ã