ç§èªèº«ã¯ããããŒ/é/éçšã®ãã±ã¢ã³ãžã¥ãŒã¹ãã®ãããªèŠåºãã¯ããŸã奜ãã§ã¯ãããŸããããããã¯äºå®ãšæãããŸã-åºæ¬çãªããšã«ã€ããŠè©±ããŸãã ããªãæ©èœããªãã®ã§ããïŒã ãŸã æããããã³/ãŸãã¯ãŠãã³ãŒããç解ããŠããªãå Žå-ç§ã¯ç«ãæ±ããŸãã
ãªãã§ïŒ
åå¿è
ã®äž»ãªè³ªåã§ãããå°è±¡çãªãšã³ã³ãŒãã£ã³ã°ãšãäžèŠè€éãªã¡ã«ããºã ïŒPython 2.xãªã©ïŒã«ééããŸãã çãçãã¯ããããèµ·ãã£ãããã§ã:)
ã³ãŒãã£ã³ã°ã¯ãç¥ããªããã¡ã«ãã³ã³ãã¥ãŒã¿ãŒã®ã¡ã¢ãªãŒïŒãŒãåäœ\æ°åã§èªã¿ãŸãïŒã§æ°åãããããã®ä»ãã¹ãŠã®æåãè¡šãæ¹æ³ãšåŒã°ããŸãã ããšãã°ãã¹ããŒã¹ã¯0b100000ïŒãã€ããªïŒã32ïŒ10é²æ°ïŒããŸãã¯0x20ïŒ16é²æ°ïŒãšããŠè¡šãããŸãã
ãã®ãããã¡ã¢ãªãéåžžã«å°ãªããªãããã¹ãŠã®ã³ã³ãã¥ãŒã¿ãŒã«å¿
èŠãªãã¹ãŠã®æåïŒæ°åãå°æå/倧æåã®ã©ãã³ã¢ã«ãã¡ããããäžé£ã®æåãããããå¶åŸ¡æå-ãã¹ãŠã®127ã®æ°åã誰ãã«äžããããŸããïŒãè¡šãã®ã«ååãª7ãããããããŸããã åœæã®ãšã³ã³ãŒãã£ã³ã°ã¯1ã€ã®
ASCIIã§ããã æéãçµã€ã«ã€ããŠã誰ãã幞ãã§ã誰ã幞ãã§ã¯ãããŸããã§ããïŒèªã-ã©ããŸãã¯ãã€ãã£ãã®æåãuããæ¬ ã人ïŒ-æ®ãã®128æåãèªç±è£éã§äœ¿çšããŸããã ãã®ããã
ISO-8859-1ãšïŒããªã«æåã®ïŒ
cp1251ãš
KOI8ãç»å ŽããŸãã ã ãããã«å ããŠãã¿ã€ã0b1 *******ïŒã€ãŸãã128ãã255ã®æå\çªå·ïŒã®ãã€ãã解éããåé¡ãçºçããŸãã-ããšãã°ãcp1251ãšã³ã³ãŒãã£ã³ã°ã®0b11011111ã¯ISOã®ãã€ãã£ããIãã§ããã 8859-1ã¯
ã®ãªã·ã£èªã®ãã€ãèªã®EszettïŒ
æã®åºãåããïŒã§ãÃãã§ãã HTTPãããã³ã«ã®ãContent-Encodingããªã©ã®ããããŒãé»åã¡ãŒã«ã¡ãã»ãŒãžãHTMLããŒãžãç¶æ³ãå°ãä¿åããã«ããããããããããã¯ãŒã¯éä¿¡ãšç°ãªãã³ã³ãã¥ãŒã¿ãŒéã®ãã¡ã€ã«å
±æã ããå°çã®ç¥èã«å€ãããŸããã
ãã®ç¬éãæããé è³ãéãŸããæ°ããæšæºã§ãã
Unicodeãææ¡ããŸããã ããã¯ãšã³ã³ãŒãã§ã¯ãªãæšæºã§ããUnicodeã ãã§ã¯ãæåãããŒããã©ã€ãã«ä¿åãããæ¹æ³ããããã¯ãŒã¯çµç±ã§éä¿¡ãããæ¹æ³ã¯æ±ºå®ãããŸããã æåãšç¹å®ã®æ°åã®éã®é¢ä¿ã®ã¿ãå®çŸ©ãããããã®æ°åããã€ãã«å€æããã圢åŒã¯Unicodeãšã³ã³ãŒãïŒ
UTF-8ãŸãã¯
UTF-16ãªã© ïŒã«ãã£ãŠæ±ºå®ãããŸãã çŸåšãUnicodeæšæºã«ã¯10äžæåãå°ãè¶
ããæåããããŸãããUTF-16ã§ã¯100äžãè¶
ããæåïŒUTF-8ãªã©ïŒããµããŒãã§ããŸãã
ãã®ãããã¯ã«ã€ããŠã壮倧ãªJoel Spolskyã
絶察ã«æå°ã®ãã¹ãŠã®ãœãããŠã§ã¢éçºè
ã§ããããŠãã³ãŒããšæåã»ããã«ã€ããŠçµ¶å¯Ÿã«ååãã«ç¥ã£ãŠããå¿
èŠããããããã¯ãèªãããšããå§ãããŸãã
èŠç¹ãã€ãããïŒ
åœç¶ãPythonã§ãUnicodeããµããŒããããŠããŸãã ããããæ®å¿µãªãããPython 3ã§ã®ã¿ãã¹ãŠã®æååããŠãã³ãŒãã«ãªããåå¿è
ã¯æ¬¡ã®ãããªãšã©ãŒã«ã€ããŠèªæ®ºããªããã°ãªããŸããã
>>> with open('1.txt') as fh: s = fh.read() >>> print s >>> parser_result = u'-'
Traceback (most recent call last): File "<pyshell#43>", line 1, in <module> parser_result + s UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
ãŸãã¯ïŒ
>>> str(parser_result)
Traceback (most recent call last): File "<pyshell#52>", line 1, in <module> str(parser_result) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
ãããç解ããŸãããããé çªã«ã
誰ãããŠãã³ãŒãã䜿çšããŠããã®ã¯ãªãã§ããïŒ
ãæ°ã«å
¥ãã®HTMLããŒãµãŒãUnicodeãè¿ãã®ã¯ãªãã§ããïŒ éåžžã®æååãè¿ãããã«ããŸããããããããã°ããã§ã«ããã§å¯ŸåŠããŸãã ããïŒ ããã§ããªãã Unicodeã«ååšããåæåã¯ïŒããããïŒããã€ãã®ã·ã³ã°ã«ãã€ããšã³ã³ãŒãã£ã³ã°ïŒISO-8859-1ãcp1251ãªã©ã¯ã·ã³ã°ã«ãã€ããšåŒã°ããŸããæåã1ãã€ãã§æ£ç¢ºã«ãšã³ã³ãŒãããããïŒã§ãããæååã«æåãããå Žåã¯ã©ãã§ããããç°ãªããšã³ã³ãŒãã£ã³ã°ããïŒ åæåã«åå¥ã®ãšã³ã³ãŒãã£ã³ã°ãå²ãåœãŠãŸããïŒ ãããããã¡ãããUnicodeã䜿çšããå¿
èŠããããŸãã
ãªãæ°ããã¿ã€ãã®ããŠãã³ãŒãããå¿
èŠãªã®ã§ããïŒ
ã ãããç§ãã¡ã¯æãèå³æ·±ãããšã«ãªããŸããã Python 2.xã®æååãšã¯äœã§ããïŒ ãããã¯åãªã
ãã€ãã§ãã äœã§ãããŸããŸããã å®éã次ã®ãããªãã®ãæžããšãïŒ
>>> x = 'abcd' >>> x 'abcd'
ã€ã³ã¿ãŒããªã¿ãŒã¯ãã©ãã³ã¢ã«ãã¡ãããã®æåã®4æåãå«ãå€æ°ãäœæããŸããããã·ãŒã±ã³ã¹ã®ã¿ãäœæããŸã
('a', 'b', 'c', 'd')
4ãã€ãã§ãã©ãã³æåã¯ãã®ç¹å®ã®ãã€ãå€ã瀺ãããã«ã®ã¿äœ¿çšãããŸãã ã€ãŸããããã§ã®ãaãã¯ã\ x61ããèšè¿°ããããã®åãªãå矩èªã§ãããããå°ãã§ã¯ãããŸããã äŸïŒ
>>> '\x61' 'a' >>> struct.unpack('>4b', x)
ããã ãã§ãïŒ
ãããŠã質åãžã®åç-ããŠãã³ãŒãããå¿
èŠãªçç±ã¯ããæçœã§ã-ãã€ãã§ã¯ãªãæåã§è¡šãããã¿ã€ããå¿
èŠã§ãã
ãŸããç§ã¯ã©ã€ã³ãäœã§ãããç解ããŸããã 次ã«ãPythonã®Unicodeãšã¯äœã§ããïŒ
ãã¿ã€ããŠãã³ãŒããã¯ãäž»ã«ãŠãã³ãŒãã®æŠå¿µïŒãããã«é¢é£ä»ããããæåãšæ°åã®ã»ããïŒãå®è£
ããæœè±¡åã§ãã ããŠãã³ãŒããã¿ã€ãã®ãªããžã§ã¯ãã¯ããã¯ããã€ãã®ã·ãŒã±ã³ã¹ã§ã¯ãªããæåèªäœã®ã·ãŒã±ã³ã¹ã§ããããããã®æåãã³ã³ãã¥ãŒã¿ãŒã®ã¡ã¢ãªã«ã©ã®ããã«å¹æçã«ä¿åãããŠãããã«ã€ããŠã¯ãŸã£ããããããŸããã å¿
èŠã«å¿ããŠãããã¯ãã€ãæååãããé«ãæœè±¡åã¬ãã«ã§ãïŒPython 3ã§ã¯ãPython 2.6ã§äœ¿çšãããéåžžã®æååãšåŒã°ããŸãïŒã
Unicodeã®äœ¿çšæ¹æ³
Python 2.6ã§Unicodeæååãäœæããã«ã¯ã3ã€ã®ïŒå°ãªããšãèªç¶ãªïŒæ¹æ³ããããŸãã
æåŸã®2ã€ã®äŸã®asciiã¯ãšã³ã³ãŒããšããŠæå®ããããã€ããæåã«å€æããããã«äœ¿çšãããŸãã ãã®å€æã®æ®µéã¯æ¬¡ã®ããã«ãªããŸãã
'\x61' -> ascii -> "a" -> u'\u0061' (unicode-point ) '\xe0' -> c1251 -> "a" -> u'\u0430'
UnicodeæååããéåžžãååŸããæ¹æ³ã¯ïŒ ãšã³ã³ãŒãããïŒ
>>> u'abc'.encode('ascii') 'abc'
ã³ãŒãã£ã³ã°ã¢ã«ãŽãªãºã ã¯ãåœç¶äžèšã®éã§ãã
èŠããŠãããŠãã ãã-æ··åããªãã§ãã ãã-Unicode ==æåãæåå==ãã€ããããã³ãã€ã->æå³ã®ãããã®ïŒæåïŒã¯ãã³ãŒãïŒãã³ãŒãïŒã§ãããæå->ãã€ãã¯ãšã³ã³ãŒãïŒãšã³ã³ãŒãïŒã§ãã
ãšã³ã³ãŒããããŠããªã:(
èšäºã®æåããäŸãèŠãŠã¿ãŸãããã æååãšUnicodeæååã®é£çµã¯ã©ã®ããã«æ©èœããŸããïŒ åçŽãªæååã¯Unicodeæååã«å€æããå¿
èŠããããã€ã³ã¿ãŒããªã¿ãŒã¯ãšã³ã³ãŒããç¥ããªããããããã©ã«ãã®ãšã³ã³ãŒãã§ããasciiã䜿çšããŸãã ãã®ãšã³ã³ãŒããæååã®ãã³ãŒãã«å€±æãããšãweããšã©ãŒãçºçããŸãã ãã®å Žåãæ£ãããšã³ã³ãŒãã£ã³ã°ã䜿çšããŠãæååãUnicodeæååã«ãã£ã¹ãããå¿
èŠããããŸãã
>>> print type(parser_result), parser_result <type 'unicode'> - >>> s = '' >>> parser_result + s
Traceback (most recent call last): File "<pyshell#67>", line 1, in <module> parser_result + s UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
>>> parser_result + s.decode('cp1251') u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0\u043a\u043e\u0449\u0435\u0439' >>> print parser_result + s.decode('cp1251') - >>> print '&'.join((parser_result, s.decode('cp1251'))) -&
ãUnicodeDecodeErrorãã¯éåžžãæ£ãããšã³ã³ãŒãã£ã³ã°ã䜿çšããŠæååãUnicodeã«ãã³ãŒãããå¿
èŠãããããšã®èšŒæ ã§ãã
çŸåšãstrããã³Unicodeæååã䜿çšããŠããŸãã ãstrãããã³ãŠãã³ãŒãæååã䜿çšããªãã§ãã ãã:)ãstrãã§ã¯ãšã³ã³ãŒãã£ã³ã°ãæå®ããæ¹æ³ããªããããããã©ã«ãã®ãšã³ã³ãŒãã£ã³ã°ãåžžã«äœ¿çšããã128æåãè¶
ãããšãšã©ãŒãçºçããŸãã ããšã³ã³ãŒããã¡ãœããã䜿çšããŸãã
>>> print type(s), s <type 'unicode'> >>> str(s)
Traceback (most recent call last): File "<pyshell#90>", line 1, in <module> str(s) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
>>> s = s.encode('cp1251') >>> print type(s), s <type 'str'>
ãUnicodeEncodeErrorãã¯ãUnicodeæååãéåžžã®æååã«å€æãããšãã«æ£ãããšã³ã³ãŒããæå®ããå¿
èŠãããããšã瀺ããŸãïŒãŸãã¯ãencodeãã¡ãœããã§2çªç®ã®ãã©ã¡ãŒã¿ãŒãignoreã\ãreplaceã\ãxmlcharrefreplaceãã䜿çšããŸãïŒã
ãã£ãšæ¬²ããïŒ
ããŠãäžèšã®äŸã®éŠ¬å Žãããåã³äœ¿çšããŸãã
>>> parser_result = u'-'
ãã®äŸã¯å®å
šã«åçŽã§ã¯ãããŸãããããã¹ãŠïŒãŸãããŸãã¯ã»ãšãã©ãã¹ãŠïŒããããŸãã ããã§äœãèµ·ãã£ãŠããŸããïŒ
- å
¥ãå£ã«ã¯äœããããŸããïŒ IDLEãã€ã³ã¿ãŒããªã¿ãŒã«æž¡ããã€ãã åºå£ã§äœãå¿
èŠã§ããïŒ Unicodeãã€ãŸãæåã ãã€ããæåã«å€æããããšã¯æ®ã£ãŠããŸããããšã³ã³ãŒããå¿
èŠã§ãããïŒ ã©ã®ãšã³ã³ãŒãã£ã³ã°ã䜿çšãããŸããïŒ ããã«èª¿ã¹ãŸãã
- éèŠãªãã€ã³ãã¯æ¬¡ã®ãšããã§ãã
>>> '-' '\xe1\xe0\xe1\xe0-\xff\xe3\xe0' >>> u'\u00e1\u00e0\u00e1\u00e0-\u00ff\u00e3\u00e0' == u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0' True
ã芧ã®ãšãããPythonã¯ãšã³ã³ãŒãã®éžæãæ°ã«ããŸãã-ãã€ãã¯åã«Unicodeãã€ã³ãã«å€ãããŸãïŒ
>>> ord('') 224 >>> ord(u'') 224
- ããã«ã®ã¿åé¡ããããŸã-cp1251ã®224çªç®ã®æåïŒã€ã³ã¿ãŒããªã¿ãŒã䜿çšãããšã³ã³ãŒãïŒã¯ããŠãã³ãŒãã®224ãšãŸã£ããåãã§ã¯ãããŸããã ãã®ãããUnicodeæååãå°å·ããããšãããškrakozyabraãååŸãããŸãã
- 女æ§ãå©ããæ¹æ³ã¯ïŒ æåã®256åã®Unicodeæåã¯ãããããISO-8859-1 \ latin1ãšã³ã³ãŒããšåãã§ããããšãããããŸããããã䜿çšããŠUnicodeæååããšã³ã³ãŒããããšãå
¥åãããã€ããååŸããŸãïŒæ°ã«ããã®ã¯-Objects / unicodeobject.c ãé¢æ°ãunicode_encode_ucs1ãã®å®çŸ©ãæ¢ããŠããŸãïŒïŒ
>>> parser_result.encode('latin1') '\xe1\xe0\xe1\xe0-\xff\xe3\xe0'
- 女æ§ããŠãã³ãŒãã«ããæ¹æ³ã¯ïŒ 䜿çšãããšã³ã³ãŒããæå®ããå¿
èŠããããŸãã
>>> parser_result.encode('latin1').decode('cp1251') u'\u0431\u0430\u0431\u0430-\u044f\u0433\u0430'
- ãã€ã³ã5ããã®ã¡ãœããã¯ç¢ºãã«ããã»ã©æãã¯ãããŸãããçµã¿èŸŒã¿ã®unicodeã䜿çšããæ¹ãã¯ããã«äŸ¿å©ã§ãã
å®éãåé¡ã¯ã³ã³ãœãŒã«ã§ã®ã¿çºçããããããuããªãã©ã«ã®ãã¹ãŠãããã»ã©æªãããã§ã¯ãããŸããã å®éããœãŒã¹ãã¡ã€ã«ã§éASCIIæåã䜿çšãããŠããå ŽåãPython㯠"ïŒ-*-codingïŒ-*-"ïŒ
PEP 0263 ïŒã®ãããªããããŒã®äœ¿çšã
èŠæ±ã ãUnicodeæååã¯æ£ãããšã³ã³ãŒãã£ã³ã°ã䜿çšããŸãã
ããšãã°ãããªã«æåãè¡šãããã«ãuãã䜿çšããæ¹æ³ãããããšã³ã³ãŒããŸãã¯èªã¿åãäžèœãªUnicodeãã€ã³ãïŒã€ãŸãããu '\ u1234'ãïŒãæå®ããŸããã ãã®æ¹æ³ã¯å®å
šã«äŸ¿å©ã§ã¯ãããŸããããèå³æ·±ãã®ã¯ãŠãã³ãŒããšã³ãã£ãã£ã³ãŒãã䜿çšããããšã§ãã
>>> s = u'\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER SHCHA}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER SHORT I}' >>> print s
ãŸãããã¹ãŠãããã§ãã äž»ãªãã³ãã¯ãããšã³ã³ãŒãããšããã³ãŒãããæ··åããªãã§ããã€ããšæåã®éããç解ããããšã§ãã
Python 3
çµéšããªããããã³ãŒãã¯ãããŸããã ç®æè
ã¯ããã¹ãŠãããã§ããç°¡åã§ãã楜ãããšäž»åŒµããŸãã ããïŒPython 2.xïŒãšããïŒPython 3.xïŒã®éã-å°æ¬ãšå°æ¬ã®éããå®èšŒããããã«ã誰ãç«ãåŒãåããŸããã
圹ã«ç«ã€
ãšã³ã³ãŒãã£ã³ã°ã«ã€ããŠè©±ããŠããã®ã§ãæã
krakozyabraãå
æããã®ã«åœ¹ç«ã€ãªãœãŒã¹-http://2cyr.com/decode/?lang=enããå§ãããŸã
ç¹°ãè¿ãã«ãªããŸãããSpolskyã®èšäºãžã®ãªã³ã¯-
ãã¹ãŠã®ãœãããŠã§ã¢éçºè
ã絶察çãã€ç©æ¥µçã«Unicodeããã³æåã»ããã«ã€ããŠç¥ã£ãŠããå¿
èŠã®ãã絶察æå°å€ã§ãã
Unicode HOWTOã¯ãPython 2.xã®Unicodeã®å Žæãæ¹æ³ãçç±ã«é¢ããå
¬åŒããã¥ã¡ã³ãã§ãã
ãæž
èŽããããšãããããŸããã ãã©ã€ããŒãã§ã®ã³ã¡ã³ãã«æè¬ããŸãã
PSã¯ãSpolsky-
Absolute Minimumã®ç¿»èš³ãžã®ãªã³ã¯ãæããŸãã
ãããã¯ããã¹ãŠã®ãœãããŠã§ã¢éçºè
ãUnicodeãšæåã»ããã«ã€ããŠç¥ã£ãŠããå¿
èŠããããŸãã