unicodedata

教程

正规化：Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code
- normalize()
各种函数教程：{383}python unicodedata用法_周小董-CSDN博客_unicodedata
英语教程：Unicode In Python - The unicodedata Module Explained - AskPython
Python 官方教程：Unicode HOWTO — Python 3.9.7 documentation
Python unicode catetory 解说：General_Category_Values —- Unicode Character Database
11. Programming languages — Programming with Unicode
- unicode 通用介绍，跨变成语言

字符集数据

unicodeit 中的数据：
- unicodeit/data.py at master · svenkreiss/unicodeit · GitHub

模块名

unicodedata
- 即 unicode database

概念

code point

即字符，unicode 中的字符，对应的字节序列
取值范围
- 0 to 0x10FFFF
  - about 1.1 million values
  - 实际取值要小
- 三个字节可以容纳下来最大值
写法
- U+265E
  - 即：0x265e
  - 十进制大小：9822

unicode 保存内容

由许多张 table表格 组成
内容
- 字节序列
- 字符
- 文字描述

eg:

0061    'a'; LATIN SMALL LETTER A
0062    'b'; LATIN SMALL LETTER B
...
007B    '{'; LEFT CURLY BRACKET
...
2167    'Ⅷ'; ROMAN NUMERAL EIGHT
...
265E    '♞'; BLACK CHESS KNIGHT
...

这里
- 0061, 007B, 就是 code point

glyph

概念解释
- 图形文字
- 在屏幕和书写在纸张上的形式
- 即，字符对应的图形
展示
- 因字体不同，表现不同
- Python 程序不关心 glyph 图形的展示
- 由 GUI toolkit 和 termial 字体渲染器负责

Encodings

原因
- code point 在存储时要保存成 a set of code units
  - code units, 可以再被映射到 8bit 字节
概念
- 把 code point 转换成 bytes 序列(a sequence of bytes)的规则，叫做 character encoding 和 encoding
直接使用 4bytes 编码（32bit）（encoding）的缺陷
1. not portable 不可移植；不同的 CPU(processor)排列字节的顺序不同
  - 按这个要求，单字节最好
2. 浪费空间，比如英语，大部分都是 ascii 字符，单字节即可，4字节太浪费空间
3. 与已有 C 函数不兼容，如 strlen()；这需要新编写一批 wide string 函数

utf-8 编码

编码规则
- code point < 128 (ascii), 直接使用，不转换
- code point >= 128, 转化成多字节（2,3,4）
  - 这些多字节大小 128 ~ 255(闭区间)
编码特性
1. It can handle any Unicode code point.
  - 可以处理任何 unicode 字符（code point）
2. A Unicode string is turned into a sequence of bytes that contains embedded zero bytes only where they represent the null character (U+0000). This means that UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes for anything other than end-of-string markers.
  - unicode 字符串被转换成字节序列，仅在表示空字符（null character， U+0000）时才会包含零值字节。这样，utf-8 字符串就可以被 C 函数 strcopy()等处理
3. A string of ASCII text is also valid UTF-8 text.
  - ascii 字符串也是合法的 utf-8 字符串。
4. UTF-8 is fairly compact; the majority of commonly used characters can be represented with one or two bytes.
  - utf-8 编码紧凑，节省空间；大部分常用字符，可以用一个或两个字节表示
5. If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
  - 部分字节损坏或丢失，不耽误后续字节解码。随机的 8-bit 数据，不易被误认为是 utf-8 编码。
6. UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes. This avoids the byte-ordering issues that can occur with integer and word oriented encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending on the hardware on which the string was encoded.
  - utf-8 编码是面向字节的编码。编码指定一个字符会被哪一个或哪几个字节表示。这可以避免字符排序问题（避免不同 CPU 兼容性，即大小端问题，同样的字节序列，大小端不同表示的含义不同）

BOM

bytes or mark
一个在文件开头的 Unicode 字符（U+FEFF），用来标记字节顺序
UTF-16
- 需要 BOM 用来标记字节顺序
- 在读取文件是，自动丢掉 BOM 字符
utf-16-le 和 utf-16-be
- utf-16 变种
- 读取文件时，不会跳过 BOM
utf-8
- 不需要指定 BOM 字符
- 有 BOM 反而会导致错误
  - 解决办法
    - 使用 "utf-8-sig" 编码

Python Unicode 处理

字符表示法

使用 character name
- eg: "\N{GREEK CAPITAL LETTER DELTA}"
使用 "\xff"
- 1 bytes, 8bit 表示法
使用 "\u1122"
- 2 bytes, 16bit 表示法
使用 "\U11223344"
- 4 bytes, 32bit 表示法

例子

1
2
3
4
5
6
7
8
>>> "\N{GREEK CAPITAL LETTER DELTA}"  # Using the character name
'\u0394'
>>> "\u0394"                          # Using a 16-bit hex value
'\u0394'
>>> "\U00000394"                      # Using a 32-bit hex value
'\u0394'
>>> "\x01"                            # Using a  8 bit hex value
'\x01'

字符串

1
2
3
4
5
6
>>> s = "a\xac\u1234\u20ac\U00008000"
... #     ^^^^ two-digit hex escape
... #         ^^^^^^ four-digit Unicode escape
... #                     ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]

数字到字符 chr(number: int) -> char

字符到数字 ord(ch: char) -> int

编码 bytes.decode('utf-8')

解码 str.encode('utf-8')

字符串比较

参考
- python compare unicde strings
工具
- unicodedata.normalize()
- str.casefold()
  - 去除大小写
  - 结果类似 str.lower()

unicode 正则表达式

\d
- bytes 中（即 ascii），匹配 [0-9]
- unicode 中（即 str），匹配 unicodedata.category() == 'Nd'
\w
- bytes 中，匹配 [a-zA-Z0-9_]
- unicode 中，匹配大范围的 unicode 字符
\s
- bytes 中，[\ \t\n\r\f\v]
- unicode 中，unicode 空白符

Unicode 文件处理

内容读取

场景
- 字节部分读取
  - bytes = read(size) –> 自己写代码读取
  - size 读入的 bytes 不全，只包含了响应字符的前面部分字节
  - 解决办法
    - 使用 low-level 接口，如： f = open(filename)

文件名编码

参考
- python unicode filenames
编码指定
- 由系统指定编码格式
- unix 中
  - 环境变量 LAN 和 LC_CTYPE 指定，默认值 UTF-8
获取系统默认文件名编码
- sys.getfilesystemencoding()

Python 文件名处理

open(filename) 函数
- 直接传入 str 类型即可
- python 会自动帮你转码到对应类型的文件名编码
os.stat
- 自动转换
os.listdir
- 接收 str, 返回 str
- 接收 bytes, 返回 bytes

Python 处理 Unocode 原则

python 内部全部使用 unicode string
传入字符串，要 decode
输出字符串，要 encode

函数接受类型

只接收 str(unicode) 或者只接受 bytes
如果接受 str + bytes
- 程序脆弱 vulnerable，多 bug

web 数据和 untructed data

使用前要先检查 illigal 字符，再解析成命令或存入数据库
检查 decoded str, 而不是 bytes data

正规化

unicodedata.normalize('format', input_text: str)

格式

参考
- Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code
- wiki: Unicode equivalence - Wikipedia
  - unicode 等价性，原理解说
通常选用的格式
- NFKC 或者 NFKD
  - NFKC: 统一格式 + 缩短文本长度

分类

是否改变原格式 change original form or not

不改变 unchanged
- NFC
  - Normalization Form canonical Composition
  - 先分解，再经典合并：canonical 合并
  - 特点：
    - 转换：支持全角字符 –> 半角字符
- NFD
  - Normalization Form canonical Decomposition
  - 先经典分解，再排序：canonical 分解
改变 changed
- NFKC
  - Normalization Form Compatibility Composition
  - 先兼容分解后合并：先做兼容性分解，再做多字节字符的经典（canonical）内部字节合并
  - 特点：
    - 转换：支持全角字符 –> 半角字符
- NFKD
  - Normalization Form Compatibility Decomposition
  - 先兼容分解再排序：先做兼容性分解，再做多字节字符的内部字节排序（特定规则）

eg:

1
2
3
4
  In [107]: print(unicodedata.normalize('NFC', '1．1'), ';', unicodedata.normalize('NFKC', '1．１'))
  # output: 1．1 ; 1.1
  # NFC  不能 全角字符 ==转换==> 半角字符
  # NFKC 可以 全角字符 ==转换==> 半角字符

缩写解释
- D
  - Decomposition 分解
    - 分解，自然长度会变长
- C
  - Composition 合成
    - 合成，长度变短
- K
  - Compatibility 兼容性
    - 兼容，统一化
    - 先做兼容性分解，再做排序或者合并
- NF
  - normalization form

是否改变长度 change length
- 合成
  - NFC
  - NFKC
- 分解
  - NFD
  - NFKD

分类

unicodedata.category(char)

用途

去除 accents 强调（注音）

参考：
- regex - Python: efficient method to replace accents (é to e), remove {^a-zA-Z…
- 类型：UnicodeCategory Enum (System.Globalization) | Microsoft Docs

方法

1
2
3
4
>>> import unicodedata
>>> s='éô'
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'eo'

遍历字符串中的字符， unicodedata.category(char) != 'Mn'
- 只保留类型不是 ModifierLetter 的字符，即可去除 accents 注音符

数字字符识别

参考
- what-is-the-difference-between-unicodedata-digit-and-unicodedata-numeric

unicodedata.decimal(ch: char) -> int

类似 digit

unicodedata.digit(ch: char) -> int

十进制数字符
- eg: 0, 1,2, …, 9
- 1, ¹ (SUPERSCRIPT ONE), ① (CIRCLED DIGIT ONE), ١ (ARABIC-INDIC DIGIT ONE)
合法字符
- unicodedata.category(ch) 返回类型
  Nd
  DecimalDigitNumber, "0" ~ "9"

unicodedata.numeric(ch)

单个字符，但是能够表达数值大小概念
- eg:
  - “一”（汉字）
  - ⅐ (VULGAR FRACTION ONE SEVENTH)
    - 即：1/7, 单个字符表达
  - "Ⅱ"（罗马数字）
  - "⑴"，"⒀"，数字序号字符
合法字符
- unicodedata.category(ch) 类型
  Nl
  number letter, 例如罗马数字
  ?
  …

unicodedata.bidirectional(ch)

适用于阿拉伯语字符

unicodedata.category(char) 字符类别 category

参考：

category 枚举： Unicode Character Categories

标点符号

hyphen, en dash and em dash 分词符和横杠

参考： Dashes vs. Hyphens–What's the Difference?

三种符号：

hyphen: "-"
en dash: "–"
em dash: "—"

注：字体原因可能区别不开

hyphen

最短，用来做连词符

en dash

和大写字母 N 一样长

em dash

和大写字母 M 一样长

utf-8 Vs utf-8-sig

utf-8-sig 用于处理带有 BOM 标记的 UTF-8 编码文件
utf-8 处理的文件文件，如果带有 BOM 标记，BOM 标记会被当成普通字符读取

superscrip 上标和 subscript 下标的输出

参考：

直接使用 unicode 转义字符

1
2
3
4
5
# subscript
print(u'H\u2082SO\u2084')  # H₂SO₄

# superscript
print("x\u00b2 + y\u00b2 = 2")  # x² + y² = 2

使用 str.maketrans

1
2
3
subscript_table = str.maketrans("0123456789", "₀₁₂₃₄₅₆₇₈₉")
print("C2H5OH".translate(subscript_table))
# C₂H₅OH

unicodedata.name() 字符的名称

1
2
3
4
5
6
>>> import unicodedata

>>> unicodedata.name("€")
'EURO SIGN'
>>> unicodedata.lookup("EURO SIGN")
'€'

unicodedata.combining(char) 判断是否是 combining 字符

参考：

Combining character - Wikipedia
类别枚举：Unicode Combining Classes

特点：

==0: 如果返回 0,说明不是 combining 字符
!=0: 是 combing 字符的类别（combining class）

统计字符数量

统计可见字符数量，而不是字节数量参考：

How do I get the "visible" length of a combining Unicode string in Python? - …

1
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)

ϕφψ𝛗

文章目录

教程

字符集数据

模块名

概念

code point

unicode 保存内容

glyph

Encodings

utf-8 编码

BOM

Python Unicode 处理

字符表示法

数字到字符 chr(number: int) -> char

字符到数字 ord(ch: char) -> int

编码 bytes.decode('utf-8')

解码 str.encode('utf-8')

字符串比较

unicode 正则表达式

Unicode 文件处理

内容读取

文件名编码

Python 文件名处理

Python 处理 Unocode 原则

函数接受类型

web 数据 和 untructed data

正规化

格式

分类

用途

去除 accents 强调（注音）

数字字符识别

unicodedata.decimal(ch: char) -> int

unicodedata.digit(ch: char) -> int

unicodedata.numeric(ch)

unicodedata.bidirectional(ch)

unicodedata.category(char) 字符类别 category

标点符号

hyphen, en dash and em dash 分词符和横杠

hyphen

en dash

em dash

utf-8 Vs utf-8-sig

superscrip 上标和 subscript 下标的输出

unicodedata.name() 字符的名称

unicodedata.combining(char) 判断是否是 combining 字符

统计字符数量

web 数据和 untructed data