xml in python

lxml 解析 xml

入门

读取 xml –> root tree (_ElementTree 对象)

1
2
3
from_object lxml import etree

etree.parse('/my/file.xml')

ElementTree 对象 –> string

1
2
3
from_object lxml import etree

etree.parse('/my/file.xml')

BeautifulSoup 解析 xml

1
2
3
4
5
from_object bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'xml')

soup.find('abstract').text      # str 类型

保持 text 中的空格

参考:

python - extracting element and insert a space - Stack Overflow

重点:

使用 get_text() 方法，而不是 text 属性

1
2
3
4
5
div.text                        # 嵌套 tag 包含的文本之间不会添加空格，造成两个单次合成一个的问题
# output: hellothere

div.get_text(separator=' ')
# output: hello there

命令行提取 xml

yq – xq

参考：

Querying JSON and XML with jq and xq | Ashby

yq 包中的一个命令
使用 jq 过滤语法

举例：

1
xq  '.TEI.teiHeader.profileDesc.abstract.div.p.s' ./003fffb4365c5171dc7fe3c4cd4029f1.tei.xml

another xq

参考：

GitHub - sibprogrammer/xq: Command-line XML and HTML beautifier and content e…

特点：

支持 xpath 文本提取
- 注意：
  - 不保持文本结构，直接是一个 string

安装

1
curl -s https://www.w3schools.com/xml/note.xml | xq

xpath 调用

1
2
3
4
cat test/data/xml/unformatted.xml | xq -x //city

# 多个 xml tag 一起选择使用 "|" 运算符
xq  -x '//titleStmt/title|//abstract' ./044514358acacfca26478d4b92877272.tei.xml|wc -l

文章目录

lxml 解析 xml

入门

BeautifulSoup 解析 xml

保持 text 中的空格

命令行提取 xml

yq – xq

another xq