您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

如何将此XPath表达式转换为BeautifulSoup?

如何将此XPath表达式转换为BeautifulSoup?

我知道BeautifulSoup是规范的HTML解析模块,但是有时您只想从某些HTML中抓取一些子字符串,而pyparsing有一些有用的方法可以做到这一点。使用此代码

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href
@H_419_4@

我从您的页面提取了914条引用,从Abel到Zupikova。

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
Zídek, Aleš /azidek
ZióÅ?ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova
@H_419_4@
其他 2022/1/1 18:46:22 有328人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶