XPath Taking Text With Hyperlinks (Python)
I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it. Take for instance
Solution 1:
The links themselves are nodes that you need to descend.
/html/body/div[3]/div[3]/div[4]/div/p[1]//text()
Solution 2:
Your XPath query matches the text child nodes of that node only. The text of the embedded live on another node and therefore excluded.
To descend use
//text()
as suggested; this will retrieve the text value of any descending node starting from the node in question./html/body/div[3]/div[3]/div[4]/div/p[1]//text()
Alternatively, you can select the node in question itself and retrieve the text using a parser method
text_content()
to retrieve the text including all child nodes.
lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
tree = html.fromstring(page.content)
firstp = tree.xpath('/html/body/div[3]/div[3]/div[4]/div/p[1]')
firstp[0].text_content()
Post a Comment for "XPath Taking Text With Hyperlinks (Python)"