Skip to content Skip to sidebar Skip to footer

XPath Taking Text With Hyperlinks (Python)

I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it. Take for instance

Solution 1:

The links themselves are nodes that you need to descend.

/html/body/div[3]/div[3]/div[4]/div/p[1]//text()

Solution 2:

Your XPath query matches the text child nodes of that node only. The text of the embedded live on another node and therefore excluded.

  1. To descend use //text() as suggested; this will retrieve the text value of any descending node starting from the node in question.

    /html/body/div[3]/div[3]/div[4]/div/p[1]//text()
    
  2. Alternatively, you can select the node in question itself and retrieve the text using a parser method text_content() to retrieve the text including all child nodes.

lxml import html
import requests

page = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
tree = html.fromstring(page.content)
firstp = tree.xpath('/html/body/div[3]/div[3]/div[4]/div/p[1]')
firstp[0].text_content()

Post a Comment for "XPath Taking Text With Hyperlinks (Python)"