lxml Tutorial: XML Processing and Web Scraping With lxml
Flipnode on May 31 2023
Welcome to this lxml Python tutorial, where we will dive into the powerful lxml library. Throughout this tutorial, we will cover the fundamental aspects of working with XML documents and progress to handling both XML and HTML documents. We will then bring everything together and explore how to extract data using lxml. Each step will be accompanied by practical examples in Python, ensuring a hands-on learning experience.
Prerequisites
This tutorial assumes that you have a basic understanding of Python programming. Additionally, familiarity with XML and HTML concepts is beneficial. If you are familiar with XML attributes, you will be able to grasp the concepts presented in this article.
The code snippets in this tutorial are written in Python 3, but with minimal modifications, they can also be adapted to Python 2.
What is lxml in Python?
The lxml library is renowned for its exceptional performance and extensive features when it comes to XML and HTML processing in Python. It serves as a convenient interface to the underlying C libraries, libxml2 and libxslt, combining the efficiency of the C implementation with the ease of use provided by Python.
With the Python lxml library, you can effortlessly create, parse, and query XML and HTML documents. It is a crucial dependency for various complex packages, such as Scrapy, enabling advanced web scraping capabilities.
Installation
To install the lxml library, the recommended approach is to obtain it from the Python Package Index (PyPI). On Linux (debian-based) systems, you can use the following command:
sudo apt-get install python3-lxml
Alternatively, you can utilize the pip package manager, which is compatible with Windows, Mac, and Linux. Execute the following command:
pip3 install lxml
For Windows users running Python 3, you can simply use the command pip install lxml.
Creating a simple XML document
To represent XML or XML-compliant HTML, we can visualize it as a tree structure consisting of a root and various branches. Each branch can further contain additional branches. In the lxml library, these branches and the root are represented as elements.
Here's an example of a simple XML document:
<root>
<branch>
<branch_one>
</branch_one>
<branch_one>
</branch_one>
</branch>
</root>
If an HTML document follows XML compliance, it adheres to the same tree structure concept. However, it's important to note that not all HTML documents are XML compliant. For instance, an HTML document with a <br> tag without a corresponding closing tag is still considered valid HTML but not valid XML. In the later part of this tutorial, we will explore how to handle such cases. For now, let's focus on XML-compliant HTML.
The Element class
To create an XML document using Python lxml, you need to start by importing the etree module from lxml:
from lxml import etree
In lxml, every XML document starts with a root element. To create the root element, you can use the Element type. This type serves as a flexible container object for storing hierarchical data. In this Python lxml example, we aim to create an XML-compliant HTML, so the root element should have the name "html":
root = etree.Element("html")
Similarly, an HTML document consists of a head and a body. We can create these elements using the Element type as well:
head = etree.Element("head")
body = etree.Element("body")
To establish parent-child relationships, we can use the append() method:
root.append(head)
root.append(body)
The XML document can be serialized and printed to the terminal using the tostring() function. This function requires the root element as a mandatory argument. Optionally, you can set pretty_print to True to make the output more readable. It's important to note that the tostring() function returns bytes, which can be converted to a string by calling decode():
print(etree.tostring(root, pretty_print=True).decode())
The SubElement class
To improve code readability and simplicity, you can utilize the SubElement type in lxml. It provides a convenient way to create child elements without the need to separately create an Element object and then use the append() function. SubElement takes two arguments: the parent node and the element name. By using SubElement, you can replace the following two lines of code with a single line:
body = etree.Element("body")
root.append(body)
The equivalent code using SubElement is:
body = etree.SubElement(root, "body")
This accomplishes the same result of creating the body element as a child of the root element in a more concise manner.
Setting text and attributes
Setting text and attributes in lxml is straightforward and can be done using the text and set methods provided by the Element and SubElement objects. Here are some examples:
To set the text of an element, you can assign a value to the text attribute:
para = etree.SubElement(body, "p")
para.text = "Hello World!"
Similarly, attributes can be set using the set method with the key-value convention:
para.set("style", "font-size:20pt")
Alternatively, attributes can be passed directly in the constructor of SubElement:
para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"
Using this approach not only saves lines of code but also improves clarity. Here's a complete example that generates an HTML document, which is also a well-formed XML:
from lxml import etree
root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p", id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p", id="secondPara")
para.text = "This is the second paragraph."
etree.dump(root) # prints everything to console. Use for debug only
with open('input.html', 'wb') as f:
f.write(etree.tostring(root, pretty_print=True))
Note that in this example, etree.dump() is used for debugging purposes to print the XML structure to the console. For serialization, etree.tostring() is used, which returns a string representation of the XML that can be stored in a variable or written to a file.
The provided code snippet saves the contents to an "input.html" file in the same folder as the script. It generates a well-formed XML that can be interpreted as XML or HTML.
How do you parse an XML file using LXML in Python?
In the previous section, we covered the Python lxml tutorial for creating XML files. Now, let's explore how to traverse and manipulate an existing XML document using the lxml library.
Before we proceed, save the following snippet as "input.html":
<html>
<head>
<title>This is Page Title</title>
</head>
<body>
<h1 style="font-size:20pt" id="head">Hello World!</h1>
<p id="firstPara">This HTML is XML Compliant!</p>
<p id="secondPara">This is the second paragraph.</p>
</body>
</html>
When an XML document is parsed, it creates an in-memory ElementTree object.
If the XML contents are in a file system, you can load it using the parse method. This method returns an ElementTree object, from which you can obtain the root element by calling the getroot() method.
from lxml import etree
tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) # prints file contents to console
The lxml.etree module also provides the fromstring() method, which can be used to parse XML contents from a valid XML string.
xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)
It's important to note that the fromstring() method returns an element object directly, eliminating the need to call getroot().
If you are interested in learning more about parsing, we have a separate tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. It's worth mentioning that lxml can use BeautifulSoup as a parser backend, and similarly, BeautifulSoup can utilize lxml as a parser.
Finding elements in XML
There are two main approaches for finding elements using the Python lxml library. The first method involves using the querying languages XPath and ElementPath. For instance, the following code snippet demonstrates how to find the first paragraph element:
tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)
Output:
<p id="firstPara">This HTML is XML Compliant!</p>
Similarly, the findall() method can be used to retrieve a list of all elements that match the specified selector:
elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
etree.dump(e)
Output:
<p id="firstPara">This HTML is XML Compliant!</p>
<p id="secondPara">This is the second paragraph.</p>
The second approach involves using XPath directly, which is particularly useful for developers familiar with XPath. With XPath, you can easily retrieve the element instances, text content, or attribute values using standard XPath syntax. Here's an example:
para = elem.xpath('//p/text()')
for e in para:
print(e)
Output:
This HTML is XML Compliant!
This is the second paragraph.
Both methods provide flexibility in locating elements within an XML document, allowing you to choose the approach that suits your needs and familiarity with XPath.
Handling HTML with lxml.hmtl
When working with HTML that may not be well-formed or XML compliant, the lxml.html module can be used instead of lxml.etree. However, please note that direct reading from a file is not supported in this case. You need to read the file contents into a string before proceeding. Here's an example code snippet that demonstrates how to print all paragraphs from an HTML file using lxml.html:
from lxml import html
with open('input.html') as f:
html_string = f.read()
tree = html.fromstring(html_string)
paragraphs = tree.xpath('//p/text()')
for paragraph in paragraphs:
print(paragraph)
Output:
This HTML is XML Compliant!
This is the second paragraph.
By utilizing lxml.html, you can handle HTML documents that may have varying structures or compliance levels, allowing you to extract desired elements using XPath queries. Remember to read the file contents into a string and then create an lxml.html tree from that string for further processing.
lxml web scraping tutorial
Now that we have learned how to parse and find elements in XML and HTML, the missing piece is retrieving the HTML of a web page.
For this task, the 'requests' library is a great choice. You can install it using the pip package manager:
pip install requests
Once the 'requests' library is installed, you can retrieve the HTML of any web page using the get() method. Here's an example:
import requests
response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints the source HTML
This can be combined with lxml to extract any desired data.
Here's a quick example that prints a list of countries from Wikipedia:
import requests
from lxml import html
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
print(country.xpath('./following-sibling::a/text()')[0])
In this code, the HTML returned by response.text is parsed into the tree variable. You can query the tree using standard XPath syntax. The XPath expressions can be concatenated. Note that the xpath() method returns a list, so in this code snippet, only the first item is considered.
This approach can be easily extended to retrieve any attribute from the HTML. For example, the modified code snippet below prints both the country name and the image URL of the flag:
for country in countries:
flag = country.xpath('./img/@src')[0]
country_name = country.xpath('./following-sibling::a/text()')[0]
print(country_name, flag)
By combining the 'requests' library with lxml, you can fetch the HTML content of web pages and extract specific data using XPath queries.
Conclusion
This Python lxml tutorial has covered various aspects of XML and HTML handling using the lxml library. The lxml library is lightweight, fast, and feature-rich, making it suitable for tasks such as creating XML documents, reading existing documents, and finding specific elements. It is equally powerful for working with both XML and HTML documents. Additionally, when combined with the requests library, lxml can be seamlessly used for web scraping tasks. Overall, lxml is a versatile tool for XML and HTML processing in Python.