lxml Tutorial: XML Processing and Web Scraping With lxml

Flipnode on May 31 2023

blog-image

Welcome to this lxml Python tutorial, where we will dive into the powerful lxml library. Throughout this tutorial, we will cover the fundamental aspects of working with XML documents and progress to handling both XML and HTML documents. We will then bring everything together and explore how to extract data using lxml. Each step will be accompanied by practical examples in Python, ensuring a hands-on learning experience.

Prerequisites

This tutorial assumes that you have a basic understanding of Python programming. Additionally, familiarity with XML and HTML concepts is beneficial. If you are familiar with XML attributes, you will be able to grasp the concepts presented in this article.

The code snippets in this tutorial are written in Python 3, but with minimal modifications, they can also be adapted to Python 2.

What is lxml in Python?

The lxml library is renowned for its exceptional performance and extensive features when it comes to XML and HTML processing in Python. It serves as a convenient interface to the underlying C libraries, libxml2 and libxslt, combining the efficiency of the C implementation with the ease of use provided by Python.

With the Python lxml library, you can effortlessly create, parse, and query XML and HTML documents. It is a crucial dependency for various complex packages, such as Scrapy, enabling advanced web scraping capabilities.

Installation

To install the lxml library, the recommended approach is to obtain it from the Python Package Index (PyPI). On Linux (debian-based) systems, you can use the following command:

sudo apt-get install python3-lxml

Alternatively, you can utilize the pip package manager, which is compatible with Windows, Mac, and Linux. Execute the following command:

pip3 install lxml

For Windows users running Python 3, you can simply use the command pip install lxml.

Creating a simple XML document

To represent XML or XML-compliant HTML, we can visualize it as a tree structure consisting of a root and various branches. Each branch can further contain additional branches. In the lxml library, these branches and the root are represented as elements.

Here's an example of a simple XML document:

<root>
<branch>
<branch_one>
</branch_one>
<branch_one>
</branch_one>
</branch>
</root>

If an HTML document follows XML compliance, it adheres to the same tree structure concept. However, it's important to note that not all HTML documents are XML compliant. For instance, an HTML document with a <br> tag without a corresponding closing tag is still considered valid HTML but not valid XML. In the later part of this tutorial, we will explore how to handle such cases. For now, let's focus on XML-compliant HTML.

The Element class

To create an XML document using Python lxml, you need to start by importing the etree module from lxml:

from lxml import etree

In lxml, every XML document starts with a root element. To create the root element, you can use the Element type. This type serves as a flexible container object for storing hierarchical data. In this Python lxml example, we aim to create an XML-compliant HTML, so the root element should have the name "html":

root = etree.Element("html")

Similarly, an HTML document consists of a head and a body. We can create these elements using the Element type as well:

head = etree.Element("head")
body = etree.Element("body")

To establish parent-child relationships, we can use the append() method:

root.append(head)
root.append(body)

The XML document can be serialized and printed to the terminal using the tostring() function. This function requires the root element as a mandatory argument. Optionally, you can set pretty_print to True to make the output more readable. It's important to note that the tostring() function returns bytes, which can be converted to a string by calling decode():

print(etree.tostring(root, pretty_print=True).decode())

The SubElement class

To improve code readability and simplicity, you can utilize the SubElement type in lxml. It provides a convenient way to create child elements without the need to separately create an Element object and then use the append() function. SubElement takes two arguments: the parent node and the element name. By using SubElement, you can replace the following two lines of code with a single line:

body = etree.Element("body")
root.append(body)

The equivalent code using SubElement is:

body = etree.SubElement(root, "body")

This accomplishes the same result of creating the body element as a child of the root element in a more concise manner.

Setting text and attributes

Setting text and attributes in lxml is straightforward and can be done using the text and set methods provided by the Element and SubElement objects. Here are some examples:

To set the text of an element, you can assign a value to the text attribute:

para = etree.SubElement(body, "p")
para.text = "Hello World!"

Similarly, attributes can be set using the set method with the key-value convention:

para.set("style", "font-size:20pt")

Alternatively, attributes can be passed directly in the constructor of SubElement:

para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"

Using this approach not only saves lines of code but also improves clarity. Here's a complete example that generates an HTML document, which is also a well-formed XML:

from lxml import etree

root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p", id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p", id="secondPara")
para.text = "This is the second paragraph."
etree.dump(root) # prints everything to console. Use for debug only
with open('input.html', 'wb') as f:

f.write(etree.tostring(root, pretty_print=True))

Note that in this example, etree.dump() is used for debugging purposes to print the XML structure to the console. For serialization, etree.tostring() is used, which returns a string representation of the XML that can be stored in a variable or written to a file.

The provided code snippet saves the contents to an "input.html" file in the same folder as the script. It generates a well-formed XML that can be interpreted as XML or HTML.

How do you parse an XML file using LXML in Python?

In the previous section, we covered the Python lxml tutorial for creating XML files. Now, let's explore how to traverse and manipulate an existing XML document using the lxml library.

Before we proceed, save the following snippet as "input.html":

<html>
<head>
<title>This is Page Title</title>
</head>
<body>
<h1 style="font-size:20pt" id="head">Hello World!</h1>
<p id="firstPara">This HTML is XML Compliant!</p>
<p id="secondPara">This is the second paragraph.</p>
</body>
</html>

When an XML document is parsed, it creates an in-memory ElementTree object.

If the XML contents are in a file system, you can load it using the parse method. This method returns an ElementTree object, from which you can obtain the root element by calling the getroot() method.

from lxml import etree

tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) # prints file contents to console

The lxml.etree module also provides the fromstring() method, which can be used to parse XML contents from a valid XML string.

xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)

It's important to note that the fromstring() method returns an element object directly, eliminating the need to call getroot().

If you are interested in learning more about parsing, we have a separate tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. It's worth mentioning that lxml can use BeautifulSoup as a parser backend, and similarly, BeautifulSoup can utilize lxml as a parser.

Finding elements in XML

There are two main approaches for finding elements using the Python lxml library. The first method involves using the querying languages XPath and ElementPath. For instance, the following code snippet demonstrates how to find the first paragraph element:

tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)

Output:

<p id="firstPara">This HTML is XML Compliant!</p>

Similarly, the findall() method can be used to retrieve a list of all elements that match the specified selector:

elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
etree.dump(e)

Output:

<p id="firstPara">This HTML is XML Compliant!</p>
<p id="secondPara">This is the second paragraph.</p>

The second approach involves using XPath directly, which is particularly useful for developers familiar with XPath. With XPath, you can easily retrieve the element instances, text content, or attribute values using standard XPath syntax. Here's an example:

para = elem.xpath('//p/text()')
for e in para:
print(e)

Output:

This HTML is XML Compliant!
This is the second paragraph.

Both methods provide flexibility in locating elements within an XML document, allowing you to choose the approach that suits your needs and familiarity with XPath.

Handling HTML with lxml.hmtl

When working with HTML that may not be well-formed or XML compliant, the lxml.html module can be used instead of lxml.etree. However, please note that direct reading from a file is not supported in this case. You need to read the file contents into a string before proceeding. Here's an example code snippet that demonstrates how to print all paragraphs from an HTML file using lxml.html:

from lxml import html

with open('input.html') as f:
html_string = f.read()

tree = html.fromstring(html_string)
paragraphs = tree.xpath('//p/text()')

for paragraph in paragraphs:
print(paragraph)

Output:

This HTML is XML Compliant!
This is the second paragraph.

By utilizing lxml.html, you can handle HTML documents that may have varying structures or compliance levels, allowing you to extract desired elements using XPath queries. Remember to read the file contents into a string and then create an lxml.html tree from that string for further processing.

lxml web scraping tutorial

Now that we have learned how to parse and find elements in XML and HTML, the missing piece is retrieving the HTML of a web page.

For this task, the 'requests' library is a great choice. You can install it using the pip package manager:

pip install requests

Once the 'requests' library is installed, you can retrieve the HTML of any web page using the get() method. Here's an example:

import requests

response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints the source HTML

This can be combined with lxml to extract any desired data.

Here's a quick example that prints a list of countries from Wikipedia:

import requests
from lxml import html
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
print(country.xpath('./following-sibling::a/text()')[0])

In this code, the HTML returned by response.text is parsed into the tree variable. You can query the tree using standard XPath syntax. The XPath expressions can be concatenated. Note that the xpath() method returns a list, so in this code snippet, only the first item is considered.

This approach can be easily extended to retrieve any attribute from the HTML. For example, the modified code snippet below prints both the country name and the image URL of the flag:

for country in countries:
flag = country.xpath('./img/@src')[0]
country_name = country.xpath('./following-sibling::a/text()')[0]
print(country_name, flag)

By combining the 'requests' library with lxml, you can fetch the HTML content of web pages and extract specific data using XPath queries.

Conclusion

This Python lxml tutorial has covered various aspects of XML and HTML handling using the lxml library. The lxml library is lightweight, fast, and feature-rich, making it suitable for tasks such as creating XML documents, reading existing documents, and finding specific elements. It is equally powerful for working with both XML and HTML documents. Additionally, when combined with the requests library, lxml can be seamlessly used for web scraping tasks. Overall, lxml is a versatile tool for XML and HTML processing in Python.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

Subscribe

Related articles

thumbnail
ScrapersWeb Scraping With RegEx

Regular Expressions (RegEx) are powerful pattern matching tools that allow you to filter and extract specific combinations of data, providing the desired output.

Flipnode
author avatar
Flipnode
5 min read
thumbnail
How to Use DataXPath vs CSS Selectors

Read this article to learn what XPath and CSS selectors are and how to create them. Find out the differences between XPath vs CSS, and know which option to choose.

Flipnode
author avatar
Flipnode
12 min read
thumbnail
ScrapersScraping Amazon Product Data: A Complete Guide

Master the art of building an Amazon scraper from scratch with this practical, step-by-step tutorial.

Flipnode
author avatar
Flipnode
11 min read