XPath vs CSS Selectors
Flipnode on Jun 20 2023
With the rising utility and accessibility of web scraping and automation tools, an increasing number of companies are recognizing their value and investing in this field. Regardless of the technology or framework you choose to employ, one crucial decision you will encounter is whether to use XPath or CSS Selector. In this article, we will thoroughly explore both options and guide you in making a well-informed decision. By understanding the strengths and advantages of XPath and CSS, you will be able to leverage these techniques to extract the maximum benefit from your web scraping endeavors.
What is an XPath selector?
XPath, short for XML Path Language, is a powerful tool used to navigate and query different parts of a document using a non-XML syntax. It allows for easy identification and matching of specific elements within the document structure.
During the software boom, companies began adopting software solutions to manage their operations individually, resulting in a chaotic multi-system environment. This led to the presence of numerous systems using different programming languages and platforms that needed to communicate with each other.
To address this challenge, XML (eXtensible Markup Language) emerged as a solution. XML serves as a standardized way of representing information. An XML document example could be as follows:
<?xml version="1.0" encoding="UTF-8"?>
<inventory>
<product type="bottle">Water Bottle</product>
<brand>ABC</brand>
<code type="upc-13">54268453659458</code>
<stock>54</stock>
<description>Red colored water bottle</description>
</inventory>
XML Schema Definition was introduced to standardize each XML document for specific purposes, while XPath was developed as a means to query and extract specific information from XML documents.
Although HTML is not as strictly governed by XML rules, XPath can still be used to query HTML documents. Even in the presence of malformed code, parsers like lxml modify the document structure to make it queryable.
Over the years, XML and XPath have remained widely adopted standards, influencing various technologies. For instance, the popular automation tool Selenium leverages XPath as a preferred method for locating elements, particularly among automation testers.
How to create XPath?
To gain a better understanding of the fundamentals of XPath, you can refer to this straightforward HTML document. Copy the provided HTML markup and save it as document1.html.
<html>
<head>
<title>XPath vs CSS</title>
</head>
<body>
<h2 id="header">Welcome to XPath</h2>
<div id="navbar">
<a href="https://oxylabs.io/blog" id="blog" class="nav">Visit our Blog</a>
<a href="https://oxylabs.io/resources/case-studies" class="nav">Case Studies</a>
</div>
</body>
</html>
Let's begin by clarifying some basic terminology associated with XPath.
Nodes and Relationships
When dealing with XML or HTML, it's important to understand the three types of nodes that exist:
- Element Node: These nodes refer to the tags in the XML or HTML document. For example, <title>XPath vs CSS</title> represents an element node.
- Attribute Node: Attribute nodes are part of an element node and define additional properties. An example of an attribute node is id="blog" within an element.
- Atomic Value: These nodes represent the final values, which can be either the text contained within an element or the value of an attribute. For instance, "Case Studies" is an atomic value in the given example.
These nodes can have different relationships with each other, and understanding these relationships is crucial when creating XPath selectors. The following fundamental relationships apply to both XPath and CSS:
- Children: Children refer to elements that are directly one level down. In the provided HTML document, the elements <h2> and <div> are children of the <body> element. Note that the <a> elements are not children of the <body>.
- Parent: Parent elements are exactly one level up from a given element. Each element has only one parent. In the example, the parent of the <a> element is the <div>.
- Siblings: Siblings are elements that share the same parent. In the given example, the <h2> and <div> tags are siblings since they have the same parent, <body>.
- Descendants: Descendants include all elements at any level below the current element, including children. In the example, the descendants of the <body> element are <h2>, <div>, <a>, and more.
- Ancestor: Ancestors refer to elements at any level above the current element, including the parent element. In this case, the ancestors of the <a> tag are <div>, <body>, and <html>.
Creating the full XPath
The "bottom-up" approach is the simplest way to create an XPath selector. Follow these steps to build an XPath expression:
- Start by writing the name of the element you want to select. In this case, it's <h2>.
- Add a forward slash before the element name. So, the XPath becomes /h2.
- Write the name of its parent element before the XPath expression. In this example, the parent is <body>. So, the XPath becomes body/h2.
- Continue this process by adding the name of each parent element before the XPath, separated by forward slashes. For instance, if the parent of <body> is <html>, the XPath becomes html/body/h2.
- Keep repeating this pattern until you reach the top-level parent. The final XPath expression will be /html/body/h2.
- While this approach may seem tedious, it helps in understanding the structure of XPath.
Alternatively, you can use the Developer Tools in your browser to generate the XPath:
- Save the provided HTML to a file and open it in Chrome or Firefox.
- Right-click on the text "Welcome to XPath" and select "Inspect" from the context menu.
- This will open the "Developer Tools" panel, highlighting the <h2> element. Right-click on the element within the panel, hover over "Copy," and choose "Copy Full XPath." This will generate the same XPath path that we created manually.
XPath Building Blocks
In the previous screenshot, you can see two options for XPath: XPath and Full XPath. The previous section explained how to create a Full XPath.
If we select XPath instead of Full XPath in the example above, the resulting XPath would be:
//*[@id="header"]
However, there are some issues with using Full XPath. The simple example we demonstrated earlier may not always apply. For instance, if we try to generate the Full XPath for the table of contents on a Wikipedia page, the Full XPath becomes complex and lengthy:
/html/body/div[3]/div[3]/div[5]/div[1]/div[2]/ul/li[1]/a/span[2]
There are three main problems with Full XPath:
- Readability and maintainability suffer because the XPath expression becomes difficult to understand.
- Full XPath is too rigid. Even slight changes in the page structure can break the XPath. For example, if an information banner is added at the top of the page, the Full XPath won't work.
- Full XPath traverses each node, which can make it slower.
That's why we need to learn a shorter, faster, and more readable XPath. Let's start with the first building block: the slash character. A single slash selects the child node, while a double-slash selects all matching nodes regardless of their location in the document.
For example, to find all <a> tags in a document, the XPath expression would be:
//a
Similarly, to find all <h2> tags, the XPath would be:
//h2
In the document1.html file we discussed earlier, there is only one <h2> tag. Therefore, the XPath we created earlier, /html/body/h2, can be simplified to simply //h2.
The next building block is the asterisk character, which acts as a wildcard and matches any element. For instance, to retrieve a list of all <a> elements that are children of a <div> element, the XPath without using an asterisk would be:
//div/a
Likewise, to select all <p> elements that are children of <div>, we can use the following XPath:
//div/p
However, if we want to select all elements that are children of a <div> element, we can use the asterisk as follows:
//div/*
It is important to use the asterisk when we want to match multiple elements without specifying specific conditions. These conditions can be added using square brackets, known as predicates.
Predicates in XPath
Predicates are written inside square brackets in XPath. In its simplest form, predicates consist of just a number.
For instance, the following XPath will match two anchor tags:
//a
By adding a number inside the square brackets, we can specify which element to select. If we want to select the first anchor tag, the XPath will be:
//a[1]
Please note that the first node is represented by the number 1, not 0 as in most programming languages.
Functions can also be used within the square brackets. To obtain the last <a> tag, we can use the following XPath:
//a[last()]
There are many more functions that will be covered shortly. But before that, let's explore how we can extract the text from elements.
Getting Text from Elements
The XPath //h2 selects the element. Now let's explore how we can extract the text, or atomic value, from elements.
The text is typically found between the opening and closing tags of an element. For example:
<h2 id="header">Welcome to XPath</h2>
To extract this text, we can use the XPath function text(). The following XPath will return the text inside the h2 element:
//h2/text()
To extract the value of an attribute, use the "@" symbol followed by the attribute name. For instance, consider this anchor tag:
<a href="https://flipnode.io/blog">Visit our Blog</a>
To extract the value of the href attribute, we can use the XPath:
//a/@href
This will retrieve the text "https://flipnode.io/blog".
Attribute selectors can also be used in XPath predicates. For example, the following XPath selects all <div> elements where the value of the id attribute is "header":
//div[@id="header"]
If you want to search for an element that contains specific text, you can use the text() function within square brackets. For example:
//a[text()="Visit our Blog"]
Note that this will look for an exact match. For partial matching, you can use the contains function. This function is an important criterion to consider when choosing between XPath and CSS.
//a[contains(text(),"Blog")]
These are the basics of XPath. In the next section, we will explore the key areas where XPath truly excels.
Advantages of using XPath
When working with HTML or the DOM, there are three main axes: ancestors, descendants, and siblings. While CSS can also select siblings and descendants, only XPath allows traversing up to ancestors.
Consider this example:
<a href="https://flipnode.io/">
<span class="link">Flipnode</span>
</a>
In this example, if we need to extract the value of the href attribute from the <a> tag, we can locate the <span> tag and go up one level to find the <a> tag:
//span[@class="link"]/../@href
Notice the use of ".." to move up one level. CSS selectors do not support this feature.
Furthermore, regarding partial matching, CSS previously had the :contains operator, but it has been deprecated and may not be universally supported. On the other hand, the XPath function contains() is universally supported.
To summarize, if you need to traverse up the DOM, XPath is the only choice. Similarly, if you require partial matching, XPath may be your only option.
What is a CSS selector?
The CSS selector is an essential component of a CSS rule. CSS, or Cascading Style Sheet, consists of various components and rules that instruct the browser on how to locate and apply CSS properties to HTML elements.
HTML is rarely presented without CSS. When there is a need to modify the appearance of an HTML element, the most common approach is to apply a style to it, whether it involves a simple change in text color or a more complex animation.
In both cases, the initial step is to locate the element. Once the element is located, CSS is used to apply styles. CSS selectors are employed to identify elements. Although CSS selectors were initially developed for different purposes, they serve the common function of selecting elements.
Now that you have an understanding of XPath and CSS selectors, let's explore how XPath and CSS are created and when to choose one over the other.
How to create a CSS selector?
Getting started with a CSS selector is straightforward. A CSS selector can utilize tag names, IDs, (pseudo-)classes, or attributes. Let's consider the following element:
<h2 id="header" name="ctrl" class="fancy">XPath vs CSS</h2>
Here are a few examples of CSS selectors for this tag:
- h2: Selects elements by their tag name. The asterisk (*) is a wildcard that matches any tag.
- #header: Uses the pound sign (#) to specify the ID.
- .fancy: Uses a period (.) to specify the class.
- [name="ctrl"]: Uses square brackets to specify any attribute.
These selectors can also be combined. For example:
- h2#header: Selects <h2> elements with the ID "header".
When it comes to selecting children, the ">" operator is used. The "+" operator selects the first sibling, and the "~" operator selects all siblings. Here are a few examples:
- div > a: Selects <a> elements that are children of <div>.
- div a: Selects <a> elements that are descendants of <div>.
- div + a: Selects the first <a> element after a <div>.
- div ~ a: Selects all <a> elements after a <div>.
If you understand these concepts, you should be able to handle most scenarios.
Advantages of using CSS Selectors
CSS selectors have several advantages that make them easy to learn and maintain.
Firstly, CSS selectors are widely supported by various web scraping and testing libraries, including Selenium, Beautiful Soup (Python), Scrapy (Python), Cheerio (JavaScript), and Puppeteer (JavaScript). However, it's important to note that Beautiful Soup specifically does not support XPath, so CSS selectors are the only option in that case.
Furthermore, modern browsers fully support CSS selectors, eliminating any compatibility concerns. While older browsers like Internet Explorer had limited CSS support, this is no longer an issue.
In terms of performance, CSS selectors are generally faster than XPath. However, it's worth noting that the difference in speed is so minimal that it is negligible in most real-world scenarios.
Overall, CSS selectors provide a convenient and efficient way to locate and manipulate elements on web pages, making them a preferred choice for many web developers and testers.
CSS vs XPath compared
XPath provides bidirectional flow, allowing traversal in both directions—from child to parent and from parent to child. On the other hand, CSS enables only one-directional flow, allowing traversal only from parent to child. In terms of speed and performance, XPath tends to be slower, while CSS is generally a faster and more efficient choice.
However, it's important to consider these factors with some caution. Browsers are rapidly evolving, and nowadays, most browsers are based on Chromium, with the exception of Firefox and Safari. Additionally, computer resources have become less of a bottleneck, leading to improved performance overall.
It's worth noting that most tests and comparisons are primarily conducted within the context of the Selenium Framework. However, when it comes to web scraping, there are various other factors to consider. If rendering is not necessary for the task at hand, using a full browser may not be required, opening up alternative possibilities.
Choosing between XPath and CSS
The choice between XPath and CSS should not be based solely on speed. In different situations, XPath may perform better than CSS selectors, and vice versa. However, the performance difference between them is generally negligible in modern browsers and programming languages, making it an insignificant factor to consider.
When it comes to web scraping, factors like network performance have a much greater impact on overall efficiency.
There are specific scenarios where CSS selectors are not suitable, such as when you need to traverse up the tree or when using the contains function for partial matching.
If you are using web scraping tools like Beautiful Soup, you can leverage the find and find_all methods provided by the library. These methods are optimized for Beautiful Soup and alleviate the need to choose between XPath and CSS selectors.
Conclusion
In conclusion, the choice between XPath and CSS selectors should not be based solely on performance. Instead, other factors such as the desired feature set, ease of use, and compatibility should be considered when deciding which selectors to use.