Select HTML Components in Declarative Rust
With scraper_query, you don't need to learn another language (CSS) to just clean up your HTML
中文版请见链接
As a LLM engineer and a sloppy frontend newbie, I don’t really know much about CSS, but I do need to crawl some web pages and turn them into a knowledgebase for my LLM agents. So, I thought, I’m going to do this in my favorite language, Rust :)
After some research, I found scraper
, which parses HTML files. We can also select the HTML components with CSS selectors with it. Simple and easy, but wait, what’s CSS selector? I just want to remove some components, like <meta>
, <script>
and <div>
that is class promotion
.
Here come some fun facts:
A component, say <div>
that has two classes a
and b
in HTML is written as <div class="a b"></div>
. But how can you select it with CSS? It’s written as div.a.b
! That seems a bit weird to a guy like me who got used to object-oriented dot notations. But OK, these are just conventions, I can work with them.
How about selecting a <div>
that is either class a
or b
in CSS? It’s written as div.a, div.b
. My GitHub Copilot even gave me multiple other options when completing the last sentence!
Things get a bit complicated when I actually take a closer look at my HTML files. I need to select and remove:
<div>
that is either classpromotion
orads
<text>
that is classsearch
<a>
that is not classlink-to-documentation
<p>
that ismisc
unless it is also classdocumentation
- all the components that are one of
[li, ol, ul ...]
- …….. this list went a bit long, so I don’t want to list it all.
How can I write CSS selectors to describe what I want? Also, I’d like to update these criteria programmatically, which means I have to do string manipulation that is essentially scrappy meta programming. I actually did it! But it’s just painful when you embed a DSL inside a language in the form of strings.
Yes, I know this is kind of the norm to use CSS to select HTML components, and I believe there are good reasons. I know crates like dom_query are super useful. But I really want to focus on ergonomics and simplicity, not performance or curiosity about CSS.
So, let’s forget CSS for now. We have some HTML components, which have some properties, like its name, class and id. We want to select some of them based on some logics. It sounds like we just need a query engine. Or more precisely, a SAT solver (no, it’s a joke :)
Let’s think declaratively
I think we are pretty used to booleans. If naming is done right, logic expressions can be pretty declarative, literate and readable like sentences. For instance, component.is_div() && component.is_class("a")
seems quite readable. However, this object-oriented style of APIs doesn’t fit with the design of scraper
. Working with scraper
, we manipulate the HTML file and its components with Node
and NodeId
, which help us index into the tree structure.
Despite that, we can make a query more declarative. Instead of div.a.b
, we can write Tag::Div & class("a") & class("b")
. We can almost read the logic as a sentence.
This is what scraper_query
allows. The complete sample is
use scraper::Html;
use scraper_query::*; // use `HTMLIndex`, `Tag`, `class`, `id`
use markup5ever::interface::tree_builder::TreeSink;
let mut document = Html::parse_document(HTML);
let index = HTMLIndex::new(&document);
let query = Tag::Div & class("a") & class("b"); // find all divs with class "a" and "b"
let node_ids = index.query(query);
// simple manipulation
for id in node_ids {
document.remove_from_parent(&id);
}
Now we can use &
, |
, !
and of course ()
to compose queries. If preferring to be even more explicit, we can do something like Tag::Div.and(class("a").and(class("b")))
.
With
&
,|
,!
, this tiny DSL disguised in Rust logic expressions is Turing-complete!
Under the hood
scraper_query
is pretty simple. It has only a few hundreds of lines of code. The heavy lifting query engine is implemented in Polars. With Polars, all I need is to:
- Iterate through all HTML nodes and build a Polars DataFrame which is conceptually a table
- The columns are
NODE_ID
,CLASS
,ID
(i.e.,id
in HTML components) andTAG
(i.e., the name of a component) - Each component has one row in the table
- The table is encapsulated in
HTMLIndex
- The columns are
- Query the data frame with expressions like logic expressions, like
Tag::Div & class("a") & class("b")
- These are transformed into Polars expressions in
std::ops::{BitAnd, BitOr, Not}
implementations. - For details of Polars expressions, see
polars_plan::dsl
.
- These are transformed into Polars expressions in
For example:
<div class="a b">
<div class="c d">
<p> Hello </p>
</div>
<div class="e" id="f">
<p> World </p>
</div>
</div>
This HTML file, after parsing and building the HTMLIndex
, will give us a Polas dataframe like this:
NODE_ID | TAG | CLASS | ID |
---|---|---|---|
0 | div | [ “a”, “b” ] | |
1 | div | [ “c”, “d” ] | |
2 | p | [] | |
3 | div | [ “e” ] | “f” |
4 | p | [] |
If we query components with class("a") | class("c")
, the resulting Polars expression is essentially generated like this:
use polars::prelude::Expr;
let class_a_expr: Expr = col("CLASS").list().contains(lit("a"));
let class_c_expr: Expr = col("CLASS").list().contains(lit("c"));
let query_expr: Expr = class_a_expr.or(class_c_expr); // this will be used to filter the dataframe
Simple and clean, it serves its purpose.
Future Work
For one, the dependency, Polars, might seem too heavy. It’s full fleet of tools built for dataframes after all. Probably we can just strip out the query engine of it. Or, build a little SAT solver just for this use case.
For another, I find the process of examining and cleaning HTML files is not straightforward and interactive. Probably I will implement a web app that interactively and responsively shows the rendering of a processed HTML file based on selection rules like Tag::Div & class("a") & class("b")
.
Metadata
Version: 0.0.1
Date: 2024.11.04
License: CC BY-SA 4.0