wxpath is a declarative web crawler where web crawling and scraping are expressed directly in XPath.
Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression:
import wxpath
# Crawl, extract fields, build a Wikipedia knowledge graph
path_expr = """
url('https://en.wikipedia.org/wiki/Expression_language')
///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
/map{
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
'url': string(base-uri(.)),
'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
}
"""
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
print(item)
The key addition is a `url(...)` operator that fetches and returns HTML for further XPath processing, and `///url(...)` for deep (or paginated) traversal. Everything else is standard XPath 3.1 (maps/arrays/functions).
Features:
- Async/concurrent crawling with streaming results
- Scrapy-inspired auto-throttle and polite crawling
- Hook system for custom processing
- CLI for quick experiments
Another example, paginating through HN comments (via "follow=" argument) pages and extracting data:
url('https://news.ycombinator.com',
follow=//a[text()='comments']/@href | //a[@class='morelink']/@href)
//tr[@class='athing']
/map {
'text': .//div[@class='comment']//text(),
'user': .//a[@class='hnuser']/@href,
'parent_post': .//span[@class='onstory']/a/@href
}
Limitations: HTTP-only (no JS rendering yet), no crawl persistence.
Both are on the roadmap if there's interest.
GitHub: https://github.com/rodricios/wxpath
PyPI: pip install wxpath
I'd love feedback on the expression syntax and any use cases this might unlock.
Thanks!
Just wanted to mention a few things.
wxpath is a result of a decade of working and thinking about web crawling and scraping. I created two somewhat popular Python web-extraction projects a decade ago (eatiht, and libextract), and even helped publish a metaanalysis on scrapers, all heavily relying on lxml/XPath.
After finding some time on my hands and after a hiatus on actually writing web scrapers, I decided to return to this little problem domain.
Obviously, LLMs have proven to be quite formidable at web content extraction, but they encounter the now-familiar issues of token limits and cost.
Besides LLMs, there's been some great projects making great progress on the problem of web data extraction, like the Scrapy and Crawlee frameworks, and projects like Ferret (https://www.montferret.dev/docs/introduction/) - another declarative web crawling framework - and others (Xidel, https://github.com/benibela/xidel).
The shared, common abstraction of most web-scraping frameworks and tools is "node selectors" - the syntax and engine for extracting nodes and their data.
XPath has proven resilient and continues to be a popular node-selection and processing language. However, what it lacks, which other frameworks provide, is crawling.
wxpath is an attempt to fill that gap.
Hope people find it useful!
https://github.com/rodricios/eatiht https://github.com/datalib/libextract
reply