2024-11-17 17:11:00
github.com
Nokolexbor is a drop-in replacement for Nokogiri. It’s 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.
It’s a performance-focused HTML5 parser for Ruby based on Lexbor. It supports both CSS selectors and XPath. Nokolexbor’s API is designed to be 1:1 compatible as much as possible with Nokogiri’s API.
Nokolexbor is shipped with pre-compiled gems on most common platforms:
- Linux:
x86_64
, with glibc >= 2.17 - macOS:
x86_64
andarm64
- Windows:
ucrt64
,mingw32
andmingw64
If you are on a supported platform, just jump to the Installation section. Otherwise, you need to install CMake to compile C extensions:
sudo apt-get install cmake
Add to your Gemfile:
Then, run bundle install
.
Or, install the gem directly:
require 'nokolexbor'
require 'open-uri'
# Parse HTML document
doc = Nokolexbor::HTML(URI.open('https://github.com/serpapi/nokolexbor'))
# Search for nodes by css
doc.css('#readme h1', 'article h2', 'p[dir=auto]').each do |node|
puts node.content
end
# Search for text nodes by css
doc.css('#readme p > ::text').each do |text|
puts text.content
end
# Search for nodes by xpath
doc.xpath('//div[@id="readme"]//h1', '//article//h2').each do |node|
puts node.content
end
- Nokogiri-compatible APIs.
- High performance HTML parsing, DOM manipulation and CSS selectors engine.
- XPath search engine (ported from libxml2).
- Text nodes CSS selector support:
::text
.
css
andat_css
- Based on Lexbor.
- Only accepts CSS selectors, doesn’t support mixed syntax like
div#abc /text()
. - To select text nodes, use pseudo element
::text
. e.g.div#abc > ::text
. - Performance is much higher than libxml2 based methods.
xpath
andat_xpath
- Based on libxml2.
- Only accepts XPath syntax.
- Works in the same way as Nokogiri’s
xpath
andat_xpath
.
nokogiri_css
andnokogiri_at_css
(requires Nokogiri installed)- Based on libxml2.
- Accept mixed syntax like
div#abc /text()
. - Works in the same way as Nokogiri’s
css
andat_css
.
-
For selector
:nth-of-type(n)
,n
is not affected by prior filter. For example, if we want to select the 3rddiv
excluding classa
and classb
, which will be the lastdiv
in the following HTML:In Nokogiri, the selector should be
div:not(.a):not(.b):nth-of-type(3)
In Nokolexbor,
:not
does affect the place of the lastdiv
(same in browsers), the selector should bediv:not(.a):not(.b):nth-of-type(5)
, but this losts the purpose of filtering though.
Benchmark parsing google result page (368 KB) and selecting nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.
Run with: ruby bench/bench.rb
Nokolexbor (iters/s) | Nokogiri (iters/s) | Diff | |
---|---|---|---|
parsing | 487.6 | 93.5 | 5.22x faster |
at_css | 50798.8 | 50.9 | 997.87x faster |
css | 7437.6 | 52.3 | 142.11x faster |
at_xpath | 57.077 | 53.176 | same-ish |
xpath | 51.523 | 58.438 | same-ish |
Raw data
Warming up --------------------------------------
Nokolexbor parse 56.000 i/100ms
Nokogiri parse 8.000 i/100ms
Calculating -------------------------------------
Nokolexbor parse 487.564 (±10.9%) i/s - 9.688k in 20.117173s
Nokogiri parse 93.470 (±21.4%) i/s - 1.736k in 20.024163s
Comparison:
Nokolexbor parse: 487.6 i/s
Nokogiri parse: 93.5 i/s - 5.22x (± 0.00) slower
Warming up --------------------------------------
Nokolexbor at_css 5.548k i/100ms
Nokogiri at_css 6.000 i/100ms
Calculating -------------------------------------
Nokolexbor at_css 50.799k (±13.8%) i/s - 987.544k in 20.018481s
Nokogiri at_css 50.907 (±35.4%) i/s - 828.000 in 20.666258s
Comparison:
Nokolexbor at_css: 50798.8 i/s
Nokogiri at_css: 50.9 i/s - 997.87x (± 0.00) slower
Warming up --------------------------------------
Nokolexbor css 709.000 i/100ms
Nokogiri css 4.000 i/100ms
Calculating -------------------------------------
Nokolexbor css 7.438k (±14.7%) i/s - 145.345k in 20.083833s
Nokogiri css 52.338 (±36.3%) i/s - 816.000 in 20.042053s
Comparison:
Nokolexbor css: 7437.6 i/s
Nokogiri css: 52.3 i/s - 142.11x (± 0.00) slower
Warming up --------------------------------------
Nokolexbor at_xpath 2.000 i/100ms
Nokogiri at_xpath 4.000 i/100ms
Calculating -------------------------------------
Nokolexbor at_xpath 57.077 (±31.5%) i/s - 920.000 in 20.156393s
Nokogiri at_xpath 53.176 (±35.7%) i/s - 876.000 in 20.036717s
Comparison:
Nokolexbor at_xpath: 57.1 i/s
Nokogiri at_xpath: 53.2 i/s - same-ish: difference falls within error
Warming up --------------------------------------
Nokolexbor xpath 3.000 i/100ms
Nokogiri xpath 3.000 i/100ms
Calculating -------------------------------------
Nokolexbor xpath 51.523 (±31.1%) i/s - 903.000 in 20.102568s
Nokogiri xpath 58.438 (±35.9%) i/s - 852.000 in 20.001408s
Comparison:
Nokogiri xpath: 58.4 i/s
Nokolexbor xpath: 51.5 i/s - same-ish: difference falls within error
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.