• About TC
  • Affiliate Disclaimer
  • Privacy Policy
  • TOS
  • Contact
Tuesday, June 10, 2025
Techcratic
  • TC
  • AI
    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

    Artificial Intelligence

    World-Consistent Video Diffusion With Explicit 3D Modeling

  • Crypto
    Uniswap Surges 24% on $88B Volume, Targeting $12

    Uniswap Surges 24% on $88B Volume, Targeting $12

    No One Fell for It: Paraguay’s Bitcoin Legal Tender Announcement Was a Zero-Sum Hack

    No One Fell for It: Paraguay’s Bitcoin Legal Tender Announcement Was a Zero-Sum Hack

    Pi Network Dives Toward $1 – Here’s Why Investors Are Nervous

    XRP Price to Pump With Golden Cross and Long-Term Holder Data

    Franklin Templeton Debuts Second-by-Second ‘Intraday Yield’ on Blockchain Platform

    Franklin Templeton Debuts Second-by-Second ‘Intraday Yield’ on Blockchain Platform

    Bitcoin ETFs Bounce Back With $386 Million Inflow as Ether ETFs Maintain Bull Run

    Bitcoin ETFs Bounce Back With $386 Million Inflow as Ether ETFs Maintain Bull Run

    Bitcoin Core Developers Merge Controversial Policy Changes: Is a Fork Ahead?

    Bitcoin Core Developers Merge Controversial Policy Changes: Is a Fork Ahead?

    Crypto to “Become Part of All Sectors” Under Trump: Kevin O’Leary

    Russian Crypto CEO Charged in $530M Laundering Fraud

    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

  • Cybersecurity
    Cybersecurity

    Researchers Uncover 20+ Configuration Risks, Including Five CVEs, in Salesforce Industry Cloud

    Cybersecurity

    Adobe Releases Patch Fixing 254 Vulnerabilities, Closing High-Severity Security Gaps

    Cybersecurity

    Researcher Found Flaw to Discover Phone Numbers Linked to Any Google Account

    Cybersecurity

    CISA Adds Erlang SSH and Roundcube Flaws to Known Exploited Vulnerabilities Catalog

    Cybersecurity

    Malicious Browser Extensions Infect 722 Users Across Latin America Since Early 2025

    Cybersecurity

    Empower Users and Protect Against GenAI Data Loss

    Cybersecurity

    Popular Chrome Extensions Leak API Keys, User Data via HTTP and Hardcoded Credentials

    Cybersecurity

    Critical Cisco ISE Auth Bypass Flaw Impacts Cloud Deployments on AWS, Azure, and OCI

    Cybersecurity

    Why Traditional DLP Solutions Fail in the Browser Era

  • Deals
    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

    HP Samsung Electronics CLT-M406S Toner, Magenta

    HP Samsung Electronics CLT-M406S Toner, Magenta

    SAMSUNG Galaxy S23 FE 5G, US Version, 128GB, Black – Unlocked (Renewed)

    SAMSUNG Galaxy S23 FE 5G, US Version, 128GB, Black – Unlocked (Renewed)

    LaCie Rugged SSD 1TB, Externe SSD, voor Mac & PC, USB-C, Schok- Regen- en drukbestendig,…

    LaCie Rugged SSD 1TB, Externe SSD, voor Mac & PC, USB-C, Schok- Regen- en drukbestendig,…

    Kingspec 44PIN IDE PATA MLC 2GB 4GB 8GB 16GB 32GB DOM SSD Disk On Module For Network…

    Kingspec 44PIN IDE PATA MLC 2GB 4GB 8GB 16GB 32GB DOM SSD Disk On Module For Network…

    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    Hitachi MAF0058 Mass Air Flow Sensor

    Hitachi MAF0058 Mass Air Flow Sensor

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

  • Gaming
    The D&D Movie IS NOT WOKE!  A Review

    The D&D Movie IS NOT WOKE! A Review

    The Legends of Zelda BOTW Switch 2 – Final Boss and Ending (4K60FPS)

    The Legends of Zelda BOTW Switch 2 – Final Boss and Ending (4K60FPS)

    The Legend of Zelda Breath of the Wild Walkthrough Part 7 (E3 2016 Gameplay)

    The Legend of Zelda Breath of the Wild Walkthrough Part 7 (E3 2016 Gameplay)

    Blue Lion Supercomputer Will Run on NVIDIA Vera Rubin

    Blue Lion Supercomputer Will Run on NVIDIA Vera Rubin

    BOTW – Breadcrumbs – Walkthrough 68, pt. 7 (Sasa Kai Shrine)

    BOTW – Breadcrumbs – Walkthrough 68, pt. 7 (Sasa Kai Shrine)

    Yellow Wind Sage Boss Theme | Black Myth: Wukong

    Yellow Wind Sage Boss Theme | Black Myth: Wukong

    Baldurs Gate 3 REVIEW (In Progress) – My Brutally Honest Opinion & Is It Worth It? (BG3 Review)

    Baldurs Gate 3 REVIEW (In Progress) – My Brutally Honest Opinion & Is It Worth It? (BG3 Review)

    Cisco and NVIDIA Advance Security for Enterprise AI Factories

    Cisco and NVIDIA Advance Security for Enterprise AI Factories

    The Callisto Protocol Game Review! (Is It Good???)

    The Callisto Protocol Game Review! (Is It Good???)

  • Tesla
    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

    GOOACC 200PCS Car Plastic Rivets Fasteners Push Retainer Kit, 10 Most Popular Sizes Auto…

    GOOACC 200PCS Car Plastic Rivets Fasteners Push Retainer Kit, 10 Most Popular Sizes Auto…

    Tera Electric Vehicle Charger Tesla: ETL Certified Level 2 48 Amp 240 Volt DIY Stickers…

    Tera Electric Vehicle Charger Tesla: ETL Certified Level 2 48 Amp 240 Volt DIY Stickers…

    Tesla (TSLA) sales are crashing in China, and things are about to get worse

    Tesla (TSLA) sales are crashing in China, and things are about to get worse

    Lifting Jack Pad for Model 3/Y/S/X,4 PCS Jack Pad with Tire Repair Tool & Storage Box,…

    Lifting Jack Pad for Model 3/Y/S/X,4 PCS Jack Pad with Tire Repair Tool & Storage Box,…

    j Junsun Portable Electric Car Charger Level 2 EV Charger 32A 240V for Tesla 21ft Cable…

    j Junsun Portable Electric Car Charger Level 2 EV Charger 32A 240V for Tesla 21ft Cable…

    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

  • UFO
    History Classics: UFOs & Aliens

    History Classics: UFOs & Aliens

    Mysteries Of Ancient Aliens According To Hinduism || #shorts || #youtube || #religion ||

    Mysteries Of Ancient Aliens According To Hinduism || #shorts || #youtube || #religion ||

    The Light Gate Welcomes Rafael Lugo, Contactee, August 21st, 2023

    The Light Gate Welcomes Rafael Lugo, Contactee, August 21st, 2023

    FOCO NFL Mens Football Team Logo Moccasin Slippers Shoes

    FOCO NFL Mens Football Team Logo Moccasin Slippers Shoes

    Horrifying Encounter While Truck Driving #scary #paranormal

    Horrifying Encounter While Truck Driving #scary #paranormal

    Vintage Gators Personalized Name Apparel Retro Classic T-Shirt

    Vintage Gators Personalized Name Apparel Retro Classic T-Shirt

    Pop Culture Conspiracy Theories! Taylor Swift, BRAT, and The Simpson Predictions!

    Pop Culture Conspiracy Theories! Taylor Swift, BRAT, and The Simpson Predictions!

    Mufon and Ufos: The Proof is Out There [DVD]

    Mufon and Ufos: The Proof is Out There [DVD]

    Unidentified Flying Objects

    Unidentified Flying Objects

No Result
View All Result
  • TC
  • AI
    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

    Artificial Intelligence

    World-Consistent Video Diffusion With Explicit 3D Modeling

  • Crypto
    Uniswap Surges 24% on $88B Volume, Targeting $12

    Uniswap Surges 24% on $88B Volume, Targeting $12

    No One Fell for It: Paraguay’s Bitcoin Legal Tender Announcement Was a Zero-Sum Hack

    No One Fell for It: Paraguay’s Bitcoin Legal Tender Announcement Was a Zero-Sum Hack

    Pi Network Dives Toward $1 – Here’s Why Investors Are Nervous

    XRP Price to Pump With Golden Cross and Long-Term Holder Data

    Franklin Templeton Debuts Second-by-Second ‘Intraday Yield’ on Blockchain Platform

    Franklin Templeton Debuts Second-by-Second ‘Intraday Yield’ on Blockchain Platform

    Bitcoin ETFs Bounce Back With $386 Million Inflow as Ether ETFs Maintain Bull Run

    Bitcoin ETFs Bounce Back With $386 Million Inflow as Ether ETFs Maintain Bull Run

    Bitcoin Core Developers Merge Controversial Policy Changes: Is a Fork Ahead?

    Bitcoin Core Developers Merge Controversial Policy Changes: Is a Fork Ahead?

    Crypto to “Become Part of All Sectors” Under Trump: Kevin O’Leary

    Russian Crypto CEO Charged in $530M Laundering Fraud

    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

  • Cybersecurity
    Cybersecurity

    Researchers Uncover 20+ Configuration Risks, Including Five CVEs, in Salesforce Industry Cloud

    Cybersecurity

    Adobe Releases Patch Fixing 254 Vulnerabilities, Closing High-Severity Security Gaps

    Cybersecurity

    Researcher Found Flaw to Discover Phone Numbers Linked to Any Google Account

    Cybersecurity

    CISA Adds Erlang SSH and Roundcube Flaws to Known Exploited Vulnerabilities Catalog

    Cybersecurity

    Malicious Browser Extensions Infect 722 Users Across Latin America Since Early 2025

    Cybersecurity

    Empower Users and Protect Against GenAI Data Loss

    Cybersecurity

    Popular Chrome Extensions Leak API Keys, User Data via HTTP and Hardcoded Credentials

    Cybersecurity

    Critical Cisco ISE Auth Bypass Flaw Impacts Cloud Deployments on AWS, Azure, and OCI

    Cybersecurity

    Why Traditional DLP Solutions Fail in the Browser Era

  • Deals
    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

    HP Samsung Electronics CLT-M406S Toner, Magenta

    HP Samsung Electronics CLT-M406S Toner, Magenta

    SAMSUNG Galaxy S23 FE 5G, US Version, 128GB, Black – Unlocked (Renewed)

    SAMSUNG Galaxy S23 FE 5G, US Version, 128GB, Black – Unlocked (Renewed)

    LaCie Rugged SSD 1TB, Externe SSD, voor Mac & PC, USB-C, Schok- Regen- en drukbestendig,…

    LaCie Rugged SSD 1TB, Externe SSD, voor Mac & PC, USB-C, Schok- Regen- en drukbestendig,…

    Kingspec 44PIN IDE PATA MLC 2GB 4GB 8GB 16GB 32GB DOM SSD Disk On Module For Network…

    Kingspec 44PIN IDE PATA MLC 2GB 4GB 8GB 16GB 32GB DOM SSD Disk On Module For Network…

    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    Hitachi MAF0058 Mass Air Flow Sensor

    Hitachi MAF0058 Mass Air Flow Sensor

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

  • Gaming
    The D&D Movie IS NOT WOKE!  A Review

    The D&D Movie IS NOT WOKE! A Review

    The Legends of Zelda BOTW Switch 2 – Final Boss and Ending (4K60FPS)

    The Legends of Zelda BOTW Switch 2 – Final Boss and Ending (4K60FPS)

    The Legend of Zelda Breath of the Wild Walkthrough Part 7 (E3 2016 Gameplay)

    The Legend of Zelda Breath of the Wild Walkthrough Part 7 (E3 2016 Gameplay)

    Blue Lion Supercomputer Will Run on NVIDIA Vera Rubin

    Blue Lion Supercomputer Will Run on NVIDIA Vera Rubin

    BOTW – Breadcrumbs – Walkthrough 68, pt. 7 (Sasa Kai Shrine)

    BOTW – Breadcrumbs – Walkthrough 68, pt. 7 (Sasa Kai Shrine)

    Yellow Wind Sage Boss Theme | Black Myth: Wukong

    Yellow Wind Sage Boss Theme | Black Myth: Wukong

    Baldurs Gate 3 REVIEW (In Progress) – My Brutally Honest Opinion & Is It Worth It? (BG3 Review)

    Baldurs Gate 3 REVIEW (In Progress) – My Brutally Honest Opinion & Is It Worth It? (BG3 Review)

    Cisco and NVIDIA Advance Security for Enterprise AI Factories

    Cisco and NVIDIA Advance Security for Enterprise AI Factories

    The Callisto Protocol Game Review! (Is It Good???)

    The Callisto Protocol Game Review! (Is It Good???)

  • Tesla
    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

    GOOACC 200PCS Car Plastic Rivets Fasteners Push Retainer Kit, 10 Most Popular Sizes Auto…

    GOOACC 200PCS Car Plastic Rivets Fasteners Push Retainer Kit, 10 Most Popular Sizes Auto…

    Tera Electric Vehicle Charger Tesla: ETL Certified Level 2 48 Amp 240 Volt DIY Stickers…

    Tera Electric Vehicle Charger Tesla: ETL Certified Level 2 48 Amp 240 Volt DIY Stickers…

    Tesla (TSLA) sales are crashing in China, and things are about to get worse

    Tesla (TSLA) sales are crashing in China, and things are about to get worse

    Lifting Jack Pad for Model 3/Y/S/X,4 PCS Jack Pad with Tire Repair Tool & Storage Box,…

    Lifting Jack Pad for Model 3/Y/S/X,4 PCS Jack Pad with Tire Repair Tool & Storage Box,…

    j Junsun Portable Electric Car Charger Level 2 EV Charger 32A 240V for Tesla 21ft Cable…

    j Junsun Portable Electric Car Charger Level 2 EV Charger 32A 240V for Tesla 21ft Cable…

    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

  • UFO
    History Classics: UFOs & Aliens

    History Classics: UFOs & Aliens

    Mysteries Of Ancient Aliens According To Hinduism || #shorts || #youtube || #religion ||

    Mysteries Of Ancient Aliens According To Hinduism || #shorts || #youtube || #religion ||

    The Light Gate Welcomes Rafael Lugo, Contactee, August 21st, 2023

    The Light Gate Welcomes Rafael Lugo, Contactee, August 21st, 2023

    FOCO NFL Mens Football Team Logo Moccasin Slippers Shoes

    FOCO NFL Mens Football Team Logo Moccasin Slippers Shoes

    Horrifying Encounter While Truck Driving #scary #paranormal

    Horrifying Encounter While Truck Driving #scary #paranormal

    Vintage Gators Personalized Name Apparel Retro Classic T-Shirt

    Vintage Gators Personalized Name Apparel Retro Classic T-Shirt

    Pop Culture Conspiracy Theories! Taylor Swift, BRAT, and The Simpson Predictions!

    Pop Culture Conspiracy Theories! Taylor Swift, BRAT, and The Simpson Predictions!

    Mufon and Ufos: The Proof is Out There [DVD]

    Mufon and Ufos: The Proof is Out There [DVD]

    Unidentified Flying Objects

    Unidentified Flying Objects

No Result
View All Result
Techcratic
No Result
View All Result
Home Hacker News

I made a worse search engine than Elasticsearch

Hacker News by Hacker News
June 5, 2025
in Hacker News
Reading Time: 12 mins read
129 1
A A
0
Share on FacebookShare on XShare on LinkedIn

2025-06-05 14:37:00
softwaredoug.com

I want you to share in my shame at daring to make a search library. And in this shame, you too, can experience the humility and understanding of what a real, honest-to-goodness, not side-project, search engine does to make lexical search fast.

BEIR is a set of Information Retrieval benchmarks, oriented around question-answer use cases.

My side project, SearchArray adds full text search to Pandas. So naturally, to see stand in awe at my amazing developer skills, I wanted to use BEIR to compare SearchArray to Elasticsearch (w/ same query + tokenization). So I spent a Saturday integrating SearchArray into BEIR, and measuring its relevence and performance on MSMarco Passage Retrieval corpus (8M docs).

… and 🥁

Library Elasticsearch SearchArray
NDCG@10 0.2275 0.225
Search Throughput 90 QPS ~18 QPS
Indexing Throughput 10K Docs Per Sec ~3.5K Docs Per Sec

… Sad trombone 🎺

It’s worse in every dimension

At least NDCG@10 is nearly right, so our BM25 calculation is correct (probably due to negligible differences in tokenization)

Imposter Syndrome anyone?

Instead of wallowing in my shame, I DO know exactly what’s going on… And it’s fairly educational. Let’s chat about why a real, non side-project, search engine is fast.

A Magic WAND

(Or how SearchArray is top 8m retrieval while Elasticsearch == top K retrieval)

In lexical search systems, you search for multiple terms. You take the BM25 score of each term, and then finally, combine those into a final score for the document. IE, a search for luke skywalker really means: BM25(luke) ??? BM25(skywalker) where ??? is some mathematical operator.

In a simple “OR” query, you just take the SUM of each term for each doc, IE, a search for luke skywalker is BM25(luke) + BM25(skywalker) like so:

Term Doc A (BM25) Doc B (BM25)
luke 1.90 1.45
skywalker 11.51 4.3
Combined doc score (SUM) 13.41 5.75

SearchArray just does BM25 scoring. You get back big numpy arrays of every document’s BM25 score. Then you combine the scores – literally using np.sum. Of course, that’s not what a search engine like Elasticsearch would do. Instead it has a different guarantee, it gets the highest scoring top N of your specified OR query.

This little bit of seemingly minute wiggle room gives search engines a great deal of latitude. Search engines can use an algorithm called Weak-AND or WAND to avoid work when combining multiple term scores into the final top N results.

I won’t get into the full nitty gritty of the algorithm, but here’s a key intuition to noodle over:

A scoring system like BM25 depends heavily on document frequency of a term. So rare terms – a high (1 / document frequency) – have a higher likelihood of impacting the final score, and ending up in the top K. Luckily these terms (like skywalker) occur on fewer documents. So we can fetch these select, elite few docs quickly in the data structure that maps skywalker -> [... handful of matching doc ids...] (aka postings). We can reach deeply into this list.

On the other hand, we can be much more circumspect about the boring, common term, luke. And that’s useful because luke has a very extensive postings list luke -> [... a giant honking list of documents...]. We’d prefer to avoid scanning all of these.

We might imagine that these lists of document ids, also is paired with its term frequency how often that term occurs in that document – the other major input of BM25. And if its SORTED from highest -> lowest term frequency, we can go down this list until its impossible for the BM25 score of a term to have any chance of making the top K results. Then exit early.

While WAND – and similar optimizations – helps Elasticsearch avoid work, SearchArray, gleefully does this work like an ignoramus happily giving you a giant idiotic array of BM25 scores.

When you look at this icicle graph of SearchArray’s performance doing an “OR” search, you can see all the time spent summing a giant array and also needlessly BM25 scoring many many documents.

image

SearchArray doesn’t directly store postings

Unlike most search engines, SearchArray doesn’t have postings lists of terms -> documents.

Instead, under the hood, SearchArray stores a positional index, built first-and-foremost for phrase matching. You give SearchArray a list of terms ['mary', 'had', 'a', 'little', 'lamb']. It then finds every place mary occurs one position before had, etc. It does this by storing, for each terms, the positions as a roaring bitmap.

In our roaring bitmap, each 64 bit word has a header indicating where the positions occur (document and region in the doc). Each bit position corresponds to a position in the document. A 1 indicates this term is present, a 0 missing.

So to collect phrase matches, for mary had we can simply find places where one term’s bits occur adjacent to another. This can be done very fast with simple bit arithmetic.

mary
   00000010000000    | 00000000000010
had
   00000001000000    | 00000000000001      #

But a nice property of this, and alleviating maintenance for this one person project, is the fact that we can also use this to compute term frequencies. Simply by performing a popcount (counting the number of set bits), then collecting those documents for a term, we get a mapping of doc ids -> term frequencies.

So we spend a fair amount of time doing that, as you can see here:

image

Now I lied actually, while this is the core mechanism for storing term frequencies, we do cache. A cache that remembers the doc id -> term frequencies when the roaring bit array is > N 64 bit words. This lets users tune N to your memory / speed tradeoff, and get closer to a postings list.

Caching Non term-dependent BM25 components

Take a look at this snippet for computing BM25:

        bm25 = (
            term_freq / (term_freq + (k1 * ((1 - b) + (b * (doc_len / avg_doc_lens)))))
        ) * idf

Notice there is a BIG PART of this calculation that has nothing to do with the terms being searched:

 (k1 * ((1 - b) + (b * (doc_len / avg_doc_lens)))))

In my testing, a bit of latency (1ms on 1 million docs) can be shaved by caching everything in the k1 + ... avg_doc_lens somewhere. If doc_len corresponds to array with a doc length value for every document, you can create an array with this formula cached. But it’s a bit of a maintenance burden to have one additional, globally shared cache. So I have avoided this so far.

Caching the FULL query, not just individual BM25 term scoring

SearchArray is just a system for computing BM25 scores (or whatever similarity). You USE it to build up an “or query” or whatever using numpy… it doesn’t do it for you. IE the code implemented in BEIR is simply:

def bm25_search(corpus, query, column):
    tokenizer = corpus[column].array.tokenizer
    query_terms = tokenizer(query)
    scores = np.zeros(len(corpus))
    query_terms = set(query_terms)
    for term in query_terms:
            scores += corpus[column].array.score(term)
    return scores

But in a regular search engine like Solr, Elasticsearch, OpenSearch, or Vespa, this logic is expressed in the search engine’s Query DSL. So the search engine can plan+cache the complete calculation, whereas SearchArray gives you all the tools to shoot yourself in the foot, performance wise (not to mention the earlier point about WAND, etc).

That’s why you should hug a search engineer

There you have it!

SearchArray is a tool for prototyping, using normal Pydata tooling, not for building giant retrieval systems like Elasticsearch. It’s good to know the tradeoffs behind your lexical system, as they focus on different tradeoffs. You might find it useful for dorking around on

What would be great would be if we COULD express our queries in such a dataframe-oriented DSL. IE a Polars-esque lazy top-N retrieval system that pulled from different retrieval sources, scored them, summed them, and did whatever arbitrary math to the underlying scores. I can cross my fingers such a thing might exist. So far people build these DAGs in less expressive ways: as part of their Torch model DAG, or some homegrown query-time DAG system.

In any case, I’m absolutely humbled by folks that work on big, large scale, distributed lexical search engines like (Vespa, Lucene, OpenSearch, Elasticsearch, Solr). These folks ought to be your hero too, they do this grunt work for us, and we should NOT take it for granted.

Below are some notes and appendices for BEIR and the different benchmarking scripts, in case you’re curious


Appendix links to scripts

Appendix – How to integrate with BEIR…

BEIR has a set of built-in datasets and metrics tools, if you implement a BaseSearch class with the following signature:

    class SearchArraySearch(BaseSearch):

        def search(self,
                   corpus: Dict[str, Dict[str, str]],
                   queries: Dict[str, str],
                   top_k: int,
                   *args,
                   **kwargs) -> Dict[str, Dict[str, float]]:

The inputs:

  • Corpus: A dict pointing a document id to a set of fields to index, ie
{'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.',
 'title': ''}
...
  • Queries: A dict pointing a query id -> query:
{"1": "Who was the original governor of the plymouth colony"}
...

Finally the output is a dictionary of query ids -> {doc ids -> scores} – each query w/ top_k scored.

So when search is called you need to

  1. Index the corpus
  2. Issue all queries and gather scores

Essentially this looks something like:

def search(self,
           corpus: Dict[str, Dict[str, str]],
           queries: Dict[str, str],
           top_k: int,
           *args,
           **kwargs) -> Dict[str, Dict[str, float]]:
    corpus = self.index_corpus(corpus)     # 

How does this look for SearchArray?

To index, we loop over each str column, and add a SearchArray column to the DF. Below, tokenized with a snowball tokenizer:

            for column in corpus.columns:
                if corpus[column].dtype == 'object':
                    corpus[column].fillna("", inplace=True)
                    corpus[f'{column}_snowball'] = SearchArray.index(corpus[column],
                                                                     data_dir=DATA_DIR,
                                                                     tokenizer=snowball_tokenizer)

Then replace some_search_function above w/ something that searches the SearchArray columns. Maybe this simple bm25_search:

def bm25_search(corpus, query):
    query = snowball_tokenizer(query)
    scores = np.zeros(len(corpus))
    for q in query:
        scores += corpus['text_snowball'].array.score(q)
    return scores

(Leaving out some annoying threading, but you can look at the code all here )

to learn how to apply LLMs to search applications. Check out

for a sneak preview.

Source Link


Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!


Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo

Bitcoin QR Code

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Scan the QR code with your crypto wallet app

DOGECOIN

Dogecoin Logo

Dogecoin QR Code

D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA

Scan the QR code with your crypto wallet app

ETHEREUM

Ethereum Logo

Ethereum QR Code

0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a

Scan the QR code with your crypto wallet app

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: Hacker News
Share161Tweet101Share28
Previous Post

Teaching Your Dog To Ignore Other Dogs Walking On Leash

Next Post

Honey, the Amazon humanoid delivery robot is here! (well, almost)

Hacker News

Hacker News

Stay updated with Hacker News, where technology meets entrepreneurial spirit. Get the latest on tech trends, startup news, and discussions from the tech community. Read the latest updates here at Techcratic.

Related Posts

manaskamal/XenevaOS: The Xeneva Operating System
Hacker News

manaskamal/XenevaOS: The Xeneva Operating System

June 10, 2025
1.3k
Denuvo Analysis | Connor-Jay’s Blog
Hacker News

Denuvo Analysis | Connor-Jay’s Blog

June 10, 2025
1.3k
Barre/compact_log: RFC6962 certificate transparency log with LSM-tree based storage
Hacker News

Barre/compact_log: RFC6962 certificate transparency log with LSM-tree based storage

June 10, 2025
1.3k
Scientific papers: innovation … or imitation?
Hacker News

Scientific papers: innovation … or imitation?

June 10, 2025
1.3k
Researchers recreate ancient Egyptian blues | WSU Insider
Hacker News

Researchers recreate ancient Egyptian blues | WSU Insider

June 10, 2025
1.3k
Askannz/munal-os: An experimental operating system fully written in Rust, with a unikernel design, cooperative scheduling and a security model based on WASM sandboxing.
Hacker News

Askannz/munal-os: An experimental operating system fully written in Rust, with a unikernel design, cooperative scheduling and a security model based on WASM sandboxing.

June 9, 2025
1.3k
Load More
Next Post
Honey, the Amazon humanoid delivery robot is here! (well, almost)

Honey, the Amazon humanoid delivery robot is here! (well, almost)

Smartphone

Chargeasap's Zeus is the ultimate 280W GaN charger

As Dusk Falls Xbox Series S Gameplay Walkthrough Part 1 Intro FULL GAME Game No Commentary

As Dusk Falls Xbox Series S Gameplay Walkthrough Part 1 Intro FULL GAME Game No Commentary

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Tech Resources

  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo
  • Google News
  • Hacker News
  • Harvard Tech
  • I Like Cats ™
  • I Like Dogs ™
  • LifeHacker
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • PC World
  • Photofocus
  • Physics
  • Random Tech
  • Retro Rewind ™
  • Robot Report
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Art
  • Tech Careers
  • Tech Deals
  • Techcratic ™
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Tech News

  • 30 Second Tech ™
  • AI
  • AnandTech
  • Apple Insider
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • ExtremeTech
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo

Tech News

  • Harvard Tech
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • Physics
  • PC World
  • Random Tech
  • Retro Rewind ™
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Careers

Tech News​

  • Tech Art
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Site Links

  • About Techcratic
  • Affiliate Disclaimer
  • Affiliate Link Policy
  • Contact Techcratic
  • Dealors Discount Store
  • Privacy and Security Disclaimer
  • Privacy Policy
  • RSS Feed
  • Site Map
  • Support Techcratic
  • Techcratic
  • Tech Deals
  • TOS
  • 𝕏
Click For A Secret Deal

Techcratic – Your All In One Tech Hub © 2020 – 2025
All Rights Reserved
∞

No Result
View All Result
  • Home
  • Apple
  • Gaming
  • Microsoft
  • AnandTech