• About TC
  • Affiliate Disclaimer
  • Privacy Policy
  • TOS
  • Contact
Wednesday, June 11, 2025
Techcratic
  • TC
  • AI
    Artificial Intelligence

    Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

  • Crypto
    Strategy Begins Nasdaq Trading of New Stock With $980M Bitcoin-Fueled Momentum

    Strategy Begins Nasdaq Trading of New Stock With $980M Bitcoin-Fueled Momentum

    Crypto Prices Set To Move Higher After US Progress on Trade

    Crypto Prices Set To Move Higher After US Progress on Trade

    Crypto ETF Surge: Bitcoin and Ether Funds Attract Over $550 Million as Inflows Continue

    Crypto ETF Surge: Bitcoin and Ether Funds Attract Over $550 Million as Inflows Continue

    From ETFs to Strategic Bitcoin Reserve: Inside Trump’s crypto playbook

    From ETFs to Strategic Bitcoin Reserve: Inside Trump’s crypto playbook

    Crypto Lost $1.64 Billion to Hackers in Q1 2025

    Why Is Crypto Up Today? – June 11, 2025

    UK FCA Creates New Deputy Chief Executive Role to Oversee Regulation of Stablecoin and Crypto Firms

    UK FCA Creates New Deputy Chief Executive Role to Oversee Regulation of Stablecoin and Crypto Firms

    Bitcoin Bull Cycle is Over: CryptoQuant CEO

    GameStop Bought 4,710 Bitcoin in 5 Weeks

    Moscow Exchange Launches Landmark Bitcoin Index

    Moscow Exchange Launches Landmark Bitcoin Index

    ETH Short Liquidations May Send Ether Price to $3K

    ETH Short Liquidations May Send Ether Price to $3K

  • Cybersecurity
    Cybersecurity

    5 Lessons from River Island

    Cybersecurity

    INTERPOL Dismantles 20,000+ Malicious IPs Linked to 69 Malware Variants in Operation Secure

    Cybersecurity

    SinoTrack GPS Devices Vulnerable to Remote Vehicle Control via Default Passwords

    Cybersecurity

    Researchers Uncover 20+ Configuration Risks, Including Five CVEs, in Salesforce Industry Cloud

    Cybersecurity

    Adobe Releases Patch Fixing 254 Vulnerabilities, Closing High-Severity Security Gaps

    Cybersecurity

    Researcher Found Flaw to Discover Phone Numbers Linked to Any Google Account

    Cybersecurity

    CISA Adds Erlang SSH and Roundcube Flaws to Known Exploited Vulnerabilities Catalog

    Cybersecurity

    Malicious Browser Extensions Infect 722 Users Across Latin America Since Early 2025

    Cybersecurity

    Empower Users and Protect Against GenAI Data Loss

  • Deals
    acer Aspire Premium Laptop | AMD Ryzen 7 5700U (Beats i7-1250U) CPU | 64GB RAM | 2TB SSD…

    acer Aspire Premium Laptop | AMD Ryzen 7 5700U (Beats i7-1250U) CPU | 64GB RAM | 2TB SSD…

    WALI Dual Monitor Mount, Adjustable Gas Spring Monitor Desk Stand for 2 Monitors, Heavy…

    WALI Dual Monitor Mount, Adjustable Gas Spring Monitor Desk Stand for 2 Monitors, Heavy…

    Lekvey Ergonomic Mouse, Vertical Wireless Mouse – Rechargeable 2.4GHz Optical Vertical…

    Lekvey Ergonomic Mouse, Vertical Wireless Mouse – Rechargeable 2.4GHz Optical Vertical…

    GTPLAYER Gaming Chair, Computer Office Chair with Pocket Spring Cushion, Linkage…

    GTPLAYER Gaming Chair, Computer Office Chair with Pocket Spring Cushion, Linkage…

    South Park: The Stick of Truth – Xbox 360 (Renewed)

    South Park: The Stick of Truth – Xbox 360 (Renewed)

    Dangerous Game: The Legacy Murders [DVD]

    Dangerous Game: The Legacy Murders [DVD]

    TOSY Flying Disc – 16 Million Colors RGB or 36 LEDs, Extremely Bright, Smart Modes,…

    TOSY Flying Disc – 16 Million Colors RGB or 36 LEDs, Extremely Bright, Smart Modes,…

    Transcend TS256GMTE220S 256GB M.2 PCIe Gen3x4 80mm Internal Solid State Drive

    Transcend TS256GMTE220S 256GB M.2 PCIe Gen3x4 80mm Internal Solid State Drive

    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

  • Gaming
    Pokemon Violet Walkthrough Part 6: Paldea is Our Cloyster!

    Pokemon Violet Walkthrough Part 6: Paldea is Our Cloyster!

    How To Get KINGAMBIT in Pokemon Scarlet and Violet!

    How To Get KINGAMBIT in Pokemon Scarlet and Violet!

    Aniimo: Breaking Down This Beautiful Creature Collector – Sign Up for a Closed Beta!

    Aniimo: Breaking Down This Beautiful Creature Collector – Sign Up for a Closed Beta!

    Zelda Ocarina of Time HD 100% Walkthrough – Part 15 – Zora's Domain | King Zora

    Zelda Ocarina of Time HD 100% Walkthrough – Part 15 – Zora's Domain | King Zora

    My Let's Play Zelda Ocarina of Time Walkthrough 25 HD

    My Let's Play Zelda Ocarina of Time Walkthrough 25 HD

    The Legend of Zelda: Ocarina of Time Walkthrough! – #11

    The Legend of Zelda: Ocarina of Time Walkthrough! – #11

    European Broadcasting Union and NVIDIA Partner on Sovereign AI

    European Broadcasting Union and NVIDIA Partner on Sovereign AI

    Zelda Minish Cap 100% Walkthrough – Part 28/61 – Flippers, Bomb Bag & Quiver Upgrade (Commentary)

    Zelda Minish Cap 100% Walkthrough – Part 28/61 – Flippers, Bomb Bag & Quiver Upgrade (Commentary)

    Majora's Mask Walkthrough – 15 – Goron's Lullaby

    Majora's Mask Walkthrough – 15 – Goron's Lullaby

  • Tesla
    Dashboard Mobile Phone Holder, Non-Slip 360 Degree Rotatable Navigation Bracket,…

    Dashboard Mobile Phone Holder, Non-Slip 360 Degree Rotatable Navigation Bracket,…

    Skechers Car Floor Mats,Heavy Duty Rubber Car Mats Full Set,All WeatherFloor…

    Skechers Car Floor Mats,Heavy Duty Rubber Car Mats Full Set,All WeatherFloor…

    Center Console Organizer Behind Screen Storage Box for 2024 Tesla Cybertruck…

    Center Console Organizer Behind Screen Storage Box for 2024 Tesla Cybertruck…

    Tesla is done in Germany: 94% say they won’t buy a Tesla car

    Tesla owners sue to break their leases over Musk making the cars ‘far-right totems’

    Flag Pole Holder Kit for Tesla Cybertruck, Lymorexan L Track Flag Pole Mount Kit for…

    Flag Pole Holder Kit for Tesla Cybertruck, Lymorexan L Track Flag Pole Mount Kit for…

    3PCS Center Console Accessories for Tesla New Model Y Juniper 2025 Model 3 Highland 2024…

    3PCS Center Console Accessories for Tesla New Model Y Juniper 2025 Model 3 Highland 2024…

    Car Sound Deadening Roller, Audio Sound Deadener Application Installation Metal Seam…

    Car Sound Deadening Roller, Audio Sound Deadener Application Installation Metal Seam…

    iZEEKER 2.5K Dash Cam WiFi Dash Camera for Cars, Mini Car Camera 1440P Front Dashcams…

    iZEEKER 2.5K Dash Cam WiFi Dash Camera for Cars, Mini Car Camera 1440P Front Dashcams…

    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

  • UFO
    Trump Discusses Drone Sightings Along US East Coast | #CISNewsStudio1s

    Trump Discusses Drone Sightings Along US East Coast | #CISNewsStudio1s

    Roswell Conspiracies: Aliens, Myths & Legends, Vol. 1

    Roswell Conspiracies: Aliens, Myths & Legends, Vol. 1

    5 Shocking Nature Sky Phenomena That Actually Happened!

    5 Shocking Nature Sky Phenomena That Actually Happened!

    UFO Hunters – Season 1 (History) (Steelbook) [DVD]

    UFO Hunters – Season 1 (History) (Steelbook) [DVD]

    The Bizarre Handbag Figure Found In Mesoamerica

    The Bizarre Handbag Figure Found In Mesoamerica

    NOVA: What are UFOs?

    NOVA: What are UFOs?

    They Are Already Here: UFO Culture and Why We See Saucers

    They Are Already Here: UFO Culture and Why We See Saucers

    Alien: Romulus

    Alien: Romulus

    Top 25 Alien Encounters: UFO Case Files Exposed [DVD]

    Top 25 Alien Encounters: UFO Case Files Exposed [DVD]

No Result
View All Result
  • TC
  • AI
    Artificial Intelligence

    Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

  • Crypto
    Strategy Begins Nasdaq Trading of New Stock With $980M Bitcoin-Fueled Momentum

    Strategy Begins Nasdaq Trading of New Stock With $980M Bitcoin-Fueled Momentum

    Crypto Prices Set To Move Higher After US Progress on Trade

    Crypto Prices Set To Move Higher After US Progress on Trade

    Crypto ETF Surge: Bitcoin and Ether Funds Attract Over $550 Million as Inflows Continue

    Crypto ETF Surge: Bitcoin and Ether Funds Attract Over $550 Million as Inflows Continue

    From ETFs to Strategic Bitcoin Reserve: Inside Trump’s crypto playbook

    From ETFs to Strategic Bitcoin Reserve: Inside Trump’s crypto playbook

    Crypto Lost $1.64 Billion to Hackers in Q1 2025

    Why Is Crypto Up Today? – June 11, 2025

    UK FCA Creates New Deputy Chief Executive Role to Oversee Regulation of Stablecoin and Crypto Firms

    UK FCA Creates New Deputy Chief Executive Role to Oversee Regulation of Stablecoin and Crypto Firms

    Bitcoin Bull Cycle is Over: CryptoQuant CEO

    GameStop Bought 4,710 Bitcoin in 5 Weeks

    Moscow Exchange Launches Landmark Bitcoin Index

    Moscow Exchange Launches Landmark Bitcoin Index

    ETH Short Liquidations May Send Ether Price to $3K

    ETH Short Liquidations May Send Ether Price to $3K

  • Cybersecurity
    Cybersecurity

    5 Lessons from River Island

    Cybersecurity

    INTERPOL Dismantles 20,000+ Malicious IPs Linked to 69 Malware Variants in Operation Secure

    Cybersecurity

    SinoTrack GPS Devices Vulnerable to Remote Vehicle Control via Default Passwords

    Cybersecurity

    Researchers Uncover 20+ Configuration Risks, Including Five CVEs, in Salesforce Industry Cloud

    Cybersecurity

    Adobe Releases Patch Fixing 254 Vulnerabilities, Closing High-Severity Security Gaps

    Cybersecurity

    Researcher Found Flaw to Discover Phone Numbers Linked to Any Google Account

    Cybersecurity

    CISA Adds Erlang SSH and Roundcube Flaws to Known Exploited Vulnerabilities Catalog

    Cybersecurity

    Malicious Browser Extensions Infect 722 Users Across Latin America Since Early 2025

    Cybersecurity

    Empower Users and Protect Against GenAI Data Loss

  • Deals
    acer Aspire Premium Laptop | AMD Ryzen 7 5700U (Beats i7-1250U) CPU | 64GB RAM | 2TB SSD…

    acer Aspire Premium Laptop | AMD Ryzen 7 5700U (Beats i7-1250U) CPU | 64GB RAM | 2TB SSD…

    WALI Dual Monitor Mount, Adjustable Gas Spring Monitor Desk Stand for 2 Monitors, Heavy…

    WALI Dual Monitor Mount, Adjustable Gas Spring Monitor Desk Stand for 2 Monitors, Heavy…

    Lekvey Ergonomic Mouse, Vertical Wireless Mouse – Rechargeable 2.4GHz Optical Vertical…

    Lekvey Ergonomic Mouse, Vertical Wireless Mouse – Rechargeable 2.4GHz Optical Vertical…

    GTPLAYER Gaming Chair, Computer Office Chair with Pocket Spring Cushion, Linkage…

    GTPLAYER Gaming Chair, Computer Office Chair with Pocket Spring Cushion, Linkage…

    South Park: The Stick of Truth – Xbox 360 (Renewed)

    South Park: The Stick of Truth – Xbox 360 (Renewed)

    Dangerous Game: The Legacy Murders [DVD]

    Dangerous Game: The Legacy Murders [DVD]

    TOSY Flying Disc – 16 Million Colors RGB or 36 LEDs, Extremely Bright, Smart Modes,…

    TOSY Flying Disc – 16 Million Colors RGB or 36 LEDs, Extremely Bright, Smart Modes,…

    Transcend TS256GMTE220S 256GB M.2 PCIe Gen3x4 80mm Internal Solid State Drive

    Transcend TS256GMTE220S 256GB M.2 PCIe Gen3x4 80mm Internal Solid State Drive

    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

    Cable Matters 10Gbps Short USB C to Micro USB 3.0 Cable – 1ft, USB-C Hard Drive Cable,…

  • Gaming
    Pokemon Violet Walkthrough Part 6: Paldea is Our Cloyster!

    Pokemon Violet Walkthrough Part 6: Paldea is Our Cloyster!

    How To Get KINGAMBIT in Pokemon Scarlet and Violet!

    How To Get KINGAMBIT in Pokemon Scarlet and Violet!

    Aniimo: Breaking Down This Beautiful Creature Collector – Sign Up for a Closed Beta!

    Aniimo: Breaking Down This Beautiful Creature Collector – Sign Up for a Closed Beta!

    Zelda Ocarina of Time HD 100% Walkthrough – Part 15 – Zora's Domain | King Zora

    Zelda Ocarina of Time HD 100% Walkthrough – Part 15 – Zora's Domain | King Zora

    My Let's Play Zelda Ocarina of Time Walkthrough 25 HD

    My Let's Play Zelda Ocarina of Time Walkthrough 25 HD

    The Legend of Zelda: Ocarina of Time Walkthrough! – #11

    The Legend of Zelda: Ocarina of Time Walkthrough! – #11

    European Broadcasting Union and NVIDIA Partner on Sovereign AI

    European Broadcasting Union and NVIDIA Partner on Sovereign AI

    Zelda Minish Cap 100% Walkthrough – Part 28/61 – Flippers, Bomb Bag & Quiver Upgrade (Commentary)

    Zelda Minish Cap 100% Walkthrough – Part 28/61 – Flippers, Bomb Bag & Quiver Upgrade (Commentary)

    Majora's Mask Walkthrough – 15 – Goron's Lullaby

    Majora's Mask Walkthrough – 15 – Goron's Lullaby

  • Tesla
    Dashboard Mobile Phone Holder, Non-Slip 360 Degree Rotatable Navigation Bracket,…

    Dashboard Mobile Phone Holder, Non-Slip 360 Degree Rotatable Navigation Bracket,…

    Skechers Car Floor Mats,Heavy Duty Rubber Car Mats Full Set,All WeatherFloor…

    Skechers Car Floor Mats,Heavy Duty Rubber Car Mats Full Set,All WeatherFloor…

    Center Console Organizer Behind Screen Storage Box for 2024 Tesla Cybertruck…

    Center Console Organizer Behind Screen Storage Box for 2024 Tesla Cybertruck…

    Tesla is done in Germany: 94% say they won’t buy a Tesla car

    Tesla owners sue to break their leases over Musk making the cars ‘far-right totems’

    Flag Pole Holder Kit for Tesla Cybertruck, Lymorexan L Track Flag Pole Mount Kit for…

    Flag Pole Holder Kit for Tesla Cybertruck, Lymorexan L Track Flag Pole Mount Kit for…

    3PCS Center Console Accessories for Tesla New Model Y Juniper 2025 Model 3 Highland 2024…

    3PCS Center Console Accessories for Tesla New Model Y Juniper 2025 Model 3 Highland 2024…

    Car Sound Deadening Roller, Audio Sound Deadener Application Installation Metal Seam…

    Car Sound Deadening Roller, Audio Sound Deadener Application Installation Metal Seam…

    iZEEKER 2.5K Dash Cam WiFi Dash Camera for Cars, Mini Car Camera 1440P Front Dashcams…

    iZEEKER 2.5K Dash Cam WiFi Dash Camera for Cars, Mini Car Camera 1440P Front Dashcams…

    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

    2 Pack For Tesla Model X 2017-2024 Front/Back Under Seat Storage Organizer,TPE…

  • UFO
    Trump Discusses Drone Sightings Along US East Coast | #CISNewsStudio1s

    Trump Discusses Drone Sightings Along US East Coast | #CISNewsStudio1s

    Roswell Conspiracies: Aliens, Myths & Legends, Vol. 1

    Roswell Conspiracies: Aliens, Myths & Legends, Vol. 1

    5 Shocking Nature Sky Phenomena That Actually Happened!

    5 Shocking Nature Sky Phenomena That Actually Happened!

    UFO Hunters – Season 1 (History) (Steelbook) [DVD]

    UFO Hunters – Season 1 (History) (Steelbook) [DVD]

    The Bizarre Handbag Figure Found In Mesoamerica

    The Bizarre Handbag Figure Found In Mesoamerica

    NOVA: What are UFOs?

    NOVA: What are UFOs?

    They Are Already Here: UFO Culture and Why We See Saucers

    They Are Already Here: UFO Culture and Why We See Saucers

    Alien: Romulus

    Alien: Romulus

    Top 25 Alien Encounters: UFO Case Files Exposed [DVD]

    Top 25 Alien Encounters: UFO Case Files Exposed [DVD]

No Result
View All Result
Techcratic
No Result
View All Result
Home Hacker News

I made a worse search engine than Elasticsearch

Hacker News by Hacker News
June 5, 2025
in Hacker News
Reading Time: 12 mins read
129 1
A A
0
Share on FacebookShare on XShare on LinkedIn

2025-06-05 14:37:00
softwaredoug.com

I want you to share in my shame at daring to make a search library. And in this shame, you too, can experience the humility and understanding of what a real, honest-to-goodness, not side-project, search engine does to make lexical search fast.

BEIR is a set of Information Retrieval benchmarks, oriented around question-answer use cases.

My side project, SearchArray adds full text search to Pandas. So naturally, to see stand in awe at my amazing developer skills, I wanted to use BEIR to compare SearchArray to Elasticsearch (w/ same query + tokenization). So I spent a Saturday integrating SearchArray into BEIR, and measuring its relevence and performance on MSMarco Passage Retrieval corpus (8M docs).

… and 🥁

Library Elasticsearch SearchArray
NDCG@10 0.2275 0.225
Search Throughput 90 QPS ~18 QPS
Indexing Throughput 10K Docs Per Sec ~3.5K Docs Per Sec

… Sad trombone 🎺

It’s worse in every dimension

At least NDCG@10 is nearly right, so our BM25 calculation is correct (probably due to negligible differences in tokenization)

Imposter Syndrome anyone?

Instead of wallowing in my shame, I DO know exactly what’s going on… And it’s fairly educational. Let’s chat about why a real, non side-project, search engine is fast.

A Magic WAND

(Or how SearchArray is top 8m retrieval while Elasticsearch == top K retrieval)

In lexical search systems, you search for multiple terms. You take the BM25 score of each term, and then finally, combine those into a final score for the document. IE, a search for luke skywalker really means: BM25(luke) ??? BM25(skywalker) where ??? is some mathematical operator.

In a simple “OR” query, you just take the SUM of each term for each doc, IE, a search for luke skywalker is BM25(luke) + BM25(skywalker) like so:

Term Doc A (BM25) Doc B (BM25)
luke 1.90 1.45
skywalker 11.51 4.3
Combined doc score (SUM) 13.41 5.75

SearchArray just does BM25 scoring. You get back big numpy arrays of every document’s BM25 score. Then you combine the scores – literally using np.sum. Of course, that’s not what a search engine like Elasticsearch would do. Instead it has a different guarantee, it gets the highest scoring top N of your specified OR query.

This little bit of seemingly minute wiggle room gives search engines a great deal of latitude. Search engines can use an algorithm called Weak-AND or WAND to avoid work when combining multiple term scores into the final top N results.

I won’t get into the full nitty gritty of the algorithm, but here’s a key intuition to noodle over:

A scoring system like BM25 depends heavily on document frequency of a term. So rare terms – a high (1 / document frequency) – have a higher likelihood of impacting the final score, and ending up in the top K. Luckily these terms (like skywalker) occur on fewer documents. So we can fetch these select, elite few docs quickly in the data structure that maps skywalker -> [... handful of matching doc ids...] (aka postings). We can reach deeply into this list.

On the other hand, we can be much more circumspect about the boring, common term, luke. And that’s useful because luke has a very extensive postings list luke -> [... a giant honking list of documents...]. We’d prefer to avoid scanning all of these.

We might imagine that these lists of document ids, also is paired with its term frequency how often that term occurs in that document – the other major input of BM25. And if its SORTED from highest -> lowest term frequency, we can go down this list until its impossible for the BM25 score of a term to have any chance of making the top K results. Then exit early.

While WAND – and similar optimizations – helps Elasticsearch avoid work, SearchArray, gleefully does this work like an ignoramus happily giving you a giant idiotic array of BM25 scores.

When you look at this icicle graph of SearchArray’s performance doing an “OR” search, you can see all the time spent summing a giant array and also needlessly BM25 scoring many many documents.

image

SearchArray doesn’t directly store postings

Unlike most search engines, SearchArray doesn’t have postings lists of terms -> documents.

Instead, under the hood, SearchArray stores a positional index, built first-and-foremost for phrase matching. You give SearchArray a list of terms ['mary', 'had', 'a', 'little', 'lamb']. It then finds every place mary occurs one position before had, etc. It does this by storing, for each terms, the positions as a roaring bitmap.

In our roaring bitmap, each 64 bit word has a header indicating where the positions occur (document and region in the doc). Each bit position corresponds to a position in the document. A 1 indicates this term is present, a 0 missing.

So to collect phrase matches, for mary had we can simply find places where one term’s bits occur adjacent to another. This can be done very fast with simple bit arithmetic.

mary
   00000010000000    | 00000000000010
had
   00000001000000    | 00000000000001      #

But a nice property of this, and alleviating maintenance for this one person project, is the fact that we can also use this to compute term frequencies. Simply by performing a popcount (counting the number of set bits), then collecting those documents for a term, we get a mapping of doc ids -> term frequencies.

So we spend a fair amount of time doing that, as you can see here:

image

Now I lied actually, while this is the core mechanism for storing term frequencies, we do cache. A cache that remembers the doc id -> term frequencies when the roaring bit array is > N 64 bit words. This lets users tune N to your memory / speed tradeoff, and get closer to a postings list.

Caching Non term-dependent BM25 components

Take a look at this snippet for computing BM25:

        bm25 = (
            term_freq / (term_freq + (k1 * ((1 - b) + (b * (doc_len / avg_doc_lens)))))
        ) * idf

Notice there is a BIG PART of this calculation that has nothing to do with the terms being searched:

 (k1 * ((1 - b) + (b * (doc_len / avg_doc_lens)))))

In my testing, a bit of latency (1ms on 1 million docs) can be shaved by caching everything in the k1 + ... avg_doc_lens somewhere. If doc_len corresponds to array with a doc length value for every document, you can create an array with this formula cached. But it’s a bit of a maintenance burden to have one additional, globally shared cache. So I have avoided this so far.

Caching the FULL query, not just individual BM25 term scoring

SearchArray is just a system for computing BM25 scores (or whatever similarity). You USE it to build up an “or query” or whatever using numpy… it doesn’t do it for you. IE the code implemented in BEIR is simply:

def bm25_search(corpus, query, column):
    tokenizer = corpus[column].array.tokenizer
    query_terms = tokenizer(query)
    scores = np.zeros(len(corpus))
    query_terms = set(query_terms)
    for term in query_terms:
            scores += corpus[column].array.score(term)
    return scores

But in a regular search engine like Solr, Elasticsearch, OpenSearch, or Vespa, this logic is expressed in the search engine’s Query DSL. So the search engine can plan+cache the complete calculation, whereas SearchArray gives you all the tools to shoot yourself in the foot, performance wise (not to mention the earlier point about WAND, etc).

That’s why you should hug a search engineer

There you have it!

SearchArray is a tool for prototyping, using normal Pydata tooling, not for building giant retrieval systems like Elasticsearch. It’s good to know the tradeoffs behind your lexical system, as they focus on different tradeoffs. You might find it useful for dorking around on

What would be great would be if we COULD express our queries in such a dataframe-oriented DSL. IE a Polars-esque lazy top-N retrieval system that pulled from different retrieval sources, scored them, summed them, and did whatever arbitrary math to the underlying scores. I can cross my fingers such a thing might exist. So far people build these DAGs in less expressive ways: as part of their Torch model DAG, or some homegrown query-time DAG system.

In any case, I’m absolutely humbled by folks that work on big, large scale, distributed lexical search engines like (Vespa, Lucene, OpenSearch, Elasticsearch, Solr). These folks ought to be your hero too, they do this grunt work for us, and we should NOT take it for granted.

Below are some notes and appendices for BEIR and the different benchmarking scripts, in case you’re curious


Appendix links to scripts

Appendix – How to integrate with BEIR…

BEIR has a set of built-in datasets and metrics tools, if you implement a BaseSearch class with the following signature:

    class SearchArraySearch(BaseSearch):

        def search(self,
                   corpus: Dict[str, Dict[str, str]],
                   queries: Dict[str, str],
                   top_k: int,
                   *args,
                   **kwargs) -> Dict[str, Dict[str, float]]:

The inputs:

  • Corpus: A dict pointing a document id to a set of fields to index, ie
{'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.',
 'title': ''}
...
  • Queries: A dict pointing a query id -> query:
{"1": "Who was the original governor of the plymouth colony"}
...

Finally the output is a dictionary of query ids -> {doc ids -> scores} – each query w/ top_k scored.

So when search is called you need to

  1. Index the corpus
  2. Issue all queries and gather scores

Essentially this looks something like:

def search(self,
           corpus: Dict[str, Dict[str, str]],
           queries: Dict[str, str],
           top_k: int,
           *args,
           **kwargs) -> Dict[str, Dict[str, float]]:
    corpus = self.index_corpus(corpus)     # 

How does this look for SearchArray?

To index, we loop over each str column, and add a SearchArray column to the DF. Below, tokenized with a snowball tokenizer:

            for column in corpus.columns:
                if corpus[column].dtype == 'object':
                    corpus[column].fillna("", inplace=True)
                    corpus[f'{column}_snowball'] = SearchArray.index(corpus[column],
                                                                     data_dir=DATA_DIR,
                                                                     tokenizer=snowball_tokenizer)

Then replace some_search_function above w/ something that searches the SearchArray columns. Maybe this simple bm25_search:

def bm25_search(corpus, query):
    query = snowball_tokenizer(query)
    scores = np.zeros(len(corpus))
    for q in query:
        scores += corpus['text_snowball'].array.score(q)
    return scores

(Leaving out some annoying threading, but you can look at the code all here )

to learn how to apply LLMs to search applications. Check out

for a sneak preview.

Source Link


Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!


Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo

Bitcoin QR Code

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Scan the QR code with your crypto wallet app

DOGECOIN

Dogecoin Logo

Dogecoin QR Code

D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA

Scan the QR code with your crypto wallet app

ETHEREUM

Ethereum Logo

Ethereum QR Code

0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a

Scan the QR code with your crypto wallet app

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: Hacker News
Share161Tweet101Share28
Previous Post

Teaching Your Dog To Ignore Other Dogs Walking On Leash

Next Post

Honey, the Amazon humanoid delivery robot is here! (well, almost)

Hacker News

Hacker News

Stay updated with Hacker News, where technology meets entrepreneurial spirit. Get the latest on tech trends, startup news, and discussions from the tech community. Read the latest updates here at Techcratic.

Related Posts

GitHub – bloom42/markdown-ninja: Markdown-first CMS for bloggers, minimalists and startups. Open Source alternative to Substack, Mailchimp and Netlify
Hacker News

GitHub – bloom42/markdown-ninja: Markdown-first CMS for bloggers, minimalists and startups. Open Source alternative to Substack, Mailchimp and Netlify

June 11, 2025
1.3k
mgschwan/viture_virtual_display: Virtual display with Viture Pro XR glasses using hdmi in on an OrangePi
Hacker News

mgschwan/viture_virtual_display: Virtual display with Viture Pro XR glasses using hdmi in on an OrangePi

June 11, 2025
1.3k
The Hashtable Packing Problem
Hacker News

The Hashtable Packing Problem

June 11, 2025
1.3k
Tim Owens Jazz and Broadcast Collection Digitized by a Generous Grant by The Recording Academy’s GRAMMY Museum Grants Program – University Libraries
Hacker News

Tim Owens Jazz and Broadcast Collection Digitized by a Generous Grant by The Recording Academy’s GRAMMY Museum Grants Program – University Libraries

June 10, 2025
1.3k
John Graham-Cumming’s blog: Low-background Steel: content without AI contamination
Hacker News

John Graham-Cumming’s blog: Low-background Steel: content without AI contamination

June 10, 2025
1.3k
manaskamal/XenevaOS: The Xeneva Operating System
Hacker News

manaskamal/XenevaOS: The Xeneva Operating System

June 10, 2025
1.3k
Load More
Next Post
Honey, the Amazon humanoid delivery robot is here! (well, almost)

Honey, the Amazon humanoid delivery robot is here! (well, almost)

Smartphone

Chargeasap's Zeus is the ultimate 280W GaN charger

As Dusk Falls Xbox Series S Gameplay Walkthrough Part 1 Intro FULL GAME Game No Commentary

As Dusk Falls Xbox Series S Gameplay Walkthrough Part 1 Intro FULL GAME Game No Commentary

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Tech Resources

  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo
  • Google News
  • Hacker News
  • Harvard Tech
  • I Like Cats ™
  • I Like Dogs ™
  • LifeHacker
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • PC World
  • Photofocus
  • Physics
  • Random Tech
  • Retro Rewind ™
  • Robot Report
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Art
  • Tech Careers
  • Tech Deals
  • Techcratic ™
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Tech News

  • 30 Second Tech ™
  • AI
  • AnandTech
  • Apple Insider
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • ExtremeTech
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo

Tech News

  • Harvard Tech
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • Physics
  • PC World
  • Random Tech
  • Retro Rewind ™
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Careers

Tech News​

  • Tech Art
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Site Links

  • About Techcratic
  • Affiliate Disclaimer
  • Affiliate Link Policy
  • Contact Techcratic
  • Dealors Discount Store
  • Privacy and Security Disclaimer
  • Privacy Policy
  • RSS Feed
  • Site Map
  • Support Techcratic
  • Techcratic
  • Tech Deals
  • TOS
  • 𝕏
Click For A Secret Deal

Techcratic – Your All In One Tech Hub © 2020 – 2025
All Rights Reserved
∞

No Result
View All Result
  • Home
  • Apple
  • Gaming
  • Microsoft
  • AnandTech