• About TC
  • Affiliate Disclaimer
  • Privacy Policy
  • TOS
  • Contact
Tuesday, June 10, 2025
Techcratic
  • TC
  • AI
    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

    Artificial Intelligence

    World-Consistent Video Diffusion With Explicit 3D Modeling

    Artificial Intelligence

    Deploy Amazon SageMaker Projects with Terraform Cloud

  • Crypto
    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

    Publicly Traded Firm KULR Acquires 118.6 Bitcoin, Treasury Reaches 920 BTC

    Publicly Traded Firm KULR Acquires 118.6 Bitcoin, Treasury Reaches 920 BTC

    ETF Weekly Flows: $129 Million Outflow for Bitcoin and $281 Million Inflow for Ether

    ETF Weekly Flows: $129 Million Outflow for Bitcoin and $281 Million Inflow for Ether

    DOGE Gets Distilled: Heritage Unleashes Dogecoin-Themed Bourbon

    DOGE Gets Distilled: Heritage Unleashes Dogecoin-Themed Bourbon

    Crypto ETFs centralize what was meant to be decentralized.

    Crypto ETFs centralize what was meant to be decentralized.

    Crypto Lost $1.64 Billion to Hackers in Q1 2025

    Why Is Crypto Down Today? – June 9, 2025

    The Blockchain Group Unveils $343 Million Capital Program to Boost Bitcoin Treasury Strategy

    The Blockchain Group Unveils $343 Million Capital Program to Boost Bitcoin Treasury Strategy

    Bitcoin Bull Cycle is Over: CryptoQuant CEO

    CEX Volumes Hit 2020 Lows as Market Shifts to HODL Mode

  • Cybersecurity
    Cybersecurity

    Malicious Browser Extensions Infect 722 Users Across Latin America Since Early 2025

    Cybersecurity

    Empower Users and Protect Against GenAI Data Loss

    Cybersecurity

    Popular Chrome Extensions Leak API Keys, User Data via HTTP and Hardcoded Credentials

    Cybersecurity

    Critical Cisco ISE Auth Bypass Flaw Impacts Cloud Deployments on AWS, Azure, and OCI

    Cybersecurity

    Why Traditional DLP Solutions Fail in the Browser Era

    Cybersecurity

    HPE Issues Security Patch for StoreOnce Bug Allowing Remote Authentication Bypass

    Cybersecurity

    Critical 10-Year-Old Roundcube Webmail Bug Allows Authenticated Users Run Malicious Code

    Cybersecurity

    Android Trojan Crocodilus Now Active in 8 Countries, Targeting Banks and Crypto Wallets

    Cybersecurity

    Microsoft and CrowdStrike Launch Shared Threat Actor Glossary to Cut Attribution Confusion

  • Deals
    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    Hitachi MAF0058 Mass Air Flow Sensor

    Hitachi MAF0058 Mass Air Flow Sensor

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

    RoboCop Rogue City (PS5)

    RoboCop Rogue City (PS5)

    My Universe: Puppies and Kittens – PlayStation 4

    My Universe: Puppies and Kittens – PlayStation 4

    Disney’s Little Mermaid: Ariel’s Undersea Adventure – Nintendo DS (Renewed)

    Disney’s Little Mermaid: Ariel’s Undersea Adventure – Nintendo DS (Renewed)

    Family Game Pack 2001- PlayStation (Renewed)

    Family Game Pack 2001- PlayStation (Renewed)

    StarTech.com Cisco GLC-T Compatible SFP Module – 1000BASE-T – SFP to RJ45 Cat6/Cat5e -…

    StarTech.com Cisco GLC-T Compatible SFP Module – 1000BASE-T – SFP to RJ45 Cat6/Cat5e -…

  • Gaming
    Apple’s new UI for Macs and iPhones ‘combines the optical qualities of glass with a fluidity only Apple can achieve,’ but it sure looks like an awful lot like Windows Vista circa 2007

    Apple’s new UI for Macs and iPhones ‘combines the optical qualities of glass with a fluidity only Apple can achieve,’ but it sure looks like an awful lot like Windows Vista circa 2007

    HYPERCHARGE UNBOXED – CUSTOMIZATIONS

    HYPERCHARGE UNBOXED – CUSTOMIZATIONS

    Scars Above: First 10 Minutes of Gameplay | New Sci-Fi Action Game

    Scars Above: First 10 Minutes of Gameplay | New Sci-Fi Action Game

    2 Years with Steam Deck: My Honest Review and Experiences

    2 Years with Steam Deck: My Honest Review and Experiences

    Dune: Awakening buried treasure: How to find it and get a Sandbike Scanner

    Dune: Awakening buried treasure: How to find it and get a Sandbike Scanner

    LittleBigPlanet 3 – Five Nights at Freddy's The Movie Full Trailer  – LBP3 FNAF Animation

    LittleBigPlanet 3 – Five Nights at Freddy's The Movie Full Trailer – LBP3 FNAF Animation

    RoboCop: Rogue City – Mission 1 All Evidence and Rank A (Officer of the month Achievement)

    RoboCop: Rogue City – Mission 1 All Evidence and Rank A (Officer of the month Achievement)

    Thymesia | Boss Fight | Mutated Odur

    Thymesia | Boss Fight | Mutated Odur

    The Callisto Protocol showed me what makes a GOOD GAME (Raptor Review)

    The Callisto Protocol showed me what makes a GOOD GAME (Raptor Review)

  • Tesla
    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

    4 Pack Trailer Ball Cover, 2.36In x 2.24In x 1.97In Waterproof Dustproof Towing Hitch…

    4 Pack Trailer Ball Cover, 2.36In x 2.24In x 1.97In Waterproof Dustproof Towing Hitch…

    ClimaTex Heavy Duty Car, Truck, Van, and SUV Automotive Floor Mat for Floor Protection,…

    ClimaTex Heavy Duty Car, Truck, Van, and SUV Automotive Floor Mat for Floor Protection,…

    2 Pcs Tow Hook Covers Compatible with Tesla Cybertruck Accessories 2024 2025 (Red)

    2 Pcs Tow Hook Covers Compatible with Tesla Cybertruck Accessories 2024 2025 (Red)

    MAXDOM Under Seat Storage Fit for 2024+ Tesla Cybertruck Rear Underseat Organizer Box…

    MAXDOM Under Seat Storage Fit for 2024+ Tesla Cybertruck Rear Underseat Organizer Box…

    Car USB Hub Charger for Tesla Model Y 2021-2024 and Model 3 2021-2023,Fast…

    Car USB Hub Charger for Tesla Model Y 2021-2024 and Model 3 2021-2023,Fast…

    CAR GUYS Tire Shine Spray | High Gloss & Satin Finish | Non-Greasy, UV Protection,…

    CAR GUYS Tire Shine Spray | High Gloss & Satin Finish | Non-Greasy, UV Protection,…

  • UFO
    CINOTON 160W UFO LED High Bay Light, Aluminum LED Shop Lights with 24000LM, 5000K Commercial Bay Lighting for Warehouse Garage Workshop Factory, 6′ Cable & Safety Rope, ETL Listed 1 Pack

    CINOTON 160W UFO LED High Bay Light, Aluminum LED Shop Lights with 24000LM, 5000K Commercial Bay Lighting for Warehouse Garage Workshop Factory, 6′ Cable & Safety Rope, ETL Listed 1 Pack

    Rewi beklaut Dner & Neue Projekte mit dem kompletten UFO

    Rewi beklaut Dner & Neue Projekte mit dem kompletten UFO

    Spacecraft Systems Engineering

    Spacecraft Systems Engineering

    NASA UAP Researchers Share Shocking UFO Evidence!

    NASA UAP Researchers Share Shocking UFO Evidence!

    UFOs Over Phoenix: Confessions of a 911 Operator [DVD]

    UFOs Over Phoenix: Confessions of a 911 Operator [DVD]

    Have Aliens Visited Earth? | COLOSSAL MYSTERIES

    Have Aliens Visited Earth? | COLOSSAL MYSTERIES

    MINDBLOWING Encounters Unraveling the Secrets of Higher Dimensions

    MINDBLOWING Encounters Unraveling the Secrets of Higher Dimensions

    Roswell: The After-Action Report

    Roswell: The After-Action Report

    Alien UFO theories: AskReddit #ufo #alien #extraterrestrial #askreddit #reddit #creepystories #scary

    Alien UFO theories: AskReddit #ufo #alien #extraterrestrial #askreddit #reddit #creepystories #scary

No Result
View All Result
  • TC
  • AI
    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

    Artificial Intelligence

    World-Consistent Video Diffusion With Explicit 3D Modeling

    Artificial Intelligence

    Deploy Amazon SageMaker Projects with Terraform Cloud

  • Crypto
    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Bitcoin’s $200K Price Forecast ‘Conservative,’ Says Bernstein

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

    Ripple Backs XRP Ledger Startups in Japan With up to $200K per Project

    Publicly Traded Firm KULR Acquires 118.6 Bitcoin, Treasury Reaches 920 BTC

    Publicly Traded Firm KULR Acquires 118.6 Bitcoin, Treasury Reaches 920 BTC

    ETF Weekly Flows: $129 Million Outflow for Bitcoin and $281 Million Inflow for Ether

    ETF Weekly Flows: $129 Million Outflow for Bitcoin and $281 Million Inflow for Ether

    DOGE Gets Distilled: Heritage Unleashes Dogecoin-Themed Bourbon

    DOGE Gets Distilled: Heritage Unleashes Dogecoin-Themed Bourbon

    Crypto ETFs centralize what was meant to be decentralized.

    Crypto ETFs centralize what was meant to be decentralized.

    Crypto Lost $1.64 Billion to Hackers in Q1 2025

    Why Is Crypto Down Today? – June 9, 2025

    The Blockchain Group Unveils $343 Million Capital Program to Boost Bitcoin Treasury Strategy

    The Blockchain Group Unveils $343 Million Capital Program to Boost Bitcoin Treasury Strategy

    Bitcoin Bull Cycle is Over: CryptoQuant CEO

    CEX Volumes Hit 2020 Lows as Market Shifts to HODL Mode

  • Cybersecurity
    Cybersecurity

    Malicious Browser Extensions Infect 722 Users Across Latin America Since Early 2025

    Cybersecurity

    Empower Users and Protect Against GenAI Data Loss

    Cybersecurity

    Popular Chrome Extensions Leak API Keys, User Data via HTTP and Hardcoded Credentials

    Cybersecurity

    Critical Cisco ISE Auth Bypass Flaw Impacts Cloud Deployments on AWS, Azure, and OCI

    Cybersecurity

    Why Traditional DLP Solutions Fail in the Browser Era

    Cybersecurity

    HPE Issues Security Patch for StoreOnce Bug Allowing Remote Authentication Bypass

    Cybersecurity

    Critical 10-Year-Old Roundcube Webmail Bug Allows Authenticated Users Run Malicious Code

    Cybersecurity

    Android Trojan Crocodilus Now Active in 8 Countries, Targeting Banks and Crypto Wallets

    Cybersecurity

    Microsoft and CrowdStrike Launch Shared Threat Actor Glossary to Cut Attribution Confusion

  • Deals
    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    GD90 Mini PC, 12th Gen Intel i9-12900HK(14C/20T), 32GB DDR4 RAM 1TB SSD Desktop Mini…

    Hitachi MAF0058 Mass Air Flow Sensor

    Hitachi MAF0058 Mass Air Flow Sensor

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    Canon PG-245 Genuine Black Ink Cartridge, Compatible with iP2820,…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

    GTRACING Gaming Chair with Footrest Speakers Video Game Chair Bluetooth Music Heavy Duty…

    RoboCop Rogue City (PS5)

    RoboCop Rogue City (PS5)

    My Universe: Puppies and Kittens – PlayStation 4

    My Universe: Puppies and Kittens – PlayStation 4

    Disney’s Little Mermaid: Ariel’s Undersea Adventure – Nintendo DS (Renewed)

    Disney’s Little Mermaid: Ariel’s Undersea Adventure – Nintendo DS (Renewed)

    Family Game Pack 2001- PlayStation (Renewed)

    Family Game Pack 2001- PlayStation (Renewed)

    StarTech.com Cisco GLC-T Compatible SFP Module – 1000BASE-T – SFP to RJ45 Cat6/Cat5e -…

    StarTech.com Cisco GLC-T Compatible SFP Module – 1000BASE-T – SFP to RJ45 Cat6/Cat5e -…

  • Gaming
    Apple’s new UI for Macs and iPhones ‘combines the optical qualities of glass with a fluidity only Apple can achieve,’ but it sure looks like an awful lot like Windows Vista circa 2007

    Apple’s new UI for Macs and iPhones ‘combines the optical qualities of glass with a fluidity only Apple can achieve,’ but it sure looks like an awful lot like Windows Vista circa 2007

    HYPERCHARGE UNBOXED – CUSTOMIZATIONS

    HYPERCHARGE UNBOXED – CUSTOMIZATIONS

    Scars Above: First 10 Minutes of Gameplay | New Sci-Fi Action Game

    Scars Above: First 10 Minutes of Gameplay | New Sci-Fi Action Game

    2 Years with Steam Deck: My Honest Review and Experiences

    2 Years with Steam Deck: My Honest Review and Experiences

    Dune: Awakening buried treasure: How to find it and get a Sandbike Scanner

    Dune: Awakening buried treasure: How to find it and get a Sandbike Scanner

    LittleBigPlanet 3 – Five Nights at Freddy's The Movie Full Trailer  – LBP3 FNAF Animation

    LittleBigPlanet 3 – Five Nights at Freddy's The Movie Full Trailer – LBP3 FNAF Animation

    RoboCop: Rogue City – Mission 1 All Evidence and Rank A (Officer of the month Achievement)

    RoboCop: Rogue City – Mission 1 All Evidence and Rank A (Officer of the month Achievement)

    Thymesia | Boss Fight | Mutated Odur

    Thymesia | Boss Fight | Mutated Odur

    The Callisto Protocol showed me what makes a GOOD GAME (Raptor Review)

    The Callisto Protocol showed me what makes a GOOD GAME (Raptor Review)

  • Tesla
    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Model Y Mud Flaps for Tesla Model Y Accessories 2024 Mud Flaps Tire Splash Guards fit…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    Tesla CCS Adapter, Fast and Efficient Charging Adapter for Tesla Model 3 Y S X, Portable…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

    4 PCS LED Reverse Lights, 4014 45SMD 6500K 800LM High Bright Brake Light Turn Signal…

    4 Pack Trailer Ball Cover, 2.36In x 2.24In x 1.97In Waterproof Dustproof Towing Hitch…

    4 Pack Trailer Ball Cover, 2.36In x 2.24In x 1.97In Waterproof Dustproof Towing Hitch…

    ClimaTex Heavy Duty Car, Truck, Van, and SUV Automotive Floor Mat for Floor Protection,…

    ClimaTex Heavy Duty Car, Truck, Van, and SUV Automotive Floor Mat for Floor Protection,…

    2 Pcs Tow Hook Covers Compatible with Tesla Cybertruck Accessories 2024 2025 (Red)

    2 Pcs Tow Hook Covers Compatible with Tesla Cybertruck Accessories 2024 2025 (Red)

    MAXDOM Under Seat Storage Fit for 2024+ Tesla Cybertruck Rear Underseat Organizer Box…

    MAXDOM Under Seat Storage Fit for 2024+ Tesla Cybertruck Rear Underseat Organizer Box…

    Car USB Hub Charger for Tesla Model Y 2021-2024 and Model 3 2021-2023,Fast…

    Car USB Hub Charger for Tesla Model Y 2021-2024 and Model 3 2021-2023,Fast…

    CAR GUYS Tire Shine Spray | High Gloss & Satin Finish | Non-Greasy, UV Protection,…

    CAR GUYS Tire Shine Spray | High Gloss & Satin Finish | Non-Greasy, UV Protection,…

  • UFO
    CINOTON 160W UFO LED High Bay Light, Aluminum LED Shop Lights with 24000LM, 5000K Commercial Bay Lighting for Warehouse Garage Workshop Factory, 6′ Cable & Safety Rope, ETL Listed 1 Pack

    CINOTON 160W UFO LED High Bay Light, Aluminum LED Shop Lights with 24000LM, 5000K Commercial Bay Lighting for Warehouse Garage Workshop Factory, 6′ Cable & Safety Rope, ETL Listed 1 Pack

    Rewi beklaut Dner & Neue Projekte mit dem kompletten UFO

    Rewi beklaut Dner & Neue Projekte mit dem kompletten UFO

    Spacecraft Systems Engineering

    Spacecraft Systems Engineering

    NASA UAP Researchers Share Shocking UFO Evidence!

    NASA UAP Researchers Share Shocking UFO Evidence!

    UFOs Over Phoenix: Confessions of a 911 Operator [DVD]

    UFOs Over Phoenix: Confessions of a 911 Operator [DVD]

    Have Aliens Visited Earth? | COLOSSAL MYSTERIES

    Have Aliens Visited Earth? | COLOSSAL MYSTERIES

    MINDBLOWING Encounters Unraveling the Secrets of Higher Dimensions

    MINDBLOWING Encounters Unraveling the Secrets of Higher Dimensions

    Roswell: The After-Action Report

    Roswell: The After-Action Report

    Alien UFO theories: AskReddit #ufo #alien #extraterrestrial #askreddit #reddit #creepystories #scary

    Alien UFO theories: AskReddit #ufo #alien #extraterrestrial #askreddit #reddit #creepystories #scary

No Result
View All Result
Techcratic
No Result
View All Result
Home Hacker News

I made a worse search engine than Elasticsearch

Hacker News by Hacker News
June 5, 2025
in Hacker News
Reading Time: 12 mins read
129 1
A A
0
Share on FacebookShare on XShare on LinkedIn

2025-06-05 14:37:00
softwaredoug.com

I want you to share in my shame at daring to make a search library. And in this shame, you too, can experience the humility and understanding of what a real, honest-to-goodness, not side-project, search engine does to make lexical search fast.

BEIR is a set of Information Retrieval benchmarks, oriented around question-answer use cases.

My side project, SearchArray adds full text search to Pandas. So naturally, to see stand in awe at my amazing developer skills, I wanted to use BEIR to compare SearchArray to Elasticsearch (w/ same query + tokenization). So I spent a Saturday integrating SearchArray into BEIR, and measuring its relevence and performance on MSMarco Passage Retrieval corpus (8M docs).

… and 🥁

Library Elasticsearch SearchArray
NDCG@10 0.2275 0.225
Search Throughput 90 QPS ~18 QPS
Indexing Throughput 10K Docs Per Sec ~3.5K Docs Per Sec

… Sad trombone 🎺

It’s worse in every dimension

At least NDCG@10 is nearly right, so our BM25 calculation is correct (probably due to negligible differences in tokenization)

Imposter Syndrome anyone?

Instead of wallowing in my shame, I DO know exactly what’s going on… And it’s fairly educational. Let’s chat about why a real, non side-project, search engine is fast.

A Magic WAND

(Or how SearchArray is top 8m retrieval while Elasticsearch == top K retrieval)

In lexical search systems, you search for multiple terms. You take the BM25 score of each term, and then finally, combine those into a final score for the document. IE, a search for luke skywalker really means: BM25(luke) ??? BM25(skywalker) where ??? is some mathematical operator.

In a simple “OR” query, you just take the SUM of each term for each doc, IE, a search for luke skywalker is BM25(luke) + BM25(skywalker) like so:

Term Doc A (BM25) Doc B (BM25)
luke 1.90 1.45
skywalker 11.51 4.3
Combined doc score (SUM) 13.41 5.75

SearchArray just does BM25 scoring. You get back big numpy arrays of every document’s BM25 score. Then you combine the scores – literally using np.sum. Of course, that’s not what a search engine like Elasticsearch would do. Instead it has a different guarantee, it gets the highest scoring top N of your specified OR query.

This little bit of seemingly minute wiggle room gives search engines a great deal of latitude. Search engines can use an algorithm called Weak-AND or WAND to avoid work when combining multiple term scores into the final top N results.

I won’t get into the full nitty gritty of the algorithm, but here’s a key intuition to noodle over:

A scoring system like BM25 depends heavily on document frequency of a term. So rare terms – a high (1 / document frequency) – have a higher likelihood of impacting the final score, and ending up in the top K. Luckily these terms (like skywalker) occur on fewer documents. So we can fetch these select, elite few docs quickly in the data structure that maps skywalker -> [... handful of matching doc ids...] (aka postings). We can reach deeply into this list.

On the other hand, we can be much more circumspect about the boring, common term, luke. And that’s useful because luke has a very extensive postings list luke -> [... a giant honking list of documents...]. We’d prefer to avoid scanning all of these.

We might imagine that these lists of document ids, also is paired with its term frequency how often that term occurs in that document – the other major input of BM25. And if its SORTED from highest -> lowest term frequency, we can go down this list until its impossible for the BM25 score of a term to have any chance of making the top K results. Then exit early.

While WAND – and similar optimizations – helps Elasticsearch avoid work, SearchArray, gleefully does this work like an ignoramus happily giving you a giant idiotic array of BM25 scores.

When you look at this icicle graph of SearchArray’s performance doing an “OR” search, you can see all the time spent summing a giant array and also needlessly BM25 scoring many many documents.

image

SearchArray doesn’t directly store postings

Unlike most search engines, SearchArray doesn’t have postings lists of terms -> documents.

Instead, under the hood, SearchArray stores a positional index, built first-and-foremost for phrase matching. You give SearchArray a list of terms ['mary', 'had', 'a', 'little', 'lamb']. It then finds every place mary occurs one position before had, etc. It does this by storing, for each terms, the positions as a roaring bitmap.

In our roaring bitmap, each 64 bit word has a header indicating where the positions occur (document and region in the doc). Each bit position corresponds to a position in the document. A 1 indicates this term is present, a 0 missing.

So to collect phrase matches, for mary had we can simply find places where one term’s bits occur adjacent to another. This can be done very fast with simple bit arithmetic.

mary
   00000010000000    | 00000000000010
had
   00000001000000    | 00000000000001      #

But a nice property of this, and alleviating maintenance for this one person project, is the fact that we can also use this to compute term frequencies. Simply by performing a popcount (counting the number of set bits), then collecting those documents for a term, we get a mapping of doc ids -> term frequencies.

So we spend a fair amount of time doing that, as you can see here:

image

Now I lied actually, while this is the core mechanism for storing term frequencies, we do cache. A cache that remembers the doc id -> term frequencies when the roaring bit array is > N 64 bit words. This lets users tune N to your memory / speed tradeoff, and get closer to a postings list.

Caching Non term-dependent BM25 components

Take a look at this snippet for computing BM25:

        bm25 = (
            term_freq / (term_freq + (k1 * ((1 - b) + (b * (doc_len / avg_doc_lens)))))
        ) * idf

Notice there is a BIG PART of this calculation that has nothing to do with the terms being searched:

 (k1 * ((1 - b) + (b * (doc_len / avg_doc_lens)))))

In my testing, a bit of latency (1ms on 1 million docs) can be shaved by caching everything in the k1 + ... avg_doc_lens somewhere. If doc_len corresponds to array with a doc length value for every document, you can create an array with this formula cached. But it’s a bit of a maintenance burden to have one additional, globally shared cache. So I have avoided this so far.

Caching the FULL query, not just individual BM25 term scoring

SearchArray is just a system for computing BM25 scores (or whatever similarity). You USE it to build up an “or query” or whatever using numpy… it doesn’t do it for you. IE the code implemented in BEIR is simply:

def bm25_search(corpus, query, column):
    tokenizer = corpus[column].array.tokenizer
    query_terms = tokenizer(query)
    scores = np.zeros(len(corpus))
    query_terms = set(query_terms)
    for term in query_terms:
            scores += corpus[column].array.score(term)
    return scores

But in a regular search engine like Solr, Elasticsearch, OpenSearch, or Vespa, this logic is expressed in the search engine’s Query DSL. So the search engine can plan+cache the complete calculation, whereas SearchArray gives you all the tools to shoot yourself in the foot, performance wise (not to mention the earlier point about WAND, etc).

That’s why you should hug a search engineer

There you have it!

SearchArray is a tool for prototyping, using normal Pydata tooling, not for building giant retrieval systems like Elasticsearch. It’s good to know the tradeoffs behind your lexical system, as they focus on different tradeoffs. You might find it useful for dorking around on

What would be great would be if we COULD express our queries in such a dataframe-oriented DSL. IE a Polars-esque lazy top-N retrieval system that pulled from different retrieval sources, scored them, summed them, and did whatever arbitrary math to the underlying scores. I can cross my fingers such a thing might exist. So far people build these DAGs in less expressive ways: as part of their Torch model DAG, or some homegrown query-time DAG system.

In any case, I’m absolutely humbled by folks that work on big, large scale, distributed lexical search engines like (Vespa, Lucene, OpenSearch, Elasticsearch, Solr). These folks ought to be your hero too, they do this grunt work for us, and we should NOT take it for granted.

Below are some notes and appendices for BEIR and the different benchmarking scripts, in case you’re curious


Appendix links to scripts

Appendix – How to integrate with BEIR…

BEIR has a set of built-in datasets and metrics tools, if you implement a BaseSearch class with the following signature:

    class SearchArraySearch(BaseSearch):

        def search(self,
                   corpus: Dict[str, Dict[str, str]],
                   queries: Dict[str, str],
                   top_k: int,
                   *args,
                   **kwargs) -> Dict[str, Dict[str, float]]:

The inputs:

  • Corpus: A dict pointing a document id to a set of fields to index, ie
{'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.',
 'title': ''}
...
  • Queries: A dict pointing a query id -> query:
{"1": "Who was the original governor of the plymouth colony"}
...

Finally the output is a dictionary of query ids -> {doc ids -> scores} – each query w/ top_k scored.

So when search is called you need to

  1. Index the corpus
  2. Issue all queries and gather scores

Essentially this looks something like:

def search(self,
           corpus: Dict[str, Dict[str, str]],
           queries: Dict[str, str],
           top_k: int,
           *args,
           **kwargs) -> Dict[str, Dict[str, float]]:
    corpus = self.index_corpus(corpus)     # 

How does this look for SearchArray?

To index, we loop over each str column, and add a SearchArray column to the DF. Below, tokenized with a snowball tokenizer:

            for column in corpus.columns:
                if corpus[column].dtype == 'object':
                    corpus[column].fillna("", inplace=True)
                    corpus[f'{column}_snowball'] = SearchArray.index(corpus[column],
                                                                     data_dir=DATA_DIR,
                                                                     tokenizer=snowball_tokenizer)

Then replace some_search_function above w/ something that searches the SearchArray columns. Maybe this simple bm25_search:

def bm25_search(corpus, query):
    query = snowball_tokenizer(query)
    scores = np.zeros(len(corpus))
    for q in query:
        scores += corpus['text_snowball'].array.score(q)
    return scores

(Leaving out some annoying threading, but you can look at the code all here )

to learn how to apply LLMs to search applications. Check out

for a sneak preview.

Source Link


Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!


Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo

Bitcoin QR Code

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Scan the QR code with your crypto wallet app

DOGECOIN

Dogecoin Logo

Dogecoin QR Code

D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA

Scan the QR code with your crypto wallet app

ETHEREUM

Ethereum Logo

Ethereum QR Code

0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a

Scan the QR code with your crypto wallet app

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: Hacker News
Share161Tweet101Share28
Previous Post

Teaching Your Dog To Ignore Other Dogs Walking On Leash

Next Post

Honey, the Amazon humanoid delivery robot is here! (well, almost)

Hacker News

Hacker News

Stay updated with Hacker News, where technology meets entrepreneurial spirit. Get the latest on tech trends, startup news, and discussions from the tech community. Read the latest updates here at Techcratic.

Related Posts

Askannz/munal-os: An experimental operating system fully written in Rust, with a unikernel design, cooperative scheduling and a security model based on WASM sandboxing.
Hacker News

Askannz/munal-os: An experimental operating system fully written in Rust, with a unikernel design, cooperative scheduling and a security model based on WASM sandboxing.

June 9, 2025
1.3k
The New Godel Prize Winner Tastes Great and is Less Filling
Hacker News

The New Godel Prize Winner Tastes Great and is Less Filling

June 9, 2025
1.3k
my first attempt at iOS app development
Hacker News

my first attempt at iOS app development

June 8, 2025
1.3k
binfmtc – binfmt_misc C scripting interface
Hacker News

binfmtc – binfmt_misc C scripting interface

June 8, 2025
1.3k
Stop Vibe Coding. Start Cyborg Coding. | by Chase | Jun, 2025
Hacker News

Stop Vibe Coding. Start Cyborg Coding. | by Chase | Jun, 2025

June 7, 2025
1.3k
Discovering a JDK Race Condition, and Debugging it in 30 Minutes with Fray
Hacker News

Discovering a JDK Race Condition, and Debugging it in 30 Minutes with Fray

June 7, 2025
1.3k
Load More
Next Post
Honey, the Amazon humanoid delivery robot is here! (well, almost)

Honey, the Amazon humanoid delivery robot is here! (well, almost)

Smartphone

Chargeasap's Zeus is the ultimate 280W GaN charger

As Dusk Falls Xbox Series S Gameplay Walkthrough Part 1 Intro FULL GAME Game No Commentary

As Dusk Falls Xbox Series S Gameplay Walkthrough Part 1 Intro FULL GAME Game No Commentary

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Tech Resources

  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo
  • Google News
  • Hacker News
  • Harvard Tech
  • I Like Cats ™
  • I Like Dogs ™
  • LifeHacker
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • PC World
  • Photofocus
  • Physics
  • Random Tech
  • Retro Rewind ™
  • Robot Report
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Art
  • Tech Careers
  • Tech Deals
  • Techcratic ™
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Tech News

  • 30 Second Tech ™
  • AI
  • AnandTech
  • Apple Insider
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • ExtremeTech
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo

Tech News

  • Harvard Tech
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • Physics
  • PC World
  • Random Tech
  • Retro Rewind ™
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Careers

Tech News​

  • Tech Art
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Site Links

  • About Techcratic
  • Affiliate Disclaimer
  • Affiliate Link Policy
  • Contact Techcratic
  • Dealors Discount Store
  • Privacy and Security Disclaimer
  • Privacy Policy
  • RSS Feed
  • Site Map
  • Support Techcratic
  • Techcratic
  • Tech Deals
  • TOS
  • 𝕏
Click For A Secret Deal

Techcratic – Your All In One Tech Hub © 2020 – 2025
All Rights Reserved
∞

No Result
View All Result
  • Home
  • Apple
  • Gaming
  • Microsoft
  • AnandTech