• About TC
  • Affiliate Disclaimer
  • Privacy Policy
  • TOS
  • Contact
Thursday, June 5, 2025
Techcratic
  • TC
  • AI
    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

    Artificial Intelligence

    World-Consistent Video Diffusion With Explicit 3D Modeling

    Artificial Intelligence

    Deploy Amazon SageMaker Projects with Terraform Cloud

    Artificial Intelligence

    Data Science ETL Pipelines with DuckDB

    Artificial Intelligence

    New Amazon Bedrock Data Automation capabilities streamline video and audio analysis

    Artificial Intelligence

    Surprising Things You Can Do with Python’s csv Module

  • Crypto
    Elon Musk Inches Closer to Bitcoin Maximalism Amid US Debt Alarm

    Elon Musk Inches Closer to Bitcoin Maximalism Amid US Debt Alarm

    Grayscale: Bitcoin Demand Explodes as $5T Deficit Projection Fuels Fiat Doubt

    Grayscale: Bitcoin Demand Explodes as $5T Deficit Projection Fuels Fiat Doubt

    Tether Leads Strategic Investment in Orionx to Increase Stablecoin Adoption in Latam

    Tether Leads Strategic Investment in Orionx to Increase Stablecoin Adoption in Latam

    Galaxy Report: Crypto Lending Dips Slightly, Bitcoin Treasuries, Futures Drive New Leverage

    Galaxy Report: Crypto Lending Dips Slightly, Bitcoin Treasuries, Futures Drive New Leverage

    Price predictions for BTC, ETH, XRP, BNB, SOL, DOGE, ADA, SUI, HYPE, LINK

    Price predictions for BTC, ETH, XRP, BNB, SOL, DOGE, ADA, SUI, HYPE, LINK

    Solidus AI Tech Partners With Fetch.ai to Expand Access to Intelligent Agent Infrastructure

    Solidus AI Tech Partners With Fetch.ai to Expand Access to Intelligent Agent Infrastructure

    Bitcoin trader says $107.5K ‘vital’ zone for new all-time highs next

    Bitcoin trader says $107.5K ‘vital’ zone for new all-time highs next

    Ctrl Alt Secures VARA License to Operate as Virtual Assets Service Provider in Dubai

    Ctrl Alt Secures VARA License to Operate as Virtual Assets Service Provider in Dubai

    VC-Backed Crypto Projects Face High Failure Rates—Study Reveals 45% Have Collapsed

    VC-Backed Crypto Projects Face High Failure Rates—Study Reveals 45% Have Collapsed

  • Cybersecurity
    Cybersecurity

    Why Traditional DLP Solutions Fail in the Browser Era

    Cybersecurity

    HPE Issues Security Patch for StoreOnce Bug Allowing Remote Authentication Bypass

    Cybersecurity

    Critical 10-Year-Old Roundcube Webmail Bug Allows Authenticated Users Run Malicious Code

    Cybersecurity

    Android Trojan Crocodilus Now Active in 8 Countries, Targeting Banks and Crypto Wallets

    Cybersecurity

    Microsoft and CrowdStrike Launch Shared Threat Actor Glossary to Cut Attribution Confusion

    Cybersecurity

    New Chrome Zero-Day Actively Exploited; Google Issues Emergency Out-of-Band Patch

    Cybersecurity

    Qualcomm Fixes 3 Zero-Days Used in Targeted Android Attacks via Adreno GPU

    Cybersecurity

    New Linux Flaws Allow Password Hash Theft via Core Dumps in Ubuntu, RHEL, Fedora

    Cybersecurity

    Czech Republic Blames China-Linked APT31 Hackers for 2022 Cyberattack

  • Deals
    GMKtec Mini PC Intel Core i9 13900HK(Turbo 5.4GHz) 32GB DDR5 2TB SSD NVMe Mini Computers…

    GMKtec Mini PC Intel Core i9 13900HK(Turbo 5.4GHz) 32GB DDR5 2TB SSD NVMe Mini Computers…

    Fantasy Flight Games Arkham Horror The Card Game The Innsmouth Conspiracy Investigator…

    Fantasy Flight Games Arkham Horror The Card Game The Innsmouth Conspiracy Investigator…

    Goliath Go Go UFO Game | Alien Adventure Game with Non-Stop Slinky Action | 2-6 Players,…

    Goliath Go Go UFO Game | Alien Adventure Game with Non-Stop Slinky Action | 2-6 Players,…

    1TB External Hard Drive Ultra Slim Portable HDD-USB 3.0 USB-C for PC, Mac, Laptop, PS4,…

    1TB External Hard Drive Ultra Slim Portable HDD-USB 3.0 USB-C for PC, Mac, Laptop, PS4,…

    EKSA USB Gaming Headset for PC – Computer Headset with Detachable Noise Cancelling Mic,…

    EKSA USB Gaming Headset for PC – Computer Headset with Detachable Noise Cancelling Mic,…

    CORSAIR VENGEANCE DDR5 64GB (2x32GB) DDR5 6000MHz CL40 AMD EXPO Intel XMP iCUE…

    CORSAIR VENGEANCE DDR5 64GB (2x32GB) DDR5 6000MHz CL40 AMD EXPO Intel XMP iCUE…

    Seventable Corner Desk with Power Outlets & 3 Drawers, 44.6″ Reversible Small L Shaped…

    Seventable Corner Desk with Power Outlets & 3 Drawers, 44.6″ Reversible Small L Shaped…

    TMKB T87SE TKL Mechanical Gaming Keyboard, Clicky Blue Switch, LED Backlit,…

    TMKB T87SE TKL Mechanical Gaming Keyboard, Clicky Blue Switch, LED Backlit,…

    Transcend 32GB SDHC Class 10 Flash Memory Card Up to 30MB/s (TS32GSDHC10), Blue

    Transcend 32GB SDHC Class 10 Flash Memory Card Up to 30MB/s (TS32GSDHC10), Blue

  • Gaming
    5 Short Game Reviews – with Tom Vasel

    5 Short Game Reviews – with Tom Vasel

    LoZ: OoT Wallkthrough Pt. 19

    LoZ: OoT Wallkthrough Pt. 19

    June Xbox Insider Community Update

    June Xbox Insider Community Update

    Zelda: The Minish Cap | Episode 9

    Zelda: The Minish Cap | Episode 9

    The Legend of Zelda: Breath of the Wild – Shee Vaneer Shrine Walkthrough [HD 1080P]

    The Legend of Zelda: Breath of the Wild – Shee Vaneer Shrine Walkthrough [HD 1080P]

    REDRAGON S101 GAMING KEYBOARD

    https://www.youtube.com/watch?v=Lcc5sGHe5go

    Zelda: Skyward Sword Walkthrough – Puzzle Rooms – Sky Keep – Part 64

    Zelda: Skyward Sword Walkthrough – Puzzle Rooms – Sky Keep – Part 64

    27 MORE Easter Eggs YOU MISSED in The Super Mario Bros. Movie

    27 MORE Easter Eggs YOU MISSED in The Super Mario Bros. Movie

    The Legend Of Zelda Ocarina Of Time 3D Nintendo 3DS Walkthrough Part 15 The Graveyard Shift

    The Legend Of Zelda Ocarina Of Time 3D Nintendo 3DS Walkthrough Part 15 The Graveyard Shift

  • Tesla
    Spigen Cybercase Adapter Case Designed for Tesla SAE J1772 Charging Adapter with Magnet…

    Spigen Cybercase Adapter Case Designed for Tesla SAE J1772 Charging Adapter with Magnet…

    Motrobe Tesla Cybertruck Center Console Organizer Tray [TPE+Anti-Scratch] Armrest…

    Motrobe Tesla Cybertruck Center Console Organizer Tray [TPE+Anti-Scratch] Armrest…

    Waterproof Car Covers Fit for 2024 Tesla Cybertruck, 6 Layers Upgrade 210T Windproof…

    Waterproof Car Covers Fit for 2024 Tesla Cybertruck, 6 Layers Upgrade 210T Windproof…

    Center Console Armrest Cover Compatible with Tesla Cybertruck 2024 2025, Upgrade…

    Center Console Armrest Cover Compatible with Tesla Cybertruck 2024 2025, Upgrade…

    Tesla’s India plans won’t include manufacturing and here’s why

    Tesla’s India plans won’t include manufacturing and here’s why

    Cup Holder Insert for Tesla Model Y 2025 Accessories 2 in 1 Silicone Console Cover Pad…

    Cup Holder Insert for Tesla Model Y 2025 Accessories 2 in 1 Silicone Console Cover Pad…

    CAT DiamondShield Rubber Floor Mats for Cars, Trucks, SUVs – All Weather Protection…

    CAT DiamondShield Rubber Floor Mats for Cars, Trucks, SUVs – All Weather Protection…

    Topfit for Tesla Model Y Underseat Protector ABS 2nd Row Kick Cover Under Seat Slide…

    Topfit for Tesla Model Y Underseat Protector ABS 2nd Row Kick Cover Under Seat Slide…

    3-Ton Scissor Jack Kit for Tesla Model Y 3 S X with Rubber Jack Pad, Car Spare Tire…

    3-Ton Scissor Jack Kit for Tesla Model Y 3 S X with Rubber Jack Pad, Car Spare Tire…

  • UFO
    My Stepmother Is an Alien

    My Stepmother Is an Alien

    SpaceCraft – Official Reveal Trailer | PC Games Show: Most Wanted

    SpaceCraft – Official Reveal Trailer | PC Games Show: Most Wanted

    Is the United States government hiding info on ufo sightings and technology? #shorts #ufo #uap

    Is the United States government hiding info on ufo sightings and technology? #shorts #ufo #uap

    ikaufen E88 Drone with Built-in screen control, Camera for Adults RC Drone With 4K HD Dual Camera WiFi FPV Foldable Quadcopter Aircraft +1 Battery (E88 Dual camera Black)

    ikaufen E88 Drone with Built-in screen control, Camera for Adults RC Drone With 4K HD Dual Camera WiFi FPV Foldable Quadcopter Aircraft +1 Battery (E88 Dual camera Black)

    These are the top 5 terrifying sky phenomena that you can only see at this time.#shorts #videos

    These are the top 5 terrifying sky phenomena that you can only see at this time.#shorts #videos

    VANISHING ORBS INVADE NEW YORK CITY – “Where’d They Go?!” | The Proof Is Out There | #Shorts

    VANISHING ORBS INVADE NEW YORK CITY – “Where’d They Go?!” | The Proof Is Out There | #Shorts

    Humanity's Greatest Mysteries *MARATHON* | Ancient Aliens

    Humanity's Greatest Mysteries *MARATHON* | Ancient Aliens

    Cosmic War: Interplanetary Warfare, Modern Physics, and Ancient Texts

    Cosmic War: Interplanetary Warfare, Modern Physics, and Ancient Texts

    Latest UFO sightings in England. On January 5th 2024 #aliensighting #ufosighting #ufo #ufosighting20

    Latest UFO sightings in England. On January 5th 2024 #aliensighting #ufosighting #ufo #ufosighting20

No Result
View All Result
  • TC
  • AI
    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

    Artificial Intelligence

    How to Use Deep Research Like a Pro

    Artificial Intelligence

    World-Consistent Video Diffusion With Explicit 3D Modeling

    Artificial Intelligence

    Deploy Amazon SageMaker Projects with Terraform Cloud

    Artificial Intelligence

    Data Science ETL Pipelines with DuckDB

    Artificial Intelligence

    New Amazon Bedrock Data Automation capabilities streamline video and audio analysis

    Artificial Intelligence

    Surprising Things You Can Do with Python’s csv Module

  • Crypto
    Elon Musk Inches Closer to Bitcoin Maximalism Amid US Debt Alarm

    Elon Musk Inches Closer to Bitcoin Maximalism Amid US Debt Alarm

    Grayscale: Bitcoin Demand Explodes as $5T Deficit Projection Fuels Fiat Doubt

    Grayscale: Bitcoin Demand Explodes as $5T Deficit Projection Fuels Fiat Doubt

    Tether Leads Strategic Investment in Orionx to Increase Stablecoin Adoption in Latam

    Tether Leads Strategic Investment in Orionx to Increase Stablecoin Adoption in Latam

    Galaxy Report: Crypto Lending Dips Slightly, Bitcoin Treasuries, Futures Drive New Leverage

    Galaxy Report: Crypto Lending Dips Slightly, Bitcoin Treasuries, Futures Drive New Leverage

    Price predictions for BTC, ETH, XRP, BNB, SOL, DOGE, ADA, SUI, HYPE, LINK

    Price predictions for BTC, ETH, XRP, BNB, SOL, DOGE, ADA, SUI, HYPE, LINK

    Solidus AI Tech Partners With Fetch.ai to Expand Access to Intelligent Agent Infrastructure

    Solidus AI Tech Partners With Fetch.ai to Expand Access to Intelligent Agent Infrastructure

    Bitcoin trader says $107.5K ‘vital’ zone for new all-time highs next

    Bitcoin trader says $107.5K ‘vital’ zone for new all-time highs next

    Ctrl Alt Secures VARA License to Operate as Virtual Assets Service Provider in Dubai

    Ctrl Alt Secures VARA License to Operate as Virtual Assets Service Provider in Dubai

    VC-Backed Crypto Projects Face High Failure Rates—Study Reveals 45% Have Collapsed

    VC-Backed Crypto Projects Face High Failure Rates—Study Reveals 45% Have Collapsed

  • Cybersecurity
    Cybersecurity

    Why Traditional DLP Solutions Fail in the Browser Era

    Cybersecurity

    HPE Issues Security Patch for StoreOnce Bug Allowing Remote Authentication Bypass

    Cybersecurity

    Critical 10-Year-Old Roundcube Webmail Bug Allows Authenticated Users Run Malicious Code

    Cybersecurity

    Android Trojan Crocodilus Now Active in 8 Countries, Targeting Banks and Crypto Wallets

    Cybersecurity

    Microsoft and CrowdStrike Launch Shared Threat Actor Glossary to Cut Attribution Confusion

    Cybersecurity

    New Chrome Zero-Day Actively Exploited; Google Issues Emergency Out-of-Band Patch

    Cybersecurity

    Qualcomm Fixes 3 Zero-Days Used in Targeted Android Attacks via Adreno GPU

    Cybersecurity

    New Linux Flaws Allow Password Hash Theft via Core Dumps in Ubuntu, RHEL, Fedora

    Cybersecurity

    Czech Republic Blames China-Linked APT31 Hackers for 2022 Cyberattack

  • Deals
    GMKtec Mini PC Intel Core i9 13900HK(Turbo 5.4GHz) 32GB DDR5 2TB SSD NVMe Mini Computers…

    GMKtec Mini PC Intel Core i9 13900HK(Turbo 5.4GHz) 32GB DDR5 2TB SSD NVMe Mini Computers…

    Fantasy Flight Games Arkham Horror The Card Game The Innsmouth Conspiracy Investigator…

    Fantasy Flight Games Arkham Horror The Card Game The Innsmouth Conspiracy Investigator…

    Goliath Go Go UFO Game | Alien Adventure Game with Non-Stop Slinky Action | 2-6 Players,…

    Goliath Go Go UFO Game | Alien Adventure Game with Non-Stop Slinky Action | 2-6 Players,…

    1TB External Hard Drive Ultra Slim Portable HDD-USB 3.0 USB-C for PC, Mac, Laptop, PS4,…

    1TB External Hard Drive Ultra Slim Portable HDD-USB 3.0 USB-C for PC, Mac, Laptop, PS4,…

    EKSA USB Gaming Headset for PC – Computer Headset with Detachable Noise Cancelling Mic,…

    EKSA USB Gaming Headset for PC – Computer Headset with Detachable Noise Cancelling Mic,…

    CORSAIR VENGEANCE DDR5 64GB (2x32GB) DDR5 6000MHz CL40 AMD EXPO Intel XMP iCUE…

    CORSAIR VENGEANCE DDR5 64GB (2x32GB) DDR5 6000MHz CL40 AMD EXPO Intel XMP iCUE…

    Seventable Corner Desk with Power Outlets & 3 Drawers, 44.6″ Reversible Small L Shaped…

    Seventable Corner Desk with Power Outlets & 3 Drawers, 44.6″ Reversible Small L Shaped…

    TMKB T87SE TKL Mechanical Gaming Keyboard, Clicky Blue Switch, LED Backlit,…

    TMKB T87SE TKL Mechanical Gaming Keyboard, Clicky Blue Switch, LED Backlit,…

    Transcend 32GB SDHC Class 10 Flash Memory Card Up to 30MB/s (TS32GSDHC10), Blue

    Transcend 32GB SDHC Class 10 Flash Memory Card Up to 30MB/s (TS32GSDHC10), Blue

  • Gaming
    5 Short Game Reviews – with Tom Vasel

    5 Short Game Reviews – with Tom Vasel

    LoZ: OoT Wallkthrough Pt. 19

    LoZ: OoT Wallkthrough Pt. 19

    June Xbox Insider Community Update

    June Xbox Insider Community Update

    Zelda: The Minish Cap | Episode 9

    Zelda: The Minish Cap | Episode 9

    The Legend of Zelda: Breath of the Wild – Shee Vaneer Shrine Walkthrough [HD 1080P]

    The Legend of Zelda: Breath of the Wild – Shee Vaneer Shrine Walkthrough [HD 1080P]

    REDRAGON S101 GAMING KEYBOARD

    https://www.youtube.com/watch?v=Lcc5sGHe5go

    Zelda: Skyward Sword Walkthrough – Puzzle Rooms – Sky Keep – Part 64

    Zelda: Skyward Sword Walkthrough – Puzzle Rooms – Sky Keep – Part 64

    27 MORE Easter Eggs YOU MISSED in The Super Mario Bros. Movie

    27 MORE Easter Eggs YOU MISSED in The Super Mario Bros. Movie

    The Legend Of Zelda Ocarina Of Time 3D Nintendo 3DS Walkthrough Part 15 The Graveyard Shift

    The Legend Of Zelda Ocarina Of Time 3D Nintendo 3DS Walkthrough Part 15 The Graveyard Shift

  • Tesla
    Spigen Cybercase Adapter Case Designed for Tesla SAE J1772 Charging Adapter with Magnet…

    Spigen Cybercase Adapter Case Designed for Tesla SAE J1772 Charging Adapter with Magnet…

    Motrobe Tesla Cybertruck Center Console Organizer Tray [TPE+Anti-Scratch] Armrest…

    Motrobe Tesla Cybertruck Center Console Organizer Tray [TPE+Anti-Scratch] Armrest…

    Waterproof Car Covers Fit for 2024 Tesla Cybertruck, 6 Layers Upgrade 210T Windproof…

    Waterproof Car Covers Fit for 2024 Tesla Cybertruck, 6 Layers Upgrade 210T Windproof…

    Center Console Armrest Cover Compatible with Tesla Cybertruck 2024 2025, Upgrade…

    Center Console Armrest Cover Compatible with Tesla Cybertruck 2024 2025, Upgrade…

    Tesla’s India plans won’t include manufacturing and here’s why

    Tesla’s India plans won’t include manufacturing and here’s why

    Cup Holder Insert for Tesla Model Y 2025 Accessories 2 in 1 Silicone Console Cover Pad…

    Cup Holder Insert for Tesla Model Y 2025 Accessories 2 in 1 Silicone Console Cover Pad…

    CAT DiamondShield Rubber Floor Mats for Cars, Trucks, SUVs – All Weather Protection…

    CAT DiamondShield Rubber Floor Mats for Cars, Trucks, SUVs – All Weather Protection…

    Topfit for Tesla Model Y Underseat Protector ABS 2nd Row Kick Cover Under Seat Slide…

    Topfit for Tesla Model Y Underseat Protector ABS 2nd Row Kick Cover Under Seat Slide…

    3-Ton Scissor Jack Kit for Tesla Model Y 3 S X with Rubber Jack Pad, Car Spare Tire…

    3-Ton Scissor Jack Kit for Tesla Model Y 3 S X with Rubber Jack Pad, Car Spare Tire…

  • UFO
    My Stepmother Is an Alien

    My Stepmother Is an Alien

    SpaceCraft – Official Reveal Trailer | PC Games Show: Most Wanted

    SpaceCraft – Official Reveal Trailer | PC Games Show: Most Wanted

    Is the United States government hiding info on ufo sightings and technology? #shorts #ufo #uap

    Is the United States government hiding info on ufo sightings and technology? #shorts #ufo #uap

    ikaufen E88 Drone with Built-in screen control, Camera for Adults RC Drone With 4K HD Dual Camera WiFi FPV Foldable Quadcopter Aircraft +1 Battery (E88 Dual camera Black)

    ikaufen E88 Drone with Built-in screen control, Camera for Adults RC Drone With 4K HD Dual Camera WiFi FPV Foldable Quadcopter Aircraft +1 Battery (E88 Dual camera Black)

    These are the top 5 terrifying sky phenomena that you can only see at this time.#shorts #videos

    These are the top 5 terrifying sky phenomena that you can only see at this time.#shorts #videos

    VANISHING ORBS INVADE NEW YORK CITY – “Where’d They Go?!” | The Proof Is Out There | #Shorts

    VANISHING ORBS INVADE NEW YORK CITY – “Where’d They Go?!” | The Proof Is Out There | #Shorts

    Humanity's Greatest Mysteries *MARATHON* | Ancient Aliens

    Humanity's Greatest Mysteries *MARATHON* | Ancient Aliens

    Cosmic War: Interplanetary Warfare, Modern Physics, and Ancient Texts

    Cosmic War: Interplanetary Warfare, Modern Physics, and Ancient Texts

    Latest UFO sightings in England. On January 5th 2024 #aliensighting #ufosighting #ufo #ufosighting20

    Latest UFO sightings in England. On January 5th 2024 #aliensighting #ufosighting #ufo #ufosighting20

No Result
View All Result
Techcratic
No Result
View All Result
Home Hacker News

MinishLab/semhash: Fast Semantic Text Deduplication

Hacker News by Hacker News
January 12, 2025
in Hacker News
Reading Time: 18 mins read
122 8
A A
0
Share on FacebookShare on XShare on LinkedIn

2025-01-12 11:20:00
github.com

SemHash logo

SemHash is a lightweight and flexible tool for deduplicating datasets using semantic similarity. It combines fast embedding generation from Model2Vec with efficient ANN-based similarity search through Vicinity.

SemHash supports both single-dataset deduplication (e.g., cleaning up a train set) and multi-dataset deduplication (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.

Install the package with:

Deduplicate a single dataset with the following code (note: the examples assume you have datasets installed, which you can install with pip install datasets):

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate().deduplicated

Or, deduplicate across two datasets with the following code (e.g., eliminating train/test leakage):

from datasets import load_dataset
from semhash import SemHash

# Load two datasets to deduplicate
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance with the training data
semhash = SemHash.from_records(records=train_texts)

# Deduplicate the test data against the training data, optionally with a specific threshold
deduplicated_test_texts = semhash.deduplicate(records=test_texts, threshold=0.9).deduplicated

Or, deduplicate multi-column datasets with the following code (e.g., deduplicating a QA dataset):

from datasets import load_dataset
from semhash import SemHash

# Load the dataset
dataset = load_dataset("squad_v2", split="train")

# Convert the dataset to a list of dictionaries
records = [dict(row) for row in dataset]

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().deduplicated

The deduplicate and self_deduplicate functions return a DeduplicationResult. This object stores the deduplicated corpus, a set of duplicate objec (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result. Examples of how these functions can be used can be found in the usage section.

  • Fast: SemHash uses model2vec to embed texts and vicinity to perform similarity search, making it extremely fast.
  • Scalable: SemHash can deduplicate large datasets with millions of records thanks to the ANN backends in Vicinity.
  • Flexible: SemHash can be used to deduplicate a single dataset or across two datasets, and can also be used to deduplicate multi-column datasets (such as QA datasets).
  • Lightweight: SemHash is a lightweight package with minimal dependencies, making it easy to install and use.
  • Explainable: Easily inspect the duplicates and what caused them with the DeduplicationResult object. You can also view the lowest similarity duplicates to find the right threshold for deduplication for your dataset.

The following examples show the various ways you can use SemHash to deduplicate datasets. These examples assume you have the datasets library installed, which you can install with pip install datasets.

Deduplicate a single dataset

The following code snippet shows how to deduplicate a single dataset using SemHash (in this example, the train split of the AG News dataset):

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate()
Deduplicate across two datasets

The following code snippet shows how to deduplicate across two datasets using SemHash (in this example, the train/test split of the AG News dataset):

from datasets import load_dataset
from semhash import SemHash

# Initialize a SemHash instance
semhash = SemHash()

# Load two datasets to deduplicate
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=train_texts)

# Deduplicate the test data against the training data
deduplicated_test_texts = semhash.deduplicate(records=test_texts)
Deduplicate multi-column datasets

The following code snippet shows how to deduplicate multi-column datasets using SemHash (in this example, the train split of the QA dataset SQuAD 2.0, which consists of questions, contexts, and answers):

from datasets import load_dataset
from semhash import SemHash

# Load the dataset
dataset = load_dataset("squad_v2", split="train")

# Convert the dataset to a list of dictionaries
records = [dict(row) for row in dataset]

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().deduplicated
DeduplicationResult functionality

The DeduplicationResult object returned by the deduplicate and self_deduplicate functions contains several useful functions to inspect the deduplication result. The following code snippet shows how to use these functions:

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplication_result = semhash.self_deduplicate()

# Check the deduplicated texts
deduplication_result.deduplicated
# Check the duplicates
deduplication_result.duplicates
# See what percentage of the texts were duplicates
deduplication_result.duplicate_ratio
# See what percentage of the texts were exact duplicates
deduplication_result.exact_duplicate_ratio

# Get the least similar text from the duplicates. This is useful for finding the right threshold for deduplication.
least_similar = deduplication_result.get_least_similar_from_duplicates()

# Rethreshold the duplicates. This allows you to instantly rethreshold the duplicates with a new threshold without having to re-deduplicate the texts.
deduplication_result.rethreshold(0.95)
Using custom encoders

The following code snippet shows how to use a custom encoder with SemHash:

from datasets import load_dataset
from model2vec import StaticModel
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Load an embedding model (in this example, a multilingual model)
model = StaticModel.from_pretrained("minishlab/M2V_multilingual_output")

# Initialize a SemHash with the model and custom encoder
semhash = SemHash.from_records(records=texts, model=model)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate()

Any encoder can be used that adheres to our encoder protocol. For example, any sentence-transformers model can be used as an encoder:

from datasets import load_dataset
from semhash import SemHash
from sentence_transformers import SentenceTransformer

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Load a sentence-transformers model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Initialize a SemHash with the model and custom encoder
semhash = SemHash.from_records(records=texts, model=model)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate()

NOTE: By default, we use the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it’s needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use the flat/exact-matching backend, you can set use_ann=False in the SemHash constructor:

semhash = SemHash.from_records(records=texts, use_ann=False)

We’ve benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup:

  • The benchmarks were all run on CPU
  • The benchmarks were all run with use_ann=True
  • The used encoder is the default encoder (potion-base-8M).
  • The timings include the encoding time, index building time, and deduplication time.

Train Deduplication Benchmark

Dataset Original Train Size Deduplicated Train Size % Removed Deduplication Time (s)
bbc 1225 1144 6.61 0.57
senteval_cr 3012 2990 0.73 0.14
tweet_sentiment_extraction 27481 26695 2.86 1.77
emotion 16000 15695 1.91 0.77
amazon_counterfactual 5000 4992 0.16 0.33
ag_news 120000 106921 10.90 5.20
enron_spam 31716 20540 35.24 2.03
subj 8000 7990 0.12 0.63
sst5 8544 8526 0.21 0.58
20_newgroups 11314 10684 5.57 0.73
hatespeech_offensive 22783 22090 3.04 0.92
ade 17637 15718 10.88 0.73
imdb 25000 24830 0.68 1.76
massive_scenario 11514 9366 18.66 0.47
student 117519 63856 45.66 8.80
squad_v2 130319 109698 15.82 8.81
wikitext 1801350 884645 50.89 83.53

Train/Test Deduplication Benchmark

Dataset Train Size Test Size Deduplicated Test Size % Removed Deduplication Time (s)
bbc 1225 1000 870 13.00 0.71
senteval_cr 3012 753 750 0.40 0.13
tweet_sentiment_extraction 27481 3534 3412 3.45 1.53
emotion 16000 2000 1926 3.70 0.65
amazon_counterfactual 5000 5000 4990 0.20 0.51
ag_news 120000 7600 6198 18.45 3.74
enron_spam 31716 2000 1060 47.00 1.94
subj 8000 2000 1999 0.05 0.62
sst5 8544 2210 2205 0.23 0.59
20_newgroups 11314 7532 7098 5.76 2.25
hatespeech_offensive 22783 2000 1925 3.75 0.77
ade 17637 5879 4952 15.77 0.81
imdb 25000 25000 24795 0.82 2.81
massive_scenario 11514 2974 2190 26.36 0.46
student 117519 5000 2393 52.14 3.78
squad_v2 130319 11873 11863 0.08 7.13
wikitext 1801350 4358 2139 50.92 40.32

As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records. There are some notable examples of train/test leakage, such as enron_spam and student, where the test dataset contains a significant amount of semantic overlap with the training dataset.

Reproducing the Benchmarks

To run the benchmarks yourself, you can use the following command (assuming you have the datasets library installed):

python -m benchmarks.run_benchmarks

Optionally, the datasets can be updated in the datasets.py file.

Source Link


Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!


Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.

Bitcoin QR Code

Simply scan the QR code below to support Techcratic.

Bitcoin QR code for donations

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: Hacker News
Share162Tweet101Share28
Previous Post

AMD accuses Intel’s Arrow Lake of being a ‘horrible’ product and implies a lack of options for consumers has caused the Ryzen 7 9800X3D shortage

Next Post

Galaxy S25 Ultra could be the coolest Samsung phone yet

Hacker News

Hacker News

Stay updated with Hacker News, where technology meets entrepreneurial spirit. Get the latest on tech trends, startup news, and discussions from the tech community. Read the latest updates here at Techcratic.

Related Posts

Doubling Down on Open Source
Hacker News

Doubling Down on Open Source

June 4, 2025
1.3k
How We Reduced the Impact of Zombie Clients
Hacker News

How We Reduced the Impact of Zombie Clients

June 4, 2025
1.3k
When memory was measured in kilobytes: The art of efficient vision
Hacker News

When memory was measured in kilobytes: The art of efficient vision

June 4, 2025
1.3k
The History of R2E and the Micral
Hacker News

The History of R2E and the Micral

June 4, 2025
1.3k
Don’t just check errors, handle them gracefully
Hacker News

Don’t just check errors, handle them gracefully

June 3, 2025
1.3k
neocanable/garlic: Java decompiler written in C
Hacker News

neocanable/garlic: Java decompiler written in C

June 3, 2025
1.3k
Load More
Next Post

Galaxy S25 Ultra could be the coolest Samsung phone yet

Getting an all-optical AI to handle non-linear math

Getting an all-optical AI to handle non-linear math

Mastering Realistic Hand Sculpting: Zbrush Timelapse Artistry | Zbrush Timelapse

Mastering Realistic Hand Sculpting: Zbrush Timelapse Artistry | Zbrush Timelapse

Your Tech Resources

  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo
  • Google News
  • Hacker News
  • Harvard Tech
  • I Like Cats ™
  • I Like Dogs ™
  • LifeHacker
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • PC World
  • Photofocus
  • Physics
  • Random Tech
  • Retro Rewind ™
  • Robot Report
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Art
  • Tech Careers
  • Tech Deals
  • Techcratic ™
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Tech News

  • 30 Second Tech ™
  • AI
  • AnandTech
  • Apple Insider
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • ExtremeTech
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo

Tech News

  • Harvard Tech
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • Physics
  • PC World
  • Random Tech
  • Retro Rewind ™
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Careers

Tech News​

  • Tech Art
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Weird Stuff
  • Wired
  • ZDNet

Site Links

  • About Techcratic
  • Affiliate Disclaimer
  • Affiliate Link Policy
  • Contact Techcratic
  • Dealors Discount Store
  • Privacy and Security Disclaimer
  • Privacy Policy
  • RSS Feed
  • Site Map
  • Support Techcratic
  • Techcratic
  • Tech Deals
  • TOS
  • 𝕏
Click For A Secret Deal

Techcratic – Your All In One Tech Hub © 2020 – 2025
All Rights Reserved
∞

No Result
View All Result
  • Home
  • Apple
  • Gaming
  • Microsoft
  • AnandTech