• About TC
  • Affiliate Disclaimer
  • Privacy Policy
  • TOS
  • Contact
Sunday, June 15, 2025
Techcratic
  • TC
  • AI
    Artificial Intelligence

    Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

    Artificial Intelligence

    Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

  • Crypto
    ZKJ Token Plummets More Than 60% in Flash Crash Amid Rug-Pull Allegations

    ZKJ Token Plummets More Than 60% in Flash Crash Amid Rug-Pull Allegations

    Bitcoin Holding $105K During War Is Bullish for Crypto

    Bitcoin Holding $105K During War Is Bullish for Crypto

    Saylor Signals Another Bitcoin Buy—Orange Dots Strike Again

    Saylor Signals Another Bitcoin Buy—Orange Dots Strike Again

    XRP Technical Analysis: Downtrend Dominates—Is the $2.05 Floor About to Collapse?

    XRP Technical Analysis: Downtrend Dominates—Is the $2.05 Floor About to Collapse?

    Best Presales to Buy Today – Which Coins Are Poised for a Breakout?

    Last 72 Hours to Join $49M Raise

    Corporate Bitcoin Holdings a ‘Blessing’ Now, Challenge Later, Says Roxom CEO

    Corporate Bitcoin Holdings a ‘Blessing’ Now, Challenge Later, Says Roxom CEO

    Cryptocurrency Remittances Spike 40% in Latin America

    Cryptocurrency Remittances Spike 40% in Latin America

    Sonic Integrates Bubblemaps V2 to Enhance Onchain Intelligence

    Sonic Integrates Bubblemaps V2 to Enhance Onchain Intelligence

    UBS Sees Senate Rushing ‘One Big, Beautiful Bill’ Toward High-Stakes Finale

    UBS Sees Senate Rushing ‘One Big, Beautiful Bill’ Toward High-Stakes Finale

  • Cybersecurity
    Cybersecurity

    AI Agents Run on Secret Accounts — Learn How to Secure Them in This Webinar

    Cybersecurity

    How to Address the Expanding Security Risk

    Cybersecurity

    ConnectWise to Rotate ScreenConnect Code Signing Certificates Due to Security Risks

    Cybersecurity

    5 Lessons from River Island

    Cybersecurity

    INTERPOL Dismantles 20,000+ Malicious IPs Linked to 69 Malware Variants in Operation Secure

    Cybersecurity

    SinoTrack GPS Devices Vulnerable to Remote Vehicle Control via Default Passwords

    Cybersecurity

    Researchers Uncover 20+ Configuration Risks, Including Five CVEs, in Salesforce Industry Cloud

    Cybersecurity

    Adobe Releases Patch Fixing 254 Vulnerabilities, Closing High-Severity Security Gaps

    Cybersecurity

    Researcher Found Flaw to Discover Phone Numbers Linked to Any Google Account

  • Deals
    Intel Core Ultra 7 Desktop Processor 265K – 20 cores (8 P-cores + 12 E-cores) up to 5.5…

    Intel Core Ultra 7 Desktop Processor 265K – 20 cores (8 P-cores + 12 E-cores) up to 5.5…

    Hitachi FIJ0038 Fuel Injector

    Hitachi FIJ0038 Fuel Injector

    EVGA Supernova 1300 P+, 80+ Platinum 1300W, Fully Modular, 10 Year Warranty, Includes…

    EVGA Supernova 1300 P+, 80+ Platinum 1300W, Fully Modular, 10 Year Warranty, Includes…

    Logitech G502 X Plus Wireless Gaming Mouse – LIGHTSPEED Optical, LIGHTFORCE Switches,…

    Logitech G502 X Plus Wireless Gaming Mouse – LIGHTSPEED Optical, LIGHTFORCE Switches,…

    Cable Matters 8-Pack Snagless Cat 5e Ethernet Cable – 5ft, Gigabit Cat5e Cable, Cat5e…

    Cable Matters 8-Pack Snagless Cat 5e Ethernet Cable – 5ft, Gigabit Cat5e Cable, Cat5e…

    Logitech iPad Pro 12.9 inch Keyboard Case | SLIM COMBO with Detachable, Backlit,…

    Logitech iPad Pro 12.9 inch Keyboard Case | SLIM COMBO with Detachable, Backlit,…

    TECKNET Ergonomic Mouse, Wireless Bluetooth Vertical Mouse, 4800 DPI Optical Tracking, 6…

    TECKNET Ergonomic Mouse, Wireless Bluetooth Vertical Mouse, 4800 DPI Optical Tracking, 6…

    DUMOS Ergonomic Gaming Desk Chair – PU Leather Recliner with Footrest, Lumbar Support,…

    DUMOS Ergonomic Gaming Desk Chair – PU Leather Recliner with Footrest, Lumbar Support,…

    Far Cry 3 (Renewed)

    Far Cry 3 (Renewed)

  • Gaming
    Five Nights At Freddy’s still going strong: The latest game briefly grappled with Dune over Steam’s top seller spot

    Five Nights At Freddy’s still going strong: The latest game briefly grappled with Dune over Steam’s top seller spot

    The Super Mario Bros Movie Reaction!

    The Super Mario Bros Movie Reaction!

    The Legend of Zelda: Ocarina of Time Master Quest Walkthrough (Pt. 78)

    The Legend of Zelda: Ocarina of Time Master Quest Walkthrough (Pt. 78)

    The Legend of Zelda: Breath of the Wild – Champions Ballad Shrines Walkthrough

    The Legend of Zelda: Breath of the Wild – Champions Ballad Shrines Walkthrough

    Legend of Zelda Montage

    Legend of Zelda Montage

    Legend of Zelda Breath of the Wild Gameplay/Walkthrough – Part 3

    Legend of Zelda Breath of the Wild Gameplay/Walkthrough – Part 3

    I'm done with Transformers Reactivate

    I'm done with Transformers Reactivate

    Amazon taps Star Trek Beyond and Mindhunter writer to helm Mass Effect show

    Amazon taps Star Trek Beyond and Mindhunter writer to helm Mass Effect show

    The Last of Us Remastered Honest Review

    The Last of Us Remastered Honest Review

  • Tesla
    Torx Plus Socket, 5-External Torx Socket 1/4″ Dr 10EPR Compatible With Tesla Model 3…

    Torx Plus Socket, 5-External Torx Socket 1/4″ Dr 10EPR Compatible With Tesla Model 3…

    Car Seat Organizers,Multi-functional Back Seat Protectors, Storage Pouches, and Tray…

    Car Seat Organizers,Multi-functional Back Seat Protectors, Storage Pouches, and Tray…

    AOHI USB C Car Charger, PD 45W&QC 30W 2 Port Type-C Fast Charging Car Charger Lighter…

    AOHI USB C Car Charger, PD 45W&QC 30W 2 Port Type-C Fast Charging Car Charger Lighter…

    Roof Sunshades for Tesla Model 3 2025, Upgraded 3.0 Sunroof Shade Sunshade Roof Sun…

    Roof Sunshades for Tesla Model 3 2025, Upgraded 3.0 Sunroof Shade Sunshade Roof Sun…

    SOOPII for Tesla Phone Mount,Strongest Magnetic Monitor Mount for Tesla 3/Y…

    SOOPII for Tesla Phone Mount,Strongest Magnetic Monitor Mount for Tesla 3/Y…

    A2C Gym Fitness Phone Mount for MagSafe – 17 N52 Strong Magnets Stable and Secure Phone…

    A2C Gym Fitness Phone Mount for MagSafe – 17 N52 Strong Magnets Stable and Secure Phone…

    3PCS All Weather Tesla Model 3 Highland Floor Mats 2024 2025 | Sleek Design,…

    3PCS All Weather Tesla Model 3 Highland Floor Mats 2024 2025 | Sleek Design,…

    Tesla Door Handle Cover Model Y Model 3 2020-2025 Door Handle Protector Model Y Model 3…

    Tesla Door Handle Cover Model Y Model 3 2020-2025 Door Handle Protector Model Y Model 3…

    FH Group Custom Fit Car Seat Covers for 2020-2024 Tesla Model Y Ultraflex Neoprene Water…

    FH Group Custom Fit Car Seat Covers for 2020-2024 Tesla Model Y Ultraflex Neoprene Water…

  • UFO
    Secrets of the Moon (S11, E11) | Ancient Aliens | Full Episode

    Secrets of the Moon (S11, E11) | Ancient Aliens | Full Episode

    UFOs Over Arizona: A True History of Extraterrestrial Encounters in the Grand Canyon State

    UFOs Over Arizona: A True History of Extraterrestrial Encounters in the Grand Canyon State

    French Contactee confirms Intergalactic Confederation is seeding human worlds

    French Contactee confirms Intergalactic Confederation is seeding human worlds

    New Balance Men’s 574 Greens V2 Golf Shoe

    New Balance Men’s 574 Greens V2 Golf Shoe

    Armin van Buuren rocking Ultra Miami with the new Exploration Of Space (Third Contact Remix)

    Armin van Buuren rocking Ultra Miami with the new Exploration Of Space (Third Contact Remix)

    I found footage of me explaining all 7 Paranormal Activity movies while haunted

    I found footage of me explaining all 7 Paranormal Activity movies while haunted

    Retro Aviator Sunglasses for Women Men – 70s Trendy Square Vintage Shade Sun Glasses UV Protection

    Retro Aviator Sunglasses for Women Men – 70s Trendy Square Vintage Shade Sun Glasses UV Protection

    Pop Culture Conspiracy Theories! Stanley Cups, Love Is Blind, and Dune

    Pop Culture Conspiracy Theories! Stanley Cups, Love Is Blind, and Dune

    Pocket Squares Handkerchiefs for Men Soft Cotton Mens Womens White Pocket Square with Holder for Suit Wedding Business

    Pocket Squares Handkerchiefs for Men Soft Cotton Mens Womens White Pocket Square with Holder for Suit Wedding Business

No Result
View All Result
  • TC
  • AI
    Artificial Intelligence

    Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

    Artificial Intelligence

    Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

    Artificial Intelligence

    5 Error Handling Patterns in Python (Beyond Try-Except)

    Artificial Intelligence

    Top 5 Alternative Data Career Paths and How to Learn Them for Free

    Artificial Intelligence

    Implementing Machine Learning Pipelines with Apache Spark

    Artificial Intelligence

    Learn Power BI for Free This Week

    Artificial Intelligence

    Build GraphRAG applications using Amazon Bedrock Knowledge Bases

  • Crypto
    ZKJ Token Plummets More Than 60% in Flash Crash Amid Rug-Pull Allegations

    ZKJ Token Plummets More Than 60% in Flash Crash Amid Rug-Pull Allegations

    Bitcoin Holding $105K During War Is Bullish for Crypto

    Bitcoin Holding $105K During War Is Bullish for Crypto

    Saylor Signals Another Bitcoin Buy—Orange Dots Strike Again

    Saylor Signals Another Bitcoin Buy—Orange Dots Strike Again

    XRP Technical Analysis: Downtrend Dominates—Is the $2.05 Floor About to Collapse?

    XRP Technical Analysis: Downtrend Dominates—Is the $2.05 Floor About to Collapse?

    Best Presales to Buy Today – Which Coins Are Poised for a Breakout?

    Last 72 Hours to Join $49M Raise

    Corporate Bitcoin Holdings a ‘Blessing’ Now, Challenge Later, Says Roxom CEO

    Corporate Bitcoin Holdings a ‘Blessing’ Now, Challenge Later, Says Roxom CEO

    Cryptocurrency Remittances Spike 40% in Latin America

    Cryptocurrency Remittances Spike 40% in Latin America

    Sonic Integrates Bubblemaps V2 to Enhance Onchain Intelligence

    Sonic Integrates Bubblemaps V2 to Enhance Onchain Intelligence

    UBS Sees Senate Rushing ‘One Big, Beautiful Bill’ Toward High-Stakes Finale

    UBS Sees Senate Rushing ‘One Big, Beautiful Bill’ Toward High-Stakes Finale

  • Cybersecurity
    Cybersecurity

    AI Agents Run on Secret Accounts — Learn How to Secure Them in This Webinar

    Cybersecurity

    How to Address the Expanding Security Risk

    Cybersecurity

    ConnectWise to Rotate ScreenConnect Code Signing Certificates Due to Security Risks

    Cybersecurity

    5 Lessons from River Island

    Cybersecurity

    INTERPOL Dismantles 20,000+ Malicious IPs Linked to 69 Malware Variants in Operation Secure

    Cybersecurity

    SinoTrack GPS Devices Vulnerable to Remote Vehicle Control via Default Passwords

    Cybersecurity

    Researchers Uncover 20+ Configuration Risks, Including Five CVEs, in Salesforce Industry Cloud

    Cybersecurity

    Adobe Releases Patch Fixing 254 Vulnerabilities, Closing High-Severity Security Gaps

    Cybersecurity

    Researcher Found Flaw to Discover Phone Numbers Linked to Any Google Account

  • Deals
    Intel Core Ultra 7 Desktop Processor 265K – 20 cores (8 P-cores + 12 E-cores) up to 5.5…

    Intel Core Ultra 7 Desktop Processor 265K – 20 cores (8 P-cores + 12 E-cores) up to 5.5…

    Hitachi FIJ0038 Fuel Injector

    Hitachi FIJ0038 Fuel Injector

    EVGA Supernova 1300 P+, 80+ Platinum 1300W, Fully Modular, 10 Year Warranty, Includes…

    EVGA Supernova 1300 P+, 80+ Platinum 1300W, Fully Modular, 10 Year Warranty, Includes…

    Logitech G502 X Plus Wireless Gaming Mouse – LIGHTSPEED Optical, LIGHTFORCE Switches,…

    Logitech G502 X Plus Wireless Gaming Mouse – LIGHTSPEED Optical, LIGHTFORCE Switches,…

    Cable Matters 8-Pack Snagless Cat 5e Ethernet Cable – 5ft, Gigabit Cat5e Cable, Cat5e…

    Cable Matters 8-Pack Snagless Cat 5e Ethernet Cable – 5ft, Gigabit Cat5e Cable, Cat5e…

    Logitech iPad Pro 12.9 inch Keyboard Case | SLIM COMBO with Detachable, Backlit,…

    Logitech iPad Pro 12.9 inch Keyboard Case | SLIM COMBO with Detachable, Backlit,…

    TECKNET Ergonomic Mouse, Wireless Bluetooth Vertical Mouse, 4800 DPI Optical Tracking, 6…

    TECKNET Ergonomic Mouse, Wireless Bluetooth Vertical Mouse, 4800 DPI Optical Tracking, 6…

    DUMOS Ergonomic Gaming Desk Chair – PU Leather Recliner with Footrest, Lumbar Support,…

    DUMOS Ergonomic Gaming Desk Chair – PU Leather Recliner with Footrest, Lumbar Support,…

    Far Cry 3 (Renewed)

    Far Cry 3 (Renewed)

  • Gaming
    Five Nights At Freddy’s still going strong: The latest game briefly grappled with Dune over Steam’s top seller spot

    Five Nights At Freddy’s still going strong: The latest game briefly grappled with Dune over Steam’s top seller spot

    The Super Mario Bros Movie Reaction!

    The Super Mario Bros Movie Reaction!

    The Legend of Zelda: Ocarina of Time Master Quest Walkthrough (Pt. 78)

    The Legend of Zelda: Ocarina of Time Master Quest Walkthrough (Pt. 78)

    The Legend of Zelda: Breath of the Wild – Champions Ballad Shrines Walkthrough

    The Legend of Zelda: Breath of the Wild – Champions Ballad Shrines Walkthrough

    Legend of Zelda Montage

    Legend of Zelda Montage

    Legend of Zelda Breath of the Wild Gameplay/Walkthrough – Part 3

    Legend of Zelda Breath of the Wild Gameplay/Walkthrough – Part 3

    I'm done with Transformers Reactivate

    I'm done with Transformers Reactivate

    Amazon taps Star Trek Beyond and Mindhunter writer to helm Mass Effect show

    Amazon taps Star Trek Beyond and Mindhunter writer to helm Mass Effect show

    The Last of Us Remastered Honest Review

    The Last of Us Remastered Honest Review

  • Tesla
    Torx Plus Socket, 5-External Torx Socket 1/4″ Dr 10EPR Compatible With Tesla Model 3…

    Torx Plus Socket, 5-External Torx Socket 1/4″ Dr 10EPR Compatible With Tesla Model 3…

    Car Seat Organizers,Multi-functional Back Seat Protectors, Storage Pouches, and Tray…

    Car Seat Organizers,Multi-functional Back Seat Protectors, Storage Pouches, and Tray…

    AOHI USB C Car Charger, PD 45W&QC 30W 2 Port Type-C Fast Charging Car Charger Lighter…

    AOHI USB C Car Charger, PD 45W&QC 30W 2 Port Type-C Fast Charging Car Charger Lighter…

    Roof Sunshades for Tesla Model 3 2025, Upgraded 3.0 Sunroof Shade Sunshade Roof Sun…

    Roof Sunshades for Tesla Model 3 2025, Upgraded 3.0 Sunroof Shade Sunshade Roof Sun…

    SOOPII for Tesla Phone Mount,Strongest Magnetic Monitor Mount for Tesla 3/Y…

    SOOPII for Tesla Phone Mount,Strongest Magnetic Monitor Mount for Tesla 3/Y…

    A2C Gym Fitness Phone Mount for MagSafe – 17 N52 Strong Magnets Stable and Secure Phone…

    A2C Gym Fitness Phone Mount for MagSafe – 17 N52 Strong Magnets Stable and Secure Phone…

    3PCS All Weather Tesla Model 3 Highland Floor Mats 2024 2025 | Sleek Design,…

    3PCS All Weather Tesla Model 3 Highland Floor Mats 2024 2025 | Sleek Design,…

    Tesla Door Handle Cover Model Y Model 3 2020-2025 Door Handle Protector Model Y Model 3…

    Tesla Door Handle Cover Model Y Model 3 2020-2025 Door Handle Protector Model Y Model 3…

    FH Group Custom Fit Car Seat Covers for 2020-2024 Tesla Model Y Ultraflex Neoprene Water…

    FH Group Custom Fit Car Seat Covers for 2020-2024 Tesla Model Y Ultraflex Neoprene Water…

  • UFO
    Secrets of the Moon (S11, E11) | Ancient Aliens | Full Episode

    Secrets of the Moon (S11, E11) | Ancient Aliens | Full Episode

    UFOs Over Arizona: A True History of Extraterrestrial Encounters in the Grand Canyon State

    UFOs Over Arizona: A True History of Extraterrestrial Encounters in the Grand Canyon State

    French Contactee confirms Intergalactic Confederation is seeding human worlds

    French Contactee confirms Intergalactic Confederation is seeding human worlds

    New Balance Men’s 574 Greens V2 Golf Shoe

    New Balance Men’s 574 Greens V2 Golf Shoe

    Armin van Buuren rocking Ultra Miami with the new Exploration Of Space (Third Contact Remix)

    Armin van Buuren rocking Ultra Miami with the new Exploration Of Space (Third Contact Remix)

    I found footage of me explaining all 7 Paranormal Activity movies while haunted

    I found footage of me explaining all 7 Paranormal Activity movies while haunted

    Retro Aviator Sunglasses for Women Men – 70s Trendy Square Vintage Shade Sun Glasses UV Protection

    Retro Aviator Sunglasses for Women Men – 70s Trendy Square Vintage Shade Sun Glasses UV Protection

    Pop Culture Conspiracy Theories! Stanley Cups, Love Is Blind, and Dune

    Pop Culture Conspiracy Theories! Stanley Cups, Love Is Blind, and Dune

    Pocket Squares Handkerchiefs for Men Soft Cotton Mens Womens White Pocket Square with Holder for Suit Wedding Business

    Pocket Squares Handkerchiefs for Men Soft Cotton Mens Womens White Pocket Square with Holder for Suit Wedding Business

No Result
View All Result
Techcratic
No Result
View All Result
Home MIT Tech

Study: Transparency is often lacking in datasets used to train large language models | MIT News

MIT Tech by MIT Tech
October 12, 2024
in MIT Tech
Reading Time: 5 mins read
121 9
A A
0


Adam Zewe | MIT News
2024-08-30 05:00:00
news.mit.edu

In order to train more powerful large language models, researchers use vast dataset collections that blend diverse data from thousands of web sources.

But as these datasets are combined and recombined into multiple collections, important information about their origins and restrictions on how they can be used are often lost or confounded in the shuffle.

Not only does this raise legal and ethical concerns, it can also damage a model’s performance. For instance, if a dataset is miscategorized, someone training a machine-learning model for a certain task may end up unwittingly using data that are not designed for that task.

In addition, data from unknown sources could contain biases that cause a model to make unfair predictions when deployed.

To improve data transparency, a team of multidisciplinary researchers from MIT and elsewhere launched a systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 percent of these datasets omitted some licensing information, while about 50 percent had information that contained errors.

Building off these insights, they developed a user-friendly tool called the Data Provenance Explorer that automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses.

“These types of tools can help regulators and practitioners make informed decisions about AI deployment, and further the responsible development of AI,” says Alex “Sandy” Pentland, an MIT professor, leader of the Human Dynamics Group in the MIT Media Lab, and co-author of a new open-access paper about the project.

The Data Provenance Explorer could help AI practitioners build more effective models by enabling them to select training datasets that fit their model’s intended purpose. In the long run, this could improve the accuracy of AI models in real-world situations, such as those used to evaluate loan applications or respond to customer queries.

“One of the best ways to understand the capabilities and limitations of an AI model is understanding what data it was trained on. When you have misattribution and confusion about where data came from, you have a serious transparency issue,” says Robert Mahari, a graduate student in the MIT Human Dynamics Group, a JD candidate at Harvard Law School, and co-lead author on the paper.

Mahari and Pentland are joined on the paper by co-lead author Shayne Longpre, a graduate student in the Media Lab; Sara Hooker, who leads the research lab Cohere for AI; as well as others at MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. The research is published today in Nature Machine Intelligence.

Focus on finetuning

Researchers often use a technique called fine-tuning to improve the capabilities of a large language model that will be deployed for a specific task, like question-answering. For finetuning, they carefully build curated datasets designed to boost a model’s performance for this one task.

The MIT researchers focused on these fine-tuning datasets, which are often developed by researchers, academic organizations, or companies and licensed for specific uses.

When crowdsourced platforms aggregate such datasets into larger collections for practitioners to use for fine-tuning, some of that original license information is often left behind.

“These licenses ought to matter, and they should be enforceable,” Mahari says.

For instance, if the licensing terms of a dataset are wrong or missing, someone could spend a great deal of money and time developing a model they might be forced to take down later because some training data contained private information.

“People can end up training models where they don’t even understand the capabilities, concerns, or risk of those models, which ultimately stem from the data,” Longpre adds.

To begin this study, the researchers formally defined data provenance as the combination of a dataset’s sourcing, creating, and licensing heritage, as well as its characteristics. From there, they developed a structured auditing procedure to trace the data provenance of more than 1,800 text dataset collections from popular online repositories.

After finding that more than 70 percent of these datasets contained “unspecified” licenses that omitted much information, the researchers worked backward to fill in the blanks. Through their efforts, they reduced the number of datasets with “unspecified” licenses to around 30 percent.

Their work also revealed that the correct licenses were often more restrictive than those assigned by the repositories.   

In addition, they found that nearly all dataset creators were concentrated in the global north, which could limit a model’s capabilities if it is trained for deployment in a different region. For instance, a Turkish language dataset created predominantly by people in the U.S. and China might not contain any culturally significant aspects, Mahari explains.

“We almost delude ourselves into thinking the datasets are more diverse than they actually are,” he says.

Interestingly, the researchers also saw a dramatic spike in restrictions placed on datasets created in 2023 and 2024, which might be driven by concerns from academics that their datasets could be used for unintended commercial purposes.

A user-friendly tool

To help others obtain this information without the need for a manual audit, the researchers built the Data Provenance Explorer. In addition to sorting and filtering datasets based on certain criteria, the tool allows users to download a data provenance card that provides a succinct, structured overview of dataset characteristics.

“We are hoping this is a step, not just to understand the landscape, but also help people going forward to make more informed choices about what data they are training on,” Mahari says.

In the future, the researchers want to expand their analysis to investigate data provenance for multimodal data, including video and speech. They also want to study how terms of service on websites that serve as data sources are echoed in datasets.

As they expand their research, they are also reaching out to regulators to discuss their findings and the unique copyright implications of fine-tuning data.

“We need data provenance and transparency from the outset, when people are creating and releasing these datasets, to make it easier for others to derive these insights,” Longpre says.

“Many proposed policy interventions assume that we can correctly assign and identify licenses associated with data, and this work first shows that this is not the case, and then significantly improves the provenance information available,” says Stella Biderman, executive director of EleutherAI, who was not involved with this work. “In addition, section 3 contains relevant legal discussion. This is very valuable to machine learning practitioners outside companies large enough to have dedicated legal teams. Many people who want to build AI systems for public good are currently quietly struggling to figure out how to handle data licensing, because the internet is not designed in a way that makes data provenance easy to figure out.”

Source Link

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.

Bitcoin QR Code

Simply scan the QR code below to support Techcratic.

Bitcoin QR code for donations

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: MIT Tech
Previous Post

More durable metals for fusion power reactors | MIT News

Next Post

Study reveals the benefits and downside of fasting | MIT News

MIT Tech

MIT Tech

Discover cutting-edge research and technological breakthroughs with MIT Tech. Explore innovative projects and academic insights shaping the future of technology. Stay informed with the latest articles here at Techcratic.

Related Posts

After more than a decade of successes, ESI’s work will spread out across the Institute | MIT News
MIT Tech

After more than a decade of successes, ESI’s work will spread out across the Institute | MIT News

June 13, 2025
1.3k
Shoring up global supply chains with generative AI
MIT Tech

Shoring up global supply chains with generative AI

June 12, 2025
1.3k
How the brain solves complicated problems | MIT News
MIT Tech

How the brain solves complicated problems | MIT News

June 11, 2025
1.3k
Recovering from the past and transitioning to a better energy future | MIT News
MIT Tech

Recovering from the past and transitioning to a better energy future | MIT News

June 10, 2025
1.3k
“Each of us holds a piece of the solution” | MIT News
MIT Tech

“Each of us holds a piece of the solution” | MIT News

June 10, 2025
1.3k
Helping machines understand visual content with AI | MIT News
MIT Tech

Helping machines understand visual content with AI | MIT News

June 9, 2025
1.3k
AI-enabled control system helps autonomous drones stay on target in uncertain environments | MIT News
MIT Tech

AI-enabled control system helps autonomous drones stay on target in uncertain environments | MIT News

June 9, 2025
1.3k
New facility to accelerate materials solutions for fusion energy | MIT News
MIT Tech

New facility to accelerate materials solutions for fusion energy | MIT News

June 9, 2025
1.3k
Load More
Next Post
Study reveals the benefits and downside of fasting | MIT News

Study reveals the benefits and downside of fasting | MIT News

Bridging the heavens and Earth | MIT News

Bridging the heavens and Earth | MIT News

A wobble from Mars could be sign of dark matter, MIT study finds | MIT News

A wobble from Mars could be sign of dark matter, MIT study finds | MIT News

Your Tech Resources

  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo
  • Google News
  • Hacker News
  • Harvard Tech
  • I Like Cats ™
  • I Like Dogs ™
  • LifeHacker
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • PC World
  • Photofocus
  • Physics
  • Random Tech
  • Retro Rewind ™
  • Robot Report
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Art
  • Tech Careers
  • Tech Deals
  • Techcratic ™
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Wired
  • ZDNet

Tech News

  • 30 Second Tech ™
  • AI
  • Apple Insider
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • ExtremeTech
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo

Tech News

  • Harvard Tech
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • Physics
  • PC World
  • Random Tech
  • Retro Rewind ™
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Careers

Tech News​

  • Tech Art
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Wired
  • ZDNet

Site Links

  • About Techcratic
  • Affiliate Disclaimer
  • Affiliate Link Policy
  • Contact Techcratic
  • Dealors Discount Store
  • Privacy and Security Disclaimer
  • Privacy Policy
  • RSS Feed
  • Site Map
  • Support Techcratic
  • Techcratic
  • Tech Deals
  • TOS
  • 𝕏
Click For A Secret Deal

Techcratic – Your All In One Tech Hub © 2020 – 2025
All Rights Reserved
∞

No Result
View All Result
  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Gaming
  • I Like Cats ™
  • I Like Dogs ™
  • MacRumors
  • Macworld
  • Tech Deals
  • Techcratic ™
  • Techs Got To Eat ™
  • Tesla
  • UFO
  • Wired