My NumPy Year: Creating a DType for the Next Generation of Scientific Computing

2024-10-23 13:44:00
quansight.com

This project was a mix of challenges and learning as I navigated the CPython C API and worked closely with the NumPy community. I want to share a behind-the-scenes look at my work on introducing a new string DType in NumPy 2.0, mostly drawn from a recent talk I gave at SciPy. In this post, I’ll walk you through the technical process, key design decisions, and the ups and downs I faced. Plus, you’ll find tips on tackling mental blocks and insights into becoming a maintainer.

By the end, I hope you’re going to have the answers to these questions:

What was wrong with NumPy string arrays before NumPy 2.0, and why did they need to be fixed?
How did the community fund the work that fixed it?
How did I become a NumPy maintainer in the midst of that?
How did I start working on the project that helped fix NumPy strings?
What cool new feature did I add to NumPy?

A Brief History of Strings in NumPy

First, I’ll start with a brief history of strings in NumPy to explain how strings worked before NumPy 2.0 and why it was a little bit broken.

String Arrays in Python 2

Let’s go back to Python 2.7 and look at how strings worked in NumPy before the Python 3 Unicode revolution. I actually compiled Python 2 in 2024 to make this post. It doesn’t build on my ARM Mac, but it does compile on Ubuntu 22.04. Python 2 "strings" were what we now call byte strings in Python 3 – arrays of arbitrary bytes with no attached encoding. NumPy string arrays had similar behavior.

				
					Python 2.7.18 (default, Jul  1 2024, 10:27:04)&#13;
&gt;&gt;&gt; import numpy as np&#13;
&#13;
&gt;&gt;&gt; np.array(["hello", "world"])&#13;
array(['hello', 'world'], dtype="|S5")&#13;
&#13;
&gt;&gt;&gt; np.array(['hello', '☃'])&#13;
array(['hello', '\xe2\x98\x83'], dtype="|S5")&#13;

Let’s say you create an array with the contents “hello", "world”, you can see it gets created with the DType “S5”. So, what does that mean? It means it’s a Python 2-string array with five elements, five characters, or five bytes per array (characters and bytes are the same thing in Python 2).

It sort of works with Unicode if you squint at it. For instance, I wrote 'hello', ‘☃’ and if you happen to know the UTF-8 bytes for Unicode 'snowman', it’s '\xe2\x98\x83'. So, it’s just taking the UTF-8 bytes from my terminal and putting them straight into the array.

Here, we have the bytes in the Python 2 string array: the ASCII byte for 'h', the ASCII byte for 'e', and over in the second element of the array is the UTF-8 bytes for the Unicode snowman. It’s also important to know that for these fixed-width strings—if you don’t fill up the width of the array, it just adds zeros to the end, which are null bytes.

				
					&gt;&gt;&gt; arr = np.array([u'hello', u'world'])&#13;
&gt;&gt;&gt; arr&#13;
array([u'hello', u'world'], dtype="<u5>
				</u5>

Python 2 also had this Unicode type, where you could create an array with the contents 'hello', 'world', but as Unicode strings, and that creates an array with the DType 'U5'. This works, and it’s exactly what Python 2 did with Unicode strings. Each character is a UTF-32 encoded character, so four bytes per character

String Arrays in Python 3

				
					&gt;&gt;&gt; arr = np.array(['hello', 'world'])&#13;
&gt;&gt;&gt; arr&#13;
array(['hello', 'world'], dtype="<u5>
				</u5>

In Python 3, they made this the default since Python 3 strings are Unicode strings, and that was the pragmatic, easy decision, but I argue it was a bad decision—and here’s why:

				
					&gt;&gt;&gt; arr.tobytes()&#13;
b'h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'

If we look at the bytes actually in the array, these are all ASCII characters, so they really only need one byte, which means there’s a bunch of zeros in the array that are just wasted. You’re using four times as much memory than is actually needed to store the array.

Before, it was written in C as a Python for-loop over the elements of the array. For each element of the array, it would create a scalar, call the string operation on that scalar, and then stuff the results into the result array. As you can imagine, that’s pretty slow. But by rewriting it to loop over the array buffer without accessing each item as a scalar, you can make it anywhere from 500 times faster for small two-element arrays or two to five times faster for longer strings

Another thing people have done, and what they’ve defaulted to because of these issues with Unicode strings in NumPy, is to use object arrays.

				
					&gt;&gt;&gt; arr = np.array(&#13;
	['this is a very long string', np.nan, 'another string'],&#13;
	dtype=object&#13;
)&#13;
&gt;&gt;&gt; arr&#13;
array(['this is a very long string', nan, 'another string'],&#13;
  	dtype=object)&#13;

You can create an array in NumPy with dtype=object and it stores the Python strings and Python objects that you put into the array directly. These are references to Python objects. If we call np.isnan on the second element of the array, you get back np.True_ because the object is np.nan, and the other elements are python strings stored directly in the array.

				
					&gt;&gt;&gt; arr = np.array(&#13;
	['this is a very long string', np.nan, 'another string'],&#13;
	dtype=object&#13;
)&#13;
&gt;&gt;&gt; np.isnan(arr[1])&#13;
np.True_&#13;
&#13;
&gt;&gt;&gt; type(arr[0])&#13;
str&#13;

Source Link

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.

Bitcoin QR Code

Simply scan the QR code below to support Techcratic.

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: HACKER NEWS

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

My NumPy Year: Creating a DType for the Next Generation of Scientific Computing

A Brief History of Strings in NumPy

String Arrays in Python 2

String Arrays in Python 3

Support Techcratic

Bitcoin QR Code

Step by Step Multimedia Art with Herbivert

New Grandoreiro Banking Malware Variants Emerge with Advanced Tactics to Evade Detection

Related Posts

Leave a Reply Cancel reply

Tech News

Tech News

Tech News

Tech News​

Site Links

Tech News