Go Plan9 Memo, Speeding Up Calculations 450%

2024-10-18 10:36:00
pehringer.info

I want to take advantage of Go’s concurrency and parallelism for some of my upcoming projects, allowing for some serious number crunching capabilities. But what if I wanted EVEN MORE POWER?!? Enter SIMD, Same Instruction Muliple Data [“sim”-“dee”]. Simd instructions allow for parallel number crunching capabilities right down at the hardware level. Many programming languages either have compiler optimizations that use simd or libraries that offer simd support. However, (as far as I can tell) Go’s compiler does not utilizes simd, and I cound not find a general propose simd package that I liked. I just want a package that offers a thin abstraction layer over arithmetic and bitwise simd operations. So like any good programmer I decided to slightly reinvent the wheel and write my very own simd package. How hard could it be?

After doing some preliminary research I discovered that Go uses its own internal assembly language called Plan9. I consider it more of an assembly format than its own language. Plan9 uses target platforms instructions and registers with slight modifications to their names and usage. This means that x86 Plan9 is different then say arm Plan9. Overall, pretty weird stuff. I am not sure why the Go team went down this route. Maybe it simplifies the compiler by having this bespoke assembly format?

I always find learning by example to be the most informative.
So lets Go (haha) over a simple example.

example
 ┣━ AddInts_amd64.s
 ┗━ main.go

example/AddInts_amd64.s

// +build amd64
2
TEXT ·AddInts(SB), 4, $0
    MOVL    left+0(FP), AX
    MOVL    right+8(FP), BX
    ADDL    BX, AX
    MOVL    AX, int+16(FP)
    RET

LINE 1: The file contains amd64 specific instructions, so we need to include a Go build tag to make sure Go does not try to compile this file for non x86 machines.

LINE 3: You can think of this line as the functions declaration. TEXT declares that this is a function or text section. ·AddInts(SB) specifies our functions name. 4 represents “NOSPLIT” which we need for some reason. And $0 is the size of the function’s stack frame (used for local variables). It’s zero in this case because we can easily fit everything into the registers.

LINE 4 & 5: Go’s calling convention is to put the function arguments onto the stack. So we MOVe both Long 32-bit values into the AX and BX registers by dereferencing the frame pointer (FP) with the appropriate offsets. The first argument is stored at offset 0. The second argument is stored at offset 8 (int’s only need 4 bytes but I think Go offsets all arguments by 8 to maintain memory alignment).

LINE 6: Add the Long 32-bit value in AX (left) with the Long 32-bit value in BX. And store the resulting Long 32-bit value in AX.

LINE 7 & 8: Go’s calling convention (as far as I can tell) is to put the function return values after its arguments on the stack. So we MOVe the Long 32-bit values in the AX register onto the stack by dereferencing the frame pointer (FP) with the appropriate offset. Which is 16 in this case.

example/main.go

1  package main
2
3  import "fmt"
4
5  func AddInts(left, right) int
6
7  func main() {
8      fmt.Println("1 + 2 = ", AddInts(1, 2))
9  }

LINE 5: This is the forward functions declaration for our Plan9 function. Since they both share the same name (AddInts) Go will link them together during compilation.

LINE 8: We can now use our Plan9 function just like any other function.

Now that we are Go assembly experts, let’s get into the details of how I structured the package. My main goal for the package was to offer a thin abstraction layer over arithmetic and bitwise simd operations. Basically, I wanted a set of functions that would allow me to perform simd operations on slices.

Here’s a look at a simplified example of my project structure.

example
 ┣━ internal
 ┃   ┗━ addition
 ┃       ┣━ AddInts_amd64.s
 ┃       ┗━ addition_amd64.go
 ┣━ init_amd64.go
 ┗━ example.go

First, we will create a private function pointer with a corresponding public function that wraps around it. By default we will point the private pointer to a software implementation of the function.

example/example.go:

package example
 2
func fallbackAddInts(left, right int) int {
   return left + right
}
 6
var addInts func(left, right int) int = fallbackAddInts
 8
func AddInts(left, right int) int {
    return addInts(left, right)  
}

Next, we create an internal package that contains an architecture specific Plan9 implementation of our function.

example/internal/addition/AddInts_amd64.s

// +build amd64
2
TEXT ·AddInts(SB), 4, $0
    MOVL    left+0(FP), AX
    MOVL    right+8(FP), BX
    ADDL    BX, AX
    MOVL    AX, int+16(FP)
    RET

example/internal/addition/addition_amd64.go

// +build amd64
2
package addition

func AddInts(left, right int) int

Lastly, we will create an init function to configure the private function pointer with our internal packages corresponding Plan9 function.

example/init_amd64.go

// +build amd64
2
package example
4
import "example/internal/addition"

func init() {
    addInts = addition.AddInts
}

TLDR The use of a private function pointer combined with architecture specific init functions and packages (using Go build tags) allows our example package to support multiple architectures easily!

Now with all that gunk loaded into your mind I will let you decipher some of my x86 simd plan9 functions.

simd/internal/sse/Supported_amd64.s

// +build amd64
 2
// func Supported() bool
TEXT ·Supported(SB), 4, $0
  //Check SSE supported.
  MOVQ    $1, AX
  CPUID
  TESTQ   $(1

simd/internal/sse/AddFloat32_amd64.s

// +build amd64
 2
// func AddFloat32(left, right, result []float32) int
TEXT ·AddFloat32(SB), 4, $0
    //Load slices lengths.
    MOVQ    leftLen+8(FP), AX
    MOVQ    rightLen+32(FP), BX
    MOVQ    resultLen+56(FP), CX
    //Get minimum length.
    CMPQ    AX, CX
    CMOVQLT AX, CX
    CMPQ    BX, CX
    CMOVQLT BX, CX
    //Load slices data pointers.
    MOVQ    leftData+0(FP), SI
    MOVQ    rightData+24(FP), DX
    MOVQ    resultData+48(FP), DI
    //Initialize loop index.
    MOVQ    $0, AX
multipleDataLoop:
    MOVQ    CX, BX
    SUBQ    AX, BX
    CMPQ    BX, $4
    JL      singleDataLoop
    //Add four float32 values.
    MOVUPS  (SI)(AX*4), X0
    MOVUPS  (DX)(AX*4), X1
    ADDPS   X1, X0
    MOVUPS  X0, (DI)(AX*4)
    ADDQ    $4, AX
    JMP     multipleDataLoop
singleDataLoop:
    CMPQ    AX, CX
    JGE     returnLength
    //Add one float32 value.
    MOVSS   (SI)(AX*4), X0
    MOVSS   (DX)(AX*4), X1
    ADDSS   X1, X0
    MOVSS   X0, (DI)(AX*4)
    INCQ    AX
    JMP     singleDataLoop
returnLength:
    MOVQ    CX, int+72(FP)
    RET

I promise all this gunk is worth it. I made a few charts so you can see the performance difference between a Go software implementation and a Plan9 simd implementation. There is roughly a 200-450% speed up depending on the number of elements. I hope this memo inspires others to use Plan9 and simd in their future projects!

Currently, my package only supports 64-bit x86 machines. If there is enough interest, I will throw in some 64-bit ARM support as well!

Source Link

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.

Bitcoin QR Code

Simply scan the QR code below to support Techcratic.

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: HACKER NEWS

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Go Plan9 Memo, Speeding Up Calculations 450%

Support Techcratic

Bitcoin QR Code

AI observability boosts business resilience with instant insights

Soundcore’s Space One headphones impressed me with solid quality (and they’re on sale)

Related Posts

Leave a Reply Cancel reply

Tech News

Tech News

Tech News

Tech News​

Site Links

Tech News