Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGSEGV during goroutine switch while utils.int32MaxMinAVX2 is running #279

Open
GeneTinderholm opened this issue Feb 14, 2025 · 4 comments
Labels
Type: bug Something isn't working

Comments

@GeneTinderholm
Copy link

GeneTinderholm commented Feb 14, 2025

Describe the bug, including details regarding any error messages, version, and platform.

I'm using this library to parse individual columns out of parquet files that have a dynamic schema into a struct of arrays.

This library works really nicely for that, and the examples + some debugging were enough for me to use the parquet package.

I ran into an issue, however. I had several runtime exits that seem to happen when goroutine switches happened while utils.int32MaxMinAVX2 is running. I gave the underlying assembly function (convenience link) a once over and it seemed fine. Thought it might be the lack of NOSPLIT, but it seems like you're not actually using anything on the stack save for the function arguments.

I'm only seeing this on linux/amd64, but without a way to consistently trigger switches while the function is running, I wouldn't bet my life that it's the only place this is happening.

I managed to get around it with a runtime.LockOSThread call, and that seems to have solved the issue for now. There's not a lot of documentation in that package that I could find. I was wondering if this implied I was doing something wrong.

I've included a cleaned stack trace of what I'm getting whenever this happens, it always seems to be during an IsValid call.

SIGSEGV: segmentation violation
PC=0x4794ec m=23 sigcode=128 addr=0x0

goroutine 0 gp=0xc0034c5880 m=23 mp=0xc000680e08 [idle]:
runtime.fpTracebackPCs(...)
	/usr/local/go/src/runtime/tracestack.go:258
runtime.traceStack(0xc000680e08?, 0x2?, 0x210)
	/usr/local/go/src/runtime/tracestack.go:116 +0x2ac fp=0x7f2a0ee9bcb8 sp=0x7f2a0ee9b850 pc=0x4794ec
runtime.traceLocker.stack(...)
	/usr/local/go/src/runtime/traceevent.go:176
runtime.traceLocker.GoStop({0xc000680e08?, 0x7f2a0ee9bdb8?}, 0x2)
	/usr/local/go/src/runtime/traceruntime.go:478 +0x85 fp=0x7f2a0ee9bd40 sp=0x7f2a0ee9bcb8 pc=0x4786c5
runtime.traceLocker.GoPreempt(...)
	/usr/local/go/src/runtime/traceruntime.go:473
runtime.goschedImpl(0xc000606a80, 0x1?)
	/usr/local/go/src/runtime/proc.go:4117 +0x7a fp=0x7f2a0ee9bda8 sp=0x7f2a0ee9bd40 pc=0x4555fa
runtime.gopreempt_m(0xc000606a80?)
	/usr/local/go/src/runtime/proc.go:4153 +0x18 fp=0x7f2a0ee9bdc8 sp=0x7f2a0ee9bda8 pc=0x455b98
runtime.mcall()
	/usr/local/go/src/runtime/asm_amd64.s:459 +0x4e fp=0x7f2a0ee9bde0 sp=0x7f2a0ee9bdc8 pc=0x49014e

goroutine 2995994 gp=0xc000606a80 m=23 mp=0xc000680e08 [running]:
runtime.asyncPreempt2()
	/usr/local/go/src/runtime/preempt.go:308 +0x39 fp=0xc00248d478 sp=0xc00248d458 pc=0x44b499
runtime.asyncPreempt()
	/usr/local/go/src/runtime/preempt_amd64.s:53 +0xdb fp=0xc00248d600 sp=0xc00248d478 pc=0x4936fb
github.com/apache/arrow-go/v18/internal/utils.int32MaxMinAVX2({0xc011aee45c?, 0x4?, 0x1b1fe00?})
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/internal/utils/min_max_avx2_amd64.go:64 +0x4a fp=0xc00248d638 sp=0xc00248d600 pc=0x168f10a
github.com/apache/arrow-go/v18/internal/utils.GetMinMaxInt32(...)
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/internal/utils/min_max.go:190
github.com/apache/arrow-go/v18/parquet/internal/encoding.(*Float64DictConverter).IsValid(0xc001980ab0, {0xc011aee45c, 0x1, 0x1})
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/internal/encoding/typed_encoder.gen.go:1176 +0x38 fp=0xc00248d668 sp=0xc00248d638 pc=0x182f0d8
github.com/apache/arrow-go/v18/parquet/internal/utils.(*RleDecoder).GetBatchWithDictFloat64(0xc015c04240, {0x227e748, 0xc001980ab0}, {0xc004692000, 0xa60, 0x1000})
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/internal/utils/typed_rle_dict.gen.go:957 +0x266 fp=0xc00248d6f0 sp=0xc00248d668 pc=0x174d3e6
github.com/apache/arrow-go/v18/parquet/internal/utils.(*RleDecoder).GetBatchWithDict(0x260?, {0x227e748?, 0xc001980ab0?}, {0x1ac41e0?, 0xc00248d798?})
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/internal/utils/rle.go:421 +0x90 fp=0xc00248d748 sp=0xc00248d6f0 pc=0x1746a70
github.com/apache/arrow-go/v18/parquet/internal/encoding.(*dictDecoder).decode(...)
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/internal/encoding/decoder.go:146
github.com/apache/arrow-go/v18/parquet/internal/encoding.(*DictFloat64Decoder).Decode(0xc001524300, {0xc004692000?, 0xf3b?, 0x1?})
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/internal/encoding/typed_encoder.gen.go:1117 +0x67 fp=0xc00248d7c0 sp=0xc00248d748 pc=0x182ec27
github.com/apache/arrow-go/v18/parquet/file.(*Float64ColumnChunkReader).ReadBatch.func1(0xc002090480?, 0x1000?)
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/file/column_reader_types.gen.go:195 +0xe7 fp=0xc00248d808 sp=0xc00248d7c0 pc=0x1869827
github.com/apache/arrow-go/v18/parquet/file.(*columnChunkReader).readBatch(0xc002090480, 0x1000, {0xc00248f988, 0x1000, 0x1000}, {0xc00248d988, 0x1000, 0x1000}, 0xc00248d8f8)
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/file/column_reader.go:513 +0xc2 fp=0xc00248d8b0 sp=0xc00248d808 pc=0x1867f22
github.com/apache/arrow-go/v18/parquet/file.(*Float64ColumnChunkReader).ReadBatch(0x0?, 0x0?, {0xc004692000?, 0x0?, 0x0?}, {0xc00248f988?, 0x0?, 0x0?}, {0xc00248d988, 0x1000, ...})
	/github/home/go/pkg/mod/github.com/apache/arrow-go/[email protected]/parquet/file/column_reader_types.gen.go:194 +0x7d fp=0xc00248d930 sp=0xc00248d8b0 pc=0x18696dd

Component(s)

Parquet

@GeneTinderholm GeneTinderholm added the Type: bug Something isn't working label Feb 14, 2025
@zeroshade
Copy link
Member

Is it consistently reproducible in any way (perhaps running it N times in a loop until it happens?) If so are you able to share the parquet file you're using and any code?

You could also try running with the noasm tag which would be less intrusive than LockOSThread, but should likely also fix the problem if the assembly is the issue.

I'll take a look into this in the morning, it's really odd that locking the OS thread fixes it. Maybe something weird in the register usage?

@GeneTinderholm
Copy link
Author

That seemed to drop performance rather significantly, which is primary in my use-case. I don't mind managing the os thread locking personally, but it felt hacky enough that I thought double checking might be a good idea.

I'm unfortunately not at liberty to give the parquet files away without checking with some other people first. There's nested columns in some cases, but the leaves are the only thing that are nullable

I wrote a simplified version of the code that's being used that does essentially the same thing.

const batchSize = 4096
var errWrongType = errors.New("enountered column of wrong type")

func ParseFloatCol(rdr *file.Reader, colName string) ([]float64, error) {
    idx := rdr.MetaData().Schema.ColumnIndexByName(colName)
    floatBatch := [batchSize]float64{}
    defLevels := [batchSize]int16{}

    current := int64(0)
    result := make([]float64, rdr.NumRows())
    for i := range rdr.NumRowGroups() {
        rgr := rdr.RowGroup(i)
        col, err := rgr.Column(idx)
        if err != nil {
            return nil, err
        }
        f64Col, ok := col.(*file.Float64ColumnChunkReader)
        if !ok {
            return nil, errWrongType
        }
        total, _, err := f64Col.ReadBatch(batchSize, floatBatch[:], defLevels[:], nil)
        if err != nil {
            return nil, err
        }
        batchIdx := 0
        maxDefLevel := col.Descriptor().MaxDefinitionLevel()
        for i := range total {
            if defLevels[i] == maxDefLevel {
                result[current+i] = floatBatch[batchIdx]
                batchIdx++
            } else {
                result[current+i] = math.NaN()
            }
        }
    }
    return result, nil
}

@zeroshade
Copy link
Member

Can you give me any information on the Parquet file (number of rows/rowgroups/rows per rowgroup? number of columns?)

Since you can't share the file, I have a few ideas that might affect things if you are able to use a local version of the library and make some code changes and test them out.

For instance:

  1. Try adding NOSPLIT to the assembly?
  2. In typed_encoder.gen.go on line 1176: change GetMinMaxInt32(*(*[]int32)(unsafe.Pointer(&idxes))) to just GetMinMaxInt32(idxes)
  3. Add a fmt.Println to print out the idxes being checked and see if there's any pattern we can discern to the slice (length, values, etc.) in the situation when it fails (is it always a specific size, does it always contain a specific value, etc.)
  4. In min_max_avx2_amd64.go in the int32MaxMinAVX2 function, same idea, add a Println(values) to see if there's any common pattern to the slice when it fails. Also, potentially change it so that instead of using the address of the named returns, make explicit min and max variables on the stack there and return those. i.e. change the function to the following:
func int32MaxMinAVX2(values []int32) (int32, int32) {
        var min, max int32
        _int32_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)), len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
        return min, max
}

maybe the compiler is doing something weird with registers because of the named returns?

What version of Go are you using?

@GeneTinderholm
Copy link
Author

Further update, the runtime.LockOSThread does not seem to have completely solved the issue, just reduced the occurrence.

We're using go 1.23.

I found this issue which seems eerily similar, except for the cgo usage. I don't know how the bp would get nil'd out without it, but the assembly function is using that register, so just in case I updated to go 1.24 (needed an excuse to do that anyways). Just in case the compiler is doing something screwy that nobody was expecting.

I'm going to monitor for a bit to see if I can find another example (we're down to ~1 per day for millions of requests at the moment).

As much info on the parquet as I can share:

  • Each file has between ~30 and ~50 columns, ~80% of which are doubles
  • Encountering files with 1 million+ rows is a fairly common occurrence
  • The vast majority of the files only have a single row group

I'll see what I can do about getting a local copy of the library used. Unfortunately my dev platform is darwin/arm64, so I'm not confident of my ability to reproduce this exact scenario locally. If a deploy/wait cycle becomes too long, I'm just going to spin up something that calls the min/max function in a loop on as many goroutines as possible and leave it running for a while (assuming moving past go 1.23.0 doesn't eliminate this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants