Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats auto refresh #7424

Merged
merged 30 commits into from
Feb 8, 2024
Merged

Stats auto refresh #7424

merged 30 commits into from
Feb 8, 2024

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Jan 27, 2024

Three new system variables to configure stats auto refresh, one for turning on, one for a timer of how often to check whether stats are fresh, and a third for a freshness threshold to trigger writing new stats. The system variables are global, and will apply to all databases in a server context. Adding or dropping databases, tables, and indexes are reflected by the statistics provider.

Statistic updates now funnel through a process where we preserve buckets for pre-existing chunks and create new buckets specifically for new chunk ordinals. The auto-refresh threshold compares the number of new and deleted buckets compared the the existing bucket count. So inserts, updates, and deletes all create new chunks or delete chunks relative to the previously saved statistics, and the fraction of (new+deleted)/previous count is compared against the threshold.

TODO:

  • how much logging is appropriate
  • go server tests
  • is stats dataset replication blocked by default?

other:

  • ANALYZE table and auto-refresh concurrency, last write wins? as long as auto-refresh thread doesn't die
  • drop database/ANALYZE concurrency, error for writing stats to a deleted folder

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.998653 to 99.998653
version result total
340e43c not ok 80
340e43c ok 5937377
version total_tests
340e43c 5937457
correctness_percentage
99.998653

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.998653 to 99.998653
version result total
f180d20 not ok 80
f180d20 ok 5937377
version total_tests
f180d20 5937457
correctness_percentage
99.998653

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.998653 to 99.998653
version result total
97d8a04 not ok 80
97d8a04 ok 5937377
version total_tests
97d8a04 5937457
correctness_percentage
99.998653

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.998653 to 99.998653
version result total
81dba2b not ok 80
81dba2b ok 5937377
version total_tests
81dba2b 5937457
correctness_percentage
99.998653

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
99.998653 to 99.998653
version result total
5edf046 not ok 80
5edf046 ok 5937377
version total_tests
5edf046 5937457
correctness_percentage
99.998653

@max-hoffman max-hoffman requested a review from nicktobey January 30, 2024 23:34
@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.998787 to 99.998653
version result total
38392dd not ok 80
38392dd ok 5937377
version total_tests
38392dd 5937457
correctness_percentage
99.998653

Copy link
Contributor

@nicktobey nicktobey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have bats tests for setting different values of threshold and interval?

Expected: []sql.Row{{uint64(6), uint64(6), uint64(0)}, {uint64(6), uint64(3), uint64(0)}},
},
},
},
{
Name: "stats updates",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's being tested here? That the second analyze changes the stats? That the stats auto-update? The name should be more clear about what's being tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested incrementally rewriting stats depending on which chunks were edited, rather than a full scan. changed test name


run dolt sql -r csv -q "select count(*) from dolt_statistics"
[ "$status" -eq 0 ]
[ "${lines[1]}" = "8" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we expect this to be 8?


run dolt sql -r csv -q "select count(*) from dolt_statistics"
[ "$status" -eq 0 ]
[ "${lines[1]}" = "8" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be testing to confirm that something changed from the last select.

}
curCnt := float64(len(curStat.active))
updateCnt := float64(len(idxMeta.updateChunks))
deleteCnt := float64(len(curStat.active) - len(idxMeta.preexisting))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be reversed? The number deleted is the pre-existing minus the current count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming here is somewhat rough and ready, but curStat is the existing statistics on disk, idxMeta is the update metadata. idxMeta.preexisting is the fraction of chunks in the update metadata overlapping with curStat's chunks. Will change idxMeta to indicate it's a candidate update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is confusingly named. Pre-existing sounds like the old stats, and curStat sounds like the current (new) stats.

Maybe rename curStats to prevStats?

var missingOffsets [][]uint64
var offset uint64
for _, n := range levelNodes {
// check if hash is in current list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining what's going on in this block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. Basically partition the target level histogram hashes for the current tree into 1) preserved or 2) missing, and track scanning metadata for missing chunks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I follow. Is there a particular reason we do it this way by storing all the hashes at some level of the tree, instead of just storing the root hash?

Copy link
Contributor Author

@max-hoffman max-hoffman Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding the question, but each histogram bucket is the contents of a chunk at the target level, one-to-one equivalence. We do track the table hash to short-circuit the target level comparison if nothing has changed, but otherwise we use structural sharing to 1) get an estimate of how much has changed, and 2) avoid unnecessarily recomputing buckets that haven't changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. I missed that the chunks being tracked here mapped 1-1 with histogram buckets.

This makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIil update docstrings to clarify

)

var ErrFailedToLoad = errors.New("failed to load statistics")

type DoltStats struct {
level int
chunks []hash.Hash
active map[hash.Hash]int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docstrings explaining the purpose of these fields? active in particular is unclear to me.

}

var _ sql.StatsProvider = (*Provider)(nil)

// Init scans the statistics tables, populating the |stats| attribute.
// Statistics are not available for reading until we've finished loading.
func (p *Provider) Load(ctx *sql.Context, dbs []dsess.SqlDatabase) error {
p.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay that this doesn't have a lock anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrote how the concurrency control works, it's fairly straightforward but i might need to add some go tests for race detection

keyBuilder.PutString(0, qual.Database)
keyBuilder.PutString(1, qual.Table())
keyBuilder.PutString(2, qual.Index())
keyBuilder.PutInt64(3, 10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment for which column is being written here, and why it gets a hardcoded 10k.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a maxBucketFanout constant to make clearer

if i < len(oldHist) && oldHist[i].Chunk == chunkAddr {
mergeHist = append(mergeHist, oldHist[i])
i++
} else if j < len(newHist) && newHist[j].Chunk == chunkAddr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible bug: if an identical chunk appears in both the old and new histograms, then j will never be incremented and all the new chunks that come after it will be skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this was a bug

func firstRowForIndex(ctx *sql.Context, prollyMap prolly.Map, keyBuilder *val.TupleBuilder, prefixLen int) (sql.Row, error) {
buffPool := prollyMap.NodeStore().Pool()

firstIter, err := prollyMap.IterOrdinalRange(ctx, 0, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these values of start and stop used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first row is ordinal 0

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.999074 to 99.998653
version result total
b5c93ad not ok 80
b5c93ad ok 5937377
version total_tests
b5c93ad 5937457
correctness_percentage
99.998653

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
99.999074 to 99.998653
version result total
5288c1b not ok 80
5288c1b ok 5937377
version total_tests
5288c1b 5937457
correctness_percentage
99.998653

@max-hoffman
Copy link
Contributor Author

I'm going to try to get some sort of tests that are race sensitive, which might be a better format for adding finer-grained parameter testing.

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.999562 to 99.999562
version result total
7198173 not ok 26
7198173 ok 5937431
version total_tests
7198173 5937457
correctness_percentage
99.999562

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.999562 to 99.999562
version result total
bbfbc36 not ok 26
bbfbc36 ok 5937431
version total_tests
bbfbc36 5937457
correctness_percentage
99.999562

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
99.999562 to 99.999562
version result total
18fd075 not ok 26
18fd075 ok 5937431
version total_tests
18fd075 5937457
correctness_percentage
99.999562

@max-hoffman
Copy link
Contributor Author

@nicktobey I added some functions for controlling auto refresh and some minimalist testing to make sure they stop, restart, and drop stats. The intention is to give a bit more control when things go wrong. I had to refactor some of the stats provider lifecycle so that background threads and sessions have consistent root value views. I also added concurrency testing for some common cases. There's probably a lot more testing we could add, but I think this is a comfortable MVP, things mostly work and when they don't we mostly won't brick the database. Might be helpful to check TestStatsAutoRefreshConcurrency, enginetest.StatProcTests, and dprocedures/stat_funcs.go.

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.999562 to 99.999562
version result total
be41ce9 not ok 26
be41ce9 ok 5937431
version total_tests
be41ce9 5937457
correctness_percentage
99.999562

"github.com/dolthub/dolt/go/libraries/doltcore/sqle/dsess"
)

func statsFunc(name string) func(ctx *sql.Context, args ...string) (sql.RowIter, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching on the string here feels unnecessary. Can't DoltProcedures just do something like:

{Name: "dolt_stats_drop", Schema: doltMergeSchema, Function: statsFunc(statsDrop)},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep good idea

}
return fmt.Sprintf("restarted stats collection: %s", ref.StatsRef{}.String()), nil
}
return nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, nil
return nil, fmt.Errorf("provider does not implement AutoRefreshStatsProvider")


sqlDb, _ := harness.provider.BaseDatabase(harness.NewContext(), "mydb")

// auto-refresh with a 0-interval, constant loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more clear, ex: Setting an interval of 0 and a threshold of 0 will result in the stats being updated after every operation, etc

_, iter, err := engine.Query(ctx, q)
require.NoError(t, err)
_, err = sql.RowIterToRows(ctx, iter)
//fmt.Printf("%s %d\n", tag, id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this commented out line.

if strings.EqualFold(db, qual.Database) && strings.EqualFold(table, qual.Table()) {
if s.AvgSize > avgSize {
avgSize = s.AvgSize
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be returning avgSize, not 0.

for qual, s := range dbStat.stats {
if strings.EqualFold(db, qual.Database) && strings.EqualFold(table, qual.Table()) {
if s.AvgSize > avgSize {
avgSize = s.AvgSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we take the highest value for avgSize across all stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to match the primary index

"github.com/dolthub/dolt/go/store/prolly"
"github.com/dolthub/dolt/go/store/prolly/tree"
stypes "github.com/dolthub/dolt/go/store/types"
"github.com/dolthub/dolt/go/store/val"
)

const maxBucketFanout = 200 * 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why this value was chosen.

@coffeegoddd
Copy link
Contributor

@max-hoffman DOLT

comparing_percentages
99.999562 to 99.999562
version result total
f09626b not ok 26
f09626b ok 5937431
version total_tests
f09626b 5937457
correctness_percentage
99.999562

@max-hoffman max-hoffman merged commit d019cce into main Feb 8, 2024
21 checks passed
@max-hoffman max-hoffman deleted the max/stats-refresh branch February 8, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants