Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(index): add lotus shed commands to prune all indexes #12393

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

akaladarshi
Copy link
Contributor

@akaladarshi akaladarshi commented Aug 15, 2024

Related Issues

Fixes: #12377

Proposed Changes

  • Add a lotus-shed command to prune the indexes, other

    • Add a prune-txhash command to prune tx hash index
    • Add a prune-events command
    • refactor prune-msgindex command to prune the msg index
    • Add prune-all-indexes command to prune all the indexes (msg, events, tx hash)at once
  • Use older-than flag to get the height to start pruning indexes older than the provided height

@akaladarshi
Copy link
Contributor Author

akaladarshi commented Aug 15, 2024

@aarshkshah1992 We can also create separate commands for pruning the events, and txhash db separately.

Need suggestions on a couple of things:

  • Instead of going to each event one by one, this PR directly deletes based on the height and epochs
  • Use of go routines

@akaladarshi akaladarshi marked this pull request as ready for review August 15, 2024 13:55
&cli.IntFlag{
Name: "from",
Value: 0,
Usage: "the tipset height to start backfilling from (0 is head of chain)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pruning", not "backfilling"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err := rows.Scan(&id, &height); err != nil {
return "", "", "", xerrors.Errorf("error scanning event id: %w", err)
}
eventIds = append(eventIds, id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this could get big, probably not ideal for an sql query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@rvagg
Copy link
Member

rvagg commented Aug 16, 2024

I think this is going to be pretty inefficient doing the select and building a really long targeted delete, can we not just use a range delete? I'm also not sure it's worth the bother doing a full range with these (i.e. the "epochs" arg probably doesn't get used by anyone). I think you'd just want to have an "older than" (in fact, you should rename "from" to be something like that, --older-than?).

So, can we just bundle the heavy lifting directly into the queries?

DELETE FROM event_entry WHERE event_id IN (SELECT id FROM event WHERE height < ?);
DELETE FROM event  WHERE height < ?;
DELETE FROM events_seen WHERE height < ?;

@akaladarshi
Copy link
Contributor Author

akaladarshi commented Aug 16, 2024

I think this is going to be pretty inefficient doing the select and building a really long targeted delete, can we not just use a range delete? I'm also not sure it's worth the bother doing a full range with these (i.e. the "epochs" arg probably doesn't get used by anyone). I think you'd just want to have an "older than" (in fact, you should rename "from" to be something like that, --older-than?).

So, can we just bundle the heavy lifting directly into the queries?

DELETE FROM event_entry WHERE event_id IN (SELECT id FROM event WHERE height < ?);
DELETE FROM event  WHERE height < ?;
DELETE FROM events_seen WHERE height < ?;
  • Yeah, you're correct range queries will be more efficient, will update it.

    • I was thinking in the direction of how it was entered (event-> events
      _entries -> event_seen), but we can delete them directly (much more efficient)
  • As per the discussion we (me and @aarshkshah1992 ) had, we decided to include the epoch part, but as you're saying it won't be required as much then queries will become simpler. @aarshkshah1992 please share your views as well before I get into to the changes.

  • As for the --older-than flag, it makes much more sense than from, but the other prune command prune-msgindex was using the from flag so I wanted to maintain consistency.

    • So should I update there (prune-msgindex) as well?
    • Also I was thinking we can have separate pruning commands for each index, what are your thoughts on it @rvagg ?

@rvagg
Copy link
Member

rvagg commented Aug 16, 2024

I don't think it need symmetry with the backfill commands, and honestly I find 'from' a little bit awkward for those anyway (I added that "going backwards" in there because of that awkwardness, I kind of wish they were something like --start=4100000 --end=-10000; but it's not too important I think.

Separate sub-commands makes sense to me rather than all in one. You may only have one of these indexes enabled anyway.

@aarshkshah1992
Copy link
Contributor

@akaladarshi I agree with Rodd. Looking at this now, separate pruning cmds for all Indices where we "prune everything older than epoch X" makes a lot of sense.

@akaladarshi
Copy link
Contributor Author

akaladarshi commented Aug 16, 2024

@akaladarshi I agree with Rodd. Looking at this now, separate pruning cmds for all Indices where we "prune everything older than epoch X" makes a lot of sense.

I was inclined more towards having a single command to prune all the indexes at once and having individual commands as well.

From a code perspective, there won't be much change as I have already separated each index pruning into its function. It's a matter of calling those functions (individually or together).

@akaladarshi akaladarshi changed the title feat(index): add prune all indexes command feat(index): add lotus shed commands to prune all indexes Aug 16, 2024
@akaladarshi akaladarshi requested a review from rvagg August 19, 2024 04:17
Comment on lines 910 to 927
basePath, err := homedir.Expand(cctx.String("repo"))
if err != nil {
return err
}

startHeight := cctx.Int64("older-than")
if startHeight == 0 {
currTs, err := api.ChainHead(ctx)
if err != nil {
return err
}

startHeight += int64(currTs.Height())

if startHeight == 0 {
return xerrors.Errorf("bogus start height %d", startHeight)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about you refactor this into a utility function since this block is repeated 4 times: basePath, startHeight, err := parsePruneArgs(cctx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rvagg
Copy link
Member

rvagg commented Aug 19, 2024

This is neat, thanks @akaladarshi. I have a suggestion, but I don't want to impose this as a requirement, just if you think it's interesting and worth doing: a --dry-run for each of these that does a query instead of a delete for the given args. You could SELECT count(*) FROM events WHERE height < ? and the same for >= ? so you can report something like "prune would delete X and leave Y events".

I think this should even work to do both in one query, for each table:

SELECT
    SUM(CASE WHEN height < ? THEN 1 ELSE 0 END) AS delete_count,
    SUM(CASE WHEN height >= ? THEN 1 ELSE 0 END) AS leave_count
FROM events

This is just a nice feature, I think though running this on my own node and I don't think I want to be pruning at the moment (I'm building up a big db which is helpful for debugging) but I would like to see what it would do, so that's what I came up with.

@akaladarshi
Copy link
Contributor Author

This is neat, thanks @akaladarshi. I have a suggestion, but I don't want to impose this as a requirement, just if you think it's interesting and worth doing: a --dry-run for each of these that does a query instead of a delete for the given args. You could SELECT count(*) FROM events WHERE height < ? and the same for >= ? so you can report something like "prune would delete X and leave Y events".

I think this should even work to do both in one query, for each table:

SELECT
    SUM(CASE WHEN height < ? THEN 1 ELSE 0 END) AS delete_count,
    SUM(CASE WHEN height >= ? THEN 1 ELSE 0 END) AS leave_count
FROM events

This is just a nice feature, I think though running this on my own node and I don't think I want to be pruning at the moment (I'm building up a big db which is helpful for debugging) but I would like to see what it would do, so that's what I came up with.

That sounds like a good-to-have feature, I will start working on it once I am done with testing this PR locally with the calibration testnet node before merging and migrating of indexes to common DB.

@akaladarshi akaladarshi requested a review from rvagg August 20, 2024 04:20
blkMsgs, err := api.ChainGetBlockMessages(ctx, blockheader.Cid())
if err != nil {
log.Infof("failed to get block messages at height: %d", currTs.Height())
continue // should we break here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg should we break here or keep going till the first tipset?

Is there a possibility, that a block is missing in between?

While testing I saw one block was missing, after that, it kept going on like this.

Testing logs:
`2024-08-21T20:21:39.118+0530 INFO lotus-shed lotus-shed/indexes.go:1148 failed to get block messages at height: 1867646

2024-08-21T20:21:39.122+0530 INFO lotus-shed lotus-shed/indexes.go:1148 failed to get block messages at height: 1867641

2024-08-21T20:21:39.127+0530 INFO lotus-shed lotus-shed/indexes.go:1148 failed to get block messages at height: 1867633

2024-08-21T20:21:39.134+0530 INFO lotus-shed lotus-shed/indexes.go:1148 failed to get block messages at height: 1867623`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this one is tricky! because we don't have an epoch to delete against we can't do it bulk, as you've discovered, but we also can't do it this way because we kind of want the inverse: we want to delete entries for messages where we don't have the message in the blockstore ourselves. So it's almost like: collect all messages, then delete everything not in that list. Which is impractical because that could get very large. The other problem with using the chain to do this is that there are going to be messages not on the canonical chain that we need to consider (i.e. there's been a reorg, so walking back from some tip doesn't lead us to tipsets that were previously considered canonical but are no longer), perhaps we want to include them, perhaps we should delete them because they are non-canonical, it's actually not even clear what you'd want to do because it depends on how the lookup API is being used. But conceptually what you'd want is something like: nothing in this txhashdb that isn't also in my blockstore. Because if I can't get to the original message anyway then there's no real help in being able to translate it from an eth hash.

BUT; just doing a quick look at the db, it's the one that we already have GC in already, and we achieve that by timestamp. If you turn on EthTxHashMappingLifetimeDays then it'll clean up according to the timestamp it was inserted at.

So I think the strategy needs to change for this table: we just can't do it on epoch unfortunately; we either need to not offer it here at all, or we need to come up with a timestamp and run something like the existing functionality:

func (ei *EthTxHashLookup) DeleteEntriesOlderThan(days int) (int64, error) {

We can get a timestamp. So, if you want to keep this pruning here, you could keep the existing api.ChainGetTipSetByHeight(ctx, abi.ChainEpoch(startHeight), types.EmptyTSK) to get the tipset, look at the first block and get its timestamp (e.g. like this and then work out how many seconds ago it is and then you can run the delete using datetime() maths. Something like DELETE FROM eth_tx_hashes WHERE insertion_time < datetime('now', '-X seconds'). Note how sqlite datetime maths works:

sqlite> select datetime('now'), datetime('now', '-10 seconds'), datetime('now', '-86400 seconds');
2024-08-22 03:38:03|2024-08-22 03:37:53|2024-08-21 03:38:03

This is one area where we could apply some reuse of existing code: in eth_transaction_hash_lookup.go, we could expose a method to reuse the existing delete statement, and since we only run this once per hour on the existing GC we can get rid of the preparedstatement too and expose a public function to run it on any instance of a db, here's a quick go at what what may look like over in that file:

func (ei *EthTxHashLookup) DeleteEntriesOlderThan(days int) (int64, error) {
	if ei.db == nil {
		return 0, xerrors.New("db closed")
	}
	epochs := abi.ChainEpoch(build.BlockDelaySecs * builtin.SecondsInDay * uint64(days))
	return DeleteEntriesOlderThan(ei.db, epochs)
}

// DeleteEntriesOlderThan deletes entries older than the given number of epochs from now.
func DeleteEntriesOlderThan(db *sql.DB, epochs abi.ChainEpoch) (int64, error) {
	secondsAgo := int64(epochs) * int64(build.BlockDelaySecs)
	res, err := db.Exec(deleteOlderThan, "-"+strconv.FormatInt(secondsAgo, 10)+" seconds")
	if err != nil {
		return 0, err
	}
	return res.RowsAffected()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as per your suggestion of reusing code.

totalRowsAffected += rowsAffected
log.Infof("pruned %s table, rows affected: %d", eventTable, rowsAffected)

res, err = tx.Exec(deleteEventEntriesByEventIDs, startHeight)
Copy link
Contributor Author

@akaladarshi akaladarshi Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After executing deleteEventEntriesByEventIDs, res.RowsAffected() is returning 0 even though in DB data is being deleted.

My guess is that SQLite not able to figure out how many rows are affected because there is a subquery inside it.

What are your thoughts on this @rvagg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need to swap this with deleteEventFromStartHeight - this query relies on the entries being in the event table because of the sub-select, but you've just deleted the ones that it's trying to match; so clean up event_entry first, then event and you're all good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned the data is still being deleted but, it's not returning the number of rows affected.

Anyway, I will update it as you mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how do you know it's being deleted? you've lost the entried they join with on the event table, have you tried a select count(*) from event and select count(*) from event_index to see that they're actually deleted? Because of the subquery working on rows that were previously deleted, you should not have the join and therefore not get deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I was checking it the wrong way.

defer closer()
ctx := lcli.ReqContext(cctx)
api := srvc.FullNodeAPI()
closer := srvc.Close
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just keep the pattern the same across these, you have some churn in naming and you have an extra closer here; the one below in pruneTxHashCmd is clean:

		srv, err := lcli.GetFullNodeServices(cctx)
		if err != nil {
			return err
		}
		defer srv.Close() //nolint:errcheck

		api := srv.FullNodeAPI()
		ctx := lcli.ReqContext(cctx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// pruneEventsIndex is a helper function that prunes the events index.db for a number of epochs starting from a specified height
func pruneEventsIndex(_ context.Context, basePath string, startHeight int64) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ditch the context arg if you have no way of using it, I know it's nice to have symmetry but if you don't need the symmetry for any tactical reason then just keep it clean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// database related constants
const (
databaseType = "sqlite"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this naming confusing when parsing the path resolution below, perhaps something like indexesDir? i.e. it's the name of a directory and isn't used to refer to the type of database anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise this until after I ran the node. Initially, I thought it was DB-type.

Anyway, I have updated it.

Comment on lines 1037 to 1040
err := db.Close()
if err != nil {
fmt.Printf("ERROR: closing db: %s", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer compact form where you can get away with it, IMO it helps with Go's nasty eye-sprawl

Suggested change
err := db.Close()
if err != nil {
fmt.Printf("ERROR: closing db: %s", err)
}
if err := db.Close(); err != nil {
fmt.Printf("ERROR: closing db: %s", err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

totalRowsAffected += rowsAffected
log.Infof("pruned %s table, rows affected: %d", eventTable, rowsAffected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical: I suggest keeping track of these numbers separately and then logging after you've committed the transaction, because until you do, they're not really gone; if you have a rollback then these log messages will be incorrect

Copy link
Member

@rvagg rvagg Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're concerned about progress logging, print a message before each execution: log.Infof("pruning `%s` table", eventTable); then at least the user knows its working if this takes a long time for some reason (it shouldn't unless it's on a really slow disk)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 1094 to 1097
err = tx.Commit()
if err != nil {
return xerrors.Errorf("error committing tx: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = tx.Commit()
if err != nil {
return xerrors.Errorf("error committing tx: %w", err)
}
if err = tx.Commit(); err != nil {
return xerrors.Errorf("error committing tx: %w", err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1207 to 1210
err := db.Close()
if err != nil {
fmt.Printf("ERROR: closing db: %s", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := db.Close()
if err != nil {
fmt.Printf("ERROR: closing db: %s", err)
}
if err := db.Close(); err != nil {
fmt.Printf("ERROR: closing db: %s", err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@akaladarshi akaladarshi requested a review from rvagg August 26, 2024 06:24
@akaladarshi akaladarshi force-pushed the akaladarshi/lotus-shed-prune-all-indexes-cmd branch from e60187d to d092ede Compare August 26, 2024 06:37
@akaladarshi akaladarshi force-pushed the akaladarshi/lotus-shed-prune-all-indexes-cmd branch from d092ede to e25ebee Compare August 26, 2024 06:37
@BigLep
Copy link
Member

BigLep commented Sep 17, 2024

@akaladarshi : what are the next steps for this PR so it can be merged?

@akaladarshi
Copy link
Contributor Author

@akaladarshi : what are the next steps for this PR so it can be merged?

@BigLep I am not sure about this PR as of now because we have removed the existing indexes in favour of #12421.

I think @aarshkshah1992 or @rvagg are the best people who can comment on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ⌨️ In Progress
Development

Successfully merging this pull request may close these issues.

Implement a lotus-shed command to garbage collect all the indices not available in the state store
4 participants