Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp age csv loader (#2044) #2068

Merged
merged 1 commit into from
Aug 22, 2024

Conversation

MuhammadTahaNaveed
Copy link
Member

  • Allow 0 as entry_id
  • Use batch inserts to improve performance
    • Changed heap_insert to heap_multi_insert since it is faster than calling heap_insert() in a loop. When multiple tuples can be inserted on a single page, just a single WAL record covering all of them, and only need to lock/unlock the page once.
    • BATCH_SIZE is set to 1000, which is the number of tuples to insert in a single batch. This number was chosen after some experimentation.
    • Change some of the field names to avoid confusion.
  • Use sequence for generating ids for edge and vertex
    • Sequence is not used if the id_field_exists is true in load_labels_from_file function, since the entry id is present in the csv.
  • Add function to create temporary table for ids, this is only used for loading vertices
    • A temporary table is created and populated with already generated vertex ids when first time load_labels_from_file function is called. A unique index is created on id column to ensure that new ids generated (using entry id from csv) are unique. This table and index will be deleted automatically whenever the session ends.
    • Whenever a row is inserted in labels, the corresponding id is inserted into temp table as well.
  • Add functions to create graph and label automatically
    • These functions will check existence of graph and label, and create them if they don't exist.

* Allow 0 as entry_id

- No regression test were impacted by this change.

* Use batch inserts to improve performance

- Changed heap_insert to heap_multi_insert since it is faster than
  calling heap_insert() in a loop. When multiple tuples can be inserted
  on a single page, just a single WAL record covering all of them, and
  only need to lock/unlock the page once.

- BATCH_SIZE is set to 1000, which is the number of tuples to insert in
  a single batch. This number was chosen after some experimentation.

- Change some of the field names to avoid confusion.

* Use sequence for generating ids for edge and vertex

- Sequence is not used if the id_field_exists is true in
  load_labels_from_file function, since the entry id is present in the
  csv.

* Add function to create temporary table for ids

- Created a temporary table and populate it with already generated
  vertex ids. A unique index is created on id column to ensure that
  new ids generated (using entry id from csv) are unique.

* Insert generated ids in the temporary table to enforce uniqueness

- Insert ids in the temporary table and also update index to
  enforce uniqueness.
- If the entry id provided in the CSV is greater than the current
  sequence value, the sequence value is updated to match the entry ID.
  For example:
  Suppose the current sequence value is 1, and the CSV entry ID is 2.
  If we use 2 but not update the sequence to 2, next time the CREATE
  clause is used, 2 will be returned by sequence as an entry id,
  resulting in duplicate.
- Update batch functions

* Add functions to create graph and label automatically

- These functions will check existence of graph and label, and create
  them if they don't exist.

* Add regression tests
@github-actions github-actions bot added PG11 PostgreSQL11 override-stale To keep issues/PRs untouched from stale action labels Aug 22, 2024
@jrgemignani jrgemignani merged commit 7ee9156 into apache:PG11 Aug 22, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
override-stale To keep issues/PRs untouched from stale action PG11 PostgreSQL11
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants