Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding source freshness reading and adding tests #2

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

ethanfuerst
Copy link
Collaborator

@ethanfuerst ethanfuerst commented Apr 6, 2023

  • creating requirements.txt and adding venv/ to .gitignore
  • breaking up text processing in to methods which is easier to read
  • using regex to make dbt core logs and dbt source logs work
  • formatting with black
  • adding tests with pytest - just run pytest in the terminal

@ethanfuerst
Copy link
Collaborator Author

ethanfuerst commented Apr 6, 2023

when following the logic to break up a line of a log, I realized that we could probably break up the logs dynamically with regex. I used regex101 to test and build the regex, you can see the logic behind it here. for a future PR, I'd like to capture test cases as well because they aren't captured in the current regex string! (you can see that in the link)

@ethanfuerst ethanfuerst marked this pull request as ready for review April 6, 2023 20:38
@ethanfuerst ethanfuerst changed the title Adding source freshness Adding source freshness reading and adding tests Apr 6, 2023
Copy link
Owner

@foundinblank foundinblank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for submitting this PR! 😍 I really like a lot of the improvements you've suggested here. I just have one major concern about the regex approach vs. the current string-parsing approach, especially because I previously used a regex approach then switched to the string-parsing. Happy to discuss more the pros/cons of one approach over another.

I learned loads about how to set up test cases from this PR, by the way!

* ⏱️ Return runtime duration for each still-running model
* 📊 Communicate model runtimes for successful models and maybe a cool viz
* 📝 Add functionality for dbt test and dbt build
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the script currently does support dbt test and dbt build (it's able to parse output when running views, tables, incrementals, tests, and seeds)

"""
10:03:09 1 of 10 START sql table model hyrule.source_quests [RUN]

REGEX = r"\s*(?:\[0m)?(\d{2}:\d{2}:\d{2})\s+(\d+)\s+of\s+\d+\s+.*?(?:model\s|of\s)\b(\w+\.\w+)\b"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment about the regex approach taken here. In the first release of this script I used a regex approach too! Then I noticed that the output was very systematic and that by dropping certain words (extraneous_words), I could parse based on space separators and output into a data frame and do the rest there.

Your PR suggests returning to the regex approach and I'm not sure I want to. Want to walk me through your reasoning some more?

10:03:09 1 of 10 START sql table model hyrule.source_quests [RUN]

REGEX = r"\s*(?:\[0m)?(\d{2}:\d{2}:\d{2})\s+(\d+)\s+of\s+\d+\s+.*?(?:model\s|of\s)\b(\w+\.\w+)\b"
EXAMPLE_LOGS = """10:03:09 1 of 10 START sql table model hyrule.source_quests [RUN]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that idea of assigning all this to a variable! Feels like a "duh"

Comment on lines +102 to +106
except AttributeError:
st.error(
"The input you provided doesn't look like dbt output. Please check your input and try again."
)
st.stop()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this error!

Comment on lines +98 to +101
try:
df["post_regex"] = df["raw_line"].apply(
lambda col: re.search(REGEX, col).groups()
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment about using regex approach, help convince me this is better than df = df["raw_line"].str.split(expand=True)!

Comment on lines +121 to +127
still_running = (
df.groupby("model_num")["raw_line"]
.count()
.reset_index()
.rename(columns={"raw_line": "count_records"})
.query("count_records == 1")["model_num"]
.to_list()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way of thinking through this, it's cleaner than joining two subsets

Comment on lines +174 to +175
if __name__ == "__main__":
main()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Comment on lines +164 to +171
def main():
'''
Run the app
'''
title_and_description()
raw_input = get_logs()
df = clean_input(raw_input)
output(df)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Learned a lot from this approach of wrapping up most things in functions, thank you for the demo!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a script that runs automatically or needs to be invoked manually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants