-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding source freshness reading and adding tests #2
base: main
Are you sure you want to change the base?
Conversation
when following the logic to break up a line of a log, I realized that we could probably break up the logs dynamically with regex. I used regex101 to test and build the regex, you can see the logic behind it here. for a future PR, I'd like to capture test cases as well because they aren't captured in the current regex string! (you can see that in the link) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for submitting this PR! 😍 I really like a lot of the improvements you've suggested here. I just have one major concern about the regex approach vs. the current string-parsing approach, especially because I previously used a regex approach then switched to the string-parsing. Happy to discuss more the pros/cons of one approach over another.
I learned loads about how to set up test cases from this PR, by the way!
* ⏱️ Return runtime duration for each still-running model | ||
* 📊 Communicate model runtimes for successful models and maybe a cool viz | ||
* 📝 Add functionality for dbt test and dbt build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the script currently does support dbt test and dbt build (it's able to parse output when running views, tables, incrementals, tests, and seeds)
""" | ||
10:03:09 1 of 10 START sql table model hyrule.source_quests [RUN] | ||
|
||
REGEX = r"\s*(?:\[0m)?(\d{2}:\d{2}:\d{2})\s+(\d+)\s+of\s+\d+\s+.*?(?:model\s|of\s)\b(\w+\.\w+)\b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment about the regex approach taken here. In the first release of this script I used a regex approach too! Then I noticed that the output was very systematic and that by dropping certain words (extraneous_words
), I could parse based on space separators and output into a data frame and do the rest there.
Your PR suggests returning to the regex approach and I'm not sure I want to. Want to walk me through your reasoning some more?
10:03:09 1 of 10 START sql table model hyrule.source_quests [RUN] | ||
|
||
REGEX = r"\s*(?:\[0m)?(\d{2}:\d{2}:\d{2})\s+(\d+)\s+of\s+\d+\s+.*?(?:model\s|of\s)\b(\w+\.\w+)\b" | ||
EXAMPLE_LOGS = """10:03:09 1 of 10 START sql table model hyrule.source_quests [RUN] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love that idea of assigning all this to a variable! Feels like a "duh"
except AttributeError: | ||
st.error( | ||
"The input you provided doesn't look like dbt output. Please check your input and try again." | ||
) | ||
st.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this error!
try: | ||
df["post_regex"] = df["raw_line"].apply( | ||
lambda col: re.search(REGEX, col).groups() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment about using regex approach, help convince me this is better than df = df["raw_line"].str.split(expand=True)
!
still_running = ( | ||
df.groupby("model_num")["raw_line"] | ||
.count() | ||
.reset_index() | ||
.rename(columns={"raw_line": "count_records"}) | ||
.query("count_records == 1")["model_num"] | ||
.to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way of thinking through this, it's cleaner than joining two subsets
if __name__ == "__main__": | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
def main(): | ||
''' | ||
Run the app | ||
''' | ||
title_and_description() | ||
raw_input = get_logs() | ||
df = clean_input(raw_input) | ||
output(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learned a lot from this approach of wrapping up most things in functions, thank you for the demo!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a script that runs automatically or needs to be invoked manually?
pytest
in the terminal