Skip to content
This repository has been archived by the owner on Oct 6, 2023. It is now read-only.

Implement the raw data summary workflow step #1

Open
Joe-Heffer-Shef opened this issue Aug 25, 2023 · 2 comments
Open

Implement the raw data summary workflow step #1

Joe-Heffer-Shef opened this issue Aug 25, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Joe-Heffer-Shef
Copy link
Collaborator

Joe-Heffer-Shef commented Aug 25, 2023

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Joe-Heffer-Shef Joe-Heffer-Shef added the enhancement New feature or request label Aug 25, 2023
@G-Accad
Copy link
Collaborator

G-Accad commented Sep 11, 2023

Is your feature request related to a problem? Please describe.
When dealing with large volumes of raw data, it becomes challenging to quickly and effectively understand the characteristics and key insights from the data. Without a structured summary of the raw data, it's time-consuming and error-prone to make informed decisions or perform further analysis.

Describe the solution you'd like
This workflow step should involve the following components:

  1. Data Quality Checks: Perform data quality checks to identify missing values, duplicates, outliers, and any other data anomalies. This step ensures that the data is clean and reliable for analysis.
  2. Data Profiling: Automatically generate descriptive statistics and metrics for the selected columns in the raw data, including measures like mean, median, standard deviation, and count. This will provide a high-level overview of the data's distribution and characteristics.
  3. Data Visualization: Create visualizations such as histograms, box plots, and scatter plots for numeric data, and bar charts for categorical data.

Describe alternatives you've considered

  1. Manual Data Summary: For example Excel (too time consuming)

Workflow:

graph LR
    func["Functions Runs"]
    input1("Type of Dataset") --> func
    input2("Columns of interest") --> func
    func --> output1("Generate Basic Descriptive Statistics")
    output1--> output2("Visualizations: Histograms, Box Plots")
Loading

@G-Accad
Copy link
Collaborator

G-Accad commented Sep 11, 2023

Quarto vs R Markdown

Aspect Quarto R Markdown
Ease of Use + User-friendly, especially for non-technical users - Requires some familiarity with R and Markdown syntax
+ Simplified YAML configuration
+ Built-in support for Pandoc templates
Document Structure + Flexible structure with notebooks, reports, and documents - Standard Markdown structure with YAML header
- Less flexibility in structuring documents
+ Notebook-style interactivity - Limited interactivity
Interactivity + Interactive code chunks + Supports interactive code chunks (with R)
+ Data visualization with JavaScript - Limited interactivity with other languages
Output Formats + Multiple output formats (HTML, PDF, Word) + Supports various output formats
+ Customizable templates - Templates can be customized
Extensibility and Ecosystem + Integration with the Quarto ecosystem + Established R Markdown ecosystem with numerous packages
+ Growing community support
Learning Curve + Shorter learning curve for beginners - Steeper learning curve for non-R users
+ Easier for non-programmers - More programming knowledge required

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants