Percentile qc - test #162

RasmusBahbah · 2023-08-08T09:33:28Z

No description provided.

…f percentiles.db existst

ladsmund · 2023-09-04T06:17:12Z

src/pypromice/process/L1toL2.py

+                diff.fillna(method='ffill', inplace=True) # forward filling all NaNs! 
+                diff = np.array(diff)
+
+                diff_period = np.ones_like(diff) * False


Suggested change

diff_period = np.ones_like(diff) * False

diff_period = np.zeros_like(diff, dtype='bool')

ladsmund · 2023-09-04T06:18:42Z

src/pypromice/process/L1toL2.py

+                        if sum(abs(diff[i-diff_h:i])) < static_lim:
+                            diff_period[i-diff_h:i] = True
+
+                diff_period = np.array(diff_period).astype('bool')


It is unnecessary to assign to boolean data type if the array is instantiated as suggested above

Suggested change

diff_period = np.array(diff_period).astype('bool')

ladsmund · 2023-09-04T07:14:34Z

src/pypromice/process/L1toL2.py

+
+    base_path = os.getcwd()
+
+    file_path1 =  base_path + '/main/src/pypromice/qc/percentiles.db'


Environment specific paths/variables shall not be assigned statically in the function body. Use input parameters or a class instead.

ladsmund · 2023-09-04T07:20:39Z

src/pypromice/qc/percentiles.db

Be very careful when adding binary files to a code repository because It will increase the repository size while being hard to version control.
Alternatively rely on the committed script to generate the database.

I believe all *.db files are included in .gitignore

ladsmund · 2023-09-04T07:28:59Z

src/pypromice/process/L1toL2.py

+        else:
+            print(f'percentiles.db does not exist running {script_path2}')
+            subprocess.call(['python',script_path2])
+            file_path = file_path2


All of the above should be implemented in a separate module and maybe a python class wrapping the database

Why use a database instead of just a datafile like NetCDF?

@ladsmund I wrote this original code that uses sqlite. The general idea is to provide processing effficiency. If the percentile distributions (for each variable) are stored in a sql database, we are able to extract the percentiles of interest very fast with a sql query, rather than opening/reading a static file for each station. sqlite is the simplest database solution, where we can have a small file on disk.

However, this is of more significant benefit for much larger datasets (e.g. 10s or 100s of thousands of stations). A solution using a static file could also be implemented if desired. Either way, you are persisting something on disk between runs that holds the percentile distributions.

ladsmund · 2023-09-04T11:24:12Z

src/pypromice/process/L1toL2.py

+
+                for i,d in enumerate(diff): # algorithm that ensures values can stay the same within the diff_period
+                    if i > (diff_h-1): 
+                        if sum(abs(diff[i-diff_h:i])) < static_lim:


Consider using mean instead of sum to make the threshold invariant of window size

ladsmund · 2023-09-04T11:37:44Z

src/pypromice/process/L1toL2.py

@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1.,
                                             T_0, T_100, ews, ei0)                   
    return ds

+def differenceQC(ds):


It would make sense to add such functions to a separate module like pypromice.qc

Yes, it might make sense to take this out of L1toL2.py and have all qc-related functions in the qc directory? Not sure if there are any implications of moving this other than organizational, but worth considering.

Also, it would be great to add a short description at the top of the docstrings for both differenceQC and percentileQC that gives a basic description of what the function does.

It is definitely something we will do in the future, probably in a future update. A better described docsrting is also a good idea

patrickjwright · 2023-09-04T19:04:35Z

src/pypromice/process/L1toL2.py

-    cc = calcCloudCoverage(ds['t_u'], T_0, eps_overcast, eps_clear,                  # Calculate cloud coverage
-                           ds['dlr'], ds.attrs['station_id'])  
-    ds['cc'] = (('time'), cc.data)
+    # Determiune cloud cover for on-ice stations


Wondering why this change for cloud cover is here? This has nothing to do with percentiles I assume?

This is something Baptiste implemented in the meantime, so no, nothing to do with the qc.

patrickjwright · 2023-09-04T19:12:14Z

src/pypromice/process/L1toL2.py

+        else:
+            print(f'percentiles.db does not exist running {script_path2}')
+            subprocess.call(['python',script_path2])
+            file_path = file_path2


@ladsmund I wrote this original code that uses sqlite. The general idea is to provide processing effficiency. If the percentile distributions (for each variable) are stored in a sql database, we are able to extract the percentiles of interest very fast with a sql query, rather than opening/reading a static file for each station. sqlite is the simplest database solution, where we can have a small file on disk.

However, this is of more significant benefit for much larger datasets (e.g. 10s or 100s of thousands of stations). A solution using a static file could also be implemented if desired. Either way, you are persisting something on disk between runs that holds the percentile distributions.

patrickjwright · 2023-09-04T19:13:26Z

src/pypromice/qc/percentiles.db

I believe all *.db files are included in .gitignore

patrickjwright · 2023-09-04T19:16:57Z

src/pypromice/process/L1toL2.py

@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1.,
                                             T_0, T_100, ews, ei0)                   
    return ds

+def differenceQC(ds):


Yes, it might make sense to take this out of L1toL2.py and have all qc-related functions in the qc directory? Not sure if there are any implications of moving this other than organizational, but worth considering.

Also, it would be great to add a short description at the top of the docstrings for both differenceQC and percentileQC that gives a basic description of what the function does.

patrickjwright · 2023-09-04T19:20:28Z

src/pypromice/process/L1toL2.py

@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1.,
                                             T_0, T_100, ews, ei0)                   
    return ds

+def differenceQC(ds):


What is differenceQC? Just glancing at the code, this appears to be a persistence check? That is, checking if a value has gotten stuck in a persisting state for a certain period of time. If so, I would rename to persistenceQC.

Yep, it is a persistence/static check. We will rename it to staticQC.

patrickjwright · 2023-09-04T19:27:09Z

src/pypromice/qc/compute_percentiles.py

+    # Define threshold dict to hold limit values, and 'hi' and 'lo' percentile.
+    # Limit values indicate how far we will go beyond the hi and lo percentiles to flag outliers.
+    # *_u are used to calculate and define all limits, which are then applied to *_u, *_l and *_i
+    var_threshold = {


I am noticing that the values in this var_threshold are different that the values in var_threshold in percentileQC. Is this intentional? Was this function used for testing and determining thresholds?

Thanks for spotting that, I will change it back! I tested some different thresholds and apparently did not change them back.

patrickjwright · 2023-09-04T19:35:04Z

@RasmusBahbah I left a few comments, mostly minor. Overall, I am curious what kind of testing and plotting was done to determine both the limit thresholds (the values in var_threshold), as well as the percentiles to use as baselines for applying the limits (i.e. 0.005 and 0.995).

I remember in our initial testing, we were certainly finding many outliers that could be removed with the percentiles check, but we were also finding instances where good data would be removed. I assume there is no tolerance for removing any good data from the historical dataset, so were you able to do enough testing for each station to determine that the limits in this PR are conservative enough to preserve all good data?

Was air temp the only variable that uses seasonal distributions? All other variables use the full dataset to calculate percentiles?

Due to my limited time for this review, I would also highly encourage you get a review from @PennyHow as well. Great that this is being pushed forward!

RasmusBahbah · 2023-09-06T09:20:17Z

Thanks for your feedback, @patrickjwright, much appreciated.
@ladsmund and I will do some testing of the QC on the pipeline, to determine if we are removing any good data, and tweak the percentiles and thresholds accordingly.

And yes, airtemp is the only variable with seasonal percentiles, and the other variables use the whole time-series. It will be engaging in a future update to test if monthly percentiles or another conf. could improve the QC. Maybe also use it on the other variables.

I really appreciate your time on this! @ladsmund and I will make some improvements on the structure, the staticQC, and how to test it. After that, we will definitely include @PennyHow for the final review.

ladsmund · 2023-09-06T08:39:38Z

src/pypromice/process/L1toL2.py


+
+    ds = differenceQC(ds)                                                      # Flag and Remove difference outliers     


Is ds the entire time series?

PennyHow · 2023-11-07T11:44:03Z

Can this PR be closed now that #183 is merged? @ladsmund @RasmusBahbah

patrickjwright and others added 27 commits March 21, 2023 16:18

initial implementation

2af3d54

minor tweaks to compute_percentiles.py

8586a91

typo in aws.py

9f2a0ec

stub in prelim location for running percentile QC check

df91382

stub out percentile QC check in L1toL2

7985e7d

fix season definitions in compute_percentiles.py

0d9bece

set percentile outliers to nan

c766e9c

clean up and comment L1toL2.py

5bdbfbc

tweak comments

fa1c0e1

clean up and comment L1toL2.py

560c808

add plotting capability to compute_percentiles.py

5a17f96

Changing the Percentiles limits. Adding Difference QC. And checking i…

d678550

…f percentiles.db existst

Wrong index

156161f

fixing subprocess run compute_percentiles.py

2ad99fe

Trying to fix subprocess path

2807977

script_check

a66e392

Update L1toL2.py

b9d29d5

Printing Current Path

1c5fd1a

Update L1toL2.py

f3d113e

Fixing paths to percentiles and compute_percentiles

c185aaa

Update L1toL2.py

564705f

Fixing paths

6a55ec7

Update L1toL2.py

068004d

updating subprocess

53e2d5c

Update L1toL2.py

64c0e48

Fixing path to .db file

879d43d

changing path to l3 data

44f4360

RasmusBahbah requested a review from patrickjwright August 15, 2023 12:14

RasmusBahbah added 2 commits August 15, 2023 14:49

crash bug if there are no season data for temp.

ddf2fd3

Add files via upload

1ab0ba0

RasmusBahbah added 7 commits August 15, 2023 15:12

Update L1toL2.py

4a8d581

Merge branch 'percentile-qc' of https://github.com/GEUS-Glaciology-an…

aa38424

…d-Climate/pypromice into percentile-qc

Bug fix - if station do not have var (p, wspd,rh)

09b4f44

windows and Linux Separator bug

305fd69

check paths

9ea80a9

update cc

370572d

correcting path to .db file

3c3d54f

RasmusBahbah requested a review from ladsmund August 29, 2023 11:32

ladsmund reviewed Sep 4, 2023

View reviewed changes

patrickjwright requested changes Sep 4, 2023

View reviewed changes

ladsmund reviewed Sep 6, 2023

View reviewed changes

src/pypromice/process/L1toL2.py

ds = differenceQC(ds) # Flag and Remove difference outliers

Copy link

Contributor

ladsmund Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ds the entire time series?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percentile qc - test #162

Percentile qc - test #162

RasmusBahbah commented Aug 8, 2023

ladsmund Sep 4, 2023

ladsmund Sep 4, 2023

ladsmund Sep 4, 2023

ladsmund Sep 4, 2023

patrickjwright Sep 4, 2023

ladsmund Sep 4, 2023

ladsmund Sep 4, 2023

patrickjwright Sep 4, 2023

ladsmund Sep 4, 2023

ladsmund Sep 4, 2023

patrickjwright Sep 4, 2023

RasmusBahbah Sep 6, 2023

patrickjwright Sep 4, 2023

RasmusBahbah Sep 6, 2023

patrickjwright Sep 4, 2023

patrickjwright Sep 4, 2023

patrickjwright Sep 4, 2023

patrickjwright Sep 4, 2023

RasmusBahbah Sep 6, 2023

patrickjwright Sep 4, 2023

RasmusBahbah Sep 6, 2023

patrickjwright commented Sep 4, 2023

RasmusBahbah commented Sep 6, 2023

ladsmund Sep 6, 2023

PennyHow commented Nov 7, 2023

	diff_period = np.ones_like(diff) * False
	diff_period = np.zeros_like(diff, dtype='bool')


		base_path = os.getcwd()

		file_path1 = base_path + '/main/src/pypromice/qc/percentiles.db'

Percentile qc - test #162

Are you sure you want to change the base?

Percentile qc - test #162

Conversation

RasmusBahbah commented Aug 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickjwright commented Sep 4, 2023

RasmusBahbah commented Sep 6, 2023

Choose a reason for hiding this comment

PennyHow commented Nov 7, 2023