Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percentile qc - test #162

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Percentile qc - test #162

wants to merge 36 commits into from

Conversation

RasmusBahbah
Copy link
Contributor

No description provided.

@RasmusBahbah RasmusBahbah requested a review from ladsmund August 29, 2023 11:32
diff.fillna(method='ffill', inplace=True) # forward filling all NaNs!
diff = np.array(diff)

diff_period = np.ones_like(diff) * False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
diff_period = np.ones_like(diff) * False
diff_period = np.zeros_like(diff, dtype='bool')

if sum(abs(diff[i-diff_h:i])) < static_lim:
diff_period[i-diff_h:i] = True

diff_period = np.array(diff_period).astype('bool')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unnecessary to assign to boolean data type if the array is instantiated as suggested above

Suggested change
diff_period = np.array(diff_period).astype('bool')


base_path = os.getcwd()

file_path1 = base_path + '/main/src/pypromice/qc/percentiles.db'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment specific paths/variables shall not be assigned statically in the function body. Use input parameters or a class instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be very careful when adding binary files to a code repository because It will increase the repository size while being hard to version control.
Alternatively rely on the committed script to generate the database.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all *.db files are included in .gitignore

else:
print(f'percentiles.db does not exist running {script_path2}')
subprocess.call(['python',script_path2])
file_path = file_path2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above should be implemented in a separate module and maybe a python class wrapping the database

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a database instead of just a datafile like NetCDF?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ladsmund I wrote this original code that uses sqlite. The general idea is to provide processing effficiency. If the percentile distributions (for each variable) are stored in a sql database, we are able to extract the percentiles of interest very fast with a sql query, rather than opening/reading a static file for each station. sqlite is the simplest database solution, where we can have a small file on disk.

However, this is of more significant benefit for much larger datasets (e.g. 10s or 100s of thousands of stations). A solution using a static file could also be implemented if desired. Either way, you are persisting something on disk between runs that holds the percentile distributions.


for i,d in enumerate(diff): # algorithm that ensures values can stay the same within the diff_period
if i > (diff_h-1):
if sum(abs(diff[i-diff_h:i])) < static_lim:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using mean instead of sum to make the threshold invariant of window size

@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1.,
T_0, T_100, ews, ei0)
return ds

def differenceQC(ds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to add such functions to a separate module like pypromice.qc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it might make sense to take this out of L1toL2.py and have all qc-related functions in the qc directory? Not sure if there are any implications of moving this other than organizational, but worth considering.

Also, it would be great to add a short description at the top of the docstrings for both differenceQC and percentileQC that gives a basic description of what the function does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely something we will do in the future, probably in a future update. A better described docsrting is also a good idea

cc = calcCloudCoverage(ds['t_u'], T_0, eps_overcast, eps_clear, # Calculate cloud coverage
ds['dlr'], ds.attrs['station_id'])
ds['cc'] = (('time'), cc.data)
# Determiune cloud cover for on-ice stations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why this change for cloud cover is here? This has nothing to do with percentiles I assume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something Baptiste implemented in the meantime, so no, nothing to do with the qc.

else:
print(f'percentiles.db does not exist running {script_path2}')
subprocess.call(['python',script_path2])
file_path = file_path2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ladsmund I wrote this original code that uses sqlite. The general idea is to provide processing effficiency. If the percentile distributions (for each variable) are stored in a sql database, we are able to extract the percentiles of interest very fast with a sql query, rather than opening/reading a static file for each station. sqlite is the simplest database solution, where we can have a small file on disk.

However, this is of more significant benefit for much larger datasets (e.g. 10s or 100s of thousands of stations). A solution using a static file could also be implemented if desired. Either way, you are persisting something on disk between runs that holds the percentile distributions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all *.db files are included in .gitignore

@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1.,
T_0, T_100, ews, ei0)
return ds

def differenceQC(ds):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it might make sense to take this out of L1toL2.py and have all qc-related functions in the qc directory? Not sure if there are any implications of moving this other than organizational, but worth considering.

Also, it would be great to add a short description at the top of the docstrings for both differenceQC and percentileQC that gives a basic description of what the function does.

@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1.,
T_0, T_100, ews, ei0)
return ds

def differenceQC(ds):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is differenceQC? Just glancing at the code, this appears to be a persistence check? That is, checking if a value has gotten stuck in a persisting state for a certain period of time. If so, I would rename to persistenceQC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it is a persistence/static check. We will rename it to staticQC.

# Define threshold dict to hold limit values, and 'hi' and 'lo' percentile.
# Limit values indicate how far we will go beyond the hi and lo percentiles to flag outliers.
# *_u are used to calculate and define all limits, which are then applied to *_u, *_l and *_i
var_threshold = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am noticing that the values in this var_threshold are different that the values in var_threshold in percentileQC. Is this intentional? Was this function used for testing and determining thresholds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting that, I will change it back! I tested some different thresholds and apparently did not change them back.

@patrickjwright
Copy link
Collaborator

@RasmusBahbah I left a few comments, mostly minor. Overall, I am curious what kind of testing and plotting was done to determine both the limit thresholds (the values in var_threshold), as well as the percentiles to use as baselines for applying the limits (i.e. 0.005 and 0.995).

I remember in our initial testing, we were certainly finding many outliers that could be removed with the percentiles check, but we were also finding instances where good data would be removed. I assume there is no tolerance for removing any good data from the historical dataset, so were you able to do enough testing for each station to determine that the limits in this PR are conservative enough to preserve all good data?

Was air temp the only variable that uses seasonal distributions? All other variables use the full dataset to calculate percentiles?

Due to my limited time for this review, I would also highly encourage you get a review from @PennyHow as well. Great that this is being pushed forward!

@RasmusBahbah
Copy link
Contributor Author

Thanks for your feedback, @patrickjwright, much appreciated.
@ladsmund and I will do some testing of the QC on the pipeline, to determine if we are removing any good data, and tweak the percentiles and thresholds accordingly.

And yes, airtemp is the only variable with seasonal percentiles, and the other variables use the whole time-series. It will be engaging in a future update to test if monthly percentiles or another conf. could improve the QC. Maybe also use it on the other variables.

I really appreciate your time on this! @ladsmund and I will make some improvements on the structure, the staticQC, and how to test it. After that, we will definitely include @PennyHow for the final review.



ds = differenceQC(ds) # Flag and Remove difference outliers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ds the entire time series?

@PennyHow
Copy link
Member

PennyHow commented Nov 7, 2023

Can this PR be closed now that #183 is merged? @ladsmund @RasmusBahbah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants