-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_data, empty line #12
Comments
Hi, Gui. The tests passed so I believe I'm ready to go! I was thinking I'd tackle a bug before jumping into enhancements, but this is the only one I see as still pending; did you want to retain ownership? If it's OK for me to work on, do you have (more extensive) notes somewhere on what you intended to do (e.g., in the source itself perhaps)? Did you implement a test for the "temp. solution," and if so, should the "refactor" pass the same test, perhaps enhanced, or replaced entirely? Any other input/advice? DLG |
Thanks @OlyDLG, that would be great! I did a fast test and if you include an empty line anywhere it would fail, because the regexp doesn't match anymore. I saw some old CTD files with this kind of issue, so it is real. Do you see any reason why would we need to retain a blank line after parse an input file? I don't. Remember headers and notes comes with raw_text = re.sub('(\r\n){2,}', '\r\n', raw_text)
raw_text = re.sub('\n{2,}', '\n', raw_text) on the first lines of the CNV object, when we load the raw_text. What's your opinion on that? |
The only reason I can see for keeping track of blank lines is that they might be symptomatic of a "deeper" problem with the data in that file, and where they are precisely might be useful in determining if that's the case and what that problem might be. Accordingly, what I would suggest, to increase the robustness of the resulting data structure, would be to add an attribute in which we store some indication of where blank lines occur, e.g., either a line number, or the line immediately following a blank line, or some such. That way, we're not losing any potential information, but the information is segregated enough that it won't interfere with anything else we're doing. (And, thinking long-term, if/when we get to implementing automated QA of files, we'll already have this as a possible indicator of problems in the file). Another option: when I was parsing CTD files from moored sensors for the WA State Dept. of Ecology, we had occasions where data rows would get cropped for various reasons--these definitely caused problems for my parser! So, if you don't already have such, eventually you'll probably need to support handling such "short" rows, and from that perspective, a blank line can be seen as simply a cropped row of length zero. In other words, perhaps we shouldn't have a separate "solution" for blank rows, but one that simply sees them as one instance of this more general problem. Of course, either of the things I'm proposing would really be an enhancement, not merely a bug fix, so we could implement a more robust bug fix for the time being (esp. if my ideas are for much further down the road). That said, in general, I'm opposed to simply dismissing anything that may turn out to be informative, so I would urge that whatever we do in the short-term, it not be simply ignoring blank lines. |
@OlyDLG those are two good points. One initial solution could be to collect all blank and/or cropped lines into a dictionary, using the line position in the original file as the key, and save all this into a new attribute in the CNV object attributes. And you're right, the data is now loaded all at once, but it would be better to load line by line. I'll abort the idea of simply remove the blank lines and open two new issues to keep those ideas in the radar. About the automatic QC, you might enjoy to check CoTeDe, it's already in production, and you can choose which QC rules do you want to apply. There is even a command line (ctdqc) to read .cnv and write the QC'ed file as a netCDF file. |
The load_data was not able to handle an empty line on the raw_data.
A temporary solution is to remove the empty line with re.sub()
The text was updated successfully, but these errors were encountered: