Strip White Space from Cell Values #215

JVickery-TBS · 2024-05-07T14:15:11Z

Use case

Legacy datasets that have white space in columns that business owners want to be treated as number/timestamp. XLoader cannot set that field or fail's data loads when said data is found 3000+ rows deep.

Changes being made

feat(logic): strip white space;

Strip white space padding from cell values when loading into DataStore.

- Strip white space from cell values when loading into DataStore.

- Fixed white space stripping for `load_table`.

duttonw · 2024-05-07T21:25:12Z

Hi @JVickery-TBS ,

What is the business use case for this change or what use behaviour are you seeing that you are trying to correct?

should/could this be controlled by a per resource feature flag on reload (like types) for when removing tailing space is not wanted?

ckanext/xloader/parser.py

JVickery-TBS · 2024-05-08T13:26:28Z

@duttonw we have some users with large CSVs that have just spaces in empty cells instead of actual empty cells (apparently from some old data building applications). So they are unable to use the numeric or timestamp types as their "empty" values are actually spaces.

So just stripping all string cell values would work. For the "per resource flag", would that be adding a field to the Data Dictionary for each field? Just a "Strip White Space" checkbox or something eh?

- Condition stripping on `str` type.

- Align test with new strip code. - Write headers to stream for `load_csv`.

duttonw · 2024-05-08T20:33:39Z

@duttonw we have some users with large CSVs that have just spaces in empty cells instead of actual empty cells (apparently from some old data building applications). So they are unable to use the numeric or timestamp types as their "empty" values are actually spaces.

So just stripping all string cell values would work. For the "per resource flag", would that be adding a field to the Data Dictionary for each field? Just a "Strip White Space" checkbox or something eh?

Hi @JVickery-TBS ,

That is a very good use case, are you able to update this pr per the following:

Always strip whitespace for non-text set fields. (Your code should already be doing this)
Do not 'strip' whitespace on type 'text'. As you never know if something might be using the datastore to concat emails and need's that trailing white space.
Optional: Add an option of per resource overrides, per how we set comments on the db table field if it should be test/numeric/timestamp and description (nice title) etc. We could extend to have that extra option added to this template (inside xloader plugin) dictionary_form.html, add logic to save it to the 'comments' section of the datastore table as well as use it when reading in for it.

Other things in this space

I'm not really a fan of having the datastore field types being stored on the datastore table as comments since when you drop the table (which is like all the time when you upload a new file to the resource**) if the uploaded file fails all customizations are lost.

I feel it should be set as a custom param against the resource instead so its persisted on a xloader failure so that when a valid csv file is upload it does not affect downstream api products expecting a number/timestamp when xloader defaulted to 'text' again.

Do you also think its a good idea in correcting the data storage location?

** recent changes now only truncate if csv fields match.

duttonw · 2024-05-08T20:41:40Z

Thankfully I've not run into this before but did run into something similar when requested to review why only 3800ish rows were imported into the db instead of the 3million on https://www.data.qld.gov.au/dataset/road-location-and-traffic-data/resource/daab3617-077f-450a-a1c0-57c26d8ba47c

The reason I found was that one of the col's which the first 3000 rows had looked like it was 100% numeric, but it was not and the system would crash but leave any loaded data it had as it was doing it the slow way (fast way fails fast when pushing the 256mb file to db)

JVickery-TBS · 2024-05-09T13:52:53Z

@duttonw okay, I will find some time to add in the condition on text type overrides to not strip from those. And will implement a new Data Dictionary field for stripping white space. (will add some test coverage for these things too)

As for the Data Dictionary being stored in the DB table info, I have always found it a bit strange. But I think the idea behind it was to keep everything in the same table and database so the queries would be the fastest and there would not need to be any joins or querying between different databases.

That being said, I would say to keep the Data Dictionary inside the table info as it is currently, but we could come up with another place where it gets saved too? Such as the normal resource table in an actual field called data_dictionary. And then add in some more logic in datastore_create, which would check between any existing table info and the data_dictionary field value.

But yeah, having the Data Dictionary inside the table info where the tables can get deleted is not great to me.

@wardi is in tomorrow, so I will discuss this further with him. He may know of any issues we could run into here or maybe know of the best way to do this.

duttonw · 2024-05-09T21:10:38Z

We could take it as is.

I just want to not have side effects that people were using before or have the ability to not strip by flag.

Reference Comic: https://xkcd.com/1172/ (workflow)

As to location of data store data types/comments, with items like validation schema going into resource extra params. We already have a pattern we could reference. This of course is something that needs to be fleshed out if it were to go ahead as well as if we want to also mirror it into the data pusher plugin so they are in sync with a defacto way.

JVickery-TBS · 2024-05-10T13:40:08Z

@duttonw yeah for sure would need to be fully fleshed out in upstream. For now, with this PR I will add a new field to the Data Dictionary form and use it in the new code. (and skip the text overrides).

I also just remembered that I did make this action: https://github.com/ckan/ckan/blob/master/ckanext/datastore/logic/action.py#L498

Which is really just a wrapper of datastore_delete but forces a filters param to be given and not null. So this could be an option in the meantime for Xloader. As if you do datastore_delete with filters={}, it will just delete all the records in the table and not the table itself. Which would preserver the table info and Data Dictionary with any overrides, comments, labels, etc.

wardi · 2024-05-10T14:39:35Z

When re-xloader-ing data if we use datastore_records_delete instead of datastore_delete (or datastore_delete with an empty filters dict) then the existing columns and data dictionary values won't be removed.
Then datastore_create should update columns while keeping data dictionary values from the previous columns.

I think data dictionary data in the resources is a good idea (this is something that ckanext-validation already does) we just need to deal with custom resource schemas somehow, and handling when columns are added/removed.

wardi · 2024-05-10T14:56:24Z

Note that the new IDataDictionaryForm allows users to customize their data dictionaries with private data that may not be appropriate for a public schema in the resource. So for now having xloader not delete columns when loading data is the only way to keep this data.

wardi · 2024-05-10T14:59:52Z

In fact xloader should add its own settings to columns with an IDataDictionaryForm so things like "strip whitespace" will be presented in the data dictionary form and validated by ckan.

- Extended tabulator stream iterator to strip white space from cell values.

duttonw · 2024-05-12T20:26:20Z

Note that the new IDataDictionaryForm allows users to customize their data dictionaries with private data that may not be appropriate for a public schema in the resource. So for now having xloader not delete columns when loading data is the only way to keep this data.

There is ways to strip it from going out unless you are sysadmin or internal plugin. We had a highly known vendor do this for PII fields from going out publicly, You can see this in effect here https://github.com/qld-gov-au/ckanext-resource-visibility/blob/master/ckanext/resource_visibility/logic/action.py#L28 which hides these fields which are in the db but not in the solr index nor page display (unless author/sysadmin) https://github.com/qld-gov-au/ckanext-resource-visibility/blob/e35a0fc6c3b84176373be6c9b0c1faffbb30f01f/ckanext/resource_visibility/constants.py#L15

duttonw · 2024-05-12T20:34:25Z

When re-xloader-ing data if we use datastore_records_delete instead of datastore_delete (or datastore_delete with an empty filters dict) then the existing columns and data dictionary values won't be removed. Then datastore_create should update columns while keeping data dictionary values from the previous columns.

I think data dictionary data in the resources is a good idea (this is something that ckanext-validation already does) we just need to deal with custom resource schemas somehow, and handling when columns are added/removed.

The main problem we are running into Authors who accidently use the "datastore 'csv'" instead of the original which includes the _id field, edits that and then uploads that back into CKAN, due to a change in fields, xloader drops the table to reload it. Of course when a validation schema is present (which takes time to get added everywhere). This could be stopped and hopefully @JVickery-TBS work on integration for that is fully ready for others to consume.

On the point of getting validation schemas to the masses'. We got a a standalone plugin built that adds a new tab when a datastore is made that general Authors can use to generate a simple validation schema which could then be applied at the resource level, or on the dataset (for monthly new files).

https://github.com/qld-gov-au/ckanext-validation-schema-generator
It may need a bit more TLC + readme updates on allowing it to be standalone consumable since they did go to down in placing huge number of overrides in our primary theme/plugin customizations. https://github.com/qld-gov-au/ckanext-data-qld/blob/develop/ckanext/data_qld/ckan_dataset.json#L203

JVickery-TBS · 2024-05-14T12:35:02Z

@duttonw I do have an idea in the backlog to exclude the _id column when downloading from the Web UI. As it really is not needed by users in that way. If they need those ID columns for some machine integration, I feel like those users are able to use datastore_search

- Added `strip_extra_white` info field and form fields. - Added validator for `strip_extra_white`. - Used `strip_extra_white` to control stripping white space.

- Added `strip_extra_white` field and form fields. - Used `strip_extra_white` to control stripping white space.

- Minor logic fixes for the new `strip_extra_white` field.

JVickery-TBS · 2024-05-14T17:21:56Z

@ThrawnCA @duttonw okay I have added a new DataStore field and schema for strip_extra_white and baked it into the Loader script.

Not sure why the tests are unable to import the DataStore module however :/

duttonw

Hi @JVickery-TBS ,

Some tweaks to the tests but overall. You have done an amazing job.

I think its ok to go as it his but would like @ThrawnCA to review it once he's more available. Currently he is dealing with a tight deadline that AWS imposed on us :'( on our infra IasC.

I do want to test this in a live environment (or pick ckan:develop as upstream) as I'm hesitant in including gui changes which don't have selenium BDD testing on a straight PR.

As Such, @JVickery-TBS if you do deploy this to one of your test environments that is publicly visibly and you can give myself of @ThrawnCA author rights for a small window of time, we could do the independent verification testing on the gui and merge it in.

We will still be pulling this into the qld-gov-au fork and can test on our dev.data.qld.gov.au system in the coming weeks/monthish.

It's also on my todo list on getting the pypi system deployments automated when my current workload drops off a bit so we can move back to nice pip installs again.

Regards,

@duttonw

ckanext/xloader/loader.py

ckanext/xloader/templates/datastore/snippets/dictionary_form.html

ckanext/xloader/tests/test_loader.py

duttonw · 2024-05-18T21:56:16Z

cicd testing issue due to missing dependancy wiring that exists in normal runtime.

    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/__w/ckanext-xloader/ckanext-xloader/ckanext/xloader/plugin.py", line 7, in <module>
    from ckanext.datastore.interfaces import IDataDictionaryForm
ImportError: cannot import name 'IDataDictionaryForm' from 'ckanext.datastore.interfaces' (/srv/app/src/ckan/ckanext/datastore/interfaces.py)

Reviewing how https://github.com/qld-gov-au/ckanext-s3filestore/ works with iUploader interface.
Please see if just adding:

requirements.txt
ckantoolkit>=0.0.4
dev-requirements.txt
ckanapi==4.3

If the interface or methods you want to use do not exist in 2.9 you may need to wire in checks to skip the new functionality to make it backwards compatible.

See:
https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L28

if toolkit.check_ckan_version(min_version='2.8'):
    from ckan.lib.uploader import ALLOWED_UPLOAD_TYPES
else:
    from cgi import FieldStorage
    if toolkit.check_ckan_version(min_version='2.7.0'):
        from werkzeug.datastructures import FileStorage as FlaskFileStorage
        ALLOWED_UPLOAD_TYPES = (FieldStorage, FlaskFileStorage)
    else:
        ALLOWED_UPLOAD_TYPES = (FieldStorage)

and maybe also in combination of
https://github.com/qld-gov-au/ckan/blob/d13b0e80726d133539d7067fee1f83d1e791e03b/ckan/logic/action/delete.py#L191

        # Don't break if old plugin/interface is found
        if hasattr(upload, "delete"):
            upload.delete(id)  # type: ignore
        else:
            log.warning("%s does not have delete function, could not cleanup: %s", type(upload).__name__, id)

duttonw · 2024-07-02T01:39:15Z

Hi @JVickery-TBS ,

Once you can resolve the merge conflict as well as why the tests is failing, this should be good to go

2024-05-14 17:22:31,450 INFO  [ckan.config.environment] Loading templates from /srv/app/src/ckan/ckan/templates
Traceback (most recent call last):
  File "/usr/bin/ckan", line 8, in <module>
    sys.exit(ckan())
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1[13](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:14)0, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1054, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/usr/lib/python3.10/site-packages/click/core.py", line 920, in make_context
    self.parse_args(ctx, args)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line 121, in parse_args
    result = super().parse_args(ctx, args)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1613, in parse_args
    rest = super().parse_args(ctx, args)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1378, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 2360, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 2322, in process_value
    value = self.callback(ctx, self, value)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line 131, in _init_ckan_config
    _add_ctx_object(ctx, value)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line [14](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:15)0, in _add_ctx_object
    ctx.obj = CtxObject(path)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line 57, in __init__
    self.app = make_app(raw_config)
  File "/srv/app/src/ckan/ckan/config/middleware/__init__.py", line 27, in make_app
    load_environment(conf)
  File "/srv/app/src/ckan/ckan/config/environment.py", line 69, in load_environment
    p.load_all()
  File "/srv/app/src/ckan/ckan/plugins/core.py", line [22](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:23)4, in load_all
    load(*plugins)
  File "/srv/app/src/ckan/ckan/plugins/core.py", line 240, in load
    service = _get_service(plugin)
  File "/srv/app/src/ckan/ckan/plugins/core.py", line 347, in _get_service
    return plugin.load()(name=plugin_name)
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line [24](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:25)64, in load
    return self.resolve()
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2470, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/__w/ckanext-xloader/ckanext-xloader/ckanext/xloader/plugin.py", line 7, in <module>
    from ckanext.datastore.interfaces import IDataDictionaryForm
ImportError: cannot import name 'IDataDictionaryForm' from 'ckanext.datastore.interfaces' (/srv/app/src/ckan/ckanext/datastore/interfaces.py)
Error: Process completed with exit code 1.

ckanext/xloader/tests/test_loader.py

- Updated various tests. - Added more gettext.

# Conflicts: # ckanext/xloader/loader.py ### RESOLVED.

- Updated post parser to set empty string type cells to `None` for parody with `load_csv`. - Updated some tests for new code. - Added check for ckan version `2.11` for the data dictionary form. - Updated conditions in `load_csv`.

- Fixed test for new output with `strip_extra_white`. - Fixed CKAN version pardoy of `info` vs `_info` in `update_datastore_info_field` (upstream issue).

JVickery-TBS · 2024-07-15T20:09:05Z

Just the one 2.9 test failing now, will have to get a 2.9 environment set up quickly and see if I can debug it in there. I know there are major differences between 2.10 datastore_info and 2.9 so I imagine it is something in there.

ckanext/xloader/plugin.py

ckanext/xloader/loader.py

- DS fields for ckan versions.

- Support current versions of CKAN for the DataDictionary form override for `strip_extra_white`. - Support current versions of CKAN for existing info before existing data dictionary custom fields.

ThrawnCA · 2024-07-16T22:16:47Z

ckanext/xloader/tests/test_jobs.py

@@ -81,7 +81,7 @@ def test_xloader_data_into_datastore(self, cli, data):
        with mock.patch("ckanext.xloader.jobs.get_response", get_response):
            stdout = cli.invoke(ckan, ["jobs", "worker", "--burst"]).output
            assert "File hash: d44fa65eda3675e11710682fdb5f1648" in stdout
-            assert "Fields: [{'id': 'x', 'type': 'text'}, {'id': 'y', 'type': 'text'}]" in stdout
+            assert "Fields: [{'id': 'x', 'type': 'text', 'strip_extra_white': True}, {'id': 'y', 'type': 'text', 'strip_extra_white': True}]" in stdout


We also need tests of the ability to disable this behaviour.

@ThrawnCA okay added test coverage for load_csv and load_table with 'strip_extra_white': False

...welp those didn't work. Worked locally on 2.9 and 2.11, will see if I can replicate and fix em up

@ThrawnCA I cannot seem to be able to replicate the failed tests in any of my environments.

- Added test coverage for no strip white extra space.

feat(logic): strip white space;

0d0dbea

- Strip white space from cell values when loading into DataStore.

JVickery-TBS mentioned this pull request May 7, 2024

Strip Trailing White Space open-data/ckanext-xloader#26

Merged

JVickery-TBS added 2 commits May 7, 2024 14:35

fix(logic): strip white space for load table;

616c948

- Fixed white space stripping for `load_table`.

fix(logic): strip white space for load table;

860ca9e

- Fixed white space stripping for `load_table`.

ThrawnCA requested changes May 8, 2024

View reviewed changes

ckanext/xloader/parser.py Outdated Show resolved Hide resolved

JVickery-TBS added 3 commits May 8, 2024 13:53

fix(logic): strip white space;

88f96a8

- Condition stripping on `str` type.

Merge branch 'master' into feature/strip-white-space

341ec1e

fix(tests): new code for tests;

21a1ece

- Align test with new strip code. - Write headers to stream for `load_csv`.

ThrawnCA approved these changes May 8, 2024

View reviewed changes

fix(logic): load csv white space;

a6ab0a0

- Extended tabulator stream iterator to strip white space from cell values.

JVickery-TBS added 3 commits May 14, 2024 15:49

feat(logic): added strip_extra_white info field;

54f87e0

- Added `strip_extra_white` info field and form fields. - Added validator for `strip_extra_white`. - Used `strip_extra_white` to control stripping white space.

feat(logic): added strip_extra_white field;

50080ea

- Added `strip_extra_white` field and form fields. - Used `strip_extra_white` to control stripping white space.

fix(logic): minor logic fixes;

116c29f

- Minor logic fixes for the new `strip_extra_white` field.

JVickery-TBS requested a review from ThrawnCA May 14, 2024 17:20

duttonw approved these changes May 14, 2024

View reviewed changes

ThrawnCA reviewed Jul 2, 2024

View reviewed changes

ckanext/xloader/tests/test_loader.py Outdated Show resolved Hide resolved

JVickery-TBS added 4 commits July 15, 2024 15:34

feat(tests,i18n): updated tests;

43b9f94

- Updated various tests. - Added more gettext.

Merge branch 'master' into feature/strip-white-space

06ee48a

# Conflicts: # ckanext/xloader/loader.py ### RESOLVED.

fix(tests,logic): misc fixes;

c00fb5a

- Updated post parser to set empty string type cells to `None` for parody with `load_csv`. - Updated some tests for new code. - Added check for ckan version `2.11` for the data dictionary form. - Updated conditions in `load_csv`.

fix(tests,logic): new output and parody;

669930e

- Fixed test for new output with `strip_extra_white`. - Fixed CKAN version pardoy of `info` vs `_info` in `update_datastore_info_field` (upstream issue).

duttonw reviewed Jul 15, 2024

View reviewed changes

ckanext/xloader/plugin.py Show resolved Hide resolved

ThrawnCA requested changes Jul 15, 2024

View reviewed changes

ckanext/xloader/loader.py Show resolved Hide resolved

JVickery-TBS added 3 commits July 16, 2024 13:57

fix(logic): ckan versioning;

3263bab

- DS fields for ckan versions.

fix(logic): ckan versioning;

7cb6a84

- DS fields for ckan versions.

feat(templates,logic): pre-datadictionary implement;

bf2e939

- Support current versions of CKAN for the DataDictionary form override for `strip_extra_white`. - Support current versions of CKAN for existing info before existing data dictionary custom fields.

ThrawnCA reviewed Jul 16, 2024

View reviewed changes

JVickery-TBS added 2 commits July 22, 2024 19:45

feat(tests): add coverage;

d6de1b1

- Added test coverage for no strip white extra space.

Merge branch 'master' into feature/strip-white-space

8564cdd

ThrawnCA approved these changes Aug 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip White Space from Cell Values #215

Strip White Space from Cell Values #215

JVickery-TBS commented May 7, 2024 •

edited by duttonw

Loading

duttonw commented May 7, 2024

JVickery-TBS commented May 8, 2024

duttonw commented May 8, 2024 •

edited

Loading

duttonw commented May 8, 2024

JVickery-TBS commented May 9, 2024

duttonw commented May 9, 2024

JVickery-TBS commented May 10, 2024

wardi commented May 10, 2024 •

edited

Loading

wardi commented May 10, 2024

wardi commented May 10, 2024

duttonw commented May 12, 2024

duttonw commented May 12, 2024

JVickery-TBS commented May 14, 2024

JVickery-TBS commented May 14, 2024

duttonw left a comment

duttonw commented May 18, 2024

duttonw commented Jul 2, 2024

JVickery-TBS commented Jul 15, 2024

ThrawnCA Jul 16, 2024

JVickery-TBS Jul 22, 2024

JVickery-TBS Jul 22, 2024

JVickery-TBS Jul 23, 2024

Strip White Space from Cell Values #215

Are you sure you want to change the base?

Strip White Space from Cell Values #215

Conversation

JVickery-TBS commented May 7, 2024 • edited by duttonw Loading

Use case

Changes being made

duttonw commented May 7, 2024

JVickery-TBS commented May 8, 2024

duttonw commented May 8, 2024 • edited Loading

Other things in this space

duttonw commented May 8, 2024

JVickery-TBS commented May 9, 2024

duttonw commented May 9, 2024

JVickery-TBS commented May 10, 2024

wardi commented May 10, 2024 • edited Loading

wardi commented May 10, 2024

wardi commented May 10, 2024

duttonw commented May 12, 2024

duttonw commented May 12, 2024

JVickery-TBS commented May 14, 2024

JVickery-TBS commented May 14, 2024

duttonw left a comment

Choose a reason for hiding this comment

duttonw commented May 18, 2024

duttonw commented Jul 2, 2024

JVickery-TBS commented Jul 15, 2024

ThrawnCA Jul 16, 2024

Choose a reason for hiding this comment

JVickery-TBS Jul 22, 2024

Choose a reason for hiding this comment

JVickery-TBS Jul 22, 2024

Choose a reason for hiding this comment

JVickery-TBS Jul 23, 2024

Choose a reason for hiding this comment

JVickery-TBS commented May 7, 2024 •

edited by duttonw

Loading

duttonw commented May 8, 2024 •

edited

Loading

wardi commented May 10, 2024 •

edited

Loading