Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added column-filter functionality #16

Open
wants to merge 296 commits into
base: master
Choose a base branch
from

Conversation

reenberg
Copy link
Contributor

@reenberg reenberg commented Aug 1, 2017

<tl;dr> This adds the feature to optionally filter out entire columns and rows based on column data in a .csv file, if you only wan't to create a small table based on a large .csv file.

If the column_filter is not specified or is an empty list, then the table is
not modified. Else the raw_table_list is filtered based on the values in
the column_filter (i.e., column indexes not specified in the filter is removed).

Each element in the column_filter list must be an integer or a dictionary
with at least the key 'col'.

Specifying an integer in the column_filter list makes sure that column
index is kept (first column is index 0 -- python list indexing).

Specifying a dictionary, gives the optional possibility of specifying the
following keys in the dictionary (note: the keys are mutually exclusive and
specifying more than one has undefined behaviour).

  • filter: filters out (removes) the row, if the content inside this column
    doesn't match (exact string matching of the value of this key and the content
    of the cell). The value may be a list of strings to be matched.

  • regex: filters out (removes) the row, if the content inside this column
    doesn't match. The value of this key is placed directly into
    re.match(pattern, string) as the pattern and the cell value as the
    string. Note: Currently we assume that a small amount of regex's is used,
    such that we don't have to deal with compiling of regex's, but rely on the
    built in caching to handle it for us.

Example: This example won't filter out any column, but it demonstrates the
three different ways that you may specify a column-filter. Just try and
make changes to either one of them, and see how either columns or rows will
be filtered from the resulting table.

``` {.table}
---
caption: "*Bar* table"
markdown: yes
column-filter:
    - 0
    - col: 1
      regex: ".*B|[\\d]"
    - col: 2
      filter: ['C', '3']
---
A,B,C
1,2,3
```

ickc and others added 16 commits January 13, 2017 21:14
because contaminated panflute 1.9.7 and 1.10 has been deleted.
pip no longer support py3.2
If the column_filter is not specified or is an empty list, then the table is
not modified.  Else the raw_table_list is filtered based on the values in
the column_filter (i.e., column indexes not specified in the filter is removed).

Each element in the column_filter list must be an integer or a dictionary
with at least the key 'col'.

Specifying an integer in the column_filter list makes sure that column
index is kept (first column is index 0 -- python list indexing).

Specifying a dictionary, gives the optional possibility of specifying the
following keys in the dictionary (note: the keys are mutually exclusive and
specifying more than one has undefined behaviour).

    - filter: filters out (removes) the row, if the content inside this
        column doesn't match (exact string matching of the value of this key
        and the content of the cell).  The value may be a list of strings to
        be matched.

    - regex: filters out (removes) the row, if the content inside this
        column doesn't match.  The value of this key is placed directly into
        `re.match(pattern, string)` as the `pattern` and the cell value as
        the `string`.  Note: Currently we assume that a small amount of
        regex's is used, such that we don't have to deal with compiling of
        regex's, but rely on the built in caching to handle it for us.

Example: This example won't filter out any column, but it demonstrates the
three different ways that you may specify a column-filter.  Just try and
make changes to either one of them, and see how either columns or rows will
be filtered from the resulting table.

    ```{.table}
    ---
    caption: "*Bar* table"
    markdown: yes
    column-filter:
        - 0
        - col: 1
          regex: ".*B|[\\d]"
        - col: 2
          filter: ['C', '3']
    ---
    A,B,C
    1,2,3
    ```
@ickc
Copy link
Owner

ickc commented Aug 1, 2017

I think we need more discussion on this for the syntax of this. You might try to ask people in Markdown, tables and CSV - Google Groups to see if there's any suggestions there, and/or open an issue here (I'll open one soon). For now I'll put this pull request on hold.

@ickc ickc mentioned this pull request Aug 1, 2017
@ickc
Copy link
Owner

ickc commented Aug 1, 2017

And remember to include tests in pull requests. There's 2 kinds of tests here, one is Python unit test that calls the functions and compare the results. Another is to run pandoc directly and see if the output native AST is the same as a predefined one (usually generated automatically and just eyeballing to verify it's doing what it's supposed to do).

@ickc ickc mentioned this pull request Aug 2, 2017
@sergiocorreia
Copy link

I think we need more discussion on this for the syntax of this

I also agree. Ideally, you want a solution that is both general and simple to implement. For instance, allowing lambdas that will be eval()uated at runtime

@ickc
Copy link
Owner

ickc commented Aug 13, 2017

@reenberg, can you briefly describe what you want to accomplished exactly? i.e. let's forget about syntax and how to do it for the moment, but gives some small, before & after example on what you want to do. In particular, how you would want the regex to behave.

e.g. the simplest kind of filter will be 1,2,3,..... filtered to 1,2 only, extracting only the first 2 columns.

@reenberg
Copy link
Contributor Author

My current issue is that I'm writing a document, where i have a spreadsheet of events.

This actually started out as a .csv file that i edited with a spreadsheet editor, but it has now evolved such that i found the need for using formulas (time calculations, column concatenation) and conditional formatting (to easily show groups of rows, etc) and thus it is now a .ods document, that I export to .csv.

The .csv file describes all the event data, such as type, start, end, various kinds of descriptions, work loads, etc.
I use this information to generate various pieces of information in my main document. One example is a table of specific event types and some of their descriptions.

Thus my need is specifically to be able to filter only some of the columns (e.g., 1,3,4,6,7) and then I also need to filter the rows, such that I only get the rows concerning the specific event types.

This has previously been delt with by some nasty LaTeX macros, that I just couldn't bother maintainer any more.

My initial implementation with the 'filter' and 'regex' properties, was just what came to my mind when coding it. However specifically I'm using the regex right now to easily filter out 'event', 'event2' and 'event3'. I use suffix numbering of the event type to have the events in different colours when generating some of the other overviews (think something like graphs)

* Changed column_filter to table_filter.
* Changed the filter into a generator.
@alerque
Copy link
Contributor

alerque commented Aug 28, 2019

I'm accomplishing something similar using CSVKit, specifically csvcut to get just the columns I want in a preprocessing step before dumping the results into the markdown. There are quite a few tools with similar filtering capabilities including Python based ones. In general I think this workflow is better, I would be skeptical of putting a bunch of active code in the content of my data and would be skeptical of Pantable if it was trying to be a full fleged data manipulation tool rather than just a format conversion tool.

@reenberg Why do you think this should be implemented in Pantable itself?

@ickc
Copy link
Owner

ickc commented Aug 28, 2019

The “pandoc way” to accomplish a task like this, without over bloating a filter, is to have another filter processing the filtering of the csv before pantable (ie piping a filter before pantable.) But inevitably this other filter before pantable has to be designed for pantable (eg which class to use.) So it is not strictly composable (ie not entirely independent of pantable.) So this hypothetical other filter is more like a pantable plugin, and hence may be why people want to put them together.

I think the solution you mention has to go through the shell (eg to me I’d use a makefile with an intermediate file chaining them together.) A solution like this is not universal. Also, a build like this makes the document less reproducible (in the sense that more details in how the document is built is needed.)

@reenberg
Copy link
Contributor Author

Its always nice that someone cares, even if its just shy of 2 years since I left a reply to your comments @ickc.

To be honest, I don't think that I knew about CSVkit back then. And I guess I just fell victim of the classical "everything looks like a nail, when you have a hammer". I Don't think that my proposed changes does anything more than what can be achieved with a good combination of csvcut and csvgrep. So with that in mind this can ble discarded.

However I remember thinking that it was cleaner not having to setup an "elaborate build pipeline"/ Makefile to carve out all the intermediate files that I needed back then. It felt more smooth having the people writing the document and invoking pantable being able to just specify what data they needed from the csv file inside markdown. Not everyone is comfortable piping unix tools.

@ickc ickc force-pushed the master branch 2 times, most recently from c51c4a4 to accb831 Compare November 10, 2020 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants