Merge "sibling" RSS item that belong to a multi-entry docket item #217

johnhawkinson · 2018-05-18T04:31:30Z

Fixes freelawproject/recap#248.

Sorry this history is a little messier than it should be, but it's all going to be squashed so there's probably not much point in cleaning it up. I am really sad that we don't get to keep 49cbaf2 which the non-Python progarmmer in me really likes as a solution to avoid if expressions that don't line up:

+        def twin_entries(a, b):
+            fields = ['title', 'link', 'id', 'published']
+            matching_fields = (a[f] == b[f] for f in fields)
+            return all(matching_fields)
@@ -122,10 +127,7 @@ def data(self):
             if (
                     preventry and
                     prevdata[u'docket_entries'] and
-                    entry.title == preventry.title and
-                    entry.link == preventry.link and
-                    entry.id == preventry.id and
-                    entry.published == preventry.published and
+                    twin_entries(entry, preventry) and
                     len(de) > 0  # xxx
             ):

I'm also not 100% the explanation in a978447 is accurate; and if it is, why was that the right way to go?:

-        """Override this to create a list of docket-like objects instead of the
-         usual dict that is usually provided by the docket report.
+        """Return a list of docket-like objects instead of the usual dict that
+         is usually provided by the BaseDocketReport superclass.

Attempt to explain why we override this function. I...actually don't understand why we do, so this explanation may be wrong. It seems like we could have fit within the BaseDocketReport framework of returning our information as `docket_entries` rather than `data` but maybe I'm missing something.

Snazzier than setting an old= var at the end of the loop.

prevdata -> lastdata (well, not really) preventry -> previous_entry Elminate `de` and just use data['docket_entries'] which we moved up

Because otherwise tests break. I think there may be a platform issue here, but this fix is obviously correct, so I'm not worried about masking it.

Update per freelawproject#217 (comment)

mlissner · 2018-05-18T05:10:18Z

juriscraper/pacer/rss_feeds.py

+            # the previous entry's and continue the loop.
+            if (
+                data_list and data_list[-1][u'docket_entries']
+                and data[u'docket_entries']


Seems like these two lines are of a different nature than the rest of the equality lines. Is it worth pulling them out of the if statement here? For example, maintain the logic, but do these types of tests in an earlier if statement? Something like:

if data_list and data_list[-1][u'docket_entries'] and data[u'docket_entries']: if all([entry.title == previous_entry.title, ...]):

It might allow you to put the comments about equality closer to where the equality is tested.

Err...I think it is advantageous to not materially increase the nesting level of logic with gratuitous nested ifs.
The comments didn't seem to me to be far removed from equality — to me they made sense before the word if.

Perhaps your reaction is to the fact that the comments don't discuss the first 2 lines of the if expressions…that's because I was focusing on what was maybe a little specialized. But certainly the comment could be broadened to discus the first 2 lines, i.e. "If there are docket entries associated with the previous entry and the current entry, and if those two entries match in metdata…"

But also, there's nothing stopping us from putting comments in the midst of the if expressions, so if there's really a desire for "Instructions at the point of need" then that can just happen, there's no need to change the structure of the if statement.

So if you want to change the structure of the if, probably a better reason is required. I actually did try the all() style and concluded erroneously that it didn't work here (I think I omitted the []), but that said it's not terribly familiar to me so I wouldn't vote for it though I don't oppose it.

Yeah, the extraneous nesting bugged me too. If you pull all this logic into a function, you could reverse the logic here and do:

if not all([this, that, the other thing]): return # Or what have you

That'd avoid the deep logic and keep the two things apart?

mlissner · 2018-05-18T05:13:51Z

juriscraper/pacer/rss_feeds.py

+                and entry.id == previous_entry.id
+                and entry.published == previous_entry.published
+            ):
+                data_list[-1][u'docket_entries'][0][u'short_description'] += (


I feel like a line like this deserves a comment, if even a short one. Or maybe break the bits into variables for readability?

previous_entry_desc = data_list[-1][u'docket_entries'][0][u'short_description'] current_entry_desc = data[u'docket_entries'][0][u'short_description'] combined_desc = ' AND '.join([previous_entry_desc, current_entry_desc]) data_list[-1][u'docket_entries'][0][u'short_description'] = combined_desc

That's a bit wordy, but clearer. This multiline thing is pretty bad.

I actually don't think that is clearer.
Forcing the repetition of data_list[-1][u'docket_entries'][0][u'short_description'] is something I would like to avoid, because it requires the reader to parse the 5 levels of hierarchy in their brain twice, and it's not obvious until inspection that they are the same. Edit: esp. when they're not consecutive.

Of course, the flipside is that I suspect you have an unfamiliarity with the += operator and do not love it the way you should, and no amount of cajoling will instill that love in you until you have worked with it for a long time. To my eye the avoidance of complex repetition is the strongest case for the so-called "addition assignment" operator. (The real problem is you can't do many of the usual fixes because the complexity is in an lvalue not just an rvalue.

And I definitely don't think that ' AND '.join([a, b])
is somehow clearer than a+' AND '+b.

I'll think about how to write this better.

At one point I had prevdata (or previous_data) instead of data_list[-1] but I axed it because I felt it wasn't helping. (d59e007)

I'm familiar with the +=. It's more about how you have three lines of code to do one small thing. That's a smell, but I'll await further approaches.

Me:

I suspect A and B

Mike:

Not A.

So B then. As I feared.

So, I suspect you won't go for this one, but it kind of jumped out at me so I thought I'd offer it here:

def entry(d): return d[u'docket_entries'][0] def get_append_shorttext(de, text=''): de[u'short_description'] += text return de[u'short_description'] previous_entry = entry(data_list[-1]) current_text = get_append_shorttext(entry(data)) get_append_shorttext(previous_entry, ' AND ' + current_text)

Think of it as a conversation-starter :)

Yeah, there's a lot of indirection in there, and get_append_shorttext is either getting or setting depending on the text argument....not great.

Hm. What about something like:

previous_entry = data_list[-1][u'docket_entries'][0] current_entry = data[u'docket_entries'][0] previous_entry[u'short_description'] += ' AND ' + current_entry[u'short_description']

That gets us back down to three lines, avoids repetition of the 5 levels of hierarchy, and uses += instead of a join? No additional helper functions needed?

Well, while I'm functionally happen with that, it fails PEP-8. And it's actually kind of tricky to fix, and it's why I introduced the helper function, which I agree is totally ridiculous and I'm not sure I could dignify by calling it a strawman:

# 1 2 3 4 5 6 7 8 #2345678901234567890123456789012345678901234567890123456789012345678901234567890 previous_entry[u'short_description'] += ' AND ' +current_entry[u'short_description'] #1 previous_entry[u'short_description'] += ( #2 ' AND ' + current_entry[u'short_description']) previous_entry[u'short_description'] += ( #3 ' AND ' + current_entry[u'short_description']) previous_entry[u'short_description'] += ( #4 ' AND ' + current_entry[u'short_description'] ) previous_entry[u'short_description'] += \ ' AND ' + current_entry[u'short_description'] #5

Although I find 5 tempting, I think that backslash terminates are disfavored (although not against PEP8?).
And of 2,3,4 I dunno. I had the sense you wanted to avoid multiple lines for this expression, and yet there's a certain additional clarity. What's your preference?

I think 5 has my vote followed closely by 1 (a minor PEP8 violation isn't the end of the world in my book). 2, 3, and 4 just don't seem worth it. I leave the rest of this to you. Let's land this.

mlissner

Looks good. One other thing that doesn't work in a line comment: What about pulling all this logic, comments, etc into its own little method/function so that the data method is still small? This feels like a lot for it.

johnhawkinson · 2018-05-18T05:47:24Z

What about pulling all this logic, comments, etc into its own little method/function so that the data method is still small?

Well, doing that would basically pull 90% of the bulk out of data() into newmethod() and not make newmethod() any smaller. It didn't seem like the complexity really affected one's ability to understand what data() is doing, but I can see the structural argument for it.

I don't think there's much benefit but I'll give it a whirl.

It's super-confusing to talk about "docket entries" as well as "RSS entries" and variables called "entry" don't help disambiguate this. Unfortunately we can't do much about the fact that feedparser's dict of items is `.entries` but hopefully this is still a readability improvement.

johnhawkinson · 2018-05-18T07:28:29Z

What about pulling all this logic, comments, etc into its own little method/function so that the data method is still small?

Well, doing that would basically pull 90% of the bulk out of data() into newmethod() and not make newmethod() any smaller. It didn't seem like the complexity really affected one's ability to understand what data() is doing, but I can see the structural argument for it.

I don't think there's much benefit but I'll give it a whirl.

Err, so...maybe you had some better idea, but doing this really doesn't seem [to me] to work: johnhawkinson@ad5be23 (specifically these lines).

I mean, the function needs access to a lot of internal state from the principle for loop of data(), which already hurts, so we have to pass a lot of args (unless we make it sub-function of data(), which I guess we could, but...)

And it also has to tell that for loop whether to iterate normally or to continue to avoid adding the item to data_list.

So I don't think this is a great way to go. But I'm a little tired, so maybe I'm missing something important.

On the other hand, we could also duplicate the for loop (thus reverting data()'s for loop to one that doesn't use complexity like previous_and_next(). It's not like performance is critical, right? But...

Hoping for a better suggestion...

mlissner

Hoping for a better suggestion...

I can't figure out Github's reply mechanism for your comment, but what about doing this as a separate step after we loop through and create all the dockets. So, instead of injecting doing the merging in the currently loop, we create a separate function after the loop terminated that does the merging?

mlissner · 2018-05-18T21:14:45Z

juriscraper/pacer/rss_feeds.py

+                and entry.id == previous_entry.id
+                and entry.published == previous_entry.published
+            ):
+                data_list[-1][u'docket_entries'][0][u'short_description'] += (


Yeah, there's a lot of indirection in there, and get_append_shorttext is either getting or setting depending on the text argument....not great.

Hm. What about something like:

previous_entry = data_list[-1][u'docket_entries'][0] current_entry = data[u'docket_entries'][0] previous_entry[u'short_description'] += ' AND ' + current_entry[u'short_description']

That gets us back down to three lines, avoids repetition of the 5 levels of hierarchy, and uses += instead of a join? No additional helper functions needed?

johnhawkinson · 2018-05-21T14:40:06Z

I can't figure out Github's reply mechanism for your comment,

Ha! It was just a regular "reply" at the bottom of the page.

but what about doing this as a separate step after we loop through and create all the dockets. So, instead of injecting doing the merging in the currently loop, we create a separate function after the loop terminated that does the merging?

Umm, sure. It's what I suggested above:

On the other hand, we could also duplicate the for loop (thus reverting data()'s for loop to one that doesn't use complexity like previous_and_next(). It's not like performance is critical, right? But...

I figured the ugliness of looping twice would doom this idea, but if it's your preference I can do that.

mlissner · 2018-05-21T17:07:24Z

I figured the ugliness of looping twice would doom this idea, but if it's your preference I can do that.

I actually don't mind processing data multiple times like that. Reminds me of map/reduce/filter kinds of processing.

CLAassistant · 2021-12-18T00:42:00Z

All committers have signed the CLA.

flooie · 2023-01-15T00:45:57Z

@johnhawkinson hi 👋

johnhawkinson added 14 commits May 18, 2018 00:24

rss_feeds: Merge adjacent qualifying entries rather than overwriting

3b60e8f

rss_feeds: use twin_entries() instead of stack of a==b

49cbaf2

Use previous_and_next()

1a1c476

Snazzier than setting an old= var at the end of the loop.

Bye bye twin_entries(), you're not readable

358bf2d

prevdata->lastdata, preventry -> previous, de -> data[de]

d59e007

prevdata -> lastdata (well, not really) preventry -> previous_entry Elminate `de` and just use data['docket_entries'] which we moved up

len(x) -> x

e7ace3b

Remove xxx, add assert() invariant

05d7078

linebreak on AND for clarity

5ea25af

reorder if protasises for sense

84d6050

Check data_list instead

a6a9c08

rss_feeds.py/docket_entries(): u'' not ''

b43395c

Because otherwise tests break. I think there may be a platform issue here, but this fix is obviously correct, so I'm not worried about masking it.

tests/.../rss_feeds: Update for multi-entry merging.

d3d7526

rss_feeds.py:data() improve docstring

6ce975c

Update per freelawproject#217 (comment)

mlissner reviewed May 18, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge "sibling" RSS item that belong to a multi-entry docket item #217

Merge "sibling" RSS item that belong to a multi-entry docket item #217

johnhawkinson commented May 18, 2018

mlissner May 18, 2018

johnhawkinson May 18, 2018

mlissner May 18, 2018

mlissner May 18, 2018

johnhawkinson May 18, 2018 •

edited

Loading

mlissner May 18, 2018

johnhawkinson May 18, 2018

johnhawkinson May 18, 2018

mlissner May 18, 2018

johnhawkinson May 21, 2018

mlissner May 21, 2018

mlissner left a comment

johnhawkinson commented May 18, 2018

johnhawkinson commented May 18, 2018

mlissner left a comment

mlissner May 18, 2018

johnhawkinson commented May 21, 2018

mlissner commented May 21, 2018

CLAassistant commented Dec 18, 2021 •

edited

Loading

flooie commented Jan 15, 2023

Merge "sibling" RSS item that belong to a multi-entry docket item #217

Are you sure you want to change the base?

Merge "sibling" RSS item that belong to a multi-entry docket item #217

Conversation

johnhawkinson commented May 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnhawkinson May 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlissner left a comment

Choose a reason for hiding this comment

johnhawkinson commented May 18, 2018

johnhawkinson commented May 18, 2018

mlissner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnhawkinson commented May 21, 2018

mlissner commented May 21, 2018

CLAassistant commented Dec 18, 2021 • edited Loading

flooie commented Jan 15, 2023

johnhawkinson May 18, 2018 •

edited

Loading

CLAassistant commented Dec 18, 2021 •

edited

Loading