Another unusable venue site #174
Replies: 3 comments 6 replies
-
Here's the Gist. The only issue I can really see with this is that it opens up the possibility of supply chain attacks, but I suppose there could be some kind of trusted contributor status. |
Beta Was this translation helpful? Give feedback.
-
I like the idea because as you mentioned elsewhere croncert's spirit is really one of capturing the smaller event locations. On the other hand one of my main goals with this project is automation. Ideally, you'd have as little work as possible to grab as much (relevant) data as possible. Even right now with the config file (instead of writing new code for each new source) in my opinion there's still too much work involved in adding new locations (you're the first real contributor so I guess this could also be interpreted as a sign that it's too much effort to add new locations). And I know the more I want to eliminate manual work the more complicated it gets 😅 but I still want to strive towards this goal. That being said, if you are willing put in the extra effort of copying the data of such unusable sites to somewhere more usable you are more than welcome to do so. Is your idea then to scrape the gist with goskyr? The specific site you're mentioning above, https://la-datcha.ch/, might even be scraped with goskyr despite its inconsistent html. I just did a quick test and I get some usable output with the following config I think.
Maybe different regex could then do the job of extracting date, title and description? We would miss a small number of events but in my opinion this as acceptable considering the lower maintenance effort. There'd also be some duplicates in this list, but if those have the exact same values for each field they should be deduped on submission to the event api. |
Beta Was this translation helpful? Give feedback.
-
I came up with this:
But it's still a mess. It misses some events, and gets the others twice. I think the only thing to do for it is to volunteer to help them out. Now, to meet the goal of a zero config setup, I think the only possible approach is to build some kind of machine learning model. I found this: https://upstackhq.com/blog/software-development/golang-machine-learning so it looks like golang resources exist, but my only experience with machine learning is from the 2019 SANS Holiday Hack Challenge: https://0xdf.gitlab.io/holidayhack2019/8 Still, I'm willing to learn. |
Beta Was this translation helpful? Give feedback.
-
One of the spots I'd really like to add is La Datcha, Lausanne, because they host some cool concerts, etc.
But the website https://la-datcha.ch/ is basically cannot be scraped. The only thing that's consistent in the HTML is that it is inconsistent. I mean, it looks nice on the site, but under the hood the HTML is just all over the place, sometimes using
<p>
and sometimes<div>
to accomplish the same thing. There's no attempt at using semantic classes or anything else.It appears to be a Wordpress site, but they clearly aren't using any of the facilities Wordpress provides for event management.
Anyway, I don't see any way to do this other than - like Bleu Lézard - doing it myself. I might ask them if I can take over the site.
Or, what about this idea. I could stand up a site publishing things like this using the json format normally output by goskyr. I could even do this in a gist, here on GitHub.
Beta Was this translation helpful? Give feedback.
All reactions