Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add german media (#23) #24

Merged
merged 122 commits into from
Dec 26, 2024
Merged

add german media (#23) #24

merged 122 commits into from
Dec 26, 2024

Conversation

schochastics
Copy link
Contributor

No description provided.

@schochastics
Copy link
Contributor Author

schochastics commented Oct 15, 2024

@JBGruber there are many articles at "Der Spiegel" that do not seem to have an author.

rss_url <- "https://www.spiegel.de/schlagzeilen/index.rss"
test_df <- paperboy::pb_collect(rss_url)
#> ℹ 1 unique URLs provided
#> ✔ 1 unique URLs provided [32ms]
#> 
#> ⠉ Fetching pages...
#> ✔ Fetching pages... [1.2s]
#> 
#> ℹ Parsing RSS feeds
#> ✔ Parsing RSS feeds [2.1s]
#> 
#> ℹ 30 pages from 1 domain collected.
#> ✔ 30 pages from 1 domain collected. [40ms]
#> 
#> ℹ no links had issues.
#> ✔ no links had issues. [36ms]
#> 
test_df_parsed <- paperboy:::test_parser(test_df)
#> ℹ Trying to parse raw data
#> ! No parser for domain spiegel.de yet, attempting generic approach.
#> ℹ Checking results
test_df_parsed$author
#>  [1] "DER SPIEGEL, "                                          
#>  [2] "Leo Klimm, Simon Hage, Alexander Demling, DER SPIEGEL, "
#>  [3] "DER SPIEGEL, "                                          
#>  [4] "DER SPIEGEL, "                                          
#>  [5] "DER SPIEGEL, "                                          
#>  [6] "David Böcking, DER SPIEGEL, "                           
#>  [7] "DER SPIEGEL, "                                          
#>  [8] "DER SPIEGEL, "                                          
#>  [9] "Regina Steffens, DER SPIEGEL, "                         
#> [10] "DER SPIEGEL, "                                          
#> [11] "DER SPIEGEL, "                                          
#> [12] "DER SPIEGEL, "                                          
#> [13] "DER SPIEGEL, "                                          
#> [14] "DER SPIEGEL, "                                          
#> [15] "DER SPIEGEL, "                                          
#> [16] "DER SPIEGEL, "                                          
#> [17] "DER SPIEGEL, "                                          
#> [18] "DER SPIEGEL, "                                          
#> [19] "DER SPIEGEL, "                                          
#> [20] "DER SPIEGEL, "                                          
#> [21] "DER SPIEGEL, "                                          
#> [22] "DER SPIEGEL, "                                          
#> [23] "Marvin Rishi Krishan, DER SPIEGEL, "                    
#> [24] "Deike Diening, DER SPIEGEL, "                           
#> [25] "Dunja Ramadan, Ghada Alkurd, DER SPIEGEL, "             
#> [26] "Marc Pitzke, DER SPIEGEL, "                             
#> [27] "Oliver Trenkamp, DER SPIEGEL, "                         
#> [28] "DER SPIEGEL, "                                          
#> [29] "Nadine Wolter, DER SPIEGEL, "                           
#> [30] "DER SPIEGEL, "

Created on 2024-10-15 with reprex v2.1.1

but the column author has "DER SPIEGEL" added for those without author (and sometimes also when there actually is an author). I verified that pb_deliver.spiegel.de() has an empty author field in that case. is there a function that adds "DER SPIEGEL" retroactively?

@JBGruber
Copy link
Owner

JBGruber commented Oct 15, 2024

The function actually needs to be named pb_deliver_paper.spiegel_de. I wrote the confusingly named paperboy:::classify("spiegel.de"), if you're in doubt. I also had a look at the issue that most bylines have DER SPIEGEL. I took the info from a different part of the page and came up with the same result. It seems that is SPON's official byline? (while on it I also saw an easier/potentially more robust way to find the headline)

@schochastics
Copy link
Contributor Author

Thanks! Had a bit of a merge nightmare but gonna test both bild and spiegel thoroughly now before moving on

@schochastics schochastics marked this pull request as ready for review October 21, 2024 18:29
@schochastics
Copy link
Contributor Author

@JBGruber So this got a bit out of hand. At GESIS, we have a webtracking project and one thing led to another and I got a list of the 200 most visited German news websites that we need a scraper for. This pull request is now a little more than the top 100 but I ran out of steam. For now, I hope this does not blow the package out of proportions. Happy to fix anything you think might need fixing.

@JBGruber
Copy link
Owner

Thank you, that's fantastic! I'll check them out this week. Don't forget to add yourself to the description 😉

@JBGruber JBGruber merged commit 3234c0c into JBGruber:main Dec 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants