-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add german media (#23) #24
Conversation
@JBGruber there are many articles at "Der Spiegel" that do not seem to have an author. rss_url <- "https://www.spiegel.de/schlagzeilen/index.rss"
test_df <- paperboy::pb_collect(rss_url)
#> ℹ 1 unique URLs provided
#> ✔ 1 unique URLs provided [32ms]
#>
#> ⠉ Fetching pages...
#> ✔ Fetching pages... [1.2s]
#>
#> ℹ Parsing RSS feeds
#> ✔ Parsing RSS feeds [2.1s]
#>
#> ℹ 30 pages from 1 domain collected.
#> ✔ 30 pages from 1 domain collected. [40ms]
#>
#> ℹ no links had issues.
#> ✔ no links had issues. [36ms]
#>
test_df_parsed <- paperboy:::test_parser(test_df)
#> ℹ Trying to parse raw data
#> ! No parser for domain spiegel.de yet, attempting generic approach.
#> ℹ Checking results
test_df_parsed$author
#> [1] "DER SPIEGEL, "
#> [2] "Leo Klimm, Simon Hage, Alexander Demling, DER SPIEGEL, "
#> [3] "DER SPIEGEL, "
#> [4] "DER SPIEGEL, "
#> [5] "DER SPIEGEL, "
#> [6] "David Böcking, DER SPIEGEL, "
#> [7] "DER SPIEGEL, "
#> [8] "DER SPIEGEL, "
#> [9] "Regina Steffens, DER SPIEGEL, "
#> [10] "DER SPIEGEL, "
#> [11] "DER SPIEGEL, "
#> [12] "DER SPIEGEL, "
#> [13] "DER SPIEGEL, "
#> [14] "DER SPIEGEL, "
#> [15] "DER SPIEGEL, "
#> [16] "DER SPIEGEL, "
#> [17] "DER SPIEGEL, "
#> [18] "DER SPIEGEL, "
#> [19] "DER SPIEGEL, "
#> [20] "DER SPIEGEL, "
#> [21] "DER SPIEGEL, "
#> [22] "DER SPIEGEL, "
#> [23] "Marvin Rishi Krishan, DER SPIEGEL, "
#> [24] "Deike Diening, DER SPIEGEL, "
#> [25] "Dunja Ramadan, Ghada Alkurd, DER SPIEGEL, "
#> [26] "Marc Pitzke, DER SPIEGEL, "
#> [27] "Oliver Trenkamp, DER SPIEGEL, "
#> [28] "DER SPIEGEL, "
#> [29] "Nadine Wolter, DER SPIEGEL, "
#> [30] "DER SPIEGEL, " Created on 2024-10-15 with reprex v2.1.1 but the column author has "DER SPIEGEL" added for those without author (and sometimes also when there actually is an author). I verified that |
The function actually needs to be named |
Thanks! Had a bit of a merge nightmare but gonna test both bild and spiegel thoroughly now before moving on |
@JBGruber So this got a bit out of hand. At GESIS, we have a webtracking project and one thing led to another and I got a list of the 200 most visited German news websites that we need a scraper for. This pull request is now a little more than the top 100 but I ran out of steam. For now, I hope this does not blow the package out of proportions. Happy to fix anything you think might need fixing. |
Thank you, that's fantastic! I'll check them out this week. Don't forget to add yourself to the description 😉 |
No description provided.