add german media (#23) #24

schochastics · 2024-10-15T14:05:06Z

No description provided.

schochastics · 2024-10-15T14:07:24Z

@JBGruber there are many articles at "Der Spiegel" that do not seem to have an author.

rss_url <- "https://www.spiegel.de/schlagzeilen/index.rss"
test_df <- paperboy::pb_collect(rss_url)
#> ℹ 1 unique URLs provided
#> ✔ 1 unique URLs provided [32ms]
#> 
#> ⠉ Fetching pages...
#> ✔ Fetching pages... [1.2s]
#> 
#> ℹ Parsing RSS feeds
#> ✔ Parsing RSS feeds [2.1s]
#> 
#> ℹ 30 pages from 1 domain collected.
#> ✔ 30 pages from 1 domain collected. [40ms]
#> 
#> ℹ no links had issues.
#> ✔ no links had issues. [36ms]
#> 
test_df_parsed <- paperboy:::test_parser(test_df)
#> ℹ Trying to parse raw data
#> ! No parser for domain spiegel.de yet, attempting generic approach.
#> ℹ Checking results
test_df_parsed$author
#>  [1] "DER SPIEGEL, "                                          
#>  [2] "Leo Klimm, Simon Hage, Alexander Demling, DER SPIEGEL, "
#>  [3] "DER SPIEGEL, "                                          
#>  [4] "DER SPIEGEL, "                                          
#>  [5] "DER SPIEGEL, "                                          
#>  [6] "David Böcking, DER SPIEGEL, "                           
#>  [7] "DER SPIEGEL, "                                          
#>  [8] "DER SPIEGEL, "                                          
#>  [9] "Regina Steffens, DER SPIEGEL, "                         
#> [10] "DER SPIEGEL, "                                          
#> [11] "DER SPIEGEL, "                                          
#> [12] "DER SPIEGEL, "                                          
#> [13] "DER SPIEGEL, "                                          
#> [14] "DER SPIEGEL, "                                          
#> [15] "DER SPIEGEL, "                                          
#> [16] "DER SPIEGEL, "                                          
#> [17] "DER SPIEGEL, "                                          
#> [18] "DER SPIEGEL, "                                          
#> [19] "DER SPIEGEL, "                                          
#> [20] "DER SPIEGEL, "                                          
#> [21] "DER SPIEGEL, "                                          
#> [22] "DER SPIEGEL, "                                          
#> [23] "Marvin Rishi Krishan, DER SPIEGEL, "                    
#> [24] "Deike Diening, DER SPIEGEL, "                           
#> [25] "Dunja Ramadan, Ghada Alkurd, DER SPIEGEL, "             
#> [26] "Marc Pitzke, DER SPIEGEL, "                             
#> [27] "Oliver Trenkamp, DER SPIEGEL, "                         
#> [28] "DER SPIEGEL, "                                          
#> [29] "Nadine Wolter, DER SPIEGEL, "                           
#> [30] "DER SPIEGEL, "

^{Created on 2024-10-15 with reprex v2.1.1}

but the column author has "DER SPIEGEL" added for those without author (and sometimes also when there actually is an author). I verified that pb_deliver.spiegel.de() has an empty author field in that case. is there a function that adds "DER SPIEGEL" retroactively?

JBGruber · 2024-10-15T17:30:54Z

The function actually needs to be named pb_deliver_paper.spiegel_de. I wrote the confusingly named paperboy:::classify("spiegel.de"), if you're in doubt. I also had a look at the issue that most bylines have DER SPIEGEL. I took the info from a different part of the page and came up with the same result. It seems that is SPON's official byline? (while on it I also saw an easier/potentially more robust way to find the headline)

…german_media

schochastics · 2024-10-15T18:19:55Z

Thanks! Had a bit of a merge nightmare but gonna test both bild and spiegel thoroughly now before moving on

schochastics · 2024-10-21T18:33:05Z

@JBGruber So this got a bit out of hand. At GESIS, we have a webtracking project and one thing led to another and I got a list of the 200 most visited German news websites that we need a scraper for. This pull request is now a little more than the top 100 but I ran out of steam. For now, I hope this does not blow the package out of proportions. Happy to fix anything you think might need fixing.

JBGruber · 2024-10-21T18:46:13Z

Thank you, that's fantastic! I'll check them out this week. Don't forget to add yourself to the description 😉

schochastics added 2 commits October 15, 2024 15:28

added spiegel

91d61a1

added docs and export

28619d3

schochastics and others added 2 commits October 15, 2024 16:11

renamed spiegel function

b6c6bba

change spiegel function

a42d16d

JBGruber and others added 3 commits October 15, 2024 19:31

fix namespace

5e9bc7e

fixed spiegel and added bild

87526ca

Merge branch 'german_media' of github.com:schochastics/paperboy into …

a39d4b9

…german_media

schochastics added 4 commits October 15, 2024 20:50

added welt.de

aa8e8ec

added tageschau.de

c79cc18

added focus.de

cdef66a

added fr.de

c975bcd

schochastics mentioned this pull request Oct 15, 2024

parsing script type="application/ld+json" #25

Closed

schochastics added 15 commits October 15, 2024 21:48

added stern.de

90a91c2

added sueddeutsche

16d5783

added n-tv.de

453872b

added rtl.de

d0e4330

added prosieben.de

42c8922

replaced base R pipe with magrittr pipe

7ef9bfc

renamed sueddeutsche file

a4470ed

added rp-online.de

4d6bfee

added t-online.de

bc7ba88

added zdf.de

a945fc4

added tagesspiegel.de

7e26217

added morgenpost.de

05d3045

added handelsblatt.com

7cb8a2c

added berliner-zeitung.de

34895ef

added badische-zeitung.de

9f40bb0

schochastics added 23 commits October 18, 2024 19:18

added epochtimes.de

f2480b2

added ostsee-zeitung.de

d40724a

added swr3.de

5525f52

added newsflash24.de

e5d4ada

added jungefreiheit.de

20ed71a

added kabeleins.de

4d50b5c

added thueringer-allgemeine.de

4a8fff6

added watson.ch

cc90fda

added maz-online.de

ba2023c

better json error handling (part 1)

29f154e

better json error handling (part 2)

c37f7b0

rm json dumping

b7d142b

better error handling (based on webtrack data)

d0b3e70

added taz.de

3ec0c8e

better error handling tz

a95b7bf

added schwaebische.de

4ae7b6b

added wz.de

7bb9b33

added dnn.de

7cd635c

added frankenpost.de

b69303f

removed non ascii in handelsblat scraper

920f4e5

better error handling focus

78aeb72

removed call to deprecated html_node

72768e8

further focus.de error handling

a7605e3

schochastics marked this pull request as ready for review October 21, 2024 18:29

schochastics added 3 commits October 22, 2024 06:49

added David as ctb

d54653b

better error handling rtl.de

230db67

changed text scraping for spiegel.de

3b4b3f9

JBGruber merged commit 3234c0c into JBGruber:main Dec 26, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add german media (#23) #24

add german media (#23) #24

schochastics commented Oct 15, 2024

schochastics commented Oct 15, 2024 •

edited

Loading

JBGruber commented Oct 15, 2024 •

edited

Loading

schochastics commented Oct 15, 2024

schochastics commented Oct 21, 2024

JBGruber commented Oct 21, 2024

add german media (#23) #24

add german media (#23) #24

Conversation

schochastics commented Oct 15, 2024

schochastics commented Oct 15, 2024 • edited Loading

JBGruber commented Oct 15, 2024 • edited Loading

schochastics commented Oct 15, 2024

schochastics commented Oct 21, 2024

JBGruber commented Oct 21, 2024

schochastics commented Oct 15, 2024 •

edited

Loading

JBGruber commented Oct 15, 2024 •

edited

Loading