Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add german media (#23) #24

Merged
merged 122 commits into from
Dec 26, 2024
Merged
Changes from 1 commit
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
91d61a1
added spiegel
schochastics Oct 15, 2024
28619d3
added docs and export
schochastics Oct 15, 2024
b6c6bba
renamed spiegel function
schochastics Oct 15, 2024
a42d16d
change spiegel function
JBGruber Oct 15, 2024
5e9bc7e
fix namespace
JBGruber Oct 15, 2024
87526ca
fixed spiegel and added bild
schochastics Oct 15, 2024
a39d4b9
Merge branch 'german_media' of github.com:schochastics/paperboy into …
schochastics Oct 15, 2024
aa8e8ec
added welt.de
schochastics Oct 15, 2024
c79cc18
added tageschau.de
schochastics Oct 15, 2024
cdef66a
added focus.de
schochastics Oct 15, 2024
c975bcd
added fr.de
schochastics Oct 15, 2024
90a91c2
added stern.de
schochastics Oct 15, 2024
16d5783
added sueddeutsche
schochastics Oct 15, 2024
453872b
added n-tv.de
schochastics Oct 16, 2024
d0e4330
added rtl.de
schochastics Oct 16, 2024
42c8922
added prosieben.de
schochastics Oct 16, 2024
7ef9bfc
replaced base R pipe with magrittr pipe
schochastics Oct 16, 2024
a4470ed
renamed sueddeutsche file
schochastics Oct 16, 2024
4d6bfee
added rp-online.de
schochastics Oct 16, 2024
bc7ba88
added t-online.de
schochastics Oct 16, 2024
a945fc4
added zdf.de
schochastics Oct 16, 2024
7e26217
added tagesspiegel.de
schochastics Oct 16, 2024
05d3045
added morgenpost.de
schochastics Oct 16, 2024
7cb8a2c
added handelsblatt.com
schochastics Oct 16, 2024
34895ef
added berliner-zeitung.de
schochastics Oct 16, 2024
9f40bb0
added badische-zeitung.de
schochastics Oct 16, 2024
1a0807c
added derwesten.de
schochastics Oct 16, 2024
a34a578
added tag24.de
schochastics Oct 16, 2024
07c6d2b
added heise.de
schochastics Oct 16, 2024
edd6439
added merkur.de
schochastics Oct 16, 2024
3899858
added ndr.de
schochastics Oct 16, 2024
744e0f7
added br.de
schochastics Oct 16, 2024
67ec519
added t3n.de
schochastics Oct 17, 2024
30b0385
added karlsruhe-insider.de
schochastics Oct 17, 2024
755b04e
added mdr.de
schochastics Oct 17, 2024
4fc6503
added ruhr24.de
schochastics Oct 17, 2024
c0a0efe
added tz.de
schochastics Oct 17, 2024
7be71e4
added swr.de
schochastics Oct 17, 2024
28ea94c
added swp.de
schochastics Oct 17, 2024
949018b
added augsburger-allgemeine.de
schochastics Oct 17, 2024
466eb5f
added watson.de
schochastics Oct 17, 2024
3336344
added wiwo.de
schochastics Oct 17, 2024
794a79b
added rnd.de
schochastics Oct 17, 2024
228437c
added news.de
schochastics Oct 17, 2024
d516701
added deutschlandfunk.de
schochastics Oct 17, 2024
b34e02a
added businessinsider.de
schochastics Oct 17, 2024
a753622
added empty author if not found for ka_insider_de
schochastics Oct 17, 2024
aaf24c2
added nzz.ch
schochastics Oct 17, 2024
e793f4b
added waz.de
schochastics Oct 17, 2024
d5a04af
added finanzen.net
schochastics Oct 17, 2024
5e86402
added presseportal.de
schochastics Oct 17, 2024
0c4398d
added wdr.de
schochastics Oct 17, 2024
63798f3
added hna.de
schochastics Oct 17, 2024
5744393
replaced base R pipe with magrittr
schochastics Oct 17, 2024
1c835a8
added express.de
schochastics Oct 17, 2024
6496292
removed extra content in headline for express.de
schochastics Oct 17, 2024
2a9520c
added ksta.de
schochastics Oct 17, 2024
2f3c56e
added suedkurier.de
schochastics Oct 17, 2024
c7dd458
added deutschlandfunkkultur.de
schochastics Oct 17, 2024
dd441a4
added kreiszeitung.de
schochastics Oct 17, 2024
9b6eaa9
added abendblatt.de
schochastics Oct 17, 2024
9716ad6
added stuttgarter-zeitung.de
schochastics Oct 17, 2024
c947183
added infranken.de
schochastics Oct 17, 2024
2d6b079
added rbb24.de
schochastics Oct 17, 2024
d2dbe2d
added abendzeitung-muenchen.de
schochastics Oct 17, 2024
6731ed8
added echo24.de
schochastics Oct 17, 2024
13ae0ab
added mopo.de
schochastics Oct 17, 2024
63a1231
added saechsische.de
schochastics Oct 17, 2024
32e8c61
added kurier.at
schochastics Oct 17, 2024
059f2d5
added manager-magazin.de
schochastics Oct 17, 2024
b92fae0
added bnn.de
schochastics Oct 17, 2024
beffc2f
added nordkurier.de
schochastics Oct 17, 2024
660b7de
added rollingstone.de
schochastics Oct 17, 2024
37f1739
added berliner-kurier.de
schochastics Oct 18, 2024
10542fd
added vice.com
schochastics Oct 18, 2024
e856414
fixed wrong variable name for br.de
schochastics Oct 18, 2024
eb53e76
added ruhrnachrichten.de
schochastics Oct 18, 2024
7a76577
added vox.de
schochastics Oct 18, 2024
a9d1d81
added der-postillon.com
schochastics Oct 18, 2024
45586a2
added heidelberg24.de
schochastics Oct 18, 2024
8a98991
added news-und-nachrichten.de
schochastics Oct 18, 2024
62cd6e4
added volksstimme.de
schochastics Oct 18, 2024
ee67203
added 3sat.de
schochastics Oct 18, 2024
e55b30c
added derstandard.at
schochastics Oct 18, 2024
677e734
added lvz.de
schochastics Oct 18, 2024
18a0a5c
added swrfernsehen.de
schochastics Oct 18, 2024
e1da882
added shz.de
schochastics Oct 18, 2024
0d1b2f9
added fnp.de
schochastics Oct 18, 2024
0abfc16
added freiepresse.de
schochastics Oct 18, 2024
47e6dd2
added wa.de
schochastics Oct 18, 2024
d409ec1
added haz_de
schochastics Oct 18, 2024
6629448
added haz.de remaining
schochastics Oct 18, 2024
1304b8f
added nw.de
schochastics Oct 18, 2024
0bee0d2
added noz.de
schochastics Oct 18, 2024
9a383eb
added orf.at
schochastics Oct 18, 2024
633fd2c
added srf.ch
schochastics Oct 18, 2024
f2480b2
added epochtimes.de
schochastics Oct 18, 2024
d40724a
added ostsee-zeitung.de
schochastics Oct 18, 2024
5525f52
added swr3.de
schochastics Oct 18, 2024
e5d4ada
added newsflash24.de
schochastics Oct 18, 2024
20ed71a
added jungefreiheit.de
schochastics Oct 18, 2024
4d50b5c
added kabeleins.de
schochastics Oct 18, 2024
4a8fff6
added thueringer-allgemeine.de
schochastics Oct 18, 2024
cc90fda
added watson.ch
schochastics Oct 18, 2024
ba2023c
added maz-online.de
schochastics Oct 18, 2024
29f154e
better json error handling (part 1)
schochastics Oct 20, 2024
c37f7b0
better json error handling (part 2)
schochastics Oct 21, 2024
b7d142b
rm json dumping
schochastics Oct 21, 2024
d0b3e70
better error handling (based on webtrack data)
schochastics Oct 21, 2024
3ec0c8e
added taz.de
schochastics Oct 21, 2024
a95b7bf
better error handling tz
schochastics Oct 21, 2024
4ae7b6b
added schwaebische.de
schochastics Oct 21, 2024
7bb9b33
added wz.de
schochastics Oct 21, 2024
7cd635c
added dnn.de
schochastics Oct 21, 2024
b69303f
added frankenpost.de
schochastics Oct 21, 2024
920f4e5
removed non ascii in handelsblat scraper
schochastics Oct 21, 2024
78aeb72
better error handling focus
schochastics Oct 21, 2024
72768e8
removed call to deprecated html_node
schochastics Oct 21, 2024
a7605e3
further focus.de error handling
schochastics Oct 21, 2024
d54653b
added David as ctb
schochastics Oct 22, 2024
230db67
better error handling rtl.de
schochastics Oct 24, 2024
3b4b3f9
changed text scraping for spiegel.de
schochastics Oct 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
added rtl.de
  • Loading branch information
schochastics committed Oct 16, 2024
commit d0e4330005ce2482c21529d9daeb0480bed6a089
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -53,6 +53,7 @@ S3method(pb_deliver_paper,nypost_com)
S3method(pb_deliver_paper,nytimes_com)
S3method(pb_deliver_paper,parlamentnilisty_cz)
S3method(pb_deliver_paper,rte_ie)
S3method(pb_deliver_paper,rtl_de)
S3method(pb_deliver_paper,rtl_nl)
S3method(pb_deliver_paper,seznamzpravy_cz)
S3method(pb_deliver_paper,sfgate_com)
31 changes: 31 additions & 0 deletions R/deliver_rtl_de.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#' @export
pb_deliver_paper.rtl_de <- function(x, verbose = NULL, pb, ...) {
pb_tick(x, verbose, pb)
# raw html is stored in column content_raw
html <- rvest::read_html(x$content_raw)

json_txt <- rvest::html_nodes(html, "script[type = \"application/ld+json\"] ")[2] |> rvest::html_text()
json_df <- jsonlite::fromJSON(json_txt)
if (json_df$`@type` != "VideoObject") { # NewsArticle
datetime <- lubridate::as_datetime(json_df$datePublished)
headline <- json_df$headline
author <- toString(json_df$author$name)
text <- html %>%
rvest::html_elements(".article-body .LeadText_lead__rfwFU,.article-body .AnnotatedMarkup_paragraph__IUT9l") %>%
rvest::html_text2() %>%
paste(collapse = "\n")
} else {
datetime <- lubridate::as_datetime(json_df$uploadDate)
headline <- json_df$name
author <- ""
text <- json_df$transcript # for video objects, use transcript as text
}

s_n_list(
datetime,
author,
headline,
text,
json_df # dumping the whole json data of an article
)
}
1 change: 1 addition & 0 deletions inst/status.csv
Original file line number Diff line number Diff line change
@@ -59,6 +59,7 @@
"pagesix.com","![](https://img.shields.io/badge/status-requested-lightgrey)","","[#1](https://github.com/JBGruber/paperboy/issues/1)",NA
"parlamentnilisty.cz","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","[@JBGruber](https://github.com/JBGruber/)","","http://www.parlamentnilisty.cz/export/rss.aspx"
"rte.ie","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","[@JBGruber](https://github.com/JBGruber/)","","https://www.rte.ie/feeds/rss/?index=/news/"
"rtl.de","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","[@schochastics](https://github.com/schochastics)","[#23](https://github.com/JBGruber/paperboy/issues/23)","https://www.rtl.de/rss/feed/news"
"rtl.nl","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","[@JBGruber](https://github.com/JBGruber/)","","https://www.rtlnieuws.nl/rss.xml"
"seznamzpravy.cz","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","[@JBGruber](https://github.com/JBGruber/)","","https://www.seznamzpravy.cz/rss"
"sfgate.com","![](https://img.shields.io/badge/status-gold-%23ffd700.svg)","[@JBGruber](https://github.com/JBGruber/)","",NA