Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bridge request for Nikkei Asia #4153

Open
4 of 10 tasks
notyourslave opened this issue Jul 20, 2024 · 2 comments
Open
4 of 10 tasks

Bridge request for Nikkei Asia #4153

notyourslave opened this issue Jul 20, 2024 · 2 comments
Labels
Bridge-Request Request for a new bridge

Comments

@notyourslave
Copy link

Bridge request

Bridge for Nikkei Asia.

General information

  • Host URI for the bridge (i.e. https://asia.nikkei.com):

  • Which information would you like to see?

Articles

  • How should the information be displayed/formatted?

  • Which of the following parameters do you expect?

    • Title
    • URI (link to the original article)
    • Author
    • Timestamp
    • Content (the content of the article)
    • Enclosures (pictures, videos, etc...)
    • Categories (categories, tags, etc...)

Options

  • Limit number of returned items
    • Default limit: 20
  • Load full articles
    • Cache articles (articles are stored in a local cache on first request): yes
    • Cache timeout (max = 24 hours): 1 hour
  • Balance requests (RSS-Bridge uses cached versions to reduce bandwith usage)
    • Timeout (default = 5 minutes, max = 24 hours): 5 minutes

Additional notes

  • Fetch from rss feed
  • Fetch article full content
  • Remove ads
<?php

class NikkeiBridge extends BridgeAbstract
{
  const NAME = 'Nikkei Bridge';
  const URI = 'https://asia.nikkei.com';
  const DESCRIPTION = 'Fetches the latest articles from the Nikkei Asia';
  const MAINTAINER = 'notme';
  const CACHE_TIMEOUT = 3600;

  const MAX_CONTENTS = 20;

  public function collectData()
  {
    $rssFeedUrl = 'https://asia.nikkei.com/rss/feed/nar';
    $rssContent = file_get_contents($rssFeedUrl);

    if (!$rssContent) {
      returnServerError('Could not request ' . $rssFeedUrl);
    }

    $rss = simplexml_load_string($rssContent);

    if (!$rss) {
      returnServerError('Could not parse RSS feed from ' . $rssFeedUrl);
    }

    $count = 0;
    foreach ($rss->item as $element) {
      if ($count >= self::MAX_CONTENTS) {
        break;
      }

      $count++;

      $item = [];
      $item['title'] = (string)$element->title;
      $item['uri'] = (string)$element->link;
      $item['timestamp'] = strtotime((string)$element->pubDate);

      // Fetch the article content
      $articleContent = $this->fetchArticleContent($item['uri']);
      if ($articleContent) {
        $item['content'] = $articleContent;
      } else {
        $item['content'] = 'Content could not be retrieved';
      }

      $this->items[] = $item;
    }
  }

  private function fetchArticleContent($url)
  {
    // Extract the path from the URL
    $urlComponents = parse_url($url);
    $path = $urlComponents['path'];

    // Base64 encode the path
    $encodedPath = base64_encode($path);

    // Create the API URL
    $apiUrl = 'https://asia.nikkei.com/__service/v1/piano/article_access/' . $encodedPath;

    // Fetch the JSON content from the API
    $apiResponse = file_get_contents($apiUrl);

    if (!$apiResponse) {
      error_log('Could not request ' . $apiUrl);
      return null;
    }

    $apiResponseData = json_decode($apiResponse, true);

    if (!isset($apiResponseData['body'])) {
      error_log('Invalid API response for ' . $apiUrl);
      return null;
    }

    // Load the HTML content
    $htmlContent = $apiResponseData['body'];

    // Remove elements with class o-ads
    $dom = new DOMDocument;
    libxml_use_internal_errors(true);
    $dom->loadHTML($htmlContent);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);
    foreach ($xpath->query('//*[contains(@class, "o-ads")]') as $adsNode) {
      $adsNode->parentNode->removeChild($adsNode);
    }

    // Save the cleaned HTML content
    $cleanedHtml = $dom->saveHTML();

    return $cleanedHtml;
  }
}
@notyourslave notyourslave added the Bridge-Request Request for a new bridge label Jul 20, 2024
@notyourslave
Copy link
Author

Actually it would be better to cache the fetchArticleContent for more than 1 hour, no idea how to do that.

@NotsoanoNimus
Copy link
Contributor

Actually it would be better to cache the fetchArticleContent for more than 1 hour, no idea how to do that.

If I'm not mistaken, you should just be able to change the value in the line const CACHE_TIMEOUT = 3600; to the actual amount of seconds you'd like to cache for, up to 24 hours (86400).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bridge-Request Request for a new bridge
Projects
None yet
Development

No branches or pull requests

2 participants