-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial: BBC world news map
This tutorial will guide you through creation of the pipeline that will display sentiments from BBC world news articles on a world map. It assumes that you have either registered at cloud version of Exynize or deployed your own copy of the platform, and finished the "Hello world" tutorial and Twitter product comparison tutorial.
First, we'll create a source component that will connect to RSS and will deliver us latest published articles.
To simplify the creation, we'll rely on feedparser npm package.
Here's how the code will look:
import FeedParser from 'feedparser';
import request from 'request';
export default (url, obs) => {
// construct request and feedparser
const req = request(url);
const feedparser = new FeedParser();
// handle errors
req.on('error', err => obs.onError(err));
feedparser.on('error', err => obs.onError(err));
// pipe request into feedparser
req.on('response', function(res) {
const stream = this;
if (res.statusCode !== 200) {
return this.emit('error', new Error('Bad status code'));
}
stream.pipe(feedparser);
});
// process articles
feedparser.on('readable', function() {
const stream = this;
let item;
while (item = stream.read()) {
obs.onNext(item);
}
});
// trigger end once done
feedparser.on('end', () => obs.onCompleted());
};
This component will dispatch latest articles from the feed and automatically complete, so we'll only see latest ~20 articles every time we run this source.
Next, we'll create a processor component that will fetch full text of the articles for us since RSS feeds usually do not provide it.
To simplify the creation, we'll rely on superagent npm package for HTTP requests and on cheerio for HTML parsing.
Here's how the code will look:
import request from 'superagent';
import cheerio from 'cheerio';
const cleanText = text => text
.replace(/[\n\r\t]+/g, ' ')
.replace(/\s+/g, ' ')
.replace(/(\w)\.([A-Z0-9_])/g, '$1. $2');
const cleanHtml = html => html
.replace(/[\n\r\t]+/g, ' ')
.replace(/<!\[CDATA\[.+?\]\]>/g, ' ')
.replace(/<!--.+?-->/g, ' ')
.replace(/\s+/g, ' ');
export default (data) => {
return Rx.Observable.create(obs => {
const {link} = data;
request
.get(link)
.end((err, res) => {
if (err) {
return obs.onError(err);
}
const $ = cheerio.load(res.text);
$('script').remove();
$('object').remove();
// try to extract only article text
let obj = $('.story-body__inner');
if (!obj || !obj.length) {
obj = $('body');
}
// cleanup
$('figure', obj).remove();
// get html and text
const resHtml = cleanHtml(obj.html()); // BBC news selector
const resText = cleanText(obj.text());
// assign to data
data.text = resText;
data.html = resHtml;
// send
obs.onNext(data);
obs.onCompleted();
});
});
};
This processor will first fetch the full HTML using the link
field of incoming data object, then it will try to extract only the meaningful text from it, append both text and HTML to data and return this new data. You can test this by entering {"link": "http://some.link.with/text"}
into data field in Exynize editor and hitting "Test" button.
After test succeeds, hit the "Save" button to save your new processor component.
We'll reuse our sentiment component that we've created during the Twitter product comparison tutorial.
Next, we'll create a processor component that will annotate the full text of the incoming articles.
We'll rely on FOX tool API for this. And to simplify the creation, we'll rely on request.js npm package for HTTP requests.
Here's how the code will look:
import _ from 'lodash';
import request from 'request';
// FOX NLP tool API url
const foxUrl = 'http://fox-demo.aksw.org/call/ner/entities';
export default (data) => Rx.Observable.create(obs => {
// construct request
const json = {
input: data.text,
type: 'text',
task: 'ner',
output: 'JSON-LD',
};
// send request
request({
method: 'POST',
url: foxUrl,
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(json),
}, (err, res, body) => {
// handle error
if (err) {
obs.onError(err);
return;
}
// check if the status code is OK
if (res && res.statusCode !== 200) {
obs.onError(`Error code: ${res.statusCode}, ${res.statusMessage}`);
return;
}
// parse results
const result = JSON.parse(body);
const entries = result['@graph'] ? result['@graph'] : [];
const annotations = entries.map(it => ({
types: it['@type'] ? it['@type']
.map(t => t.indexOf(':') !== -1 ? t.split(':')[1] : t)
.map(t => t.toLowerCase())
.map(_.capitalize)
.filter(t => t !== 'Annotation') : [],
name: it['ann:body'],
beginIndex: typeof it.beginIndex === 'string' ? [it.beginIndex] : it.beginIndex,
endIndex: typeof it.endIndex === 'string' ? [it.endIndex] : it.endIndex,
}));
data.annotations = annotations;
// return and complete
obs.onNext(data);
obs.onCompleted();
});
});
This processor will first annotate the text from the text
field of incoming data object, append resulting annotations to the data and return this new data. You can test this by entering the following data into data field in Exynize editor and hitting "Test" button: {"text": "The philosopher and mathematician Leibniz was born in Leipzig in 1646 and attended the University of Leipzig from 1661-1666. The current chancellor of Germany, Angela Merkel, also attended this university. "}
After test succeeds, hit the "Save" button to save your new processor component.
Next, we'll create a processor component that will find coordinates for all annotations that has type Location
.
We'll rely on Nominatim API for this. And to simplify the creation, we'll rely on nominatim npm package.
Here's how the code will look:
import _ from 'lodash';
import nominatim from 'nominatim';
// covert search function into observable
const observableSearch = Rx.Observable.fromNodeCallback(nominatim.search);
export default (inputData) => Rx.Observable.return(inputData)
.flatMap(data => {
// if no annotations - just return original data
if (!data.annotations) {
return Rx.Observable.return(data);
}
// init places array
if (!data.places) {
data.places = [];
}
// resolve all annotations and merge the results
return Rx.Observable.merge(data.annotations.map(annotation => {
if (_.includes(annotation.types, 'Location')) {
return observableSearch({q: annotation.name})
.map(([opt, results]) => {
if (results && results[0]) {
return {
name: opt.q,
lat: results[0].lat,
lon: results[0].lon,
};
}
return undefined;
});
}
return Rx.Observable.return(undefined);
}))
.filter(loc => loc !== undefined)
.reduce((acc, place) => [place, ...acc], [])
.map(places => {
data.places = places;
return data;
});
});
This processor will use all annotations with type Location
to fetch geo coordinates for them, then it'll append resulting coordinates to the data and return this new data. You can test this by entering the following data into data field in Exynize editor and hitting "Test" button: {"annotations": [{"types": ["Location"], "name": "Leipzig"}]}
After test succeeds, hit the "Save" button to save your new processor component.
Finally, we need to create a render component that will display results. We'll create a component that will display the incoming data as color-coded points on the map. We'll use leaflet.js to simplify creation of the map. Here's how the code will look:
import L from 'leaflet';
import 'leaflet/dist/leaflet.css';
const styleGray = '#cccccc';
const styleGreen = '#5cb85c';
const styleRed = '#d9534f';
const mapConfig = {
minZoom: 2,
maxZoom: 20,
layers: [
L.tileLayer(
'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
{
attribution: '© <a href="http://openstreetmap.org">OpenStreetMap</a>' +
' contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>',
}
)
],
attributionControl: false,
};
// popup rendering
const popup = (it) => `
<a href="${it.link}" target="_blank">${it.title}</a>
<div>${it.description}</div>
`;
// main render generator
export default() => React.createClass({
componentDidMount() {
// init map
this.map = L.map(this.refs.map, mapConfig);
// set view to show full world map
this.map.setView([-10, 10], 2);
},
componentWillReceiveProps(props) {
// render items
props.data.forEach(this.renderItem);
},
renderItem(it) {
if (!it.places) {
return;
}
// go over location
it.places.forEach((loc) => {
// do not render location with -1 -1 as lat or lon
if (loc.lat === -1 || loc.lon === -1) {
return;
}
const color = it.sentiment.score === 0 ? styleGray :
it.sentiment.score > 0 ? styleGreen : styleRed;
const marker = L.circle([loc.lat, loc.lon], 100000, {
stroke: false,
fillColor: color,
fillOpacity: 0.8,
className: 'leaflet-marker-animated',
}).addTo(this.map);
marker.bindPopup(popup(it));
});
},
render() {
return (
<div id="map" ref="map" style={{width: '100%', height: '100%', position: 'absolute'}}></div>
);
},
});
This component will render a map with red or green circles (depending on sentiments) representing places mentioned in incoming articles.
Now that all the components have been created, we need to assemble them into a pipeline.
When adding RSS source, you'll need to provide BBC world news RSS URL: http://feeds.bbci.co.uk/news/world/rss.xml
Processors do not require any configuration - just adding them is sufficient. But make sure to add the processor in same order we'd created them here - order is important.
Finally, also no configuration is required when adding render component.
Make sure to test the pipeline by pressing "Test" button before saving it using the "Save" button.
Now that you've assembled, tested and saved your new pipeline, you can start it and view the rendered result by clicking "Web" button next to pipeline name.