Text analytics

EventRegistry module also has an Analytics class that can be used to perform various text analytics. The class will be extended with additional functionality, but for now it allows you to

semantically annotate your documents with entities and non-entities mentioned in the document,
categorize the document into a list of predefined categories based on DMOZ.org taxonomy,
compute sentiment of the document
determine the language of the document.

To visually test different methods please visit our demo pages.

Available methods

Semantic annotation

In order to semantically annotate a given document use code such as:

import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
ann = analytics.annotate("Microsoft released a new version of Windows OS.")

Text categorization

Categorization is currently only supported for English language. To categorize the document into a predefined set of categories and identify top related keywords use code such as:

import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
cat = analytics.categorize("Microsoft released a new version of Windows OS.")

Sentiment detection

Here is a sample code to detect the sentiment expressed in the document:

import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
cat = analytics.sentiment("Microsoft released a new version of Windows OS.")

Language detection

Here is a sample code to detect the code of the document

import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
langInfo = analytics.detectLanguage("Microsoft released a new version of Windows OS.")

Train a custom topic

By analyzing several tens of documents you can identify what are the common concepts and categories associated with the documents. Below is sample code to demonstrate the usage of the API:

import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
ret = analytics.trainTopicCreateTopic("my topic")
uri = ret["uri"]
# add the documents relevant for your topic of interest
analytics.trainTopicAddDocument(uri, "Facebook has removed 18 accounts and 52 pages associated with the Myanmar military, including the page of its commander-in-chief, after a UN report accused the armed forces of genocide and war crimes.")
analytics.trainTopicAddDocument(uri, "Emmanuel Macron’s climate commitment to “make this planet great again” has come under attack after his environment minister dramatically quit, saying the French president was not doing enough on climate and other environmental goals.")
analytics.trainTopicAddDocument(uri, "Theresa May claimed that a no-deal Brexit “wouldn’t be the end of the world” as she sought to downplay a controversial warning made by Philip Hammond last week that it would cost £80bn in extra borrowing and inhibit long-term economic growth.")
# finish training of the topic
ret = analytics.trainTopicFinishTraining(uri)
assert "topic" in ret
# use the "concepts" and "categories" properties in the topic. They represent what your documents are mostly about
topic = ret["topic"]

Train a topic based on the tweets

You can analyze a larger number of tweets matching search criteria and build a topic with common concepts and categories associated with the tweets. You can determine the set of tweets to analyze by either identifying tweets based on the username (using @ as a prefix), using a hashtag (using # as a prefix) or using a regular keyword. You can choose to analyze the content of the tweets or to just analyze the links provided in the tweets.

import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
# enqueue the task of building a topic based on the tweets from a user
ret = analytics.trainTopicOnTweets("@SeanEllis", useTweetText = True, maxConcepts = 50, maxCategories = 20, maxTweets = 400)
assert ret and "uri" in ret
uri = ret["uri"]
# the training of the topic can take several minutes. For this reason you have to use the uri provided in the response and
# get the topic after a while
time.sleep(5)
# retrieve the topic definition. If the topic is not built yet, it will not be returned
ret = analytics.trainTopicGetTrainedTopic(uri)

Returned data format

Text categorization

{
    "dmoz": {
        // top categories associated with the text
        "categories": [
            {
                // category ID
                "label": "dmoz/Computers/Companies/Microsoft_Corporation",
                // relevance of the category to the document
                "score": 0.456
            },
            ....
        ],
        // top keywords that summarize the document and their weights
        "keywords": [
            {
                "keyword": "Computers",
                "wgt": 0.160
            }
            ...
        ]
    }
}

Language detection

{
    "reliable": true,
    "textBytes": 32,
    // the language candidates for the document
    "languages": [
        {
            "name": "ENGLISH",
            // ISO2 code of the language
            "code": "en",
            // probability of the document being in this language
            "percent": 96,
            "score": 1321
        },
        ...
    ]
}

Train topic and train topic on Twitter

{
    "name": "@SeanEllis",
    "topic": [
        "concepts": [
            {
                "uri": "https://en.wikipedia.org/wiki/Amazon_(company)",
                "type": "org",
                "label": "Amazon (company)",
                "wgt": 50
            },
            ...
        ],
        "categories": [
            {
                "uri": "dmoz:Business/Investing",
                "label": "Business/Investing",
                "wgt": 42
            },
            ...
        ]
    ]
}

Semantic annotation

{
    // the list of annotations
    "annotations": [
        {
            // the URL that uniquely identifies the concept represented by the annotation
        	"url": "http://en.wikipedia.org/wiki/Microsoft",
            // the label that can be used to represent the annotation (in the language of the document)
        	"title": "Microsoft",
            // the input language
        	"lang": "en",

            // secondary URL that uniquely identifies the concept as a concept on English wikipedia
        	"secUrl": "http://en.wikipedia.org/wiki/Microsoft",
            // label that can represent the concept in English language
        	"secTitle": "Microsoft",
            "secLang": "en",

            // dbpedia URI of the concept
            "dbPediaIri": "http://dbpedia.org/resource/Microsoft",
            // dbpedia types for the concept
            "dbPediaTypes": [
                "Agent",
                "Organisation",
                "Company"
            ],
            // general categorization of the concept (person, org or loc)
            "type": "org",
            // importance of the concept for the whole document
            "wgt": 0.6666,
            // mentions of the concept in the document
            "support": [
                {
                    // character positions in text
                    "chFrom": 0,
                    "chTo": 8,
                    // based on the word(s) mentioned in the text, how likely it is that this is the correct annotation
                    "pMentionGivenSurface": 0.253001126280801,
                    "pageRank": 0.03690052603740375,
                    // the word/phrase that is used to mention the concept in the text
                    "text": "Microsoft",
                    // word indices
                    "wFrom": 0,
                    "wTo": 0,
                    "wikiLang": "en"
                }
            ],
            "pageRank": 0.2520778231483313,
            // wikidata id for the concept
            "wikiDataItemId": "Q2283"
            // wikidata class ids for the concept
            "wikiDataClassIds": [
                "Q891723",
                "Q1058914",
                "Q4830453",
                "Q43229",
                "Q874405",
                "Q24229398",
                "Q16334295",
                "Q58778",
                "Q35120",
                "Q16334298",
                "Q286583",
                "Q17519152",
                "Q517966",
                "Q223557",
                "Q16889133",
                "Q18844919",
                "Q488383",
                "Q5127848"
            ],
            // wikidata class ids and names
            "wikiDataClasses": [
                {
                    "enLabel": "public company",
                    "itemId": "Q891723"
                },
                {
                    "enLabel": "software house",
                    "itemId": "Q1058914"
                },
                ...
            ]
        },
        ...
    ],
    // list of nouns identified in the document
    "nouns": [
        {
            // starting and ending indices of the noun
            "iFrom": 25,
            "iTo": 31,
            // normalized form of the text
            "normForm": "version",
            // list of Wordnet synset IDs for the word
            "synsetIds": [
                "101267901",
                "105840650",
                "105928513",
                "106408779",
                "106536389",
                "107173585"
            ]
        },
        ...
    ],
    // list of adjectives found in the document
    "adjectives": [
        {
            // position in the document
            "iFrom": 21,
            "iTo": 23,
            // normalized form of the adjective
            "normForm": "new",
            // wordnet synset ids
            "synsetIds": [
                "300024996",
                "300128733",
                "300818008",
                "300937186",
                "301640850",
                "301687167",
                "301687965",
                "302070491",
                "302584699"
            ]
        },
        ...
    ],
    // list of verbs identified in the document
    "verbs": [
        {
            // text positions
            "iFrom": 10,
            "iTo": 17,
            // normalized form of the verb
            "normForm": "release",
            // wordnet sysnsets
            "synsetIds": [
                "200069295",
                "200104868",
                "200269682",
                "200967625",
                "201436518",
                "201474550",
                "201757994",
                "202316304",
                "202421374",
                "202494047"
            ]
        },
        ...
    ],
    // list of adverbs
    "adverbs": [

    ],
    // there are other returned properties that don't have significant importance for the user
}