Skip to content
yyn19951228 edited this page Feb 12, 2020 · 9 revisions

Dorothy-Ymir Wiki

Data-set

Toy Dataset

Text Information

The toy dataset consists of [] instances, is randomly sampled from US_Grant_CPC_MCF_2020_01_01 files, and the corresponding file content is retrieved from Dorothy's API. The API doc will be described in the next section.

The document information contains in this dataset are:

  1. mcf # the cpc code
  2. title # title of patent
  3. abstraction # abstraction of patent
  4. claims # flatten claims of patent
  5. brief_summary # brief summary of patent
  6. description # description of patent
  7. cpc_codes # all the MCF with the same document id

Among them, the claims are flatten to one single string. And all of the html_tag/meth/ref/image/tables are removed from claims, brief_summary and description.

And all the fields are string instead of cpc_codes. The next Label Information will describe the cpc_codes field.

In MCF file, each document will have multiple MCF rows and each row represent one CPC classification result. And as such, we will use these CPC results for classifications.

The cpc_codes is a list that contains all the relative MCF codes that belong to the document.

How to load

The data set is saved by using pickle. In order to read the data, using code below

data = pickle.load(open("filename", "rb"))

Now the data contains all the patents. It is a list of dict, in which each element is a patent instance. The keys of dict ARE described in the two above sections. The following code could be used to read each field.

first_element = data[0]
title = first_element['title']
abst = first_element['abstraction']
cpc_codes = first_element['cpc_codes']

Backend API documentation

How to connect

Simply run following command, and the connection will be built and we could assess the APIs on local:

ssh  -NL 8000:localhost:8000 [email protected]

Notes: the IP address might be updated in the future by Dorothy.

In order to access the API, we could use the following codes

a = slumber.API('http://localhost:8000/api/v0')
par = {'username': 'capstone2020', 'api_key': '48f0580836eaf85e7af82c57a0e7391a7e06530f'}
# using document id
patent = a.patent.get(document_id=document_id, **par, full_document=True)
# or using document number
patent = a.patent.get(document_number=document_number, **par, full_document=True)
# or using a list of document_ids, like ['US20050168051A1', 'US20050168050A1',]
patents = a.patent.get(document_id__in=document_ids, **par, full_document=True)

How to get document number and ID

The document number is present in MCF file in (fixed-width) columns 10:18

The document id is basically 'US%d%s' % (document_number, kind)

So we could use the following code to extract the document_id and retrieve the document as an example:

s = 'B21513931210200000H03G   3/20    20130101FI  0 0'      # provided in MCF file as one row
a.patent.get(document_id='US%s%s' % (s[10:18].lstrip(),s[0:2].lstrip()),
             **par, full_document=False)

Notes that there might be record like this:

A           800007Y10T 292/391   20150401LA  0 0

This record is an old one, which was published in 1905 and expired in 1922, so it’s not in our database, and as such, simply ignore them.

There should not be very many of them — of all the patents that have ever been issued, most are utility patents published after 1970 (can’t remember the exact year) and should be in our db.

Our db doesn’t contain design or plant patents, btw,

Important Notes

  1. CPC code might be empty for some patents retrieved from API, they only has IPC code
  2. Some MCF rows might represent old patents which we can just ignore them

model

sklearn

install

pip3 install git+https://github.com/globality-corp/sklearn-hierarchical-classification.git@f19e3a0320b46c5a004e0d9ea1105dc09f62bd3a