-
Notifications
You must be signed in to change notification settings - Fork 40
Home
The toy dataset consists of [] instances, is randomly sampled from US_Grant_CPC_MCF_2020_01_01
files, and the corresponding file content is retrieved from Dorothy's API. The API doc will be described in the next section.
The document information contains in this dataset are:
mcf # the cpc code
title # title of patent
abstraction # abstraction of patent
claims # flatten claims of patent
brief_summary # brief summary of patent
description # description of patent
cpc_codes # all the MCF with the same document id
Among them, the claims
are flatten to one single string. And all of the html_tag/meth/ref/image/tables are removed from claims
, brief_summary
and description
.
And all the fields are string instead of cpc_codes
. The next Label Information
will describe the cpc_codes
field.
In MCF file, each document will have multiple MCF rows and each row represent one CPC classification result. And as such, we will use these CPC results for classifications.
The cpc_codes
is a list that contains all the relative MCF codes that belong to the document.
The data set is saved by using pickle
. In order to read the data, using code below
data = pickle.load(open("filename", "rb"))
Now the data
contains all the patents.
It is a list
of dict
, in which each element is a patent instance. The key
s of dict
ARE described in the two above sections. The following code could be used to read each field.
first_element = data[0]
title = first_element['title']
abst = first_element['abstraction']
cpc_codes = first_element['cpc_codes']
Simply run following command, and the connection will be built and we could assess the APIs on local:
ssh -NL 8000:localhost:8000 [email protected]
Notes: the IP address might be updated in the future by Dorothy.
In order to access the API, we could use the following codes
a = slumber.API('http://localhost:8000/api/v0')
par = {'username': 'capstone2020', 'api_key': '48f0580836eaf85e7af82c57a0e7391a7e06530f'}
# using document id
patent = a.patent.get(document_id=document_id, **par, full_document=True)
# or using document number
patent = a.patent.get(document_number=document_number, **par, full_document=True)
# or using a list of document_ids, like ['US20050168051A1', 'US20050168050A1',]
patents = a.patent.get(document_id__in=document_ids, **par, full_document=True)
The document number is present in MCF file in (fixed-width) columns 10:18
The document id is basically 'US%d%s' % (document_number, kind)
So we could use the following code to extract the document_id
and retrieve the document as an example:
s = 'B21513931210200000H03G 3/20 20130101FI 0 0' # provided in MCF file as one row
a.patent.get(document_id='US%s%s' % (s[10:18].lstrip(),s[0:2].lstrip()),
**par, full_document=False)
Notes that there might be record like this:
A 800007Y10T 292/391 20150401LA 0 0
This record is an old one, which was published in 1905 and expired in 1922, so it’s not in our database, and as such, simply ignore them.
There should not be very many of them — of all the patents that have ever been issued, most are utility patents published after 1970 (can’t remember the exact year) and should be in our db.
Our db doesn’t contain design or plant patents, btw,
- CPC code might be empty for some patents retrieved from API, they only has IPC code
- Some MCF rows might represent old patents which we can just ignore them
install
pip3 install git+https://github.com/globality-corp/sklearn-hierarchical-classification.git@f19e3a0320b46c5a004e0d9ea1105dc09f62bd3a