-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON reader & extractor #27
Changes from all commits
26307bd
8ffd412
1907919
ed5813d
13ed87a
f7ad58c
2fde2a8
5baebc0
4fdc3df
5b2b42a
48401a0
7b296cb
f963d3e
d792f00
5f08c32
651fb14
404cc5d
ee4d7e0
574b482
324c215
6cef273
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
''' | ||
This module defines the JSONReader. | ||
|
||
It can parse documents nested in one file, for which it uses the pandas library, | ||
or multiple files with one document each, which use the generic Python json parser. | ||
''' | ||
|
||
import json | ||
from os.path import isfile | ||
from typing import Iterable, List, Optional, Union | ||
|
||
from pandas import json_normalize | ||
from requests import Response | ||
|
||
from .core import Reader, Document, Source | ||
import ianalyzer_readers.extract as extract | ||
|
||
class JSONReader(Reader): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a design question, since the single-document and multiple-document versions use different libraries and accept different arguments, perhaps they should just be two different classes? |
||
''' | ||
A base class for Readers of JSON encoded data. | ||
|
||
The reader can either be used on a collection of JSON files (`single_document=True`), in which each file represents a document, | ||
or for a JSON file containing lists of documents. | ||
|
||
If the attributes `record_path` and `meta` are set, they are used as arguments to [pandas.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to unnest the JSON data. | ||
|
||
Attributes: | ||
single_document: indicates whether the data is organized such that a file represents a single document | ||
record_path: a path or list of paths by which a list of documents can be extracted from a large JSON file; irrelevant if `single_document = True` | ||
meta: a list of paths, or list of lists of paths, by which metadata common for all documents can be located; irrelevant if `single_document = True` | ||
""" | ||
|
||
Examples: | ||
### Multiple documents in one file: | ||
```python | ||
example_data = { | ||
'path': { | ||
'sketch': 'Hungarian Phrasebook', | ||
'episode': 25, | ||
'to': { | ||
'records': | ||
[ | ||
{'speech': 'I will not buy this record. It is scratched.', 'character': 'tourist'}, | ||
{'speech': "No sir. This is a tobacconist's.", 'character': 'tobacconist'} | ||
] | ||
} | ||
} | ||
} | ||
|
||
MyJSONReader(JSONReader): | ||
record_path = ['path', 'to', 'records'] | ||
meta = [['path', 'sketch'], ['path', 'episode']] | ||
|
||
speech = Field('speech', JSON('speech')) | ||
character = Field('character', JSON('character')) | ||
sketch = Field('sketch', JSON('path.sketch')) | ||
episode = Field('episode', JSON('path.episode')) | ||
``` | ||
To define the paths used to extract the field values, consider the dataformat the `pandas.json_normalize` creates: | ||
a table with each row representing a document, and columns corresponding to paths, either relative to documents within `record_path`, | ||
or relative to the top level (`meta`), with list of paths indicated by dots. | ||
```csv | ||
row,speech,character,path.sketch,path.episode | ||
0,"I will not buy this record. It is scratched.","tourist","Hungarian Phrasebook",25 | ||
1,"No sir. This is a tobacconist's.","tobacconist","Hungarian Phrasebook",25 | ||
``` | ||
|
||
### Single document per file: | ||
```python | ||
example_data = { | ||
'sketch': 'Hungarian Phrasebook', | ||
'episode': 25, | ||
'scene': { | ||
'character': 'tourist', | ||
'speech': 'I will not buy this record. It is scratched.' | ||
} | ||
} | ||
|
||
MyJSONReader(JSONReader): | ||
single_document = True | ||
|
||
speech = Field('speech', JSON('scene', 'speech')) | ||
character = Field('character', JSON('scene', 'character)) | ||
sketch = Field('sketch', JSON('sketch')) | ||
episode = Field('episode', JSON('episode)) | ||
``` | ||
|
||
''' | ||
|
||
single_document: bool = False | ||
''' | ||
set to `True` if the data is structured such that one document is encoded in one .json file | ||
in that case, the reader assumes that there are no lists in such a file | ||
''' | ||
|
||
record_path: Optional[List[str]] = None | ||
''' | ||
a keyword or list of keywords by which a list of documents can be extracted from a large JSON file. | ||
Only relevant if `single_document=False`. | ||
''' | ||
|
||
meta: Optional[List[Union[str, List[str]]]] = None | ||
''' | ||
a list of keywords, or list of lists of keywords, by which metadata for each document can be located, | ||
if it is in a different path than `record_path`. Only relevant if `single_document=False`. | ||
''' | ||
|
||
def source2dicts(self, source: Source, *nargs, **kwargs) -> Iterable[Document]: | ||
""" | ||
Given a Python dictionary, returns an iterable of extracted documents. | ||
|
||
Parameters: | ||
source: the input data | ||
|
||
Returns: | ||
list of documents | ||
""" | ||
if isinstance(source, tuple): | ||
metadata = source[1] | ||
json_data = self._get_json_data(source[0]) | ||
else: | ||
metadata = None | ||
json_data = self._get_json_data(source) | ||
|
||
if not self.single_document: | ||
documents = json_normalize( | ||
json_data, record_path=self.record_path, meta=self.meta | ||
).to_dict('records') | ||
else: | ||
documents = [json_data] | ||
|
||
self._reject_extractors(extract.XML, extract.CSV, extract.RDF) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line is correct, but all other |
||
|
||
for doc in documents: | ||
field_dict = { | ||
field.name: field.extractor.apply( | ||
doc, metadata=metadata, *nargs, **kwargs | ||
) | ||
for field in self.fields | ||
} | ||
|
||
yield field_dict | ||
|
||
def _get_json_data(self, source: Source) -> dict: | ||
if isfile(source): | ||
with open(source, "r") as f: | ||
return json.load(f) | ||
elif isinstance(source, Response): | ||
return source.json() | ||
elif isinstance(source, bytes): | ||
return json.loads(source) | ||
else: | ||
raise Exception("Unexpected source type for JSON Reader") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,8 @@ dependencies = [ | |
"beautifulsoup4", | ||
"lxml", | ||
"openpyxl", | ||
"pandas", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pandas is quite a large library, which seems out of balance with how little it's doing here. You already wrote the functionality to look up paths in JSON data (for the extractor), so the extra bit of logic to get the list of documents and the metadata based on a path seems kind of trivial? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually struggled quite a bit with parsing a nested list, as well as metadata. The pandas implementation is far from trivial, as it also allows for data within the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I probably underestimated this at first glance. And you're right, it's a non-issue for I-analyzer. Let's go with the pandas solution then 👍 |
||
"requests", | ||
"rdflib", | ||
] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work with an empty list of keys, but I can imagine there would be use cases for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be useful only in case we want to parse the whole contents of a file. I cannot imagine a use case for that on the top of my head. Leaving this for later.