A python script to traverse through HTML files with ProQuest results to generate an easily navigable CSV file (and Pandas DataFrame).
This package requires you to install two other packages for it to run: pandas
and BeautifulSoup
. Install them by running these two commands in your command line:
pip install pandas
pip install beautifulsoup4
Drop the ProQuestResult.py
file into your project folder. Then run the following command in your project, whether it is a Python file or a Jupyter Notebook:
from ProQuestResult import *
The program allows you to define two optional settings. Open ProQuestResult.py
and find the two lines that contain the two variables STOPFILES
and CACHE_RAW_IN_OBJECT
.
STOPFILES
needs to be a list of strings. It determines which file names the program will block when reading a directory. By default it is set to only include one element, Mac OS X's annoyingly present .DS_Store files:
STOPFILES = ['.DS_Store']
CACHE_RAW_IN_OBJECT
needs to be a boolean. It determines whether each ProQuestResult will contain an instance variable (ProQuestResult._raw
) that contains the raw HTML from each of the files. By default, this variable is set to False
in order to save memory. Switch to True
if you for some reason need to be able to access the HTML from your search result file.
You have two options when creating an object containing your search results: ProQuestResult
(1) and ProQuestResults
(2). The subtle difference is in the plural.
If you have one individual HTML files with ProQuest search results, this is the object you want to invoke. It provides a list of dictionaries (ProQuestResults.results
) and a DataFrame object (ProQuestResults.df
) with all the details for the search results.
To set up an object, simply provide it with a file variable to set it up:
parsed_results = ProQuestResult(file = './my_search_results/the_file_with_results.html')
The file
parameter should be a string but can also be a PosixPath (see pathlib's documentation for reference).
Once the object has been set up, you can easily access the search results as a list of dictionaries:
print(parsed_results.results)
If you'd rather see the search results as a pandas DataFrame, you can do so by calling:
parsed_results.df
This also provides an easy way to export the DataFrame to a CSV, by calling:
parsed_results.df.to_csv('xxx.csv')
Note: Accessing the instance variables results
and df
will both generate them to order. That means that the script, depending on the number of search results in each file, can take some time to run.
The object also gives you easy access to the search query as a string:
print(parsed_results.query)
If you request len()
for the object, it will return the number of search results in the file:
len(parsed_results)
If you have a directory or a list of files containing search results from ProQuest and you want to collect all of them in one object, you can do so by calling ProQuestResults
instead of the examples above.
The program is flexible and can ingest a number of variations through the two variables it accepts: files
or directory
.
files
needs to be provided as a list of file names as strings (or PosixPaths). For example:
parsed_results = ProQuestResult(files = ['./first_file.html', './second_file.html', './third_file.html', './fourth_file.html'])
directory
can be provided as either (i) a string (or a PosixPath) with a path to a directory containing the search result files you want to work with, or (ii) a list of strings (or PosixPaths) that refer to any number of directories containing search result files.
(i) For example, if you work with a single directory, you would call:
parsed_results = ProQuestResults(directory = './my_search_results/')
(ii) If you have a number of directories you need to summarize in one object, you would call the same object but set it up with a list of directories:
parsed_results = ProQuestResults(directory = ['./my_first_search_result_directory/', './my_second_search_result_directory/'])
Once the object has been set up, you can easily access the search results in the same manner as the examples under ProQuestResult
above:
To access all the search results as a list of dictionaries:
print(parsed_results.results)
To access all the search results as a DataFrame:
parsed_results.df
Note: As is the case with ProQuestResult
, accessing the instance variable results
and df
will both generate them to order. That means that the script, depending on the number of search results in each file, can take some time to run.
Since the ProQuestResults
object is set up by numerous files, which all contain one search query, there are two methods to access search query information. The program can provide the search query for each file (through requesting ProQuestResults.files_to_queries
) and a list of the files that contains each search query (through requesting ProQuestResults.query_to_files
).
files_to_query
is accessible as a native Python dictionary of the key-value structure {Path(file): 'search term'}
:
dict_object_with_files_to_query = parsed_results.files_to_query
query_to_files
is accessible in the same way a native Python dictionary but with the inverse key-value structure {Path(file): 'search term'}
:
dict_object_with_query_to_files = parsed_results.query_to_files
Since both of these methods provide you with a native dictionary, you can use any of the native functions built in to the dictionary type with these results such as slicing:
file = Path('./my_search_results/the_file_with_results.html')
dict_object_with_files_to_query[file]
You can also iterate through the results through the dictionary type's native method items()
:
for search_term, list_of_files in dict_object_with_query_to_files.items():
print("The search term", search_term, "was used to generate these files:", list_of_files)
for file, search_term in dict_object_with_files_to_query.items():
print("The file", file, "was generated from this search term:", search_term)
No future features are planned. If you would like to request a feature, feel free to so by opening an Issue on GitHub.