-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future of the pypath.share.curl module #199
Comments
I agree that this is an important topic for pypath "v2"; just my two cents: collaborators have built a download / caching application named Buffy, basically a smart downloader with file caching functionality, which in my eyes would be suitable for managing pypath downloads. We could try whether it works for our use case and maybe add missing functionality there, directly. This could replace download and temp file handling of pypath. I am currently busy with finalising the BioCypher manuscript, so if there is immediate need to process something (eg these KEGG data), then I suggest adding a preprocessing module that gets the data independently and then work on the aggregated data. It may not be worth modifying the |
Talked briefly about it with Tim @motey, the Buffy maintainer. We could implement async functionality there. |
Buffy has a server/client architecture and in principle async calls are possible. Maybe a batch API in the client would be useful. Very simplified, Buffy is a wrapper around the python |
Hi, thanks for looking at this, it's indeed something we should address soon. The curl module in pypath is very old, not well designed and maintained, hence no surprise the code is convoluted and difficult to use. Buffy looks great, it's definitely an option we should consider. It has many of the features that we need, maybe we have to do some adjustments, but that might be cheaper than developing from scratch. The features that we need is a full control over all HTTP parameters, GET, POST, multipart forms, binary payload, HTTP headers (both request & response), cookies, 30x redirects, disabling ALPN, downgrade to HTTP 1.1, retries, timeouts, FTP. And the two most important: The library should not add headers and do session management on its own, or even better if it does, but can be disabled. It should be able to provide detailed debug traces, like Looking at Buffy, two important points: can we start the server automatically, and without docker? And same for redis? Also, if redis is a software to be independently installed, that complicates our installation, shall we add a Python key-value storage backend? @motey Maybe based on the points above, you can advise how suitable Buffy is for the purpose? Originally I would have preferred to have the cache management in a completely separate module. We should still decide about this. |
I was thinking that Buffy could be this "separate module". Depending on how aligned our feature wishes are with what Tim has in mind for Buffy, we could just contribute there. Or we could write a shallow wrapper around Buffy that adds our required features. |
@slobentanzer I talked about having the download manager and cache manager modules separated |
@deeenes I see, got it |
Of course: def pharos_general(query, variables=None):
url = "https://pharos-api.ncats.io/graphql"
req_headers = {
"Accept-Encoding": "gzip, deflate, br",
"Content-Type": "application/json",
"Connection": "keep-alive",
"DNT": "1",
"Origin": "https://pharos-api.ncats.io",
}
if variables:
binary_data = json.dumps(
{
'query': query,
'variables': variables
}
)
else:
binary_data = json.dumps(
{
'query': query,
}
)
binary_data = binary_data.encode('utf-8')
c = Curl(
url=url,
req_headers=req_headers,
binary_data=binary_data,
compressed=True,
compr='gz',
)
result = json.loads(c.result)['data']
return result
def get_targets(chunk_size=100):
step = 0
result = list()
variables = {
'chunk_size': chunk_size,
'step': step
}
query = '''
query targetDetails($chunk_size: Int! $step: Int!) {
targets {
targets(top: $chunk_size skip: $step) {
name
sym
uniprot
expressions(top: 10000) {
expid
type
tissue
value
uberon {
name
uid
}
pub {
pmid
}
}
gtex {
tissue
tpm
log2foldchange
uberon {
name
uid
}
}
orthologs(top: 10000) {
name
species
orid
dbid
geneid
source
}
ligands(top: 10000 isdrug: true) {
ligid
name
activities(all: true) {
actid
type
moa
value
}
}
xrefs(source: "Ensembl") {
name
}
diseases(top:10000) {
name
mondoID
dids {
id
dataSources
doName
}
}
}
}
}
'''
while True:
response = pharos_general(query, variables)['targets']['targets']
if not response:
break
result.extend(response)
variables['step'] += chunk_size
return result
If we decrease that number to 100, it takes so long to reach 14000. One query takes approximately 5-10 seconds so we could benefit from async requests as the order of the responses doesn't matter. Similar example would be for getting drug-drug interactions from KEGG. Making thousands of (14000, again) REST requests without async support takes lots of time. I think I made a point, please ask me anything about use cases if you want to. |
@deeenes wrote:
I would love to provide/invest-in some new features for Buffy for external usecases and also welcome contributions :) 🚀
For any Buffy request the method, header-fields and request-body can be already fully customized. see https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffypyclient/buffypyclient.py#L685 One spontaneous idea as an improvement would be a just a new param containing a dict that will pas through any other params to But yea, the design-philosophy/base-idea is to provide reasonable default values wherever we can but give the client/caller full control over the request params if needed. Any suggestion for improvements are welcome :)
Easy :) from multiprocessing import Process
from buffy.buffyserver.main import start_buffy_server
backend_process, api_webserver_process:tuple[Process, Process] = start_buffy_server()
Maybe, but i guess that will be a dirty hack :)
YES! redis is just my goto prototyping database. Actually it is bad in being a database for buffy :) Redis is better in being a f**** fast cache then being a persistent DB. A python based key-value store sounds very attractive but i am a little bit afraid regarding performance. Lets have a evaluation maybe. and lets see which solutions are on the "market" already. A Python native solution that allows Buffy to run in one context would be great! |
Are there any updates on this topic? |
sorry, a little busy atm with other stuff. i have your use-case on my todo list, but dont know if i manage to write a test-case for it this year. Regarding an alternative storage implementation to replace redis(via Docker): atm, i dont have too much reasons to work on that :D i would prefer to enable your framework to handle containers. Also that would open up a whole new universe of tools. |
No problem at all. Sorry for disturbing you @motey, ıt's an open source project at all, you may not have time to look at it. Just wanted to learn what's the status so I can act on that. |
There should be an async option since some API request take long to respond and we may have 14000 more requests to make which is a real case in KEGG database. I tried to implement
__await__
method toCurl
class but I wasn't successfull as thecurl
module is monolithic and obsolete.Other than that, there are so many arbitrary options and redundant code in Curl. I think it violates simplicity rule. Do one thing and do it well. I expect it to just connect to web and return a response, so I can handle the outcome. In contrast, this module does so many complex things that shouldn't be its responsibility and it doesn't fit to everything. I myself struggled with a different problem on every new database I tried to add. There were some connections which couldn't be made with our curl module but 3 different libraries which are:
external
pycurl
library,external
requests
library,standart
urllibs
library.And for example aiohttp would do nicely for async calls. I understand that there may be a need for caching, I don't know caching capabilities of this libraries but I think we should keep things small and modular.
I said 'I' the whole issue and this resembles my ideas but this topic has bee discussed in a meeting with @slobentanzer and our professor. We agree that this topic may need another meeting.
So, is curl going to be replaced in the near future, how should we start to do the change, should we continue using it?
The text was updated successfully, but these errors were encountered: