Skip to content
This repository has been archived by the owner on Jun 7, 2023. It is now read-only.

Speedup/refactor proposal for rforcecom.query #22

Open
ax42 opened this issue Oct 23, 2015 · 6 comments
Open

Speedup/refactor proposal for rforcecom.query #22

ax42 opened this issue Oct 23, 2015 · 6 comments

Comments

@ax42
Copy link
Contributor

ax42 commented Oct 23, 2015

I have an idea on how to improve up rforcecom.query, but I'm struggling with the mechanics of the XML, so if someone can point me in the right direction (@reportmort?) that would be helpful.

I think the function should run as follows:

  • fetch the first hunk of xml
  • extract all the //records (into a list?)
  • check the //done entry, if it's false
    • fetch the next lot of xml, add the //records into the list(?) above
  • Run through the records, find all that have a "type" attribute (which means they are a lookup field)
    • For all the child records, rename them to ParentName.ChildName (so the Name child in the Opportunity record become Opportunity.Name
    • Shift these records up a level in the tree (append and delete original?)
  • Rerun this step until it returns no records (because the tree could be multiple levels deep)

I think this will be faster / cleaner than the existing code (which builds an empty data.frame and then populates cell-by-cell) as it can also use locators which are hopefully quite optimised.

So far, I've got:

function(session, soqlQuery, queryAll=FALSE){

 # Retrieve XML via REST API

 if(queryAll){
   endpointPath <- rforcecom.api.getSoqlAllEndpoint(session['apiVersion'])
 } else {
   endpointPath <- rforcecom.api.getSoqlEndpoint(session['apiVersion'])
 }
 URL <- paste(session['instanceURL'], endpointPath, curlEscape(soqlQuery), sep="")

 fetchXML <- function(session, URL) {
   h <- basicHeaderGatherer()
   t <- basicTextGatherer()
   OAuthString <- paste("Bearer", session['sessionID'])
   httpHeader <- c("Authorization"=OAuthString, "Accept"="application/xml")
   curlPerform(url=URL, httpheader=httpHeader, headerfunction = h$update, writefunction = t$update, ssl.verifypeer=F)

   # BEGIN DEBUG
   if(exists("rforcecom.debug") && rforcecom.debug){ message(URL) }
   if(exists("rforcecom.debug") && rforcecom.debug){ message(t$value()) }
   # END DEBUG  

   return(t$value())
 }

 system.time(
   xml.in <- fetchXML(session, URL)
 )

 # Parse XML
 x.root <- xmlRoot(xmlParse(xml.in, asText=T))

 records.xml <- x.root['//records']

..but as I said, I'm really struggling with getting the XML stripped apart.

@ax42
Copy link
Contributor Author

ax42 commented Oct 23, 2015

Another advantage of the above is that I think it will work with arbitrarily deep nested queries (e.g. SELECT Id, Name, (SELECT ActivityDate, Description FROM ActivityHistories) FROM Account from Issue #15

@StevenMMortimer
Copy link
Contributor

@ax42 I wrote a separate function just for parsing query response XML that can handle nested parent-child relationships and follows an algorithm similar to your suggested pseudocode. You can find that function on a branch in my forked repository: https://github.com/ReportMort/RForcecom/blob/query-refactor/R/rforcecom.utils.R

You can download and test this improved version of rforcecom.query by installing from my branch. I may open a pull request if things seem acceptable to you.

library(devtools)
install_github('ReportMort/RForcecom', ref='query-refactor')

This function will obey your global setting for stringsAsFactors = F, so that addresses Issue #21 and it also handles the nested relationships described in Issue #15. The one thing is that it no longer uses type.convert, so numeric and integer variables stay as characters because the XML is characters. I think we can address the type conversion (including dates) at another time since I'd rather leave the user to decide how to type convert for their specific implementation.

One last thing I didn't test for an improvement in speed, but I think the lower bound on performance is always going to be how fast you can parse the XML, which is tough to get going quickly when it's millions of characters long. That said, I'm not sure how much we can optimize at the moment, but would love your input.

@ax42
Copy link
Contributor Author

ax42 commented Oct 24, 2015

Thanks -- I'll try and poke around with it on Sunday. Splitting things into pieces makes it easier to measure, and then if necessary, refactor them one-by-one.

@ax42
Copy link
Contributor Author

ax42 commented Oct 24, 2015

FWIW, I'll be using line profiling as detailed here: http://adv-r.had.co.nz/Profiling.html One of the quotes from that page is "unlist(x, use.names = FALSE) is much faster than unlist(x)" so there may already be a quick win there.

@ax42
Copy link
Contributor Author

ax42 commented Oct 25, 2015

First report back:

  • Works with the nested query SELECT Id, Name, CreatedDate, (SELECT ActivityDate, Description FROM ActivityHistories) FROM Account where CreatedDate > 2015-08-01T00:00:00.000Z 👍
  • The query SELECT Id, Amount, Name, Account.Name, Owner.Name, Owner.Alias, Master_Opportunity__r.Name FROM Opportunity WHERE Opportunity.CreatedDate >=2014-06-01T00:00:00.000Z causes the function to choke with Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1, 0

I'm passing the result of x.root <- xmlRoot(xmlParse(xml.in, asText=T)) to the function, where xml.in is the result payload from the request.

@StevenMMortimer
Copy link
Contributor

Thanks for checking. I just committed a new version of query_parser() that works for the list of queries shown below. I don't have the object Master_Opportunity__r, but it should work and behave like the references to Account and Owner in that same query you have. The code was a rush job this morning. I can review and clean it up later, but figured you might want something to play around with today if you've got the time.

Seemingly Working Queries:

columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Name, CreatedDate FROM Account")
columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Name, CreatedDate, (SELECT ActivityDate, Description FROM ActivityHistories) FROM Account where CreatedDate > 2015-08-01T00:00:00.000Z")
columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Amount, Name, Account.Name, Owner.Name FROM Opportunity WHERE Opportunity.CreatedDate >=2014-06-01T00:00:00.000Z")
columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Amount, Name, Account.Name, Owner.Name, Owner.Alias FROM Opportunity WHERE Opportunity.CreatedDate >=2014-06-01T00:00:00.000Z")

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants