Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getBM() error "... more columns than column names" #109

Open
karlmakepeace opened this issue Aug 13, 2024 · 1 comment
Open

getBM() error "... more columns than column names" #109

karlmakepeace opened this issue Aug 13, 2024 · 1 comment

Comments

@karlmakepeace
Copy link

I am trying to access 5' and 3' UTR Ensembl sequence page attributes ("5utr" and "3utr") using getBM() but encounter the following error:

Error in read.table(text = postRes, sep = "\t", header = TRUE, quote = quote, :
more columns than column names

If I include an additional attribute (e.g. "external_gene_name") then getBM() will return results without error (however there is still a header/value mismatch as described in issue #108),

# {biomaRt} error "... more columns than column names" #------------------------
# install.packages("tibble")

mart <- biomaRt::useEnsembl(
  biomart = "genes",
  version = "112", # latest as of 2024-08-13
  dataset = "hsapiens_gene_ensembl")

gene_sequences_attributes_subset_1 <- biomaRt::getBM(
  mart = mart,
  attributes = c("external_gene_name", "5utr", "3utr"),
  filters = c("external_gene_name"),
  values = list(c("TP53")))

gene_sequences_attributes_subset_1 |> tibble::as_tibble()
# # A tibble: 20 × 3
#    `3utr` external_gene_name `5utr`                                         
#    <chr>  <chr>              <chr>                                          
#  1 TP53   ENSG00000141510    CCCCATGTTCCTGGCTAGCCAAGGAACCACCAGTTGATTAGCAGAG…
#  2 TP53   ENSG00000141510    GGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCA…
#  3 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGAC…
#  4 TP53   ENSG00000141510    TGAGGCCAGGAGATGGAGGCTGCAGTGAGCTGTGATCACACCACTG…
#  5 TP53   ENSG00000141510    Sequence unavailable                           
#  6 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
#  7 TP53   ENSG00000141510    TGAGGCCAGGAGATGGAGGCTGCAGTGAGCTGTGATCACACCACTG…
#  8 TP53   ENSG00000141510    CTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGC…
#  9 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
# 10 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
# 11 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGAC…
# 12 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGAC…
# 13 TP53   ENSG00000141510    TTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGG…
# 14 TP53   ENSG00000141510    TTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGG…
# 15 TP53   ENSG00000141510    TCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGG…
# 16 TP53   ENSG00000141510    GTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGC…
# 17 TP53   ENSG00000141510    TTTGTAATGCAGGGCTGAGGAGTGTCCGAAGAGAATGGGCAGCAGC…
# 18 TP53   ENSG00000141510    GGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAG…
# 19 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
# 20 TP53   ENSG00000141510    CTAGAGCTTTTGGGGAAGAGGGAGTGGTTGTTAAGAGATGAGATTA…

gene_sequences_attributes_subset_2 <- biomaRt::getBM(
  mart = mart,
  attributes = c("5utr", "3utr"),
  filters = c("external_gene_name"),
  values = list(c("TP53")))
# Error in read.table(text = postRes, sep = "\t", header = TRUE, quote = quote,  : 
#                       more columns than column names
@grimbough
Copy link
Owner

grimbough commented Aug 15, 2024

I think both this and #108 are because you're asking for both 5utr and 3utr. If you look at the web interface for the sequence attributes, you'll see that those options are a radio button and you can only select one of them:

image

Unfortunately the BioMart API doesn't provide any way to detect which attributes are mutually exclusive like this, so I can't detect and filter it in biomaRt. It seem the server is also happy to run a query, even if what comes back doesn't reflect exactly what was asked for.

The getSequence() function does a similar job to what you're looking for and will fail if you ask for more than one sequence type, but I'm not sure there's anyway I can catch this in generic calls to getBM(). I think it's an issue server side to even allow a query like this to run if it isn't viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants