Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running on the sponging branch, there is an error in the annotation part. #170

Closed
xfk274280 opened this issue Aug 23, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@xfk274280
Copy link

xfk274280 commented Aug 23, 2024

Description of the bug

image
Here are my annotation file and bed file. Are the formats correct?

Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:BSJ_DETECTION:ANNOTATE_PER_SAMPLE:ANNOTATE (C23)'

Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:BSJ_DETECTION:ANNOTATE_PER_SAMPLE:ANNOTATE (C23)` terminated with an error exit status (1)

Command executed [~/workspace/nf_core/nf-core-circrna_sponging/sponging/./workflows/circrna/../../subworkflows/local/./../../modules/local/annotation/templates/annotation.py]:

  #!/usr/bin/env python
  
  import pandas as pd
  import numpy as np
  import platform
  import csv
  
  def format_yaml_like(data: dict, indent: int = 0) -> str:
      """Formats a dictionary to a YAML-like string.
  
      Args:
          data (dict): The dictionary to format.
          indent (int): The current indentation level.
  
      Returns:
          str: A string formatted as YAML.
      """
      yaml_str = ""
      for key, value in data.items():
          spaces = "  " * indent
          if isinstance(value, dict):
              yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
          else:
              yaml_str += f"{spaces}{key}: {value}\n"
      return yaml_str
  
  
  columns = {
      0: 'chr',
      1: 'start',
      2: 'end',
      3: 'name',
      4: 'score',
      5: 'strand',
      9: 'tx_start',
      10: 'tx_end',
      14: 'attributes'
  }
  
  attributes = ['gene_id', 'gene_name', 'transcript_id']
  
  exon_boundary = int("0")
  
  try:
      df = pd.read_csv("C23.intersect_gtf.bed", sep="\t", header=None, usecols=columns.keys())
  except pd.errors.EmptyDataError:
      raise ValueError("Intersection between circRNAs and GTF file is empty.")
  df = df.rename(columns=columns)
  
  # Extract circRNAs without match
  mask = df['tx_start'] == -1
  df_intergenic = df[mask]
  df = df[~mask]
  df_intergenic['type'] = 'intergenic-circRNA'
  df_intergenic['gene_id'] = 'intergenic_' + df_intergenic['name']
  df_intergenic['gene_name'] = 'intergenic_' + df_intergenic['name']
  df_intergenic['transcript_id'] = 'intergenic_' + df_intergenic['name']
  
  # Convert attributes to a dictionary
  df['attributes'] = df['attributes'].apply(lambda row: dict([[value.strip(r'"') for value in entry.strip().split(' ', 1)] for entry in row.split(';') if entry]))
  # Make sure all attributes are present
  df_incomplete = df['attributes'].apply(lambda row: ", ".join([key for key in attributes if key not in row]))
  df_incomplete = df_incomplete[df_incomplete != ""]
  if len(df_incomplete) > 0:
      counts = df_incomplete.value_counts()
      counts.name = 'count'
      counts.index.name = 'missing'
      raise ValueError(f"The following attributes are missing in the intersection file:\n\n{counts.to_frame()}")
  # Keep only the attributes we want
  df['attributes'] = df['attributes'].apply(lambda row: {key: row[key] for key in attributes if key in row})
  # Convert attributes to columns
  df = pd.concat([df.drop(['attributes'], axis=1), df['attributes'].apply(pd.Series)], axis=1)
  
  df['any_outside'] = (df['start'] < df['tx_start'] - exon_boundary) | (df['end'] > df['tx_end'] + exon_boundary)
  # Perfect is inverse of any_outside
  df['perfect'] = ~df['any_outside']
  # Drop any_outside
  df = df.drop(['any_outside', 'tx_start', 'tx_end'], axis=1)
  
  df = df.groupby(['chr', 'start', 'end', 'strand']).aggregate({
      'name': lambda x: x.iloc[0],
      'score': lambda x: x.iloc[0],
      'gene_id': lambda x: list(x),
      'gene_name': lambda x: list(x),
      'transcript_id': lambda x: list(x),
      'perfect': lambda x: list(x)
  })
  
  def filter_perfect(row, col):
      if any(row['perfect']):
          matching_values = [value for value, perfectness in zip(row[col], row['perfect']) if perfectness]
      else:
          matching_values = row[col]
      valid_values = set([value for value in matching_values if type(value) == str])
      return ",".join(valid_values) if valid_values else "NaN"
  
  def determine_type(row):
      if row["no_transcript"]:
          return "ciRNA"
      if any(row['perfect']):
          return "circRNA"
      else:
          return 'EI-circRNA'
  
  df['no_transcript'] = df['transcript_id'].apply(lambda x: all([type(value) != str and np.isnan(value) for value in x]))
  df['type'] = df.apply(lambda row: determine_type(row), axis=1)
  df['gene_id'] = df.apply(lambda row: filter_perfect(row, 'gene_id'), axis=1)
  df['gene_name'] = df.apply(lambda row: filter_perfect(row, 'gene_name'), axis=1)
  df['transcript_id'] = df.apply(lambda row: filter_perfect(row, 'transcript_id'), axis=1)
  # Drop perfect
  df = df.drop(['perfect'], axis=1)
  
  df = df.reset_index()
  df_intergenic = df_intergenic.reset_index()
  bed_order = ['chr', 'start', 'end', 'name', 'score', 'strand', 'type', 'gene_id', 'gene_name', 'transcript_id']
  df = df[bed_order]
  df_intergenic = df_intergenic[bed_order]
  
  df = pd.concat([df, df_intergenic], axis=0)
  
  db_intersections = "C23-circBase.intersect_database.bed".split()
  has_db = len(db_intersections) > 0
  
  if has_db:
      db_colnames = ['chr', 'start', 'end', 'name', 'score', 'strand', 'db_chr', 'db_start', 'db_end', 'db_name', 'db_score', 'db_strand']
      db_usecols = ['chr', 'start', 'end', 'name', 'score', 'strand', 'db_name']
      df_databases = pd.concat([pd.read_csv(db_path, sep="\t", names=db_colnames, usecols=db_usecols) for db_path in db_intersections])
  
      # Group by chr, start, end, name, score, strand, and aggregate the db_name to list
      df_databases = df_databases.groupby(['chr', 'start', 'end', 'name', 'score', 'strand']).aggregate({
          'db_name': lambda x: ",".join([val for val in x if val != '.'])
      })
  
      df_databases['db_name'] = df_databases['db_name'].apply(lambda x: x if x else '.')
  
      df = df.merge(df_databases, how='left', on=['chr', 'start', 'end', 'name', 'score', 'strand'])
  else:
      df['db_name'] = "."
  
  # Sort by chr, start, end
  df = df.sort_values(['chr', 'start', 'end'])
  
  df.to_csv("C23.annotated.bed", sep='\t', index=False, header=False)
  
  # Convert to GTF
  df['source'] = 'circRNA'
  df['frame'] = '.'
  df['attributes'] = 'gene_id "' + df['gene_id'] + '"; gene_name "' + df['gene_name'] + '"; transcript_id "circ_' + df['name'] + '"; db_ids "' + df['db_name'] + '";'
  
  gtf_order = ['chr', 'source', 'type', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
  df = df[gtf_order]
  
  df.to_csv("C23.annotated.gtf", sep='\t', index=False, header=False, quoting=csv.QUOTE_NONE)
  
  # Versions
  
  versions = {
      "NFCORE_CIRCRNA:CIRCRNA:BSJ_DETECTION:ANNOTATE_PER_SAMPLE:ANNOTATE": {
          "python": platform.python_version(),
          "pandas": pd.__version__,
          "numpy": np.__version__
      }
  }
  
  with open("versions.yml", "w") as f:
      f.write(format_yaml_like(versions))

Command exit status:
  1

Command output:
  (empty)

Command error:
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 44 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 45 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 46 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 47 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 48 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 49 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 50 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 51 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 52 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 53 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 54 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 55 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 56 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 57 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 58 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 59 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 60 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 61 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 62 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 63 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  .command.run: fork: retry: Resource temporarily unavailable
  .command.run: fork: retry: Resource temporarily unavailable
  .command.run: fork: retry: Resource temporarily unavailable
  .command.sh:45: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
    df = pd.read_csv("C23.intersect_gtf.bed", sep="\t", header=None, usecols=columns.keys())
  .command.run: fork: retry: Resource temporarily unavailable
  .command.run: fork: Resource temporarily unavailable
  .command.run: line 158:    37 Segmentation fault      /usr/bin/env python .command.sh
  .command.run: line 158: kill: (39) - No such process

Command used and terminal output

nextflow run nf-core-circrna_sponging/sponging --input samplecircrna.bk.csv --phenotype samplecircrna_condition.bk.csv --outdir result240812 -profile singularity --fasta Genome/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa --gtf Genome/GRCh38/Homo_sapiens.GRCh38.101.chr.gtf --skip_trimming --limitSjdbInsertNsj 10000000 -resume --max_cpus 31 --max_time '960.h' --annotation circrna_ann.csv

Relevant files

No response

System information

No response

@xfk274280 xfk274280 added the bug Something isn't working label Aug 23, 2024
@nictru
Copy link
Contributor

nictru commented Aug 23, 2024

Hey, you seem to have a problem similar to this. I think this is more a problem with your server than with the pipeline.

Also, is there a reason why you use the sponging? It is not ready for productive usage yet.

@xfk274280
Copy link
Author

xfk274280 commented Aug 28, 2024

I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out.Based on the previous tips, I re-ran the program for the 'annotation' step and it seems to have succeeded, but other errors occurred.

Execution cancelled -- Finishing pending tasks before exit
-[nf-core/circrna] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST'

Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST` terminated with an error exit status (1)

Command executed [~/nf_core/nf-core-circrna_sponging/sponging/./workflows/circrna/../../subworkflows/local/../../modules/local/circtest/circtest/templates/circtest.R]:

  #!/usr/bin/env Rscript
  
  require(aod)
  require(plyr)
  require(ggplot2)
  
  ## CircTest functions
  ## Package: CircTest (https://github.com/dieterich-lab/CircTest)
  ## Version: 0.1.1
  ## Author(s): Jun Cheng, Tobias Jakobi
  ## License: GPL
  
  ## SUMMMARY
  
  #'@title  Summarizes data
  ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
  ##   data: a data frame.
  ##   measurevar: the name of a column that contains the variable to be summariezed
  ##   groupvars: a vector containing names of columns that contain grouping variables
  ##   na.rm: a boolean that indicates whether to ignore NA's
  ##   conf.interval: the percent range of the confidence interval (default is 95%)
  #'
  summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                          conf.interval=.95, .drop=TRUE) {
  
      require(plyr)
  
      # New version of length which can handle NA's: if na.rm==T, don't count them
      length2 <- function (x, na.rm=FALSE) {
          if (na.rm) sum(!is.na(x))
          else       length(x)
      }
  
      # This does the summary. For each group's data frame, return a vector with
      # N, mean, and sd
      datac <- ddply(data, groupvars, .drop=.drop,
                      .fun = function(xx, col) {
                          c(N    = length2(xx[[col]], na.rm=na.rm),
                          mean = mean   (xx[[col]], na.rm=na.rm),
                          sd   = sd     (xx[[col]], na.rm=na.rm)
                      )
                      },
                      measurevar
      )
  
      # Rename the "mean" column
      datac <- rename(datac, c("mean" = measurevar))
  
      datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
  
      # Confidence interval multiplier for standard error
      # Calculate t-statistic for confidence interval:
      # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
      ciMult <- qt(conf.interval/2 + .5, datac$N-1)
      datac$ci <- datac$se * ciMult
  
      return(datac)
  }
  
  ## RATIO PLOT
  
  Circ.ratioplot <- function(Circ,Linear,CircCoordinates = None,plotrow='1',size=24,ncol=2,groupindicator1=NULL,groupindicator2=NULL,x='Conditions',y='circRNA/(circRNA+Linear)',lab_legend='groupindicator1', circle_description = c(1:3), gene_column = None, y_axis_range = 1, colour_mode = "colour"){
  
      if( !is.null(groupindicator1) & length(groupindicator1) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator1 should be equal to the number of samples.")
      }
      if( !is.null(groupindicator2) & length(groupindicator2) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator2 should be equal to the number of samples.")
      }
      if(is.null(groupindicator1)){
          stop("At least one grouping should be provided through groupindicator1.")
      }
      if(!is.null(groupindicator2)){
          twolevel <- TRUE
      }else{
          twolevel <- FALSE
      }
  
      rownames.circ <- rownames(Circ)
      Circ <- data.frame(lapply(Circ, as.character), stringsAsFactors=FALSE)
      rownames(Circ) <- rownames.circ
  
      rownames.linear <- rownames(Linear)
      Linear <- data.frame(lapply(Linear, as.character), stringsAsFactors=FALSE)
      rownames(Linear) <- rownames.linear
  
      if(!missing(CircCoordinates)){
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }else{
          CircCoordinates <- data.frame(Circ[,circle_description])
          rownames(CircCoordinates) <- rownames.circ
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }
  
      groupindicator1 <- factor(groupindicator1,levels=unique(groupindicator1))
      groupindicator2 <- factor(groupindicator2,levels=unique(groupindicator2))
  
      # Get gene name, if no annotation, output NULL
      if (is.character(plotrow)){
          if ( ! plotrow %in% rownames(CircCoordinates) ){
              stop("Specified 'plotrow' not found.")
          }
      }else{
          if ( is.numeric(plotrow) ){
              if ( ! plotrow %in% 1:nrow(CircCoordinates) ){
                  stop("Specified 'plotrow' not found.")
              }
          }else{
              stop("Specified plotrow should be ONE rowname or ONE rownumber.")
          }
      }
      # Choose your own column containing the gene name using gene_column. The genename will be displayed in the plot title if available
      if (missing(gene_column)){
          genename = NULL
      }else{
          genename <- as.character(CircCoordinates[plotrow,gene_column])
          if (genename == '.'){
              genename = NULL
          }
      }
      if(twolevel){
          plotdat <- summarySE( data.frame(Ratio=as.numeric(Circ[plotrow,-circle_description])/(as.numeric(Linear[plotrow,-circle_description])+as.numeric(Circ[plotrow,-circle_description])),
                                                                          groupindicator1,
                                                                          groupindicator2),
                                                  measurevar='Ratio',groupvars=c('groupindicator1','groupindicator2') )
      }else{
          plotdat <- summarySE( data.frame(Ratio=as.numeric(Circ[plotrow,-circle_description])/(as.numeric(Linear[plotrow,-circle_description])+as.numeric(Circ[plotrow,-circle_description])),
                                                                          groupindicator1),
                                                                          measurevar='Ratio',groupvars=c('groupindicator1') )
      }
  # construct plot
      Q <- ggplot(plotdat, aes(x=groupindicator1, y=Ratio)) +
              geom_boxplot() + theme_classic() +
              theme(axis.text.x = element_blank())+
              theme(axis.text.y = element_text(size=size+4))+
              theme(axis.ticks = element_line(colour = 'black', size = 1)) +
              theme(axis.ticks.x = element_blank())+
              theme(legend.title=element_blank()) +
              theme(text=element_text(size=size+4))+
              theme(legend.text=element_text(size=size)) +
              theme(plot.title = element_text(size=size)) +
              theme(axis.text.y = element_text(margin=margin(5,5,10,5,"pt")))+
              ggtitle(paste("Annotation: ", genename, "\nChr ", toString(Circ[plotrow,circle_description]),sep="")) +
              ylab("circRNA/(circRNA + Linear RNA)") +
              xlab("Sample") +
              geom_errorbar(aes(ymin=Ratio, ymax=Ratio+se), width=.2 , size=2) +
              geom_bar(stat="identity",aes(fill=groupindicator1), color = "black", size=2)
  
      if (colour_mode == "bw"){
              Q <- Q + scale_fill_grey(start = 0.0, end = 1)
      } else {
              Q <- Q + scale_fill_discrete(name=lab_legend)
      }
  
              Q <- Q +
              theme(legend.position="bottom") +
              theme(axis.ticks.length = unit(0.5, "cm")) +
              theme(panel.background = element_blank(),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  axis.line = element_line(colour = "black"),
                  panel.border = element_rect(colour = "black", fill=NA, size=3)) +
                  guides(fill=guide_legend(
                                  keywidth=0.3,
                                  keyheight=0.3,
                                  default.unit="inch")
              ) + scale_y_continuous(expand=c(0,0), limits= c(0, y_axis_range))
  
      if(twolevel){
          Q <- Q + facet_wrap( ~ groupindicator2,ncol=ncol )
      }
  
      print(Q)
  }
  
  ## LINEPLOT
  
  Circ.lineplot <- function(Circ,Linear,CircCoordinates = None,plotrow='1',size=18,ncol=2,groupindicator1=NULL,groupindicator2=NULL,x='Conditions',y='Counts', circle_description = c(1:3), gene_column = None){
  
      require(ggplot2)
  
      if( !is.null(groupindicator1) & length(groupindicator1) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator1 should be equal to the number of samples.")
      }
      if( !is.null(groupindicator2) & length(groupindicator2) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator2 should be equal to the number of samples.")
      }
      if(is.null(groupindicator1)){
          stop("At least one grouping should be provided through groupindicator1.")
      }
      if(!is.null(groupindicator2)){
          twolevel <- TRUE
      }else{
          twolevel <- FALSE
      }
  
      rownames.circ <- rownames(Circ)
      Circ <- data.frame(lapply(Circ, as.character), stringsAsFactors=FALSE)
      rownames(Circ) <- rownames.circ
  
      rownames.linear <- rownames(Linear)
      Linear <- data.frame(lapply(Linear, as.character), stringsAsFactors=FALSE)
      rownames(Linear) <- rownames.linear
  
      # if CircCoordinates are available, use them, otherwise get more information from the Circ table, as indicated by the circle_description columns.
      if(!missing(CircCoordinates)){
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }else{
          CircCoordinates <- data.frame(Circ[,circle_description])
          rownames(CircCoordinates) <- rownames.circ
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }
  
      groupindicator1 <- factor(groupindicator1,levels=unique(groupindicator1))
      groupindicator2 <- factor(groupindicator2,levels=unique(groupindicator2))
  
      # Get gene name, if no annotation, output NULL
      if (is.character(plotrow)){
          if ( ! plotrow %in% rownames(CircCoordinates) ){
              stop("Specified 'plotrow' not found.")
          }
      }else{
          if ( is.numeric(plotrow) ){
              if ( ! plotrow %in% 1:nrow(CircCoordinates) ){
                  stop("Specified 'plotrow' not found.")
              }
          }else{
              stop("Specified plotrow should be ONE rowname or ONE rownumber.")
          }
      }
      # Choose your own column containing the gene name using gene_column. The genename will be displayed in the plot title if available
      if (missing(gene_column)){
          genename = NULL
      }else{
          genename <- as.character(CircCoordinates[plotrow,gene_column])
          if (genename == '.'){
              genename = NULL
          }
      }
  
      plot.func <- function(row=plotrow){
          if(twolevel){
              plotdat <- summarySE(data.frame(Counts=c(as.numeric(Circ[row,-circle_description]),as.numeric(Linear[row,-circle_description])),
                                                                              groupindicator1,
                                                                              groupindicator2,
                                                                              Type=c(rep('circRNA',ncol(Circ)-length(circle_description)),rep('linear RNA',ncol(Circ)-length(circle_description)))
              ), measurevar='Counts',groupvars=c('Type','groupindicator1','groupindicator2') )
          }else{
              plotdat <- summarySE(data.frame(Counts=c(as.numeric(Circ[row,-circle_description]),as.numeric(Linear[row,-circle_description])),
                                                                              groupindicator1,
                                                                              Type=c(rep('circRNA',ncol(Circ)-length(circle_description)),rep('linear RNA',ncol(Circ)-length(circle_description)))
              ), measurevar='Counts',groupvars=c('Type','groupindicator1') )
          }
  
          Q=ggplot(plotdat, aes(x=groupindicator1, y=Counts, group=Type,colour=Type)) +
              theme(text=element_text(size=size))+
              theme_bw()+
              labs( list(title=paste(toString(Circ[row,circle_description]),genename,sep=" "),x=x,y=y) ) +
              ggtitle(paste(toString(Circ[row,circle_description]),genename,sep=" "))+
              geom_errorbar(aes(ymin=Counts-se, ymax=Counts+se), width=.1, position=position_dodge(.1) ) +
              xlab("Condition") +
              geom_line(position=position_dodge(.1)) +
              geom_point(position=position_dodge(.1))
          if (twolevel){
              Q = Q + facet_wrap( ~ groupindicator2,ncol=ceiling(sqrt(length(levels(groupindicator2)))) )
          }
  
          print(Q)
      }
  
      return(plot.func(row=plotrow))
  }
  
  ## FILTER
  
  Circ.filter <- function(circ=circ,linear=linear,Nreplicates=3,filter.sample=4,filter.count=5,percentage=1, circle_description=c(1:3)){
      del_row=c()
      for ( i in 1:nrow(circ) ){
          if ( sum(circ[i,-circle_description]>=filter.count)<filter.sample | circ_max_perc(circ[i,-circle_description], linear[i,-circle_description],Nreplicates=Nreplicates)<percentage)
              del_row = c(del_row,i)
      }
  
      if (length(del_row) > 0){
          new_dat=circ[-del_row,]
          return(new_dat)
      } else {
          return(circ)
      }
  }
  
  circ_max_perc <- function(circ=circ,linear=linear,Nreplicates=3){
      # convert to vector
      circ = as.numeric(circ)
      linear = as.numeric(linear)
      if( length(circ) != length(linear) ){
          stop ('Number of samples in circRNA is not equal to Hostgene.')
      }
      Ngroups = length(circ)/Nreplicates
      # calculate percentage
      circ_sum = unname(tapply(circ, (seq_along(1:length(circ))-1) %/% Nreplicates, sum ))
      linear_sum = unname(tapply(linear, (seq_along(1:length(linear))-1) %/% Nreplicates, sum ))
      perc = max(circ_sum / (circ_sum+linear_sum),na.rm=T)
      return(perc)
  }
  
  ## CIRC TEST
  
  Circ.test <- function(Circ, Linear, CircCoordinates=None, group, alpha=0.05, plotsig=T, circle_description = c(1:3)){
  
          # Requre packge
          require(aod)
  
          # check whether the input matrix are correct
          if ( nrow(Circ)!=nrow(Linear) | ncol(Circ) != ncol(Linear)){
                  stop('Circ data and Linear data are not matched, dimention different.')
          }
  
          # A vector for pvalue and directions indicator
          p.val <- c()
          direction <- c()
  
          # groups
          if ( length(group) != ncol(Circ)-length(circle_description) ){
                  print(length(group))
                  print(ncol(Circ)-length(circle_description))
                  stop("length of 'group' must be equal to the number of samples of 'Circ' and 'Linear'. ")
          }
          group <- factor(group)
          counter <- 0
  
          ## test
          # construct test matrix for each circRNA
  
          tmp_df = Circ[,FALSE]
  
          for (j in seq(1,length(unique(group)))){
                  tmp_df[paste("group_",j,"_ratio_mean",sep="")] <- NA
          }
  
          for ( i in rownames(Circ) ){
                  counter <- counter+1
  
                  # total read counts vector
                  tot <- round( as.numeric(Linear[i,-circle_description]) + as.numeric(Circ[i,-circle_description]) )
  
                  # circRNA read counts
                  circ <- as.numeric(Circ[i,-circle_description])
  
                  # if there is 0 in the total count vector, the model will fail. So permute 0 to 1
                  if ( 0 %in% tot ){
                      tot[tot==0]=1
                  }
  
                  if (counter %% 1000 == 0){
                          message(paste(counter, "candidates processed"))
                  }
  
                  tmp_rations <- data.frame(Ratio=as.numeric(Circ[i,-circle_description])/(as.numeric(Linear[i,-circle_description])+as.numeric(Circ[i,-circle_description])),
                  group=group)
                  for (rep_group in seq(1,max(as.numeric(levels(group))),1)){
                          tmp_df[i, paste("group_",rep_group,"_ratio_mean",sep="")] <- mean(na.omit(unlist(tmp_rations[tmp_rations$group==rep_group,1])))
                  }
  
                  # Constract data frame
                  testdat = data.frame(tot,circ,group)
  
                  ## do test
                  # Null model
                  fitNull <- betabin(cbind(circ,tot-circ) ~ 1, ~ 1, data=testdat)
                  # Alternative model
                  fitAlt <- betabin(cbind(circ,tot-circ) ~ group, ~ 1, data=testdat)
                  # test models
                  a <- anova(fitNull,fitAlt)
                  p.value <- [email protected][,11][2]
                  p.val <- c( p.val, p.value )
          }
          message(paste(counter, "candidates processed in total"))
  
          Circ$direction <- direction
          p.adj <- p.adjust(p.val,n=sum(!is.na(p.val)),'BH')
          # select significant ones
          sig_dat <- Circ[p.adj<=alpha    & !is.na(p.adj),]
          sig_ratios <- tmp_df[p.adj<=alpha    & !is.na(p.adj),]
          sig_p <- p.adj[p.adj<=alpha    & !is.na(p.adj)]
  
          # sort by p-val
          sig_dat <- sig_dat[order(sig_p),]
          sig_ratios <- sig_ratios[order(sig_p),]
          sig_p <- sort(sig_p)
  
          # A summary table
          if (missing(CircCoordinates)){
                  summary_table <- data.frame(sig_dat[,circle_description],sig_p,sig_dat[,circle_description])
  
                  rownames(summary_table) <- rownames(sig_dat)
                  names(summary_table) <- c(names(sig_dat)[circle_description],"sig_p",names(sig_ratios)[circle_description])
          } else {
                  summary_table <- cbind(CircCoordinates[rownames(sig_dat),],sig_p,sig_ratios)
                  colnames(summary_table) <- c(colnames(CircCoordinates),"sig_p",colnames(sig_ratios))
          }
  
          message(paste(nrow(summary_table), "candidates passed the specified thresholds"))
  
          # return all objects in a list
          return(list(summary_table=summary_table,
                              sig.dat=sig_dat,
                              p.val=p.val,
                              p.adj=p.adj,
                              sig_p=sig_p,
                              ratios=sig_ratios
                          )
                  )
  }
  
  ## MAIN
  
  circs = read.table("tx_counts_circs.tsv", header=T, sep="\t")
  genes = read.table("tx_counts_genes.tsv", header=T, sep="\t")
  pheno = read.csv  ("samplecircrna_condition.bk.csv"  , header=T, row.names = "sample")
  
  circs <- Circ.filter(circ = circs, linear = genes, filter.sample = 2, filter.count = 5, percentage = 0.00001)
  genes <- genes[rownames(circs),]
  
  description <- c(1)
  pheno <- pheno[colnames(circs[,-description]),,drop=FALSE]
  
  test <- Circ.test(circs, genes, group=as.numeric(as.factor(pheno$condition)), circle_description = description)
  write.table(test$summary_table, "tx_counts_summary.txt", row.names=F)
  
  
  pdf("circ_linear_ratio_plots.pdf", width = 8, height = 10)
  for (i in rownames(test$summary_table)) {
      Circ.ratioplot(circs, genes, plotrow=i, groupindicator1=pheno$condition,
          lab_legend = 'condition', circle_description = description )
  }
  dev.off()
  
  pdf("circ_linear_line_plots.pdf", width = 8, height = 10)
  for (i in rownames(test$summary_table)) {
      Circ.lineplot(circs, genes, plotrow=i, groupindicator1=pheno$condition,
          circle_description = description )
  }
  dev.off()
  
  
  ################################################
  ################################################
  ## VERSIONS FILE                              ##
  ################################################
  ################################################
  
  r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
  
  writeLines(
      c(
          '"NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST":',
          paste('    r-base:', r.version),
          paste('    aod:', packageVersion('aod')),
          paste('    plyr:', packageVersion('plyr')),
          paste('    ggplot2:', packageVersion('ggplot2'))
      ),
  'versions.yml')
  
  ################################################
  ################################################
  ################################################
  ################################################

Command exit status:
  1

Command output:
  (empty)

Command error:
  Loading required package: aod
  Loading required package: plyr
  Loading required package: ggplot2
  Warning message:
  In max(circ_sum/(circ_sum + linear_sum), na.rm = T) :
    no non-missing arguments to max; returning -Inf
  756 candidates processed in total
  756 candidates passed the specified thresholds
  There were 50 or more warnings (use warnings() to see the first 50)
  Error: Must request at least one colour from a hue palette.
  In addition: There were 33 warnings (use warnings() to see them)
  Execution halted

@xfk274280
Copy link
Author

xfk274280 commented Aug 28, 2024

image

If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file? Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?

@nictru
Copy link
Contributor

nictru commented Aug 28, 2024

Hey

I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out

This was true until #166 was merged. Since then, this functionality is also available on the dev branch.

If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file?

DESeq2 is not made for differential expression on transcript level, but I know it is frequently used in the circRNA space. This files contain the output of psirc-quant, which is an adaption of kallisto for quantifying circRNAs. If you want to, you can use them with DESeq. The experiments.merged.rds contains a summarizedExperiment that can be used with fishpond/swish, which is made for differential expression on transcript level.

Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?

Check out this section of the docs. The *.annotated.bed and the *.annotated.gtf should contain the annotations based on the files you provided via the --annotation parameter.

@xfk274280
Copy link
Author

Thank you for your reply.

Hey

I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out

This was true until #166 was merged. Since then, this functionality is also available on the dev branch.

If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file?

DESeq2 is not made for differential expression on transcript level, but I know it is frequently used in the circRNA space. This files contain the output of psirc-quant, which is an adaption of kallisto for quantifying circRNAs. If you want to, you can use them with DESeq. The experiments.merged.rds contains a summarizedExperiment that can be used with fishpond/swish, which is made for differential expression on transcript level.

Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?

Check out this section of the docs. The *.annotated.bed and the *.annotated.gtf should contain the annotations based on the files you provided via the --annotation parameter.

@nictru nictru closed this as completed Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants