Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug when using Phenograph with merge method = FIXED and number of events lower than fixedNum #12

Closed
lconde-ucl opened this issue Oct 11, 2017 · 6 comments

Comments

@lconde-ucl
Copy link

Hi,

I have been using cytofkit for a while and I have noticed that when I try to analyze data with merge method = FIXED and one or more of the FCS files has a lower number of events than the specified fixedNum (i.e., sampling with replacemenet), phenograph gets stuck for a long time in the "Finding nearest neighbors..." step until it crashes.

For example, a run on a single FCS file with 5004 events, merge = FIXED and FixedNum = 10000 ends like this:

[...]
> cluster_res <- lapply(clusterMethods, cytof_cluster, 
                   ydata = allDimReducedList[[dimReductionMethod]], 
                   xdata = exprs_data)
 Running PhenoGraph...Run Rphenograph starts:
 -Input data of 10000 rows and 17 columns
 -k is set to 30
 Finding nearest neighbors...

Whereas if I run the same dataset with FixNum = 5000, it runs no problem:

[...]
 > cluster_res <- lapply(clusterMethods, cytof_cluster, 
                           ydata = allDimReducedList[[dimReductionMethod]], 
                           xdata = exprs_data)
  Running PhenoGraph...Run Rphenograph starts:
  -Input data of 5000 rows and 17 columns
  -k is set to 30
  Finding nearest neighbors...DONE ~ 1.957 s
  Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 1.328 s
  Build undirected graph from the weighted links...DONE ~ 0.687 s
  Run louvain clustering on the graph ...DONE ~ 0.603 s
Run Rphenograph DONE, took a total of 4.575s.
  Return a community class
  -Modularity value: 0.7892829 
  -Number of clusters: 13 DONE!
[...]

Looks like the merge with replacement is done correctly ("Input data of 10000|5000 rows...'), but for some reason Phenograph does not like it.

If I use other merge methods (ceil, all), it runs fine, it only crashes when merge = fixed and the number of events is lower than FixNum.

Also, I saw that there was an unrelated bug in version 1.8.3 (issue #11), so just in case this bug was also specific for version v1.8.3, I tried with versions 1.6.5 and 1.9.4 and in both versions I find the same issue, i.e. phenograph crashing when merging with replacement.

Happy to send example files if you want to try to replicate the problem.

Thanks in advance for your time and for developing such a great package,
Best regards,
Lucia

@MattMyint
Copy link
Contributor

Hi Lucia,

Thanks for identifying this! It seems to be an issue with the "Finding neighbours stage", where the function nn2 from the package RANN is used.

The treetype argument here is "bd", which according to this issue page, gives the error you reported when there are duplicates in the data (such as those created when sampling with replacement).

I'll look into whether its feasible to use "kd" rather than "bd", or add an internal test for duplicates to decide which is best.

@SamGG
Copy link

SamGG commented Oct 13, 2017

If kd fails, adding some noise may do the job: I think rnorm(..., sd = 1e-7) should avoid data duplication.

@MattMyint
Copy link
Contributor

I realised there may be downstream problems if doing sampling with replacement.

I've come up with two possible solutions if cytofkit detects: number of events in any FCS file < fixedNum, and mergeMethod = "fixed"

Solution 1: Force mergeMethod = "ceil"

Solution 2: Force fixedNum to the lowest number of events across the FCS files (e.g. if I have 5 FCS files with 4000, 5000, 5500, 6000 and 4500 events each, fixedNum will be set to 4000.

Currently, I've decided to go with solution 2, as I'm of the opinion that it would provide more balanced analysis results if the same amount of data was extracted from each FCS.

However, if you feel that solution 1 would be more appropriate for analysis, do let me know!

@lconde-ucl
Copy link
Author

Sounds good to me. Thanks so much for looking into this!
Lucia

@traumenCapote
Copy link

Solution 2: Force fixedNum to the lowest number of events across the FCS files (e.g. if I have 5 FCS files with 4000, 5000, 5500, 6000 and 4500 events each, fixedNum will be set to 4000.

Currently, I've decided to go with solution 2, as I'm of the opinion that it would provide more balanced analysis results if the same amount of data was extracted from each FCS.

So then is there no way to sample with replacement as the documentation suggests? I'd like to set the number to the maximum, not the minimum, number of events across fcs files and upsample.

@SamGG
Copy link

SamGG commented Mar 30, 2019

Hi,
How many cells per sample have you got? The sample function could do with replacement, but t is not implemented in cytofkit. One reason is that tSNE dislikes identical cells and there is usually a warning about that unless the corresponding flag has been set to FALSE. So I think there is something bad about identical cells in tSNE and there may be something wrong with the overall cytofkit pipeline as well. Be also warned that computation will be longer as well. If you really want to try it it's not a big deal to replace the code of fixedNum to sample with replacement.
Best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants