Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Many small updates. #113

Merged
merged 11 commits into from
Jan 27, 2016
Merged

Many small updates. #113

merged 11 commits into from
Jan 27, 2016

Conversation

deflaux
Copy link
Contributor

@deflaux deflaux commented Jan 26, 2016

Also removed several obsolete files. The following redirects need to be configured before this change can be merged:

  • pgp-data.html to use_cases/discover_public_data/pgp_public_data.html
  • use_cases/browse_genomic_data/index.html to sections/access_data.html
  • use_cases/run_familiar_tools/index.html to sections/access_data.html
  • workshops/index.html to sections/learn_more.html

@@ -1,4 +1,4 @@
* Deploy your Spark cluster using `Google Cloud Dataproc`_.
* Deploy your Spark cluster using `Google Cloud Dataproc`_. This can be done using the `Cloud Console <https://console.developers.google.com/project/_/dataproc/clustersAdd>`__ or the following ``gcloud`` command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the URL to console.cloud.google.com as per #112?

@mbookman
Copy link
Contributor

LGTM

counts the number of variants two samples have in common. These counts are then placed into an
``NxN`` matrix where ``N`` is the number of samples in the variant set. The matrix is centered,
scaled, and then the first two principal components are computed for each invididual.
`Principal Coordinate Analysis <http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/>`_ counts the number of variants two samples have in common. These counts are then placed into an ``(N+M)x(N+M)`` matrix where ``N`` is the number of samples in the control variant set (e.g., 1,000 Genomes) and ``M`` is the number of samples in the case variant set. The matrix is centered, scaled, and then the first two principal components are computed for each invididual.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicole, when performing this Multidimensional scaling I'm not sure you're actually doing any scaling in the code - as referenced in the link below - besides the centering before running the PCA on the centered dissimilarity matrix:

https://github.com/googlegenomics/spark-examples/blob/15213450109e0d8629317f98c275be1cf5114072/src/main/scala/com/google/cloud/genomics/spark/examples/VariantsPca.scala#L221

Hope it helps,
~p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/invididual/individual/

@cmclean
Copy link
Contributor

cmclean commented Jan 27, 2016

Not a blocking comment, but I'm curious whether /includes/dataflow_on_gce_setup.rst could also be updated to suggest Java 8 rather than Java 7. The forthcoming LD pipeline uses Java 8 for writing to BigTable. Can the other pipelines be run on 8?

@deflaux
Copy link
Contributor Author

deflaux commented Jan 27, 2016

Filed #115 for Java 8.

Saving #112 for another PR.

I set up the redirects but they do not appear to be working at the moment. (see also readthedocs/readthedocs.org#1826)

deflaux added a commit that referenced this pull request Jan 27, 2016
@deflaux deflaux merged commit ef7aa75 into master Jan 27, 2016
@pgrosu
Copy link

pgrosu commented Jan 27, 2016

Nicole, Maybe an additional PR might be needed for the other points as well, so they don't get lost in the mists of time :)

~p

@deflaux
Copy link
Contributor Author

deflaux commented Jan 27, 2016

@pgrosu Thanks for pointing out the lack of scaling! I filed googlegenomics/spark-examples#82

@pgrosu
Copy link

pgrosu commented Jan 27, 2016

Nicole, sure thing and sorry to be picky, but the above text says:

The matrix is centered, scaled, and then the first two principal components are computed for each invididual.

This suggests that you are scaling before you find that PC1 and PC2, which is confusing. Usually you determine the eigenvectors with the 2 highest eigenvalues in this particular case, and then you scale the data by multiplying with them, not beforehand. Maybe my eyes are getting tired, but I'm not sure I see where you're actually performing that in the Dataflow code. I notice that you find maxEigenvalue and secondEigenvalue, but I'm not sure where you rescale the original data with the corresponding eigenvectors.

Thanks,
~p

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants