-
Notifications
You must be signed in to change notification settings - Fork 48
Many small updates. #113
Many small updates. #113
Conversation
@@ -1,4 +1,4 @@ | |||
* Deploy your Spark cluster using `Google Cloud Dataproc`_. | |||
* Deploy your Spark cluster using `Google Cloud Dataproc`_. This can be done using the `Cloud Console <https://console.developers.google.com/project/_/dataproc/clustersAdd>`__ or the following ``gcloud`` command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the URL to console.cloud.google.com as per #112?
LGTM |
counts the number of variants two samples have in common. These counts are then placed into an | ||
``NxN`` matrix where ``N`` is the number of samples in the variant set. The matrix is centered, | ||
scaled, and then the first two principal components are computed for each invididual. | ||
`Principal Coordinate Analysis <http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/>`_ counts the number of variants two samples have in common. These counts are then placed into an ``(N+M)x(N+M)`` matrix where ``N`` is the number of samples in the control variant set (e.g., 1,000 Genomes) and ``M`` is the number of samples in the case variant set. The matrix is centered, scaled, and then the first two principal components are computed for each invididual. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicole, when performing this Multidimensional scaling I'm not sure you're actually doing any scaling in the code - as referenced in the link below - besides the centering before running the PCA on the centered dissimilarity matrix:
Hope it helps,
~p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/invididual/individual/
Not a blocking comment, but I'm curious whether /includes/dataflow_on_gce_setup.rst could also be updated to suggest Java 8 rather than Java 7. The forthcoming LD pipeline uses Java 8 for writing to BigTable. Can the other pipelines be run on 8? |
Filed #115 for Java 8. Saving #112 for another PR. I set up the redirects but they do not appear to be working at the moment. (see also readthedocs/readthedocs.org#1826) |
Nicole, Maybe an additional PR might be needed for the other points as well, so they don't get lost in the mists of time :) ~p |
@pgrosu Thanks for pointing out the lack of scaling! I filed googlegenomics/spark-examples#82 |
Nicole, sure thing and sorry to be picky, but the above text says:
This suggests that you are scaling before you find that PC1 and PC2, which is confusing. Usually you determine the eigenvectors with the 2 highest eigenvalues in this particular case, and then you scale the data by multiplying with them, not beforehand. Maybe my eyes are getting tired, but I'm not sure I see where you're actually performing that in the Dataflow code. I notice that you find Thanks, |
Also removed several obsolete files. The following redirects need to be configured before this change can be merged: