Many small updates. #113

deflaux · 2016-01-26T01:40:20Z

Also removed several obsolete files. The following redirects need to be configured before this change can be merged:

pgp-data.html to use_cases/discover_public_data/pgp_public_data.html
use_cases/browse_genomic_data/index.html to sections/access_data.html
use_cases/run_familiar_tools/index.html to sections/access_data.html
workshops/index.html to sections/learn_more.html

mbookman · 2016-01-26T03:11:36Z

docs/source/includes/spark_setup.rst

@@ -1,4 +1,4 @@
-* Deploy your Spark cluster using `Google Cloud Dataproc`_.
+* Deploy your Spark cluster using `Google Cloud Dataproc`_.  This can be done using the `Cloud Console <https://console.developers.google.com/project/_/dataproc/clustersAdd>`__ or the following ``gcloud`` command:


Can you set the URL to console.cloud.google.com as per #112?

mbookman · 2016-01-26T04:50:19Z

LGTM

pgrosu · 2016-01-26T05:18:15Z

docs/source/use_cases/compute_principal_coordinate_analysis/2-way-pca.rst

-counts the number of variants two samples have in common.  These counts are then placed into an
-``NxN`` matrix where ``N`` is the number of samples in the variant set.  The matrix is centered,
-scaled, and then the first two principal components are computed for each invididual.
+`Principal Coordinate Analysis <http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/>`_ counts the number of variants two samples have in common.  These counts are then placed into an ``(N+M)x(N+M)`` matrix where ``N`` is the number of samples in the control variant set (e.g., 1,000 Genomes) and ``M`` is the number of samples in the case variant set.  The matrix is centered, scaled, and then the first two principal components are computed for each invididual.


Nicole, when performing this Multidimensional scaling I'm not sure you're actually doing any scaling in the code - as referenced in the link below - besides the centering before running the PCA on the centered dissimilarity matrix:

https://github.com/googlegenomics/spark-examples/blob/15213450109e0d8629317f98c275be1cf5114072/src/main/scala/com/google/cloud/genomics/spark/examples/VariantsPca.scala#L221

Hope it helps,
~p

Nit: s/invididual/individual/

cmclean · 2016-01-27T00:04:07Z

Not a blocking comment, but I'm curious whether /includes/dataflow_on_gce_setup.rst could also be updated to suggest Java 8 rather than Java 7. The forthcoming LD pipeline uses Java 8 for writing to BigTable. Can the other pipelines be run on 8?

deflaux · 2016-01-27T04:41:36Z

Filed #115 for Java 8.

Saving #112 for another PR.

I set up the redirects but they do not appear to be working at the moment. (see also readthedocs/readthedocs.org#1826)

Many small updates.

pgrosu · 2016-01-27T05:08:29Z

Nicole, Maybe an additional PR might be needed for the other points as well, so they don't get lost in the mists of time :)

~p

deflaux · 2016-01-27T21:42:36Z

@pgrosu Thanks for pointing out the lack of scaling! I filed googlegenomics/spark-examples#82

pgrosu · 2016-01-27T22:57:01Z

Nicole, sure thing and sorry to be picky, but the above text says:

The matrix is centered, scaled, and then the first two principal components are computed for each invididual.

This suggests that you are scaling before you find that PC1 and PC2, which is confusing. Usually you determine the eigenvectors with the 2 highest eigenvalues in this particular case, and then you scale the data by multiplying with them, not beforehand. Maybe my eyes are getting tired, but I'm not sure I see where you're actually performing that in the Dataflow code. I notice that you find maxEigenvalue and secondEigenvalue, but I'm not sure where you rescale the original data with the corresponding eigenvectors.

Thanks,
~p

deflaux added 11 commits January 25, 2016 11:04

Add link to YouTube version of tutorial.

169a11d

Add link to CGC readthedocs.

ce6d298

Add link to run Bioconductor on GCE.

6d7e758

Remove obsolete index pages.

42eb9c8

Consolidate PGP information.

d30fd83

Add parameters for large samples runs of 2-way pca.

44d30f6

Improve clarity of contents of tutorials.

85506bf

Add link to cloud console cluster creation.

b8f225e

Updated github repo list.

93ce5fd

Add link to Tute queries.

44a9ca9

Merge branch 'master' into staging

4c254da

deflaux assigned mbookman Jan 26, 2016

mbookman reviewed Jan 26, 2016
View reviewed changes

pgrosu reviewed Jan 26, 2016
View reviewed changes

deflaux added a commit that referenced this pull request Jan 27, 2016

Merge pull request #113 from googlegenomics/staging

ef7aa75

Many small updates.

deflaux merged commit ef7aa75 into master Jan 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many small updates. #113

Many small updates. #113

deflaux commented Jan 26, 2016

mbookman Jan 26, 2016

mbookman commented Jan 26, 2016

pgrosu Jan 26, 2016

cmclean Jan 26, 2016

cmclean commented Jan 27, 2016

deflaux commented Jan 27, 2016

pgrosu commented Jan 27, 2016

deflaux commented Jan 27, 2016

pgrosu commented Jan 27, 2016

		@@ -1,4 +1,4 @@
		* Deploy your Spark cluster using `Google Cloud Dataproc`_.
		* Deploy your Spark cluster using `Google Cloud Dataproc`_. This can be done using the `Cloud Console <https://console.developers.google.com/project/_/dataproc/clustersAdd>`__ or the following ``gcloud`` command:

Many small updates. #113

Many small updates. #113

Conversation

deflaux commented Jan 26, 2016

mbookman Jan 26, 2016

Choose a reason for hiding this comment

mbookman commented Jan 26, 2016

pgrosu Jan 26, 2016

Choose a reason for hiding this comment

cmclean Jan 26, 2016

Choose a reason for hiding this comment

cmclean commented Jan 27, 2016

deflaux commented Jan 27, 2016

pgrosu commented Jan 27, 2016

deflaux commented Jan 27, 2016

pgrosu commented Jan 27, 2016