Skip to content
Alan Iwi edited this page Dec 15, 2015 · 14 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category REVISE.
This page contains useful content but needs revision. It may contain out of date or inaccurate content.

ESGF Node Installation FAQ


General

What's up with the ESGF P2P Release Names?

The release names follow an alphabetical list of Brooklyn (NYC) neighborhoods .

What operating systems are supported for Node installation?

Currently we test and build on CentOS/RedHat 5 . There are installations that have been successfully done on Ubuntu and SuSe . I suspect most LINUX distributions are supported with little or no modifications necessary. The current installation does NOT support Windows or Mac. (For the mac, the only real issue is how mac users are created i.e. not using useradd. I suspect that this will be addressed after the LINUX installs are more settled and/or there is a sufficient demand for mac support).

What do I need on my system before I install?

The system should have gcc , gcc-c++ , openssl-devel as well the X11-devel headers installed before you begin the Node installation. Additionally you may need to install zlib-devel , gettext-devel and expat-devel as well. Actually the installation of CDAT will install the last three libraries, however, GIT is needed to pull down the CDAT distribution, hence these are actually prerequisites for building GIT. To support the GridFTP installation you will need Flex and Bison (see below for details)

NOTE: the node will not work with OpenSSL 1.0. See Bug 123

==centos 6.3 (64bit)== autoconf

  • autoconf-archive.noarch: The Autoconf Macro Archive
  • autoconf.noarch: A GNU tool for automatically configuring source code

automake

  • automake.noarch: A GNU tool for automatically creating Makefiles

bison

  • bison-devel: -ly library for development using Bison-generated parsers
  • bison-runtime: Runtime support files used by Bison-generated parsers

file

  • file-libs: Libraries for applications using libmagic
  • file-roller: Tool for viewing and creating archives

flex

  • flexiport-devel: Header files and libraries for flexiport
  • jflex-javadoc.noarch: Javadoc for jflex

gcc

  • libgcc: GCC version 4.4 shared support library

gcc-c++

  • gcc-c++: C++ support for GCC

gettext-devel

  • gettext-devel: Development files for gettext

libtool

  • libtool-ltdl-devel: Tools needed for development using the GNU Libtool Dynamic Module Loader
  • libtool: The GNU Portable Library Tool

libuuid

  • libuuid-devel: Universally unique ID library

libxml2

  • libxml2: Library providing XML and HTML support
  • libxml2-devel: Libraries, includes, etc. to develop XML and HTML applications

libxslt

  • libxslt: Library providing the Gnome XSLT engine
  • libxslt-devel: Libraries, includes, etc. to embed the Gnome XSLT engine

lsof

  • lsof: A utility which lists open files on a Linux/UNIX system

make

  • make: A GNU tool which simplifies the build process for users

openssl

  • openssl-devel: Files for development of applications which will use OpenSSL

pam

  • pam-devel: Files needed for developing PAM-aware applications and modules for PAM

pax

  • pax: POSIX File System Archiver
  • pax-utils: PaX aware and related utilities for ELF binaries

readline

  • readline-devel: Files needed to develop programs which use the readline library

tk

  • tk-devel: Tk graphical toolkit development files

wget

  • wget: A utility for retrieving files using the HTTP or FTP protocols

zlib-devel

  • zlib-devel: Header files and libraries for Zlib development

ExtUtils

  • perl- ExtUtils *

perl-Archive-Tar

  • perl-Archive-Tar: A module for Perl manipulation of .tar files

perl-XML-Parser

  • perl-XML-Parser: Perl module for parsing XML files

x11

  • xorg-x11*

A more copy&paste friendly version of that (using yum for centos):

yum install autoconf automake bison file flex gcc gcc-c++ gettext-devel libtool libuuid-devel libxml2 libxml2-devel libxslt libxslt-devel lsof make openssl-devel pam-devel pax readline-devel tk-devel wget zlib-devel *ExtUtils* perl-Archive-Tar perl-XML-Parser

NOTE: There are additional prerequisites from the UV-CDAT tool that is installed as part of the DATA configuration of the stack. Please see them here: https://github.com/UV-CDAT/uvcdat/wiki/System-Requirements, most notably the need for gfortran. (In newer versions of uv-cdat gfortran is part of the installation procedure)

Can I install it in a machine without host and domain name? (or

localhost.localdomain?)

mmm.... probably not, there are a lot of difficulties I run into for attempting such a thing.

just give yourself a name (for testing) by calling hostname myname.mydomain and updating the /etc/hosts file to include this. For Example:

...
my.ip.goes.here myname.mydomain
127.0.0.1 myname.mydomain localhost.localdomain
...

This will solve at least some issues arising probably with the new security infrastructure.

When building CDAT I get: "Your Python does not have support for Tkinter

CDAT will not work"

This most likely indicates that your machine does not have X11 headers installed. This is a vestigial dependency because the publisher has a tk/tcl graphical user interface that gets built in this process. (Pre-Requisite: X11 headers)

Solution: Install all the X11 headers

When I am installing Thredds I have an error with "reinit"

ERROR:

----------------------------
Thredds Data Server Test... (publisher catalog gen)
----------------------------

Tomcat (jsvc) process is running...
Postgres process is running...
/usr/local/cdat/bin/esgpublish --use-existing pcmdi.ichec.ie.test.mytest
--noscan --thredds
INFO       2011-01-14 11:34:22,437 Writing THREDDS catalog
/esg/content/thredds/esgcet/1/pcmdi.ichec.ie.test.mytest.v1.xml
WARNING    2011-01-14 11:34:22,468 No dataset_id option found for project test
INFO       2011-01-14 11:34:22,494 Writing THREDDS ESG master catalog
/esg/content/thredds/esgcet/catalog.xml
INFO       2011-01-14 11:34:22,497 Reinitializing THREDDS server
ERROR      2011-01-14 11:34:22,499 Error reading url
https://localhost:443/thredds/admin/debug?catalogs/reinit: URLError('unknown
url type: https',)
Traceback (most recent call last):
  File "/usr/local/cdat/bin/esgpublish", line 5, in <module>
    pkg_resources.run_script('esgcet==2.7.4', 'esgpublish')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in
run_script
  File
"/usr/local/cdat/lib/python2.6/site-packages/esgcet-2.7.4-py2.6.egg/EGG-INFO/scripts/esgpublish",
line 434, in <module>
...
  • 1 - Check tomcat's server.xml file in the connector section for port 443 is setup properly. Namely, check that the paths to the truststore and keystore are correct and that the passwords are also correct. To check that the passwords are correct use java's keytool to do a simple listing. This requires you to put in the password. If they password you enter and what is in the connector is correct, then it confirms the password and path to keystore/truststore is good.

    %> keytool -list -v -keystore -storepass

  • 2 - Make sure you have installed the openssl-devel package on your machine. It is a prerequisite that is not done by the installation script. Openssl is needed when building python (the CDAT portion of the install when python is built which needs to have openssl support). (see: bug report )

    Solution: install openssl-devel

I set the environment variable but it doesn't take effect!?!?

The ESGF Node makes use of environment variables to set different parameters used during installation and operation. If you are changing an environment variable at the command line or in your environment and they don't seem to be taking affect (i.e. the value is not being changed accordingly) - first thing to do is to check if the environment variable is already being set in /etc/esg.env. Key variables are set in the /etc/esg.env file. This file is sourced as the last environment sourcing sequence, which means that it supercedes variables set at the command line or in the shell environment. This file is chmod'ed 644 to prevent anyone except the node administrator setting these values.

Solution: Check the /etc/esg.env file... if the value you are attempting to set is already present in /etc/esg.env then you will have to either 1) remove the value from the file or 2) edit the file to change the variable entry to desired value.  To do either of these things, you have to be the node admin.

When starting the node after install I get HTTP 500 Jasper error...

Occasionally after a fresh install or upgrade when visiting the main web page you get a strange 500 error page saying the following...

HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: org.apache.jasper.JasperException: Unable to load class for JSP
        org.apache.jasper.servlet.JspServletWrapper.getServlet(JspServletWrapper.java:161)
        org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:340)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
...

To fix this, simply restart the node:

%> esg-node restart

This issue may be related to a page caching chichen/egg issue. Much like having to run LaTex multiple times to compile a document. The cause is still under investigation, but the solution is relatively straight forward.


Database

I get an java.sql.SQLException from the AccessLoggingFilter

 ERROR - esg.node.filters.AccessLoggingDAO - java.sql.SQLException:
>> ERROR: relation "seq_access_logging" does not exist
>> Query: select nextval('seq_access_logging') Parameters: []

This will be the case if you are upgrading the esgf node manager from an installed version older than 1.0.4.0. The database has to be manually updated. As the database admin, run the following script: [ esgf node manager update database script ](http://rainbow.llnl.gov/dist/esg- node/db_upgrade/create_access_logging.sql)

Postgres will not start: Error regarding /dev/null?

Scenario: The build and installation of postgres seemed to have gone fine, however when it is time to start postgres you get:

su: /dev/null: Permission denied
Starting Postgress...
su postgres -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data start"
su: /dev/null: Permission denied
 ERROR: Could not start database!

This is because there was already a postgres user on your system and it is set to /dev/null in /etc/passwd. For using the command line postgres commands you need the postgres user to have a real shell. Edit the /etc/passwd file and change the entry for postgres to use the shell /bin/bash.

or run this sed command:

sed -i 's#\(postgres.*:\)\(.*\)$#\1/\bin/\bash#' /etc/passwd

Tomcat

The THREDDS servlet isn't starting

If you have "COMPUTE" also installed make sure that you have the file /esg/content/las/conf/server/las_servers_static.xml

# touch /esg/content/las/conf/server/las_servers_static.xml

(or respective $ESGF_HOME location) Then restart the node

The THREDDS servlet isn't starting: Permission Denied

This happens when the installer breaks at certain points. The created/downloaded files in the tomcat webapp directory are still owned by root and cannot be accessed by the tomcat server. Just assure everything is assigned to tomcat:

# chown -R tomcat:tomcat /usr/local/tomcat

Tomcat is complaining about too many open files

The default value for open files is 1024 which might bee too low (shouldn't but there's a leakage that leave files open until they get garbage collected) Check the number of open files allowed for tomcat:

 #as tomcat run ulimit
# su -c "ulimit -n" tomcat
1024

 #if it's that low try to increase it to 4096 by adding this line to /etc/security/limits.conf
tomcat               -       nofile          4096

 #Check it has been changed
# su -c "ulimit -n" tomcat
4096

This is probably a bug in the security library

Tomcat is complaining about too many connections "Maximum number of

threads (200) created..."

There something that's leaving dangling connections in the software stack (or the clients). Those connections are garbaged collected at some point and the only resources the take are the ports they leave open. A work around is to increase the number of threads in the connector (port 80) to something above the default of 200:

    <Connector port="80" protocol="HTTP/1.1"
               ...
               maxThreads="400"/>

If you get into problems try doubling it again.

You'll need to restart the node after that.

How do I Monitor Tomcat?

The best approach I know of is using "jconsole". Here's a description of what's need to be done: http://download.oracle.com/javase/1.5.0/docs/guide/management/agent.html

Basically you need to start tomcat with some extra parameters. The most simple one, apparently not viable for production environment because of memory consumption of the jconsole thread is to add just this:

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=[your jmx port]
-Dcom.sun.management.jmxremote.ssl=true
-Dcom.sun.management.jmxremote.authenticate=true
-Djavax.net.ssl.keyStorePassword=[your password]
-Djavax.net.ssl.keyStore=[full path to keystore file]

Then you'll have to start the jconsole from the same machine and plug to the probably "unnamed" process. You could check the PID externally to be sure.


MyProxy

How can I renew my CA when it expires?

There is good documentation on how to do this using openssl in the official Globus Toolkit documentation:

(client perspective)

Where can I get more MyProxy help?

Please be sure to check out this page as well if your question is not answered here:

IDP Node type FAQ on MyProxy (server-side)

Cannot connect to MyProxy Server

Try to connect manually and see what is causing the error:

globus/bin/myproxy-logon -v -s <gateway_host.domain> -l <user> -p <port> -o <X509_CERT_DIR> -T

Where:

variable

default
(for tests, this will change)

Meaning

gateway_host.domain

pcmdi3.llnl.gov

The gateway where the MyProxy server is running

user

Expected "=" to follow "no_default"

The user name of the gateway account. This is the one used for publishing

port

2119

The default value is really 7512 for the MyProxy service, but the pcmdi3.llnl.gov is using this port instead.

X509_CERT_DIR

/root/.globus/certificate-file

Where the certificate will get stored.

The -T parameter tells the MyProxy client to retrieve the root certificate of the server. It is probably required to delete the certificates directory (X509_CERT_DIR) if you connect to a second server. In this case you will probably get an Error like:

OpenSSL&#160;Error:&#160;s3_clnt.c:897:&#160;in&#160;library:&#160;SSL&#160;routines,&#160;function SSL3_GET_SERVER_CERTIFICATE:&#160;certificate&#160;verify&#160;failed globus_gsi_callback_module:&#160;Could&#160;not&#160;verify&#160;credential globus_gsi_callback_module:&#160;Can't&#160;get&#160;the&#160;local&#160;trusted&#160;CA&#160;certificate: Untrusted&#160;self-signed&#160;certificate&#160;in&#160;chain&#160;with&#160;hash&#160;acdc777a


The error above appears to indicate that whatever's in your $X509_CERT_DIR is not compatible with the MyProxy server that you're trying to get credentials from. For example, if for some reason the MyProxy server is no longer trusted (i.e. trustroots have changed on the server side), you have little choice but to clear out or remove the existing X509_CERT_DIR and try again. An example of this is shown below:

export X509_CERT_DIR=/some/dir
rm -rf $X509_CERT_DIR
[ re-run myproxy logon here using the -T option ]

The X509_CERT_DIR directory on the client side, while not useless, is disposable. So you can rm -rf it if you'd like before every MyProxy logon if you wanted to be very inefficient about things. In most cases if you run into trouble, that will solve the issue.

During data node installation, error during "Registering the Data node with Globus Platform"

Symptom:

Please provide a Globus username []: <hidden>
Globus password []: Creating directory: /var/lib/globus-connect-server
ENTER: IO.setup()
ENTER: IO.configure_credential()
ENTER: GCMU.configure_credential()
EXIT: GCMU.configure_credential()
Writing GridFTP credential configuration
EXIT: IO.configure_credential()
ENTER: configure_server()
Creating gridftp configuration
EXIT: IO.configure_server()
ENTER: IO.configure_sharing()
GridFTP Sharing Disabled
ENTER: IO.configure_trust_roots()
ENTER: GCMU.configure_trust_roots()
Fetching MyProxy CA trust roots
ENTER: get_myproxy_dn_from_server()
fetching myproxy dn from server
MyProxy DN is None
EXIT: get_myproxy_dn_from_server()
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/globus/connect/server/io/setup.py", line 137, in <module>
    ioobj.setup(reset=reset)
  File "/usr/lib/python2.6/site-packages/globus/connect/server/io/__init__.py", line 68, in setup
    self.configure_trust_roots(**kwargs)
  File "/usr/lib/python2.6/site-packages/globus/connect/server/io/__init__.py", line 285, in configure_trust_roots
    super(IO, self).configure_trust_roots(**kwargs)
  File "/usr/lib/python2.6/site-packages/globus/connect/server/__init__.py", line 475, in configure_trust_roots
    self.get_myproxy_dn_from_server()
  File "/usr/lib64/python2.6/os.py", line 471, in __setitem__
    putenv(key, item)
TypeError: putenv() argument 2 must be string, not None

  • Underlying problem: it is trying to do myproxy-logon -b -s <myproxy_endpoint> and failing.
  • Possible cause: firewall. Check what host is in myproxy.endpoint in /esg/config/esgf.properties, and if it is your host, check that incoming port 7512/tcp is open from the data nodes.
  • Testing: as non-root (e.g. user globus), try the myproxy-logon -b -s <myproxy_endpoint>. Do not try it as root. If you try it as root, it will fail for some other reason even if it is in fact working fine as non-root.

MyProxy Alert 42 and 43

The next two errors are different than the above, and are somewhat related. The errors are shown below, but the solution in both cases is generally the same. Don't run myproxy-logon as root .

Error authenticating: GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
globus_gss_assist: Error during context initialization
globus_gsi_gssapi: Unable to verify remote side's credentials
globus_gsi_gssapi: Unable to verify remote side's credentials: Couldn't verify the remote certificate
OpenSSL Error: s3_pkt.c:1053: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate SSL alert number 42


------------------------

Error authenticating: GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
globus_gss_assist: Error during context initialization
globus_gsi_gssapi: Unable to verify remote side's credentials
globus_gsi_gssapi: SSLv3 handshake problems: Couldn't do ssl handshake
OpenSSL Error: s3_pkt.c:1086: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert unsupported certificate SSL alert number 43

As the root user, myproxy-logon attempts to use host certificates if they are present, rather than user certificates to authenticate with the server. It's highly unlikely that this will succeed. As a non-root user, the user certificates are looked for. Unlike the host certificates on the datanode, the user certificates do not exist. Therefore, anonymous authentication is attempted (which succeeds).

If for some reason it's absolutely imperative that you run myproxy-logon as root, you can force myproxy-logon to think that no host certificates are present on the datanode by explicitly setting the following environment variables to files that don't exist:

export X509_USER_CERT=foo
export X509_USER_KEY=bar

For more information on MyProxy trustroots, check out the official documentation here:

GridFTP

GridFTP Installation Pre-Requisites

Be sure to install these before running installation script...

Flex v2.5.35

[ http://downloads.sourceforge.net/project/flex/flex/flex-2.5.35/flex-2.5.35.t ar.gz ](http://downloads.sourceforge.net/project/flex/flex/flex-2.5.35/flex-2. 5.35.tar.gz)

Bison v2.4

http://ftp.gnu.org/gnu/bison/bison-2.4.tar.gz

How do I debug the gridFTP connection?

There are a couple of environment variables that will turn the debugging more verbose for both client and server:

export GLOBUS_ERROR_OUTPUT=1
export GLOBUS_ERROR_VERBOSE=1
export GLOBUS_GSI_AUTHZ_DEBUG_LEVEL=2

You may then start the server in the debug mode (you could add -l /tmp/gridftplog to save the output to a log file.):

globus-gridftp-server -debug -d all -p <port>

In this mode the server ends after the transfer is done. In any case it is best to add the -debug&#160;-d&#160;ALL parameter to the call already used to start it (in case you _ can _ start gridFTP). This will display the complete invocation:

ps -wwo args= -C "globus-gridftp-server"

GridFTP authorization fails

If this is a problem with the security the client might present an output like this:

error: globus_ftp_client: the server responded with an error
500 500-Command failed. : globus_i_gfs_data.c:globus_l_gfs_authorize_cb:911:
500-authorization failed.
500-globus_gsi_authz.c:globus_gsi_authorize:507:
500-Callout returned an error
500-globus_callout.c:globus_callout_handle_call_type:749:
500-The callout returned an error
500-globus_gfork_lib.c:gfork_l_get_env_fd:460:
500-GFork error: Env not set
500 End.

The causes might be many, so you'll have to debug the server. You'll have to mimic the starting command as much as possible in order to debug it properly (see debugging gridFTP above). For Example a complete debug command (in our case) looks like this:

GLOBUS_GSI_AUTHZ_DEBUG_LEVEL=1 GLOBUS_ERROR_OUTPUT=1 GLOBUS_ERROR_VERBOSE=1 GLOBUS_TCP_PORT_RANGE=60000,64000 GLOBUS_TCP_SOURCE_RANGE=60000,64000 GSI_AUTHZ_CONF=/etc/grid-security/gsi-authz.conf /usr/local/globus/sbin/globus-gridftp-server -disable-command-list APPE,DELE,ESTO,MKD,RMD,RNFR,RNTO,RDEL,STOR,STOU,XMKD,XRMD,CHMOD -p 2811 -chroot-path /esg/gridftp_root -usage-stats-id 2811 -usage-stats-target localhost:0\!all -acl customgsiauthzinterface -no-cas -debug -d ALL

The following are some examples of what might go wrong:

Certificates' directory missing

globus_error_put(): globus_gsi_system_config.c:globus_i_gsi_sysconfig_create_cert_dir_string:411:
Could not find a valid trusted CA certificates directory
globus_gsi_system_config.c:globus_gsi_sysconfig_dir_exists_unix:4694:
File does not exist: /root/.globus/certificates is not a valid directory

This means the X509_CERT_DIR environment variable was not set. Set it pointing to the certificates directory, e.g.&#160;/etc/grid-security/certificates , before starting the server.

Soap not called

Be sure the globus-gridftp-server command was started with the -no-cas parameter which tells the gridFTP instance to use ESG's security procedure.

Certificate verify failed

Calling out to auth service https://albedo2.dkrz.de/esgcet/saml/soap/secure/authorizationService.htm to retrieve SAML Assertion
        for identity https://albedo2.dkrz.de/esgcet/myopenid/user, file ftp://cmip2.dkrz.de/somefile.nc, and action read
SOAP 1.1 fault: SOAP-ENV:Client [no subcode]
"SSL_ERROR_SSL
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed"
Detail: SSL_connect error in tcp_connect()

This means that the SSL connection to the gateway for authorization purposes failed. The SSL connection requires that the server have a trustworthy certificate, which is granted if either the certificate itself or the CA chain certifying it's validity is in the CA certificate directory.

So you'll have to:

  1. Check the certificate is valid: * printf "GET /\n\n" | openssl s_client -connect ipcc-ar5.dkrz.de:443 -CApath $X509_CERT_DIR -verify 999 -quiet

  2. Check the Certificates' directory missing problem.

  3. If you are using chroot, assure the /$X509_CERT_DIR exists. For example this should do:

    cp -r $X509_CERT_DIR $[chroot}$X509_CERT_DIR

(normally this implies /esg/gridftp_root/etc/grid-security/certificates&#160;== /etc/grid-security/certificates )

  1. Check the file exists and you have permission to download it (see Permission missing )

Permission missing

Well this is almost impossible to spot because at this time there's no hint in any of the mentioned outputs. You'll have to check the gateway logs, and make sure you are logging esg.saml.authz.service.impl.SAMLAuthorizationServiceSoapImpl on debug modus. If this is the case you'll se in the logs something like:

[DEBUG] esg.saml.authz.service.impl.SAMLAuthorizationServiceSoapImpl: SOAP response:
<?xml version="1.0" encoding="UTF-8"?>
<soap11:Envelope xmlns:soap11="http://schemas.xmlsoap.org/soap/envelope/">
 <soap11:Body>
  <samlp:Response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol" ID="c3434258-f4e5-4b8c-9593-f72cefa00519" InResponseTo="140734728698544" IssueInstant="2010-11-02T13:49:59.366Z" Version="2.0">
        [...]
    <saml:AuthzDecisionStatement Decision="Indeterminate" Resource="gsiftp://cmip1.dkrz.de:2811/test.nc">
     <saml:Action>read</saml:Action>
    </saml:AuthzDecisionStatement>
        [...]
  </samlp:Response>
 </soap11:Body>
</soap11:Envelope>

There are two things you should note here:

  1. The _ saml:AuthzDecisionStatement Decision="Indeterminate" _ means the access request was not granted (you should se a _ Decision="PERMIT" _ if it where).

  2. The requested resource was _ Resource="gsiftp://cmip1.dkrz.de:2811/test.nc" _ this entry should be verbatim equal with that from the gateway DB in the metadata.file_accesess_point table. At the time the publisher is wrongly publishing files as (in our example) _ gsiftp://cmip1.dkrz.de:2811//test.nc _ , so you have to access that file exactly as in this case (i.e. doubling the slashes before the file name).

Virtual Host or alias

The GridFTP server don't send the gsiftp URL string "as is" to the gateway for authorization. it replaces the hostname of the local file by the one from the host where it is running. So if hostname&#160;-f is not reporting the name expected (i.e. the one from gsiftp://host_name/) you should define it in the GLOBUS_HOSTNAME environment variable (which probably you would like to add to the /etc/esg.env file).

To clarify this a little more with an example:

  • Datanode is called cmip.dkrz.de
  • the alias bmbf-ipcc-ar5.dkrz is used for accessing the file
    •     globus-url-copy gsiftp://bmbf-ipcc-ar5.dkrz/mydata/myfile.nc file:///dev/null
      

will be trigger the query of (gsiftp://cmip.dkrz.de/mydata/myfile.nc, read) to the gateway attribute service, which probably won't work.

In order for this to work you'll have to:

echo "export GLOBUS_HOSTNAME=bmbf-ipcc-ar5.dkrz" >> /etc/esg.env
#restart the node as usual

Thredds Server (TDS)

The server have started reporting "404: File not found" for no apparent

reason

There are some issues regarding how the server cache the existing files and their location. If a file or directory is not present while it's being accessed the TDS marks it as "non-existent" and don't try it again afterwards. Every access attempt will be reported as trying to access a non existent file, even though it might be now accessible.

This happens especially when mounting a remote file system after the TDS has been started. DKRZ has experience this a couple of times, because the gpfs system might take longer than expected to be available and thus cause such inconsistency. The only known solution is to restart the server.


Publishing

PREREQUISITES

1) Be sure that your node or institution is NOT performing web caching for this node!!!
2) Be sure that your node is visible from the outside, i.e. can accept inbound external connections!!!

When installing the Publisher (esgcet) I get: "ImportError:

/usr/local/cdat/lib/python2.6/site-packages/cdtime.so:"

It looks like a problem with the SeLinux security extension, which (apparently) affect shared library loading. E.g., from

[ http://www.archiware.com/support/index.php?_m=knowledgebase&_a=viewarticle&k barticleid=58 ](http://www.archiware.com/support/index.php?_m=knowledgebase&_a =viewarticle&kbarticleid=58) :

In case you run a Linux host and get the following error message in the logfile when starting PresSTORE: Error: modload: /usr/local/aw/bin/libarchdev.so:

  • couldn't load file "/usr/local/aw/bin/libarchdev.so": /usr/local/aw/bin/libarchdev.so: cannot restore segment prot after

reloc:

  • Permission denied

Fatal: modload: failed to load module '/usr/local/aw/bin/libarchdev.so'

This problem is most probably caused by the security extension SeLinux . SeLinux is active in newer Linux distributions with 2.6. kernels. SeLinux changes some system default behaviour, including the shared library loading.

This can be checked by disabling SeLinux : just add the line

  • SELINUX=disabled

to the file

  • /etc/sysconfig/selinux

an restart the host.

In case the shared libary can be loaded this way, but the SeLinux shall be kept active, it is required to adopt the security context for the shared library loading by using the chcon program.

Got Error: "Parent THREDDS catalog is null, cannot start algorithm

without parent THREDDS catalog"

This is a fairly common error. It means the gateway could not reach the node THREDDS server to upload the catalog. Things to check:

  • In esg.ini, thredds_url should be the address of the esgf "data" node THREDDS server
  • port 443 should be accessible from outside.
  • tomcat & thredds should be running.

Got Error: "esgcet.publish.hessianlib.!RemoteCallException: Java

!ServiceException: Access is denied"

This is probably caused because the gateway account doesn't have the required publishing role.

Find out which group membership is required.

Go to the parent project and select the "Administration" tab.

alt Project view

Login to the gateway and go to Account->"List&#160;current&#160;Membership" (see picture above)

alt Login to the gateway and in Account->"List currentMembership"

check that the account is member of the proper group and has the special role of _ Data Publisher _

alt check that the account is member of the proper group and has the special role of Data Publisher

Got Error: raise "FOO" / TypeError: exceptions must be old-style classes

or derived from BaseException, not str

This is an improperly handled java error thrown by the index node. So this could be anything. Most likely this is a security issue that should be fixed from the index node side (if you have access check the logs, they'll tell you exactly what happened).

If not at least verify that you are really publishing what you want, and that the catalog exists and is accesible from the web server (everything that forbids this will cause such an exception).

Particularly check the esg.ini file for this properties (check for typos!):

thredds_url = http://<data_node_fqdn>/thredds/esgcet

While trying to publish I get an httplib.BadStatusLine Error

this appears to be a problem with tomcat at the gateway. Contact the gateway admin. Or see the gateway FAQ if you are one.

While trying to publish I get a gaierror: (-2, 'Name or service not

known')

This is probably caused by a port number in the hessian_service_url entry. Although it says url, only a subset of url is supported (not the port number stuff). So just use the hessian_service_port variable for that.

For example change this:

hessian_service_url=https://myserver.com:8443/...path...

into:

hessian_service_url=https://myserver.com/...path...
hessian_service_port=8443

The problem is that _ myserver.com:8433 _ is understood as the server name and tried to be resolved through DNS to find its ip number.

While publishing to the Gateway:

esgcet.publish.hessianlib.RemoteCallException: Java ServiceException: Parent THREDDS catalog is null, cannot start algorithm without parent THREDDS catalog.

check the _ thredds_url _ variable in the esg.ini file is properly set, a typo there might be preventing the gateway to find the TDS. For Example:

thredds_url = http://cmip2.dkrz.de/thredds/esgcet

If you are not publishing to the default esgcet collection be sure the tredds_url also points to the thredds_root directory, e.g.:

thredds_root = /esg/content/thredds/lucid
thredds_url = http://cmip2.dkrz.de/thredds/lucid

How do I recreate the catalogs for datasets that are already in the DB?

Of course you could use the map files if you still have them and issue:

esgpublish --map mapfile.map --noscan --thredds

But if you don't, you could get a list of the datasets (e.g. of project cmip5) via

esglist_datasets --select name --no-header cmip5 > datasets.txt

Then you could trim that down (or pipe grep in the middle) to select those interesting datasets. Then you publish them using the \--use-list flag

esgpublish --use-list datasets.txt --project cmip5 --noscan --thredds

Basically to recreate the complete catalog you could issue:

esglist_datasets --select name --no-header cmip5 | esgpublish --use-list - --project cmip5 --noscan --thredds

This will recreate the catalogs, but if they were already published to a gateway, it must be republished or somehow that new url must get in there :-)

In trying to publish I get this error: :SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure'),)

We're forcing TLS to overcome a security vulnerability in sslv3. Unfortunately, this requires for now a manual patch to the python installed with uvcdat for use with the publisher.

  • Edit /usr/local/uvcdat/1.5.0/lib/python2.7/httplib.py

  • Modify line 1176

    self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,ssl_version=ssl.PROTOCOL_TLSv1)

I'm still having problems

If you have access to the gateway look at the Gateway's publishing FAQ .


Security

Do I have every certificate in the right place?

Well, at the current time all federation certificates must be present at the following places:

  • $X509_CERT_DIR

  • ${chroot}$X509_CERT_DIR

  • ${chroot} + distro dependent (see this to know more)

    • _ Notes _ :

    • $X509_CERT_DIR is set before starting gridFTP and normally points to /etc/grid-security/certificates.

    • ${chroot} points to the chroot directory used by gridFTP (if you are not using it, assume chroot="")

And the java truststore file:

  • $CATALINA_HOME/conf/esg-truststore.ts
  • $JAVA_HOME[/jre]/lib/security/jssecacerts

_ Notes _ :

* I _ think _ [/jre] is optional and depends on whether $JAVA_HOME points to a JDK or JRE (no '/jre' then). 

I get a 500 when the ORP sends me back from the gateway after OpenID

authentication took place

The probable cause for this is that the tomcat ssl certificate for the node is missing from the truststore. This is required for the time being.

The installer should have already done this, but check it is there with this and insert it if it's not.

I cannot retrieve a file directly from the node

Check you can download it from the gateway (you must publish the file to a gateway in order to be able to retrieve it) If you can check the problem is probably at the SAML as the attribute server is not properly identifying either:

  • the transaction: check you are not accessing it in https mode at the node (if so change to http and try again). Check the gateway link to the file works.
  • the data node: most specifically this might be a certificate issue, see the previous entry .

I'm still having security related issues!

See the Security subsection and specially its FAQ .


MISC

What does the "nodeType" numeric value mean, in the registration.xml?

In the ESGF P2P Node these values are the base 10 bit values in a bit vector. They can be combined in any permutation to give you the node configuration you desire. They correspond exactly to the installation "--type" value you set, specifically:

  • DATA_BIT=4 -> "data" type

  • INDEX_BIT=8 -> "index" type

  • IDP_BIT=16 -> "idp" type

  • COMPUTE_BIT=32 -> "compute" type

There is the type "all" which is the sum of these values, namely:

  • ALL_BIT=$((DATA_BIT+INDEX_BIT+IDP_BIT+COMPUTE_BIT)) which gives you a sum of 60 . -> "all" type

FYI the other bits are used to determine the setting of the script execution, specifically:

  • INSTALL_BIT=1
  • TEST_BIT=2

ESGF IDP Node FAQ

here


ESGF Node Development FAQ


OS specifics

Why does GIT tell me always that files are modified even after reset or

commit?

This happens because of an error while handling file permissions. See this to work around it: [ http://superuser.com/questions/204757/git-chmod-problem- checkout-screws-exec-bit ](http://superuser.com/questions/204757/git-chmod- problem-checkout-screws-exec-bit)


Clone this wiki locally