3. Storage on Spider

Tip

Spider is meant for processing of big data, thus it supports several storage backends. In this page you will learn:

  • which internal and external storage systems we support
  • best practices to manage your data

3.1. Internal storage

3.1.1. Transfers within Spider

To transfer data between directories located within Spider we advise you to use the unix commands cp and rsync. Other options may be available, but these are currently not supported by us.

Help on these commands can be found by (i) typing man cp or man rsync on the command line after logging into the system, or (ii) by contacting our helpdesk.

3.1.2. Spider filesystems

3.1.2.1. Using Home

Spider provides to each user with a globally mounted home directory that is listed as /home/[USERNAME]. This directory is accessible from all nodes. This is also the directory that you as a user will find yourself in upon first login into this system. The data stored in the home folder will remain available for the duration of your project.

3.1.2.2. Using scratch

Each of Spider worker nodes has a large scratch area on local SSD. These scratch directories enable particularly efficient data I/O for large data processing pipelines that can be split up into many parallel independent jobs.

Please note that you should only use the scratch space to temporarily store and process data for the duration of the submitted job. The scratch space is cleaned regularly in an automatic fashion and hence can not used for long term storage.

3.1.2.3. Using project spaces

Similarly to home folders Spider’s project spaces are also available on all worked nodes, the following paths are available on your Spider UI:

  • /project/[Project Name]/Data
  • /project/[Project Name]/Public
  • /project/[Project Name]/Share
  • /project/[Project Name]/Software

This allows you to easily access your software, data and output from the worker nodes from the project spaces. See below for an example of a command that could be executed from a script on a worker node:

sh /project/[Project Name]/Software/[script].sh /project/[Project Name]/Data/[input file(s)] /home/[USER]/[output]

3.1.2.4. Using scientific catalogs

Scientific catalogs allow for you to share software and data repositories accross projects. For example if you would like to share a large biobank of data with other research projects you could request access to upload to the scientific catalogue. Then it will be accessible from the worker nodes similarly to the /home and /project folders.

To request access to add a shared catalogue please reachout to our helpdesk.

3.2. External storage

3.2.1. Transfers from own system

If you are logged in as a user on Spider then we support scp, rsync, curl or wget to transfer data between Spider and your own Unix-based system. Other options may be available, but these are currently not supported by us.

  • Example of transferring data from Spider to your own Unix-based system:
scp /home/[USERNAME]/transferdata.tar.gz [own-system-user]@own_system.nl:/home/[own-system-user]/
rsync -a -W /home/[USERNAME]/transferdata.tar.gz [own-system-user]@own_system.nl:/home/[own-system-user]/
  • Example of retrieving data from own Unix-based system on Spider:
scp [own-system-user]@own_system.nl:/home/[own-system-user]/transferdata.tar.gz /home/[USERNAME]/
rsync -a -W [own-system-user]@own_system.nl:/home/[own-system-user]/transferdata.tar.gz /home/[USERNAME]/

3.2.2. SURFsara dCache

dCache is our large scalable storage system for storing and processing huge volumes of data fast. The system runs on dCache software that is designed for managing scientific data. You can use dCache for disk or tape or address both types of storage under a single virtual filesystem tree. Our dCache service is a remote storage with extremely fast network link to Spider. You may use the storage if your data does not fit within the storage allocation on Spider project space or if your application is I/O intensive.

There are several protocols and storage clients to interact with dCache. On Spider we support two main methods to use dCache, ADA and Grid interfaces:

Our ADA (Advanced dCache API) interface is based on the dCache API and the webdav protocol to access and process your data on dCache from any platform and with various authentication methods.

Our Grid interface is based on the Grid computing technology and the gridftp protocol to access and process your data on dCache from Grid compliant platforms and with X509 certificate authentication.

3.2.3. SURFsara SWIFT

Coming soon ..

3.2.4. SURFsara Central archive

For long-term preservation of precious data SURFsara offers the Data Archive. Data ingested into the Data Archive is kept in two different tape libraries at two different locations in The Netherlands. The Data Archive is connected to all compute infrastructures, including Spider.

Access on Data Archive is not provided by default to the Spider projects. To request for Data Archive access, please contact our our helpdesk.

If you already have access on Data Archive, then you can use it directly from Spider by using scp and rsync to transfer data between Spider and Data Archive:

  • Transfer data from Spider to Data Archive:
scp /home/[USERNAME]/transferdata.tar.gz [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/
rsync -a -W /home/[USERNAME]/transferdata.tar.gz [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/
  • Retrieve data from Data Archive on Spider:
scp [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/transferdata.tar.gz /home/[USERNAME]/
rsync -a -W [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/transferdata.tar.gz /home/[USERNAME]/

In case that the file to be retrieved from Data Archive to Spider is not directly available on disk then the scp/rsync command will hang until the file is moved from tape to disk. Data Archive users can query the state of their files by logging into the Data Archive user interface and performing a dmls -l on the files of interest. Here the state of the file is either on disk (REG) or on tape (OFL). The Data Archive user interface is accessible via ssh from anywhere for users that have a login account and an example is given below:

ssh [ARCHIVE_USERNAME]@archive.surfsara.nl
      touch test.txt
      dmls  -l test.txt
      -rw-r--r--  1 homer    homer    0 2019-04-25 15:24 (REG) test.txt

Best practices for the usage of Data Archive are described on the Data Archive page.

3.3. Quota policy

Each Spider is granted specific compute and storage resources in the context of a project. For these resources there is currently no hard quotas. However, we monitor both the core-hour consumption and storage usage to prevent that users exceed their granted allocation.

3.4. Backup policy

The data stored on CephFS (home and project spaces) is disk only, replicated three times for redundancy. For disk-only data there is no backup. If you cannot afford to lose this data, we advise you to copy it elsewhere as well.

See also

Still need help? Contact our helpdesk