3. Storage on Spider¶
Spider is meant for processing of big data, thus it supports several storage backends. In this page you will learn:
- which internal and external storage systems we support
- best practices to manage your data
3.1. Internal storage¶
The available filesystems on Spider are CephFS and SSDs. Home and project spaces are mounted on CephFS, while the batch worker nodes have large scratch areas on local SSD.
CephFS is a distibuted parallel filesystem which stores files as objects and it is suitable for workloads that deal with comparably large files. Please note that conda/pip packages handling lots of small files can slow down the system response. For high I/O performance, we recommend the local scratch of the worker nodes on SSDs.
3.1.1. Transfers within Spider¶
To transfer data between directories located within
Spider we advise
you to use the unix commands
rsync. Other options may be
available, but these are currently not supported by us.
Help on these commands can be found by (i) typing
man cp or
on the command line after logging into the system, or (ii) by contacting
3.1.2. Spider filesystems¶
184.108.40.206. Using Home¶
Spider provides to each user with a globally mounted
home directory that is listed as
/home/[USERNAME]. This directory is
accessible from all nodes.
This is also the directory that you as a user will find yourself in upon first
login into this system. The data stored in the home folder will
remain available for the duration of your project.
220.127.116.11. Using scratch¶
Each of Spider worker nodes has a large scratch area on local SSD. These scratch directories enable particularly efficient data I/O for large data processing pipelines that can be split up into many parallel independent jobs.
Please note that you should only use the scratch space to temporarily store and process data for the duration of the submitted job. The scratch space is cleaned regularly in an automatic fashion and hence can not used for long term storage.
For more information about how to use scratch during your compute jobs, please see the compute section. .. _project-space-fs:
18.104.22.168. Using project spaces¶
Similarly to home folders Spider’s project spaces are also available on all worked nodes, the following paths are available on your Spider UI:
This allows you to easily access your software, data and output from the worker nodes from the project spaces. See below for an example of a command that could be executed from a script on a worker node:
sh /project/[Project Name]/Software/[script].sh /project/[Project Name]/Data/[input file(s)] /home/[USER]/[output]
22.214.171.124. Using scientific catalogs¶
Scientific catalogs allow for you to share software and data repositories accross projects. For example if you would
like to share a large biobank of data with other research projects you could request access
to upload to the scientific catalogue. Then it will be accessible from the worker nodes similarly to the
To request access to add a shared catalogue please reachout to our helpdesk.
126.96.36.199. Querying internal storage usage¶
As a mounted filesystem spider storage can be queried with local linux commands, but for optimal performance we recommend querying some preconfigured fattr tags instead of du commands that slow down the system.
The total usage of local spider storage is the total usage of projct home folders and project space together.
Please note that this will show your current usage, not the max, or average for the month.
# Project folder getfattr -n ceph.dir.rbytes --absolute-names /project/[PROJECT]/ # Home folder getfattr -n ceph.dir.rbytes --absolute-names /home/[PROJECT]-[USER]
3.2. External storage¶
3.2.1. Transfers from own system¶
If you are logged in as a user on Spider then we support
wget to transfer data between Spider and your own Unix-based system.
Other options may be available, but these are currently not supported by us.
- Example of transferring data from Spider to your own system:
# Using scp scp [spider-username]@spider.surfsara.nl:[path-to-your-spider-folder]/transferdata.tar.gz [path-to-your-local-folder]/ # Using rsync rsync -a -W [spider-username]@spider.surfsara.nl:[path-to-your-spider-folder]/transferdata.tar.gz [path-to-your-local-folder]/
- Example of transferring data from your own system to Spider:
# Using scp scp [path-to-your-local-folder]/transferdata.tar.gz [spider-username]@spider.surfsara.nl:[path-to-your-spider-folder]/ # Using rsync rsync -a -W [path-to-your-local-folder]/transferdata.tar.gz [spider-username]@spider.surfsara.nl:[path-to-your-spider-folder]/
3.2.2. SURFsara dCache¶
dCache is our large scalable storage system for quickly processing huge volumes of data. The system runs on dCache software, that is designed for managing scientific data. You can use dCache for disk or tape, or address both types of storage under a single virtual filesystem tree. Our dCache service is a remote storage with an extremely fast network link to Spider. You may use the storage if your data does not fit within the storage allocation on Spider project space or if your application is I/O intensive.
There are several protocols and storage clients to interact with dCache. On Spider we support two main methods to use dCache, ADA and Grid interfaces:
Our ADA (Advanced dCache API) interface is based on the dCache API and the webdav protocol to access and process your data on dCache from any platform and with various authentication methods.
Our Grid interface is based on the Grid computing technology and the gridftp protocol to access and process your data on dCache from Grid compliant platforms and with X509 certificate authentication.
3.2.3. SURFsara Central archive¶
For long-term preservation of precious data SURFsara offers the Data Archive. Data ingested into the Data Archive is kept in two different tape libraries at two different locations in The Netherlands. The Data Archive is connected to all compute infrastructures, including Spider.
Access on Data Archive is not provided by default to the Spider projects. To request for Data Archive access, please contact our our helpdesk.
If you already have access on Data Archive, then you can use it directly from Spider by
rsync to transfer data between Spider and Data Archive:
- Transfer data from Spider to Data Archive:
# Using scp scp /home/[USERNAME]/transferdata.tar.gz [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/ # Using rsync rsync -a -W /home/[USERNAME]/transferdata.tar.gz [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/
- Transfer data from Data Archive to Spider:
# Using scp scp [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/transferdata.tar.gz /home/[USERNAME]/ # Using rsync rsync -a -W [ARCHIVE_USERNAME]@archive.surfsara.nl:/home/[ARCHIVE_USERNAME]/transferdata.tar.gz /home/[USERNAME]/
In case that the file to be retrieved from Data Archive to Spider is not
directly available on disk then the scp/rsync command will hang until the file is
moved from tape to disk. Data Archive users can query the state of their files by
logging into the Data Archive user interface and performing a
dmls -l on the files
of interest. Here the state of the file is either on disk (REG) or on tape (OFL).
The Data Archive user interface is accessible via
ssh from anywhere for users that
have a login account and an example is given below:
ssh [ARCHIVE_USERNAME]@archive.surfsara.nl touch test.txt dmls -l test.txt -rw-r--r-- 1 homer homer 0 2019-04-25 15:24 (REG) test.txt
Best practices for the usage of Data Archive are described on the Data Archive page.
3.3. Quota policy¶
Each Spider is granted specific compute and storage resources in the context of a project. For these resources there is currently no hard quotas. However, we monitor both the core-hour consumption and storage usage to prevent that users exceed their granted allocation.
3.4. Backup policy¶
The data stored on CephFS (home and project spaces) is disk only, replicated three times for redundancy. For disk-only data there is no backup. If you cannot afford to lose this data, we advise you to copy it elsewhere as well.
3.5. Data Ownership Policy¶
The data stored in the /project folder is owned by the grant’s signing authority. If data is owned by a user who has left the project in the /project folder we ask that you request that user change the ownership to an active project member before leaving.
The data stored in the /home folders is owned by individual users of those folders and can not be transferred to another user without their consent. We are also obligated to remove a users data no more than 6 months after they have left the project.
Still need help? Contact our helpdesk