The researcher must store the research data and transfer and share it safely throughout the research lifecycle. Research data must always be processed in accordance with the protection and processing instructions of your own organisation (File saving and sharing in UEF, in Heimo Services, requires UEF login).
Storage and preservation solutions are influenced by the
- level of protection of the data
- size of the data
- possible need to use the data in cooperation between different organisations.
The University of Eastern Finland's Digital Services (DiPa) produces a large part of the IT services and server resources used by researchers. Researchers also have access to a wide range of services provided by CSC - IT Center for Science Ltd. for the processing, storing, and opening of research data (see below, Other services).
Protection levels of research data and processing measures
The content of the research data affects the type of data protection needed. For example, public information can be processed, stored, and shared outside the university, usually without special measures. Since this is public information, it cannot, in principle, end up in the wrong hands. In this case, consumer cloud services (e.g., GoogleDrive, Dropbox, iCloud) are also possible, although they are generally not recommended for work use. Other than public information should not be handled in consumer cloud services.
Many kinds of data fall into the base protection level, such as anonymized research data, research plans or other than sensitive personal data. Such data can be stored and shared in many solutions offered by the university, with certain limitations. Sensitive personal or otherwise confidential data require special protection and is subject to high security requirements.
Detailed instructions on the protection levels and storage solutions to be followed in the UEF can be found in the data processing instructions of the University of Eastern Finland (in Heimo Services, requires UEF login).
File sharing
Sharing files within UEF is relatively straightforward. Of course, data protection must be ensured so that only those with the appropriate access rights can access the shared information. Access rights can be defined, when using the disk space of the research groups.
There are plenty of services for sharing research data. They may, for example, be typical for the field of research or depend on a partner. In this context, we refer, above all, to the general services supported by UEF.
Funet FileSender
Large files can be sent to partners outside the university using the Funet FileSender service. UEF users can access the Funet FileSender service through Haka login. A user outside of UEF (or other Haka login) can also access the information sharing service by receiving a so-called Upload voucher invitation from a UEF user.
The service is browser-based and can be used to send files of even over 100 GB. Funet FileSender is not as such suitable for sending sensitive data, but the research data file sent using the service can be encrypted. For encryption, the recipient of the file receives a password from the sender that is not stored on the server but is always sent separately to the recipient (for example, as a text message to the phone).
Other services
IDA storage solution also enables the sharing and storage of research data with various partners. IDA is part of CSC's Fairdata services and, as a rule, offered free of charge to researchers from Finnish higher education institutions or state research institutes and other persons working in research. You can start using the IDA by contacting the IDA contact person of your home organisation. At UEF, you can do this by contacting the IT services for research (servicedesk@uef.fi).
The pan-European EUDAT service catalogue (EUDAT Service catalogue), which is jointly maintained by numerous higher education institutions and research institutes, enables the sharing and storage of research data. EUDAT B2SHAREBasic is a free solution for researchers for storing, publishing, and sharing research data that also provides a persistent identifier (DOI or Handle). The EUDAT catalogue also includes many other services and functionalities, for example for searching for existing research data or for the long-term preservation of research data.
The quality of the research data refers to slightly different issues depending on the context. In research data management, quality refers to so-called technical or external factors and in this context the suitability of the data content to the research question is not tackled. Rather, the latter is part of the discussion concerning research methods and theory.
Integrity is another term that is used alongside the quality of the data. In general, integrity refers to the fact that the data are in the form they are designed for. The data have not, for example, accidentally changed, and are thus also useful in the research context.
Ensuring the quality and integrity of research data starts at the planning stage. It is important to consider what can happen during data processing that would weaken the suitability or justification of the research data in terms of the research question or, in the worst case, invalidate the research project.
The data types and the data processing methods naturally affect to the quality assurance methods, i.e., what must be considered in, for example, data collection or conversion to another form. These may include calibration of measuring instruments, external transcription of the interview data or data checksums that reveal deviations in values.
The risks affecting the quality of research data are also prevented by measures pertaining to nearly all research data, such as backups, version control, and data description and documentation (see below the sections Backup and version control and Documentation, description and metadata).
Backup and version control (versioning) are an important part of risk management during research and systematic implementation of research data quality management. These measures safeguard the preserving of files and support the comprehensibility of data.
It is a good idea to plan the measures in advance and ensure that all members of the research group are also aware of the measures and responsibilities. Such information should be included in the common guidelines for the research project and in a place where it can be easily found.
Backup
Ensuring backup protects research data from accidental alterations or destruction, damage caused by hardware or software failures, or damage caused by external factors (e.g. hackers, computer viruses, fires, water damage).
Backup measures should take into account, for example,
- routine and regularity
- decentralisation so that not all backups are in the same (physical) location
- the suitability and replacement of the backup-device at regular intervals
- file formats that work during and after the research for as long as necessary.
The storage location of files and data affects the backup process. Although backups are usually automatically secured in the preservation locations provided by the university, it is worth remembering to distribute backups. If the research data are stored, for example, on the hard disk of your personal computer, you must perform the backup yourself.
You will find information about the backup of the storage solutions offered by the university in the UEF instructions on information processing (Heimo Services, requires UEF login).
Version control and file naming
Version control keeps a record of the changes made to the research data. How version control is implemented depends on the data type. For example, software version control utilises versioning systems, whereas for research data consisting of text files, for example, file naming is a key tool of version control.
Version control is particularly important when several people work with the same research data. Versionning systems typically enable simultaneous work. One example of a versioning system is Git, which is used, for example, on a Microsoft owned GitHub platform.
It is a good idea to plan the organisation and naming of files so that it supports the monitoring of changes to the data. Such methods include dividing research data into file folders and systematically naming files within folders. The file name should include a date that is always marked in the same way (e.g. yyyy-mm-dd: 2022-07-22). The date is used to avoid vague "latest version" entries in file names. The folder structure and file naming description should be included as a separate file (e.g. * .txt).
There are numerous file formats for different purposes. File formats are also constantly being renewed, some go out of use and are replaced by new ones. The longer you work with the same research data, the more important it is to ensure that the files are usable and readable. Special attention must be paid to file formats, especially for long-term preserving and archiving.
As a general instruction, it is recommended that you make at least one copy of the file in a commonly used format. The Ministry of Education and Culture's Open Science and Digital Cultural Heritage entity maintains extensive guidelines on file formats suitable for preserving and transfer which should be examined especially when planning the long-term preserving of research data.
Different file formats
The file format indicates the structure of the file and often how information is stored in digital format (e.g. PDF - Portable Document Format or TIFF - Tagged Image File Format). This facilitates file interoperability. Some file formats are linked to commercial software (such as Microsoft Office), while others are openly accessible to anyone without commercial links (such as OpenDocument).
Openly accessible file formats are recommended especially for opening research data and/or for preserving it after the research, so that the files can be read using different software without paying software licences. The file format is indicated by a file extension separated by a dot at the end of the file name.
Common text file formats include DOC/DOCX (*.doc, *.docx) which contains text formatting and is familiar from Microsoft Word, unformatted text stored as TXT (* .txt), open file format, OpenDocument Text, ODT (*.odt), or Comma Separated Values, CSV (*.csv). In statistical data, SPSS software (*.sav) or spreadsheet software (e.g. Excel, *.xlx, *.xlsx) is often used.
A JPEG format (*.jpg, *.jpeg) is commonly used in images files as it does not take up much space, however it also does not contain as much information as TIFF format (*.tiff, *.tif), for example. Formats that record sound or sound and image are rather dependent on the systems and are therefore constantly changing. When you want to keep such files usable for a longer period of time, they are often converted to formats such as WAV (*.wav, *.wave) or MPEG (*.mpg).
Conversion and digitisation
Transferring files from one format to another is called converting. Conversion may be necessary if other than the originally used software is used, for example, because the hardware does not support the original data format. When converting files, data may be lost or corrupted. Converting should always be done as planned and minimising the loss of data. Many software programs have the option to select the save as-storing or export-function when saving a file. There are also separate software for conversion.
Research data in the form of papers can be converted into digital format by scanning. Even in this case, attention should be paid to quality, such as resolution, colour tones or darkness, so that all necessary information is transferred and can be read or viewed as well as possible. At the same time, however, it should be remembered that the higher the quality of the result, the larger the file, which affects the storage and usage requirements of the file. Scanning is based on imaging the material, but a text file can also be produced from material containing text using the OCR (Optical Character Recognition) program. PDF (Portable Document Format) is a widely used file format that maintains the layouts of scanned material well. A PDF/A file format is recommended for archiving.
Analogue audio or video and audio recordings can be converted to digital format using separate devices or devices directly connected to a computer.
In order for research data to be findable, understandable and useful for the researchers themselves and others, it must be enriched with additional information. In this context, we talk about metadata, description and documentation, which should be planned and implemented right from the start of the research throughout the research. This makes it as easy as possible to publish or archive research data at the end of the research. Retrospectively, metadata is difficult, if not impossible to create.
As rich descriptive information as possible, is one of the key means of implementing the FAIR principles so that the data is
- Findable,
- Accessible,
- Interoperable and
- Re-usable.
You can read more about the FAIR principles on UEF Data Support website in the section Data management planning and the beginning of research (What does FAIR mean and why are FAIR principles often mentioned in connection with data management?).
Definition of metadata, description and documentation
There are no strict definitions for the terms metadata, description and documentation on the practical action level which may cause confusion and headaches. Metadata is typically explained as information about the data. In addition, paradata is sometimes discussed, which for example in the Data Management Guidebook of the Data Archive refers to “empirical information on data collection processes" (e.g. the start and end time of interviews, delay in response).
Documentation may refer in particular to the actions taken on the research data during the research (versions, file and folder structures, codes, etc.).
Metadata is often divided in connection with research data to
- descriptive metadata,
- administrative metadata, and
- structural metadata.
Administrative metadata describes the terms and technical conditions under which research data can be used at the level of a single file. Such information includes file format and size, license, embargo (i.e. when the material may be published) or ownership. Descriptive metadata describes the content and nature of research data, which also includes some kind of basic data, such as the author's name, title, persistent identifier and provenance. Structural metadata refers to the structure and order of the research data.
Understanding the description as a whole can be conceived so that in administrative metadata the focus is on the files and in descriptive metadata on the comprehensive research data. Ultimately, it is not essential to focus on the division of metadata, but rather on the fact that research data is described comprehensively and clearly.
Metadata Standard
In connection with metadata, there is often a call to utilise the standards or schemas. This is because when research data is described in a uniform and machine-readable manner, it would be easier to find and utilise them among each other and in different contexts. At their simplest, metadata standards are fillable forms that follow a specific structure. Consequently, the desired metadata information is always similar from all persons filling in the forms. This way, metadata is compared with a familiar format from publications, which includes information such as name, author, ownership, etc.
There are numerous standards. Some are so-called generic metadata standards, such as commonly used Dublin Core (DC), and others are discipline-specific. Researchers are often directed to utilise the standards of their field, which can be found in lists maintained by the Digital Curation Centre or the Research Data Alliance.
The selection of a metadata standard can also be guided by the data repository or archive, which the researcher intends to utilise for their research data. For example, the Data Archive uses the DDI metadata standard (Data Documentation Initiative), which has been developed and is maintained especially for the needs of describing social science data.
The use of the standard is not always conscious, but metadata standards can be utilised by researchers, for example, when entering information about their research data into a data repository or describing their data using a national Qvain tool based on a web browser.
The use of the Qvain tool is relatively easy: the researcher enters the requested information into an online form that can be completed later and published if they wish. After the publication, the research data in question can be found through the Etsin tool, and the metadata is also harvested by other services and platforms, such as UEF's UEF eRepository (UEF eRepo).
The Qvain User guide introduces you step by step to providing the necessary metadata. Qvain requires the following information, including some mandatory fields:
- Data source
- Rights and licenses
- Data description
- Actors
- Publications and other outputs related to data
- Geographical area
- Period
- Infrastructure
- Provenance (history and events)
- Project and funding
Description and documentation of files
The metadata standard or a kind of bibliographic or formal, research level metadata is not sufficient to guarantee the comprehensibility of research data, as shown in the Qvain Field List above. Sometimes it even feels that metadata and description instructions are too concentrated around standards.
A kind of internal description of research data is an integral part of metadata, if this is not apparent from the metadata standard. By means of the internal description, for example, the variables and terms used, the order and hierarchy of the files, technical and administrative requirements and other factors relevant to understanding the data are recorded for yourself and others. In practice, this can be a description added to a spreadsheet file or a README file included in the file folders. The idea of how I can understand my own research data ten years after its active use can be used to assist in producing the description.
The description thus includes information
- of the context of data collection, research objectives and methods,
- file structure,
- quality assurance methods,
- version management,
- terms of use,
- variables and records,
- codes and classification systems,
- special terms and abbreviations, and
- missing values.
Glossary, thesaurus and ontology
Glossaries, thesauri and ontologies are recommended to be used for describing research data. According to Finto glossaries, thesauri and ontologies are structured and machine-readable concepts. Terms and concepts selected on certain grounds in the glossaries are predefined. In the thesauri and ontologies, relationships between concepts are important, which makes it possible to link information.
In practice, the use of glossaries and ontologies means that when describing research data, an attempt is made to find words suitable for said research data in existing, generally accepted descriptive terms and keywords. In Finland, this is the General Finnish Ontology, which contains descriptive terminology in three languages (Finnish, Swedish, English). Glossaries and ontologies are also used automatically in data repositories, for example, when data from research data is entered therein.
The researcher can also describe their data in fully optional terms and words, which makes possible the most suitable and versatile description of data from the researcher's perspective, but does not necessarily promote the accessibility of the data as such.
Because glossaries and ontologies are built on commonly agreed meanings and relationships between terms, they support the quality of the research data's metadata. They promote the discoverability, interoperability and reusability of data, i.e. serve key FAIR principles.
Qvain is a web browser-based tool for describing research data. It is part of CSC's Fairdata services. Using Qvain requires the creation of a CSC user account (the instructions can be found here). After that, you can login to the service by using, e.g., the UEF ID (Haka ID).
From the front page, you can either create a new dataset or edit an existing one. The Qvain User Guide helps you to fill in the required metadata step by step.
Once you have published the metadata, the information of the dataset is found in the Etsin service, and through Etsin, in other services as well, such as in the UEF eRepo and in the national Finnish Research.fi service.