Guideline for data correction

Table of Contents

Introduction

This document describes the general principles and rules to follow to amend the data of the SuperCon 2 database. SuperCon 2 is a database of superconducting materials which properties are extracted automatically from scientific literature.

The guidelines assume that the user knows well the SuperCon 2 application. SuperCon 2 is intuitive and should not require specific training, the main features are illustrated in the Readme.md.

Data Model

This section describes the different information that are stored in the database.

Record field description

Items related to material Items related to target properties Items Related to paper Miscellanecus
Raw material Critical Temperature Document Flag
Name Applied Pressure DOI Actions
Formula Method obtaining Tc Year Link Type
Doping Section Record Status
Variables Subsection Error Types
Form Path
Substrate Timestamp
Fabrication Authors
Material Class Title
Unit cell type Publisher
Space Group Journal
Structure type Filename

General principles

In this section, we illustrate the general principles that are applied to the guidelines. Both "Record status" and "Error types" are be covered in Rules with examples and illustrations.

Units

As a general rule the Units are kept in the data. Although GPa and K are the most common units for applied pressure and superconducting critical temperature, there are still several cases of valid papers mentioning other units, e.g., kbar or MPa.

Correction workflow

Each records contains two main internal properties:

  • status, indicate the state of the record. More information here,
  • type, indicate whether the record was modified manually or automatically

workflow.png

Record status

Status Description
new Default status when a new record is created.
curated The record has been updated by a human.
validated The record was validated by a human.
invalid The record is wrong or inappropriate for the situation (e.g., Tm or Tcurie extracted as superconducting critical temperature).
obsolete Assigned to a record when it is modified, triggering the creation of a new record (internal status, not visible to users).
deleted The record has been removed by a curator (internal status, not visible to users).
Table 1: A record can be marked with four status type.

Error types

The extraction of superconductors materials is performed, following a unified data flow. Failures can occur at each stage in the flow, and we distinguish each failure by naming them "error type". Error types indicate the causes for which a specific material-properties record is invalid, wrong or missing. Table 2 illustrates these type of errors:

Name Description
From table The entities Material → Tc → Pressure are identified in a table. At the moment, table extraction is not performed.
Extraction The material, temperature, pressure are not extracted (no box) or extracted incorrectly.
Linking The material is incorrectly linked to the Tc given that the entities are correctly recognized.
Tc classification The temperature is not correctly classified as "superconductors critical temperature" (e.g., Curie temperature, Magnetic temperature...).
Composition resolution The exact composition cannot be resolved (e.g., the stoichiometric values cannot be resolved).
Value resolution The extracted formula contains variables that cannot be resolved, even after having read the paper. This includes when data is from tables.
Anomaly detection The data is automatically modified by the anomaly detection script.
Curation amend The curator is updating the data which does not present issues due to the automatic system.
Table 2: List of error types, sorted by their occurrence in the data flow.

Priority between error types

Some errors can occur in specific part of the process, shown below:

Following the dataflow, the priority between certain errors is as follows:

From table > Extraction > Tc classification > Linking > Composition resolution

For example, when a wrong formula (Extraction) is linked incorrectly (Linking) to Curie temperature (Tc classification), the error is "Extraction".

Annotated entities conventions

Rules

There are two types of actions that a curator can do when checking the data:

  1. Data reporting or flagging
  2. Data correction

The (1) data reporting (or flagging) is the process in which a record is marked as "possibly invalid". The term "flag" indicates the action of adding a flag on top of a record. In this case, the record will be hidden from the public view of the database. In addition, curators could select only reported records and inspect them thoughtfully, amending or removing for good (2).

Detailed description and rules for individual items

  • Raw material

    • Description & typical example:

    The extracted material name as it is

    For example: "tetragonal Ba(Fe_1-x_Co_x_)2_As_2 grown on Si(111)" - Curation rule:

    We do not curate this item because is the way originally the record is extracted from the document

  • Name

    • Description & typical example:

    Abbreviation/acronym of a material, for example: "LSCO"

    • Curation rule:

    Try to fill it up when it is available in the text or context

    • Possible error and examples: N/A
  • Formula

    • Description & typical example:

    Chemical formula of the material

    For example: La_1.75_Sr_0.25_CuO_4_ - Curation rule:

    What we want is chemical formula, that ONLY consists of atomic species and numbers. Therefore;

    - When the value for $x$ in the formula can be found in the same document, try to fill it.
    - When any additional information regarding the materials remains attached (e.g. tetragonal, annealed, grown on substrate), split and put them in appropriate sections.
    
    • Possible error and examples :

      • Wrong - Extraction

      example1: extracted formula is only a part of it

      • Wrong - Composition resolution:

      example1: variable in formula remained unsubstituted

      example2: formula has extra

      • Invalid - Extraction

      Example1: extracted something else than formula

      • Missing - Extraction

      example1: formula is not annotated

  • Doping

    • Description & typical example:

    Atoms and molecules that are used for doping, adjointed to the material name

    For example: Ag-doped Bi_2_Te_3_ - Curation rule:

    Try to fill when it is available - Possible error and examples :

  • Variables

    • Description & typical example:

    Variables that can be substituted in the formula - Curation rule:

    This is often kept unfilled, due to "composition resolution" error. Try to fill it when curator finds it in the paper. - Possible error and examples :

  • Form

    • Description & typical example:

    Identify the form of the material

    For example: "poly crystals", thin film, wire - Curation rule:

    Try to fill when it is available

    • Possible error and examples
  • Substrate

    • Description & typical example:

    Substrate on which target material is grown

    For example: Cu grown on Si(111) substrate

    • Curation rule:

    Try to fill when it is available

    • Possible error and examples
  • Fabrication

    • Description & typical example:

    Represent all the various information that are not belonging to any of the previous tags

    For example: annealed, irradiated

    • Curation rule:

    Try to fill when it is available

    • Possible error and examples
  • Material Class

    • Description & typical example:

    For the time being, class name is given by rule-based approach, based on containing either anion or cation atoms.

    For example: Fe-based

    • Curation rule:

    We do not curate this item - Possible error and examples

  • Unit Cell Type

    • Description & typical example:

    Bravais lattice that the crystal structure of the material belongs to

    For example: tetragonal, orthorhombic

    • Curation rule:

    Try to fill whenever available - Possible error and examples

    - Wrong - Composition resolution
    
      Example1: formula has extra
      ![](images/ex_W_compres_2.png)
    
  • Space Group

    • Description & typical example:

    Space group for which the material's crystal structure belongs to. Either No. (1-230) or "standard short symbol" is fine.

    For example: "Space group No. 225 (Fm-3m)" - Curation rule:

    try to fill whenever available - Possible error and examples - Wrong - Composition resolution

      Example1: formula has extra
      ![](images/ex_W_compres_2.png)
    
  • Structure type

    • Description & typical example:

    Type of crystal structure, described by the name of typical material that takes the crystal structure.

    For example: MnP-type, AlB_2_-type - Curation rule:

    Try to fill whenever available - Possible error and examples

  • Critical Temperature

    • Description & typical example:

    Represent the value of the superconducting critical temperature, Tc. Other temperatures (fabrication conditions, etc.) should not be extracted.

    For example: 100 K

    • Curation rule:

    It has to be properly linked with composition and applied pressure(if it exists) - Possible error and examples : - Invalid - From table

      Example1: entity is from table
      ![](images/ex_I_table_1.png)
    
    - Invalid - Tc classification
    
      Example1: extracted T is not superconducting transition temperature
      ![](images/ex_I_Tc_classification_1.png)
    
    - Missing - Tc classification
    
      Example1: Superconducting transition temperature was annotated but classified as other than SC
      ![](images/ex_M_Tc_classification_1.png)
    
    - Wrong - Linking
    
      Example1: Link among material - Tc_value - Presssure_value was not correct
      ![](images/ex_W_linking_1.png)
    
    - Missing - Linking Example1: Tc_value was annotated but failed to link to material
      ![](images/ex_M_linking_1.png)
    
  • Applied Pressure

    • Description & typical example:

    Represent the value of applied pressure on which superconducting critical temperature Tc is determined. Other pressures (pressure during fabrication process, etc.) should not be extracted.

    For example: 10 GPa - Curation rule:

    • Possible error and examples :

      • Extraction

      • Tc classification

      • Wrong - Linking

      Example1: Link among material - Tc_value - Presssure_value was not correct

  • Method Obtaining Tc

    • Description & typical example:

    Indicates the techniques used to determine the superconductiving transition temperature, either by experimental measurements or theoretical calculations. This includes also explanatory sentences for temperature dependence of resistivity, magnetic susceptibility or specific heat graphs, not necessarily related to superconductivity.

    For example: resistivity, magnetic susceptibility, specific heat, calculated - Curation rule:

    Try to fill whenever available - Possible error and examples

  • DOI

Digital Object Identifier of the paper where the entity is extracted

  • Year

Published year of the paper where the entity is extracted

  • Section

Indicate from which of the main sections of the paper (header, body, annex) the record has been extracted

  • Subsection

Indicate the portion of the section from which the record has been extracted: - header: title, abstract, keyword - body: paragraph, figureCaption, tableCaption - annex: paragraph

  • Timestamp

Latest timestamp on which the record was modified

  • Authors

Authors' names of the paper where the entity is extracted

  • Title

Title of the paper where the entity is extracted

  • Publisher

Publisher's name of the paper where the entity is extracted

  • Journal

Journal's name where the entity is extracted

  • Filename

  • The original file name of the PDF document

Miscellaneous

  • Flag

Flag a record as invalid or validated

  • Actions

Actions that can be performed on a specific record (edit, remove, validate, etc.)

  • Link Type

The algorithm used to link the two entities

  • Record Status

Described in the "General principles" section

  • Error Types

Described in the "General principles" section

Collection of examples

When it is not obvious which state-errortype is appropriate, examples below might help curator to decide.

examples of state-errortypes: - Invalid - from_table ![](images/ex_I_table_1.png) - Wrong - Extraction ![](images/ex_W_extraction_1.png) - Invalid - Extraction ![](images/ex_I_extraction_1.png) - Missing - Extraction ![](images/ex_M_extraction_1.png) - Invalid - Tc_classification ![](images/ex_I_Tc_classification_1.png) - Missing - Tc_classification ![](images/ex_M_Tc_classification_1.png) - Wrong - Linking ![](images/ex_W_linking_1.png) - Missing - Linking ![](images/ex_M_linking_1.png) - Wrong - Composition resolution ![](images/ex_W_compres_1.png) ![](images/ex_W_compres_2.png) - multiple errors in vicinity ![](images/ex_two_vicinity_1.png)

Missing entity

In the following example the entity has been missed completely. In this case the cause is likely "Extraction" because the process failed to recognise HCl.

Figure 1: Example of missing entity.

Invalid temperature

In the following example the temperature of about 1234 K has been extracted. This case is likely a problem of Tc classification because the temperature should not have classified as superconductor critical temperature and therefore not linked.

Figure 2: Example of invalid extracted temperature.

Composition extraction

[LF] This specific case should be clarified Ref

Figure 3: Example