Skip to content

Reading Data

Overview

Data is read as a Stream in sdmx-core iterating over Series and Observations in a dataset. As the data is read as a stream, the memory requirements for reading data are kept to a minimum.

The key interfaces for reading data are:

  1. DataReaderEngine - the format specific reader
  2. DataReaderManager - obtains the correct DataReaderEngine for the dataset that was provided
  3. ReadableDataLocation - the source dataset
  4. SdmxBeanRetrievalManager

Note: It is not the responsibility of the DataReaderEngine to validate data in the generic sense, i.e. that a reported Concept is valid with respect to the Data Structure Definition. As such, the DataStructureBean and other structural metadata are read by the reader for information only; this is used as part of the sdmx-core validation framework, which is separate from the DataReaderEngine. Each specific implementation of the DataReaderEngine may provide their own syntax validation, and report errors if these are present; many readers allow error handlers to be provided which can decide how syntax errors are managed.

Data Reader Engine

Overview

The DataReaderEngine is an abstraction from the underlying data format, allowing the dataset to be read and processed in a format and syntax agnostic way.

The DataReaderEngine operates as an iterator, allowing the consumer to walk the data message by:

  1. Moving to the next DataSet
  2. Moving to the next Series in a DataSet
  3. Moving to the next Observation in a Series

When the iterator moves to the next Dataset, Series, or Observation, the DataReaderEngine can provide the details, for example:

  1. Read the Dataset Header details
  2. Read the Series values (reported Dimension and Attribute values)
  3. Read the Observation values (reported Attribute and Measure values)

The DataReaderEngine can at any point be reset, which moves the iterator back to the start of the message. The DataReaderEngine can be copied, which allows the same message to be read by two parallel streams in two separate threads. When the DataReaderEngine is closed, the resources are released.

Data Model

The Data Model is what the DataReaderEngine produces when iterating the dataset. There are two main classes read, the Keyable which represents a Series Key or a Group Key, and the Observation which contains the measured values and an optional Time component; both Keyable and Observation have optional Attributes. All reported values are a KeyValue which combines the Concept ID with the reported value, as a code or as a list of reported values if the Component allows multiple values to be reported.

classDiagram
  IPositionalDataElement <|-- Attributable
  Attributable <|-- IDatasetAttributes
  Attributable <|-- Keyable
  Attributable <|-- Observation
  Attributable "1" *-- "*" KeyValue
  IPositionalDataElement *-- IDataPosition
  IDataPosition *-- POSITION

  class IPositionalDataElement {
     +IDataPosition getPosition()
  }

  class Attributable {
     +List~KeyValue~ getAttributes()
     +List~IMultilingualKeyValue~ getMultilingualAttributes()
  }

  class Keyable {
    +DataStructureBean getDataStructure()
    +DataflowBean getDataflow()
    +String getShortCode()
    +List~KeyValue~ getKey()
    +boolean isSeries()
    +String getGroupName()
  }

  class Observation {
    Keyable getSeriesKey()
    String getShortCode()
    List~KeyValue~ getMeasures()
    SdmxDate getSdmxObsTime()
    List~IMultilingualKeyValue~ getMultilingualMeasures()
  }

  class KeyValue {
     +String getCode()
     +String getConcept()
     +List~String~ getValues()
     +boolean hasMultipleValues()
     +boolean isIntentionallyMissing()
  }

  class IDataPosition {
      +String getId()
      +POSITION getPositionType()
      +String toPositionalString()
  }

  class POSITION {
    <<enumeration>>
    FILE
    CSV
    EXCEL
    RELATIVE
    CUSTOM
    UNAVAILABLE
  }

Both Keyable and Observation are IPositionalDataElement; this provides the client with the position information in the source 'file' where the key of observation was read. As the file formats differ, the way the position is reported can differ, and as such each POSITION type has a different concrete implementation of the IDataPosition interface. See positional information for additional details.

Iterating a Dataset

ReadableDataLocation rdl = ...
SdmxBeans beans = ...

DataReaderEngine dre = new CompactDataReaderEngine(rdl, beans)

//Print ID for the message
HeaderBean header = dre.getHeader();
System.out.println(header.getId());

//Iterate Datasets
while(dre.moveNextDataset()) {

  DatasetHeaderBean dsHeader =  dre.getCurrentDatasetHeaderBean();
  DataflowBean flow = dre.getDataFlow();

  //Print Dataset metadata
  System.out.println(dsHeader.getDatasetId());
  System.out.println(dsHeader.getAction());
  System.out.println(flow.getUrn());

  //Iterate Series for Dataset
  while(dre.moveNextKeyable()) {

    //Print series short Code, e.g. A:UK:EMP
    Keyable series = dre.getCurrentKey();
    System.out.println(series.getShortCode());

    //Iterate Observations for Series
    while(dre.moveNextObservation()) {
      Observation obs = dre.getCurrentObservation();

      //Print obs time and value
      String obsTime = obs.getSdmxObsTime().getDateInSdmxFormat();
      String obsValue = obs.getPrimaryMeasure().getCode();
      System.out.println(obsTime + " = " + obsValue);
    }
  }

  //Close resources
  dre.close();
}

Positional Information

The DataReaderEngine decouples the client from the underlying data format. However, for clients that validate data, having a link back to the original dataset can be useful; this is supported in sdmx-core with Positional Information.

Each Keyable and Observation inherit from IPositionalDataElement which provides a method to request the IDataPosition, this is a reference back to the underlying file that is being read. The IDataPosition is an abstract interface, as the type of position it can report are very much related to the type of file that is being read. The IDataPosition reports the POSITION type, allowing it to be downcast to the correct implementation.

POSITION IDataPosition Implementation Description
FILE FilePosition Offset from start of file in bytes, and length in bytes
CSV CsvFilePosition Extends FilePosition by providing Delimiter and Column Names
EXCEL - No implementation
RELATIVE - No implementation
CUSTOM - No implementation
UNAVAILABLE UnavailableFilePosition

If the type of reader is known, it can be wrapped in a PositionAwareDataReaderEngine, which automatically casts the IDataPosition to the concrete type:

 public static void main(String[] args) throws IOException {
     SdmxJsonModule.register();
     SdmxMLModule.register();
     SdmxBeans beans = getStructures("src/main/resources/bis_structures.xml");

     Path filePath = Paths.get("src/main/resources/BIS_RB_T4.xml");
     ReadableDataLocation rdl = new ReadableDataLocationTmp(filePath);

    //Create DataReaderEngine and wrap it in PositionAwareDataReaderEngine
    //casting the IDataPosition to a FilePosition
    DataReaderEngine dre = new CompactDataReaderEngine(rdl, beans, null, null, null, null) ;
    PositionAwareDataReaderEngine<FilePosition> posReader = new PositionAwareDataReaderEngine<>(dre);

    //Iterate the dataset, and print the positions
    while(posReader.moveNextDataset()) {
        while(posReader.moveNextKeyable()) {
            FilePosition keyPos = posReader.getPosition();
            printPosition(keyPos, filePath);

            while(posReader.moveNextObservation()) {

                FilePosition obsPos = posReader.getPosition();
                printPosition(obsPos, filePath);
            }
        }
    }
}

/**
  * Print the position, and print the String read from the file
  * at this position
  *
  * Example:
  *
  * offset=302334, bytes=251
  * <Series FREQ="A" REP_CTY="SA" DEVICE_TYPE="A" FUNCTION="D" SUB_FUNCTION="A" TECHNOLOGY="A" ISSUER="A" TABLE="4" COLLECTION="E" AVAILABILITY="A" DECIMALS="0" UNIT_MULT="0" TITLE="Saudi Arabia - Number of cards with a debit function" UNIT_MEASURE="373">
  *
  * offset=302585, bytes=79
  * <Obs TIME_PERIOD="2012" OBS_VALUE="16440258" OBS_STATUS="A" OBS_CONF="F"></Obs>
*/
private static void printPosition(FilePosition pos, Path filePath) throws IOException {
    long offset = pos.getOffset();
    long readByteCount = pos.getBytes();

    //Print the offset
    System.out.println("offset="+offset+", length="+readByteCount);

    //Read the bytes from the file at the given offset + length
    byte[] snippet = PathUtil.getContentsFromOffset(filePath, offset, readByteCount);

    //Convert bytes to string, in UTF-8 format
    String actual = new String(snippet, StandardCharsets.UTF_8);

    //Print the string
    System.out.println(actual);
}

SDMX Readers

Implementation Supported Format Versions IDataPosition Implementation
GenericDataReaderEngine 2.0, 2.1 FilePosition
CompactDataReaderEngine 2.0, 2.1, 3.0 FilePosition
SdmxJsonDataReaderEngine v1.0 UnavailableFilePosition
SdmxJsonDataReaderEngineV2 2.0 UnavailableFilePosition
SdmxCsvDataReaderEngineV1 1.0 CsvFilePosition
SdmxCsvDataReaderEngineV2 2.0 CsvFilePosition
EDIDataReaderEngineImpl FilePosition

Note: the CompactDataReaderEngine reads CompactData in version 2.0 and StructureSpecificData in v2.1 and 3.0. CompactData was renamed to StructureSpecificData in v2.1 of the SDMX standard.

Custom Formats

ExcelReportingTemplateReader

Reading Excel datasets generated as a report template

CellDataReaderEngine

Used to read datasets in Excel, conforming to the format used by FusionXL

KryoDataReaderEngine

Reads datasets that have been serialised using the KryoDataWriterEngine, which uses the Kryo serialisation format; typically used as a fast caching solution where datasets require local processing.

Preserves the IDataPosition of the original dataset it serialised

Data Reader Manager

Whilst a specific implementation of the DataReaderEngine can be created by calling the constructor directly, the intent of sdmx-core is to always go through the DataReaderManager, as this provides the abstraction from the data format, enabling the code to remain format and syntax agnostic. The DataReaderManager is given a ReadableDataLocation and it returns the DataReaderEngine that is capable of reading the stream.

ReadableDataLocation rdl = ...
SdmxBeans beans = ...
DataReaderManager drm = new DataReaderManagerImpl();

//the manager determines which engine can read the data, based on the
//registered DataReaderFactory
DataReaderEngine dre = drm.getDataReaderEngine(rdl, beans, null);

Note typically the DataReaderManager is configured in an application as a Singleton, and registered with the framework. A new instance is not expected to be built for each request.