IRI Data Library: enhancing accessibility of climate knowledge
© Blumenthal et al.; licensee Springer. 2014
Received: 1 October 2013
Accepted: 27 December 2013
Published: 17 June 2014
Climate variability affects a broad swath of socio-economic sectors, and if it increases or the sector becomes overly-tuned to past or present climate conditions, climate variability becomes of increasing concern to a wide range of non-climate specialists. The significant challenges to building the capacity of non-climate specialists to use climate information in research and decision-making include the difficulties in accessing relevant and timely quality-controlled data and information in formats that can be readily incorporated into specific analysis and reporting.
The IRI Data Library is a facility designed to cope with these issues of information dissemination. Methods developed include Map Rooms which are designed for rapid access to needed information for particular user groups, analysis tools useful for a wide range of users (especially while training), and a metadata framework that uses semantic technologies to transform metadata from a variety of sources into a variety of standards.
The results are tools to merge standard climate products with GIS information (e.g. averaging climate data over the political boundaries used to geolocate health and socio-economic data), as well as simplified access/transformation of large datasets only available as collections of many files or service points elsewhere.
The IRI Data Library is thus a key platform that makes climate and other data products more widely accessible through tool development, data organization and transformation, and data/technology transfer.
Climate variability affects a broad swath of socio-economic sectors, and if climate variability increases or the sector becomes more efficient and thus more precisely tuned to past or present climate conditions, it becomes of increasing concern to a wide range of non-climate specialists.
For example, public health professionals are increasingly concerned about the potential impact that climate can have on health outcomes. In the absence of effective disease control, climate determines the spatial and seasonal distribution of many infectious diseases and is a key determinant of inter-annual variability in disease incidence, including epidemics and longer-term changes in endemicity (Kelly-Hope and Thomson 2008).
Protecting public health from the vagaries of climate will require new working relationships between the public health sector and providers of climate data and information. It will also demand a wide variety of strategies occurring at multiple levels. One of these strategies is to increase the public health community’s capacity to understand, use, and demand appropriate climate data and information to mitigate the public health impacts of the climate. However, good information is not enough. The public health community must also be able to distinguish between different kinds of data and information products to determine what is relevant for their specific needs, how it can be readily accessed, and what methodologies and tools can best serve their purpose. Health practitioners and researchers concerned with climate-sensitive decisions are not routinely trained to consider these issues.
Significant challenges to building the capacity of health professionals to use climate information in research and decision-making include the difficulties in accessing relevant and timely quality-controlled data and information in formats that can be readily incorporated into specific analysis with other data sources (Thomson et al. 2011). While initiatives to improve health communities access to relevant quality controlled climate data are underway (Dinku et al. 2013) many barriers remain in terms of data, services, practice and policy (IRI 2006) that will need to be overcome for climate and environmental information to play a significant part in reducing climate-related risks with regard to health (Connor et al. 2010).
These barriers include (but are not limited to) a lack of:
access to relevant local and globally accessible data that may be used to create policy-relevant evidence for local, national, and regional decision-making;
ability to generate new knowledge because there is insufficient capacity to understand, assess, and use climate information (as well as other environmental and demographic information), in analyses designed to support a specific research question;
effective and available tools to enable the analysis of relevant data in space and time and which communicate easily with other software used for research or knowledge sharing;
policies for data sharing as well as technological constraints to knowledge and data sharing that could facilitate networks of researchers to engage with each other around common research agendas; and
a policy and practice environment that is responsive to new information concerning changes in disease risk.
The capacities of the IRI Climate Data Library can be used to build an integrated knowledge system to support the use of climate and environmental information in climate-sensitive decision-making. Initially funded as an aid to climate scientists for exploratory data analysis, it has now expanded to provide a platform for interdisciplinary researchers focused on topics related to climate impacts on society (del Corral et al. 2012).
The IRI Climate Data Library
As its name suggests, it represents a collection of datasets, both locally and remotely held, designed to make information more accessible for the library’s users. Datasets in the library come from many different sources, “data cultures”, and formats. By “dataset” we mean a collection of data organized as multidimensional dependent variables, independent variables, and sub-datasets, along with the metadata (particularly on purpose and use) that makes it possible to interpret the data in a meaningful manner.
access, manage and manipulate any number of datasets from a variety of earth science and climate-related topics, including public health;
create analyses of data (including climate and health data) ranging from simple averaging to more advanced Empirical Orthogonal Function (EOF) analyses using the Ingrid programming language;
monitor current and review past climate/environmental conditions with maps and analyses;
create multi-dimensional visual representations of climate and public health data, including animations over time; and
customize and download data plots and maps in a variety of image and data formats, including those compatible with geographical information systems (GIS) or other software for data visualization.
Traditional GIS platforms are now widely used by planners and decision makers in society. However, they are highly-focused on geospatial capabilities and have limited functionality for temporal analysis. Without information on the latter, meaningful inference about the causation of disease outbreaks is impossible (Jacquez 2000). Furthermore, many tools are unable to readily process the vast quantities of space-time data associated with, for example, the outputs of a global climate model. The IRI Climate Data Library overcomes the limitations imposed by GIS platforms by being based on a much more general multi-dimensional data model that includes both space and time dimensions. All datasets, including GIS features (such as points, lines, and polygons) are geo-located and temporally referenced in a uniform framework. Functions and operators in the Data Library use this framework to perform a wide range of analyses that integrate climate/environmental datasets and public health-related datasets. Large datasets, such as 100-year climate change model intercomparison results, are available through the Climate Data Library's cataloging and data transfer protocol support. In addition, the Data Library's interface and functions can be used to access shared repositories in different parts of the world.
A further challenge to spatio-temporal analysis used in agriculture, hydrology, and public health is the integration of climate/environmental data with the sector data. There are normally important differences in the spatial and temporal scales of the datasets. Within the Data Library, an environmental dataset can be temporally averaged to match the time frequency of the sector data. If the sector data is based on a geographic points or administrative polygons, the environmental dataset can be sampled with the same geographic constraints.
The IRI Climate Data Library can be used via two distinct mechanisms that are designed to serve different communities. Expert Mode serves the needs of operational practitioners and researchers that have an in-depth knowledge of the functionality of the system and are able to customize it to their own specific needs. Advanced users may develop custom functions and perform tailored analyses using Ingrid, the Data Library’s programming language. This functionality is widely used around the world by climate researchers as Expert Mode allows users with programming skills a very extensive level of personalized functionality. Online tutorials, examples, and function definitions are part of the Data Library.
Map Rooms are web-accessible tools targeted at particular user-groups, the end result of a process which evaluates user-group needs and builds tools that helps address those needs. These tools preselect data and analyses suitable for the task, building an easy-to-use framework for addressing the users' immediate needs, as well as providing links that allow the user to quickly download the data into the user group's standard tools for further analysis. While there are now many map rooms and hundreds of map room pages, several stand out in their operational use by their user groups.
Data are organized into datasets comprised of sub-datasets and multi-dimensional variables with use metadata: these variables can be quite large (terabytes) with many dimensions, so that a single variable can conceptually unify what in practice may be many files spread across many directories, details the user can ignore (or be blissfully unaware).
Analysis filters usually return variables (sometimes datasets), i.e. data with associated use metadata. This means filters can be chained together, any analysis result behaves as if it were a named dataset. In fact a number of variables named in the dataset collection are analyses based on other datasets.
Specifying a calculation is separate from actually executing it, so that chains of calculations processing large amounts of data can be specified and manipulated while the actual execution of the data flow (or portions thereof) is delayed until it is actually required. This allows one to think in the abstract about manipulating the entire dataset, yet actually access it one portion at a time. It also allows shifting the responsibility of efficiently arranging the calculation away from the user, who can then focus on the actual scientific and statistical analysis.
This easy access to analysis filters is particularly useful in training. The Climate Information for Public Health course (Cibrelus and Mantilla 2010), for example, is intended to engage decision makers directly, not just through expert lectures, but also through focused discussions and practical training sessions. These sessions introduce the participants to geographical information system (GIS)-based computational tools for analyzing epidemiological data with climate, population and environmental data. To allow the students to focus on the course content and still be able to analyze their own data in the context of available climate information, we have built services that allowed them to access and analyze their own data within the Climate Data Library, as well as adding analysis functions particularly useful in health analyses, ranging from k-means clustering to disease epidemic threshold calculations. This course and its tools have been taught in an annual Summer Institute and in sessions around the world, some using the Standalone Data Library (see below).
Another example of advanced analysis using the IRI Data Library is to create spatial-temporal maps of malaria incidence using health surveillance data. Monthly data on clinical malaria cases from 242 health facilities in 58 subzobas (district boundaries from the National Statistics and Evaluation Office) in Eritrea from 1996 to 2003 were used in a novel stratification process to guide future interventions and development of an epidemic early warning system. The process used principal component analysis and nonhierarchical clustering to define five areas with distinct malaria intensity and seasonality patterns and has been used by the Eritrean Malaria Control program in its planning process (Ceccato et al. 2007b).
Often in a research community there are several different metadata standards used to describe the same object. Associated with each metadata standard is a conceptual model, frequently not explicit, which describes the object in its own way. We are using an RDF/XML (Resource Description Framework) framework to address this issue, and create a flexible, reusable solution that can adapt to a variety of new metadata standards. It implements a semantic framework for explicitly writing down multiple metadata schema and conceptual models as ontologies; the ontologies identify metadata elements and concepts and characterize the relationships between them. We also use the framework to write crosswalks, i.e. explicit characterizations of the relationships between concepts and metadata elements belonging to different systems, including the connections between the metadata objects and the concepts they represent. Not only does this framework allow translation between alternate systems, it also facilitates building a more complete description of data objects out of a number of narrowly-focused standard systems. Going beyond standards, it can explicitly describe the data models implicit in programs that display and manipulate data. Writing Models, Crosswalks, and Objects all with RDF/SemanticWeb means that these data models and metadata standards can be combined into a single framework, leading to an interoperable metadata standard (Blumenthal et al. 2011).
Crosswalking between different standards can be as simple as two different names for the same quantity, but sooner or later the mapping gets more complicated. Frequently, different objects are related conceptually but are very different structurally. Our framework thus has both structure and conceptual models. Structure models describe how dataset metadata is written, e.g. cfatt which describe the attributes of a Climate and Forecast Metadata (CF) Convention netcdf file. Conceptual models describe the conceptual objects represented in the convention, e.g. cf-obj which describes the more abstract objects (like geo-located data) that are being described in the CF convention, objects that are also described in other systems, but are not explicitly written in any given CF netcdf file. XML Schema is a common way to represent structure models for XML files, and we have a translation of XML Schema to RDF/OWL which allows us to create conforming XML files from RDF information. We have applied this to the WCS Schema, for example, to extract the needed information for an OPeNDAP WCS service based on RDF extracted from CF/netcdf files. We also have included controlled vocabularies such as CF standard names or GCMD scientific parameters. Controlled vocabularies are a common way to structure classifications, and important for us to build a faceted search that works across diverse datasets.
The framework is established by creating ontologies for each metadata representation of these objects, and rule‒based crosswalks between them so that each object is expressed in all representations, thus all objects can be viewed in multiple systems. This technology has been encapsulated in a Java based persistence/inferencing framework for OPeNDAP (Cornillon et al. 2009) as part of a NOAA/IOOS project (Holloway et al. 2010). This work combines custom innovations, the use of ontologies, and leading Semantic Web technologies, such as, Sesame and OWLIM. Because this framework was developed on Java technology, the system is highly portable between various platforms.
We also developed an XML element extraction system based on Java, which allows the extraction of information from the framework into an XML format that is based on data description and delivery standards (WMS, SERF, etc.). With these tools we can further develop technologies of delivering climate data and analysis to partner systems.
Merging standard climate products with GIS information
A central part of the IRI Data Library's functionality is that it brings a wide variety of data together into a framework that allows that data to be analyzed together. The framework is sufficiently general that it overcomes the differences that disparate domains can have, while able to represent the results of the analysis so that it can be used for further analysis.
For example, consider different ways that data can be characterized geospatially. Atmospheric and oceanic scientists tend to have multi-dimensional data, a simple case being where temperature or precipitation is characterized as being a function of latitude, longitude, and time. Health or economics sciences, on the other hand, tend to have data as a function of time and geospatial entity, which might be a district or state or census tract. GIS data tends to have two structures, either a raster image with associated projection information, or vector descriptions of shapes, describing how to draw geospatial entities as sets of polygons or points. The IRI Data Library combines all three of these geospatial frameworks, allowing interoperability.
A manifestation of that geospatial interoperability is how the Data Library brings together three-dimensional (longitude, latitude, time) climate data and GIS spatial entity descriptions in the MEWS Malaria Map Room mentioned earlier, including a tool which not only displays a zoomable map of precipitation, but also allows selection of a district and displays downloadable time series computed by averaging the three dimensional data over the district (Figure 1b). This combination of data to get time series for a particular geographic entity is an important enhancement to the accessibility of the original data, for users whose other information is geolocated by entity. In this case, the user requests the analysis simply by clicking, but the underlying analysis functions can be used to combine many data sets and geographic entities.
Simplified access/transformation of large datasets
An essential role of the Data Library has been to enhance the value of publicly available disparate datasets by bringing them into a single framework that allows them to be analyzed together. One way we have done this is by transforming large datasets from their reference format (frequently large numbers of files in specialized formats with either purely descriptive access documentation or highly specialized metadata) to a more conceptual structure that allows the user to make selections and analyses in space, time, and physical variable without mastering the original multipart structure, the requests being made in the same way whether the variable represents megabytes or terabytes. The data can then be directly analyzed, partially analyzed to reduce its size and/or make the data more suitable for the users' needs, and displayed and/or downloaded in a wide range of commonly used formats and for many commonly used tools.
Transforming from a structure that is appropriate for a provider to one that is appropriate for a user is a critically enabling step, preventing technical barriers from keeping users from accessing the information. It is also important to remember that both the provider and the user have critically important needs. While the user needs to analyze the data as a coherent whole, the provider needs to characterize the data with clean provenance: which parts of the dataset were created when, were any parts of the dataset revised, and as the dataset is extended in time what new segments have been added. Simply stating that each extension to the dataset creates an entirely new one, for example, means that anyone trying to track the changes in the dataset would falsely think there is an enormous volume of data in keeping all the versions.
Shared Data Library Technology
The standalone Data Library configuration for partners is used in situations where targeted data services are needed. This may be in a region where only regional data are to be analyzed and delivered over the internet. The mirror Data Library configuration is used where the partner would like to see parts of or all the IRI Data Library datasets in their local configuration. This configuration allows the partner to view, analyze, and deliver both local regional data and global data, as well as allow the partner to share their local data with the IRI Data Library in a seamless data catalog available to both the partner and the IRI. This technology can be used to create a federation of remotely deployed Data Library sites and the IRI Data Library. This means that locally stored data can be shared globally over the internet among federation members.
The Data Library technology incorporates a content delivery service over local area and wide area networks. The portable Data Library can be used in a classroom setting (local area network) where all the visualization, analysis, and data delivery capabilities of the software are accessible to each student sitting in front of a computer or tablet that is running a browser. In a wide area network setting, the portable Data Library can be used as a self-contained website, or as part of an existing website. The start-up costs for a partner to implement a ready-made website for climate data can be minimized by using a portable Data Library.
When partners are evaluating the risk factors of climate change and variability on various sectors they encounter two impediments. One is the inability to bring sectoral and climate data together in a unified framework for comprehensive analysis. The other impediment is that often, government ministries (or departments within a single ministry) are reluctant to share data. A portable Data Library brought into a region can help remove some of these barriers. It is a neutral platform that can be installed in almost any (neutral) location. The functions within the Data Library software can be used to align climate data with the spatial and temporal resolutions of the sector data of interest. Once this alignment is performed, the correlation and statistical functions within the Data Library can be used to determine the climate risk factors affecting a sector or sectors in the partner’s region.
The IRI Data Library is a key platform that makes climate and other data products more widely accessible through tool development, data organization and transformation, and data/technology transfer. Tools developed include Map Rooms which are designed for rapid access to needed information for particular user groups, analysis tools useful for a wide range of users, tools to merge standard climate products with GIS information (e.g. boundaries of political entities used to geolocate health and socio-economic data), and simplified access/transformation of large datasets only available as collections of large numbers of files or service points elsewhere. We have developed a metadata framework that uses semantic technologies to transform metadata from a variety of sources into a variety of standards. We have also shared Data Library technology with partners to assist them in their own data sharing and access.
General Data Library work has been funded most recently under NOAA Grant NA10OAR4310210 and USAID Grant AID-OAA-A-11-00011. Specific map rooms mentioned have been funded under these grants as well as NOAA Grant NA08OAR4310622. We also acknowledge considerable cooperation and effort from all our partners: our partners in creation of the map rooms mentioned particularly, are the Food and Agricultural Organization (FAO), the International Federation of Red Cross and Red Crescent Societies, and the National Ethiopian Meteorological Agency.
Responsible editor: Xiubin Li.
- Blumenthal MB, del Corral JC, Liu H, Holloway D, Potter N: Semantic Framework for climate metadata interoperability (T248A). In WCRP Open Science Conference; 24–28 Oct. Denver CO, USA; 2011.Google Scholar
- Ceccato P, Bell MA, Blumenthal MB, Connor SJ, Dinku T, Grover-Kopec EK, Ropelewski CF, Thomson MC: Use of Remote Sensing for Monitoring Climate Variability for Integrated Early Warning Systems: Applications for Human Diseases and desert Locust Management. IGARSS Denver: IEEE International Conference on Geoscience and Remote Sensing Symposium 2006.Google Scholar
- Ceccato P, Cressman K, Giannini AS, Trzaska S: The desert locust upsurge in West Africa (2003–2005): Information on the desert locust early warning system and the prospects for seasonal climate forecasting. Int J Pest Manag 2007, 53: 7–13. 10.1080/09670870600968826View ArticleGoogle Scholar
- Ceccato P, Ghebremeskel T, Jaiteh M, Graves PM, Levy M, Ghebreselassie S, Ogbamariam A, Barnston AG, Bell MA, del Corral JC, Connor SJ, Fesseha I, Brantly EP, Thomson MC: Malaria stratification, climate, and epidemic early warning in Eritrea. Am J Trop Med Hyg 2007, 77(6):61–68.Google Scholar
- Cibrelus L, Mantilla G: Climate Information for Public Health: A Curriculum for Best Practices - Putting Principles to Work. Palisades, New York: International Research Institute for Climate and Society Report; 2010.Google Scholar
- Connor SJ, Omumbo J, DaSilva J, Green C, Mantilla G, Delacollette C, Hales S, Rogers D, Thomson MC: Health and Climate - Needs. Procedia Environmental Sci 2010, 1: 27–36.View ArticleGoogle Scholar
- Cornillon P, Adams J, Blumenthal MB, Chassignet E, Davis E, Hankin S, Kinter J, Mendelssohn R, Potemra JT, Srinivasan A, Sirott J: NVODS and the development of OPeNDAP. Oceanography 2009, 22(2):116–127. 10.5670/oceanog.2009.43View ArticleGoogle Scholar
- del Corral JC, Blumenthal MB, Mantilla G, Ceccato P, Connor SJ, Madeleine C, Thomson MC: Climate Information for Public Health: the role of the IRI Climate Data Library in an Integrated Knowledge System. Geospat Health 2012, 6(3):S15-S24.View ArticleGoogle Scholar
- Dinku T, Hailemariam K, Maidment R, Tarnavsky E, Connor S: Combined use of satellite estimates and rain gauge observations to generate high-quality historical rainfall time series over Ethiopia. Int J Climatol 2013. doi:10.1002/joc.3855 doi:10.1002/joc.3855Google Scholar
- Dinku T, Hilemariam K, Grimes D, Kidane A, Connor SJ: Improving availability, access and use of climate information. World Meteorological Bulletin 2011, 60: 2.Google Scholar
- Greene AM, Goddard L, Cousin R: Web tool deconstructs variability in twentieth-century climate. Eos Trans AGU 2011, 92(45):397.View ArticleGoogle Scholar
- Grover-Kopec EK, Kawano M, Klaver RW, Blumenthal MB, Ceccato P, Connor SJ: An online operational rainfall-monitoring resource for epidemic malaria early warning systems in Africa. Malar J 2005, 4: 6. 10.1186/1475-2875-4-6View ArticleGoogle Scholar
- Holloway D, Blumenthal MB, Liu H, Potter N: Using Semantic Web Technologies with OPeNDAP. 2010 Fall Meeting, AGU, 13–17 Dec, 2010; San Francisco, Calif 2010. IN41C-1369 IN41C-1369Google Scholar
- IRI: A Gap Analysis for the Implementation of the Global Climate Observing System Programme in Africa. International Research Institute for Climate and Society. Palisades, NY 2006.Google Scholar
- Jacquez GM: Spatial analysis in epidemiology: Nascent science or a failure of GIS? J Geogr Systems 2000, 2: 91–97. 10.1007/s101090050035View ArticleGoogle Scholar
- Kelly-Hope L, Thomson MC: Climate and infectious diseases. Seasonal Forecasts . Climatic Change and Human Health 2008, 31–70.Google Scholar
- Lyon B, Bell MA, Tippett MK, Kumar A, Hoerling MP, Quan X, Wang H: Baseline probabilities for the seasonal prediction of meteorological drought. J Appl Meteorol Climatol 2012, 51: 1222–1237. 10.1175/JAMC-D-11-0132.1View ArticleGoogle Scholar
- McKee TB, Doesken NJ, Kliest J: The relationship of drought frequency and duration to time scales. Soc.; Anaheim, CA. Eighth Conf. of Applied Climatology, Amer. Meteor 1993, 179–184.Google Scholar
- Someshwar S, Boer R, Conrad E: Managing Peatland Fire Risk in Central Kalimantan, Indonesia. World Resources Report: Washington DC; 2010.Google Scholar
- Thomson MC, Connor SJ, Zebiak SE, Jancloes M, Mihretie A: Africa needs climate data to fight disease. Nature 2011, 471: 440–442. 10.1038/471440aView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.