INSPIRE Thematic Clusters

Combining datasets for data harmonisation: how to deal with metadata?

1016 Views
  • Public

Combining datasets for data harmonisation: how to deal with metadata?

Started by Thijs BRENTJENS Replies (10)

Hello all,

In the Netherlands we have several organizations providing data for Protected Sites. Some of these want to combine their datasets to create a single, harmonised dataset for Protected Sites. This makes publication easier for the service provider they have chosen. And in a sense it is also a nice result of INSPIRE: harmonising data on a national level. Data and responsibility for the original / source datasets will stay at the current organizations for now.

But publishing this new dataset (with Download and View Services) means that we also have to provide a single (new) metadata document of this newly created dataset. At this point we have some questions.

We need to combine information in the metadata. For the elements on Responsible Party, the contactinfo may contain multiple organizations. There is no issue here for us. But for some other elements / parts of the metadata we don't know yet how to describe the situation. Specifically: Completeness Omission, Absolute External Positional Accuracy and Lineage are a bit hard. How should we express that there are multiple sources here, with different values for these elements?

Do others have ideas how to deal with this or experiences / similar situations? Please feel free to comment.

Replies

    • Public

    By Michael LUTZ

    Hi Thijs,

    interesting question. By looking at some examples of data set series in the geoportal lately, we have been wondering whether data set series could not be a powerful / useful concept to be used in such cases. The idea could be to leave the data sets as they are, but to create a data set series as an overarching concept linking them together. It would then probably be enough to set up the view and download services for the series and not necessarily for each data set.

    We would be interested to hear if there are other experiences using some such approach.

    Also, Thijs, if you are interested, we could use the NL example to come up with a proposal for how data set series could be used in such a situation.

    Best regards,
    Michael

    P.S. Thijs, could you please tag the issue as "cross-cluster", so it shows up on the main page of the TC platform?

    • Public

    By Iurie MAXIM

    Hi Thijs,

    If speaking about protected sites, the situation is common in most, if not even in all EU Member States. I have no knowledge of a single MS that has a single organisation that manage the data of all natural protected areas (national parks, natural monuments, etc), archaeological sites, cultural sites as they are under different national authorities such as for example Ministry of Environment for natural protected areas and Ministry of Culture for cultural sites. If speaking only about natural protected areas, in most cases there are different organisations, some being responsible with protected areas of local or national importance, other with those of European Community Importance and other with those international designations under different conventions such as Ramsar, UNESCO World Heritage Sites, etc. Even in the EEA CDR (European Environmental Agency Common Data Repository) there are two reports, one for "CDDA" (Common Database on Designated Areas) that is dealing with national designations, being a voluntary data delivery to EEA and another one for "Natura 2000" dealing with SPAs, SCIs and SACs, being an obligation of EU MS. Usually there are different organisations preparing the data for these deliveries, with different precision, data quality and different lineage.

    On the other hand, if speaking about any dataset, it should be known that a dataset can be even be based on multiple data themes. This is why in the metadata editor a user can select multiple data themes. There are such datasets that contain for example natural protected sites (PS), administrative units (AU) and biogeographical regions (BR) in topology with the protected sites, and on top of these the geographical names (GN) of protected sites and administrative units.

    In both situations described above, there are different organisations involved, some of them being Originators or Principal Investigators, some being Authors or Owners, other being Data Processors or Resource Providers, while some are only Contact Point for that dataset. All these organisations with their roles are filled in the Responsible Party Metadata section.

    Then, it is quite clear, for example that there are organisations that put all these data together, correcting some topology errors and aggregating different data + transforming them in INSPIRE format and their role is in this case "Processor". There are some organisations that are not producing the data itself, but they are providing the information in order to produce the spatial datasets, in this case their role is Principal Investigators. There are some organisations that are not producing the data, but are the owners of the data by contracting the services of producing that data to another entity (for example to a private company), in this case their role is Owner. There are some organisations that are only keeping the view and download services alive 99% of the time, but they have no role in producing the spatial dataset, in this case they are Resource Provider that can/should be mentioned in the metadata of the spatial data service. In this way the Responsible Party section of the metadata of the spatial dataset and the metadata for the spatial data services should be filled in.

    In the lineage metadata section of the dataset, the relevant story about how the dataset was produced from the very begining until the final product can/should be explained, explaining in words what the different organizations did to make the final dataset. In the lineage can be provided details about different precision of the data (for example it can be mentioned that the marine sites are at scale 1:100.000, while the terrestrial sites are most of them at scale 1:5.000 while some are at 1:25.000 scale, being possible to mention which ones). In the lineage section it is possible to write about the fact that the dataset is not including some protected sites (that for example were not identified in the field but that are legally established, or that are not included for a certain reason etc, being even possible to mention their names or national codes).

    Therefore anything that is necessary to be written for a certain user to understand how the data was produced, how complete and how precise is (in order to allow him to understand if the dataset can be used or not for a certain purpose), can/should be written in words in the Lineage section of the metadata of the spatial dataset. It should not confuse the metadata of the spatial dataset with the metadata of the spatial data service (view, download etc).

    And not to confuse the spatial dataset series with spatial datasets from different data sources that are used to create an aggregated dataset, as the spatial dataset series (i.e. CLC1990, CLC2000) are related to TIME and not to different organisation having different roles in providing an aggregated spatial dataset.

    As regards the elements that are in one to many relationship but for which only one value can be provided, such as for example precision, there are no rules, but either the worst or the average precision is provided, giving more explanations in the lineage section.

    To give specific examples, here you may find what you are looking for:

    Metadata for spatial dataset: http://gmlid.eu/RO/ENV/PADS/MD

    Metadata for view spatial data service: http://gmlid.eu/RO/ENV/PADS/MD/WMS

    Metadata for download spatial data service: http://gmlid.eu/RO/ENV/PADS/MD/WFS

    Even if the metadata are in Romanian language, Google translate can be used as it provides quite a good translation of the texts.The Lineage can be found at the very end of the metadata files, after all those thousand of keywords (used for discovering the dataset/services if searching in the Geoportal)

    The EC Geoportal can be used for viewing these Metadata files as well, instead of looking at the XML files in the browser. if searching after any site name or any national site code in the EC Geoportal, i.e.: searching for "ROSCI0001" or "retezat", the "Romanian Protected Areas Dataset" - PADS will be discovered and it's metadata can be dispalyed in the EC Geoportal. 

    If you have a specific question on how to write a certain metadata element, don't hesitate to ask. Remember to use Lineage to provide all relevant information to the users, including those ones that cant be correctly expressed in single metadata elements (i.e. precision).

    Hope it helps,

    Iurie Maxim

    Romania

     

    • Public

    By Thijs BRENTJENS

    Hi Michael, Iurie,

    Thanks for your answers, these are really, really helpful. I'll discuss it with the organizations; I think the practical approach of Romania is very suitable.

    Regarding dataset series: do I understand correctly that you both see spatial data series differently? That is: Iurie, you write that spatial data set series should be used for time-based series of datasets only, not for aggregation of different datasets. Is that the ISO definition of dataset series for example? For us to have a justification. I see that in the MD guidelines, there is a recommendation (only). TG Recommendation 6 states:

    series: is a collection of resources or related datasets that share the same product specification

    So if we aggregate datasets, it might be not the intended use of series. Or Michael, would you like to try out how to use dataset series? Because I can imagine that series might also be a solution, especially to express the differences in quality.

    • Public

    By Thijs BRENTJENS

    Thanks Stefania, that certainly helps. So dataset series might be an option too.

    Thijs

    • Public

    By Michael LUTZ

    Thanks, Stefania. Indeed, time series are just one of the cases, where the use of dataset series makes sense. Another one (already used in a number of cases in INSPIRE) is a series of data sets with different spatial extent, e.g. a series of ortho-photos covering a whole country. Here, each data set of the series is a single ortho-photo.

    Best regards,
    Michael

    • Public

    By Stefania MORRONE

    Dear all,

    please have a look at the dataset series definition in the ISO 19115.

    So, 'the exact definition of what constitutes a series entry is determined by the data provider'

    Hope this helps

    Stefania

    • Public

    By Iurie MAXIM

    Hi all,

    First of all to reply to Thijs: The fact why I mentioned that it should not be confused the spatial dataset series with the spatial dataset is based on your question and Michael reply. You mentioned that several institutions are producing data and you mentioned that there there is a need to combine information in the metadata and there are some issues for certain metadata elements metadata, such as "Completeness Omission, Absolute External Positional Accuracy and Lineage" because "there are multiple sources here, with different values for these elements". According to your question, this means that the "collection of spatial data" produced by different organisations do not "share simmilar characteristics of theme, source date, resolution, and methodology", so they cant be considered a dataset series, according to ISO definition shared by Stefania.

    Second, indeed, according to the ISO definition, time series are just one of the cases but are the most used series. If looking in more depth at the examples in the ISO definition, there is only space and time. At least to my understanding, series can be collections of spatial datasets that differ either in time or either in space (2D or 3D) or both space and time. At least to my knowledge noting else can be merged/divided in order to form a collection of spatial data in a spatial data series. Probably I am wrong according to ISO definition that theoreticaly covers more than just space and time, but this is my GIS understanding and I see no other spatial dataset series than related to time or space, most of them being related to time and only few to space and time or only to space. In INSPIRE context, individual aerial images /satelite scenes and scanned paper maps are not covered in any data theme, therefore dataset series related to space or space and time would be quite unusual in INSPIRE, but possible in a federal state or where data is produced and maintained at regional level without being aggregated at national level.

    I cant figure two different spatial datasets "sharing similar characteristics of theme, source date, resolution, and methodology" that form a dataset series and that are different by something else than time and/or space. I do not see for example a dataset containing hydrography and another dataset containg transportation network being part of a dataset series, but I see them part of the same dataset. I see as a dataset series all Corine Land Cover datasets from different years, I see as as dataset series ortophotos datasets from different years or periods, even if they cover an entire country, or each one covers different parts of the country. I see as a dataset series the federal collection of all protected areas spatial datasets produced at the local/regional level according to the same metodology in a federal state and made available at the local/regional level.

    As Thijs explained that he wants to create a single dataset by combining multiple datasets, it means that a new dataset would exist with all elements put toghether and accesible via download and view services, And that dataset needs metadata for that dataset. Thijs did not asked only to produce metadata for a spatial dataset series that would point to all protected areas spatial datasets without producing a new aggregated dataset.

    Having in mind that:

    -  "The creation of a dataset series metadata level is an optional feature that allows users to consult higher level characteristics for data search. The definition of this type of metadata may be adequate for the initial characterization of the of available spatial data, but may not be adequate for detailed assessment of of data quality of specific datasets"

    - dataset series does not seems to be a sub-type of a dataset and does not seems to be a dataset neither, but only a collection of spatial datasets, similar to folders in which are similar files (where the files are the datasets and the folder is the dataset series).

    ‘spatial data set’ means an identifiable collection of spatial data; [INSPIRE Directive, Art. 3].

    - "download Service is a service enabling copies of spatial data sets, or parts of such sets, to be downloaded and, where practicable, accessed directly" [INSPIRE Directive, Art. 11]. and I see no download services for spatial dataset series in any INSPIRE document. In INSPIRE documents, dataset series seems to be covered only in Metadata IR and TG. All other INSPIRE documents seems to refers only to spatial datasets and not to spatial dataset series.

    .. my understanding is the following:

    If I have two Corine Land Cover spatial datasets, one from year 1990 and one from year 2000, than I am producing metadata files for each spatial dataset and for each network service (i.e: view and download). On top of these spatial datasets I can produce metadata for the spatial dataset series in which I put as coupled resource locator the link to the metadata of the two spatial datasets indicating in the abstract and lineage why the two datasets are part of the same dataset series. I am not making any other metadata file for a network service and i am not aggregating the data of the two Corine Land Cover in an agregated dataset, to be served via view and download services. Once I am producing a new CLC, I am producing the metadata file for the new spatial dataset and for the respective view and download service, while I am updating the metadata file of the spatial dataset series to include information of the new dataset. The creation of the dataset series metadata allows users to find the specific datasets and their view and download services, but there is no such dataset that is aggregating both datasets and there are no view and download services allowing users to view all spatial datasets and download all datasets that are part of the spatial dataset series.

    Similar, if I have an ortophoto covering entire country with the flight period from 2004-2006 for which I am producing the metadata of the spatial dataset and the metadata for the coresponding view and download services. I have other ortophotos covering entire country from 2007-2010, 2011 - 2014 and 2014 - 2017. For all these I am producing metadata for the other three spatial datasets and for their coresponding view and download services. Because I am the data owner of all these four ortophotos covering entire country I am deciding to provide metadata for the spatial dataset series in which I am indicating in the coupled resource locator the locations of the metadata files of individual metadata for each ortophotos. I am including as a coupled resource locator in each individual dataset the link to the location of the metadata file of the spatial dataset series. Of course I am not producing any aggregated dataset out of the four ortophotos (as it does not make any sense).

    Simiilar, if all regional bodies are producing metadata for the spatial datasets containing protected areas under their jurisdiction and metadata for the corresponding view and download services, then the national body/federal state can produce a metadata file for a spatial dataset series if all individual spatial datasets at the regional level "share simmilar characteristics of theme, source date, resolution, and methodology". If the national body decides to produce as well an aggregated spatial dataset by combining all spatial datasets from the regional bodies, then it should produce a metadata file for that aggregated dataset and for the coresponding view and download services. Even I am not sure, probably the link to the metadata of the aggregated dataset can be included then as a coupled resource locator in the metadata file of the spatial dataset series, by indicating in the lineage and in the abstract that the spatial dataset series include as well all aggregated data.

    Is this the way the metadata for a dataset series should be filled in, or my understanding is wrong ?

    For the protected sites for the example provided within the question is difficult to figure that it may be a spatial dataset series. Anybody may try to fill the metadata for a spatial dataset series and will come to the same problem as in Thijs question and probably find that according to ISO that aggregated data from multiple sources and organisations representing different entities and made according to different methodologies in different periods of time, cant be a spatial dataset series.

    If looking in the EC Geoportal there are 96152 spatial datasets and only 2662 dataset series. It is interesting to have a look. Surprisingly, almost half are from Species Distribution data theme from UK, having as resource title texts such as "1960 - 1960 Centre for Environment, Fisheries & Aquaculture Science (Cefas) Survey : SLAN/16/1960 (part of CEFAS Historic surveys)". There are even 75 spatial data series for Protected Sites, but none of those I opened are suggesting a spatial dataset series. Indeed it seems that a separate topic to discuss only about the spatial dataset series would be useful.

    Even if Thijs dataset would be considered as a spatial dataset series, personally I have no clue how metadata for such an aggregated spatial dataset can be filled for a spatial dataset series. So if anyone can provide guidance on what to write in lineage, spatial resolution, responsible parties, quality as requested by Thijs and what to write as resource locator, then it would be useful. In any case, my understanding then is that Thijs should not produce any aggregated dataset, but just a metadata file for the spatial dataset series that has as coupled resource locator the individual datasets. Of course no view and download service would exist to see all protected areas toghether, as no aggregated spatial dataset would exist.

    Iurie

    • Public

    By Iurie MAXIM

     

    As pictures worth more than 1000 words, is this correct or not ?

    Not sure if the blue links exist, are optional or obligatory, but I think that  either they do not exist or they are optional.

    It may be considered that for Dataset 1, the service 1.1 is the view service. Than the download service will be service 1.2 and in Dataset 1 MD resource locator will be added the second URL pointing to the Download Service GetCapabilities document.

    Iurie

    • Public

    By Thijs BRENTJENS

    Iurie, thanks again for these great posts! Your post really helps in my understanding of what options we have and how spatial dataset series could (and could not) be used in the (INSPIRE) practice.

    You are right, the source datasets we are dealing with in this case have too much differences to consider them similar. It wouldn't work in this case because we want to publish a dingle Download and single View Service. 

    I think that we should go back to your first post in this case and consider the dataset we will publish as a new dataset, with a (new) dataset metadata document containing an extensive description in Lineage and be as reasonable as we can for the other elements (like resolution etc). I'll discuss this with the dataproviders and get back to our solution.

    But please, feel free to discuss further, really helpful :-).

    • Public

    By Iurie MAXIM

    Hi Thijs,

    I am glad that the information was helpful.

    I want to clarify as well that my intention was not to contradict Michael as I saw that sometimes it seems that we do not share similar points of view and the posts are looking quite strange. It is and it was no intention to contradict people. Being a platform for sharing information and for providing help to the organisations that are implementing INSPIRE, I try to share my knowledge, as there is no other instrument to do it except this platform and the INSPIRE Conferences. It is important to mention that as I have a great team that spend a lot of man days in the last 10 years to read, understand and implement INSPIRE, we have a point of view that is coming from practical implementations. As unfortunately the Technical Guidelines were not developed in such a way  for organisations to understand exactly what to do and how to do it, but rather are just a set of Requirements and Recommendations, this platform is one of the only way in which organisations can ask and discuss practical things.

    For me it is quite clear that in order to make some steps ahead in INSPIRE, first of all is to have Practical Guidelines with examples as the TGs that we have now can be called "Requirement and Recommendations" instead on Technical Guidelines and they are simply just "extending: the Implementing Rules.

    In this context, I am almost sure that organisations are asking first of all for simplification, mainly because of this, but if organisations would have Practical Guidelines to understand what to do, if they will have the necessary tools to do what is necessary to be done and if they will know how to do it, then everything would be more clear.

    Indeed we can discuss in more details about the data series, but this was not the intention of the thread and nor it is related with the question that you raised. Simply by looking in the EC Geoportal it is clear that there is very limited understanding about data series. Similar I can say that looking in the EC Geoprtal it is quite clear that there is very limited understanding about all types of metadata, for spatial datasets, spatial data series, spatial data services, get capabilities and how they are related and interlinked. And this is mainly because of the TGs. How many organisations understand from the TGs what to write for the "Resource locator - The resource locator defines the link(s) to the resource and/or the link to additional information about the resource. The value domain of this metadata element is a character string, commonly expressed as uniform resource locator (URL)." ? From the EC Geoportal it seems that less than 1 %. How many organisations are aware that the EC Metadata Editor is wrongly implemented and does not allow creation of valid metadata for spatial data services (there is only one field for coupled resource instead of three fields) and that the only solution is Notepad ++ ? As I raised such issues I was rather seen as an uncomfortable person and not as a person interested to move the things forward.

    Best regards,

    Iurie

Biodiversity and Management Areas Cluster

Biodiversity and Management Areas Cluster

Thematic Biodiversity and Management Areas Cluster. If themes like Protected Sites, Area Management/Restriction/Regulation Zones and Reporting Units, Habitats and Biotopes, Species Distribution, Bio-geographical Regions matters to you, join these groups!