What are “datasets” and "data services" in the RING

“Data services” is a generic term for any type of information service on the web, from a simple website to a search engine to an application programmable interface to a data dump. A more technical definition of "information service" is: any platform that provides information services from one server instance (website, mail server, web services endpoint, XML archive) to any client (browsers, email clients, news readers, special protocol clients). Any service that is registered in the RING will be listed in the “information services” section.

“Datasets” is a more specific term that has been defined in several ways, all of which further specify or extend the basic concept of “a collection of data”.
The way datasets are conceived in the RING follows the definition given by the W3C Government Linked Data Working Group to the concept of “dataset”:  a dataset is “a collection of data, published or curated by a single source, and available for access or download in one or more formats”. According to the same definition, the “instances” of the dataset “available for access or download in one or more formats” are called “distributions”: a distribution is “a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed”.

Therefore, datasets in the RING are a subset of the more generic data services and comprise only the services that make a collection of data available for access or download in one or more formats (distributions). The word “access” here has a specific technical meaning indicating machine-access at a certain address through a certain protocol, not just access through a web user interface (therefore, an online catalog search is not a dataset). In the same way “in one or more formats” here means in one or more machine-processable formats (therefore, a downloadable Word or PDF file with a list of bibliographic citations is not a dataset).

For example, a website that has a search engine where the use can interactively search and browse a collection of data is not a dataset, while the following can all be considered datasets:

  • an RSS feed reachable at a URL;
  • an XML dump downloadable via FTP or reachable at a URL;
  • a SPARQL engine that responds to a query with an RDF response;
  • an OAI-PMH target that responds to a verb call with an XML response;
  • any web service or API endpoint whose response is in a machine-readable format.

When users want to register a dataset, they must start registering a generic information service and then under the “Access to data” tab fill in the information about the available distributions of the datasets.

Following the above definitions, any information service that is registered in the RING is listed among the generic “information services”, while only those services for which at least one “distribution of data” is available for access or download in one or more formats are listed among the datasets.

For the purposes of data sharing and re-use and the building of better information and data services, registering a service with at least one real accessible “dataset” goes a much longer way than registering just a website or an interactive search engine: data in a dataset are re-usable, data behind a search engine are not.

agINFRA The RING is part of the agINFRA project EC 7th framework program INFRA-2011-1.2.2 - Grant agr. no: 283770