Machine-actionable DMPs: What can we automate?

Following on some initial thoughts about Scoping Machine-Actionable DMPs (maDMPs), we’re keen to dive into the substance. There are plenty of research questions we plan to explore here and over the course of our maDMP prototyping efforts. Let’s begin with these:

What can we automate?
What needs to be entered manually?

One of the major goals for maDMPs is to automate the creation and maintenance of some pieces of information.

Automation stands to alleviate administrative burdens and improve the quality of information contained in a DMP.

Thankfully, we’re not starting from scratch since Tomasz Miksa crafted an assignment for his CS students at the Technical University of Vienna to build an maDMP prototype tool and answer these very questions (course details; assignment). The student reports provide valuable insights that will help guide our own and others’ work on the topic. Read on for a brief overview of the assignment and a discussion of the key results; the results are woven into answers to the questions above.

I will also note that our own project includes grant numbers as a key piece of project metadata, which is not part of this assignment. We’re currently exploring the NSF Awards API and institutional grants management systems in the context of these questions, more on this anon.

Assignment
Students were instructed to build a tool that gathers information from external sources and automatically creates a DMP. Modeled on the European Commission’s DMP requirements for Horizon 2020, students could choose to create a DMP when a project starts (first version upon receiving funding) or when a project ends and all products have been preserved/published (final report). For the first option, the tool should help researchers estimate their storage needs and select a proper repository to store their research outputs. For the second option, the tool should connect to services where data is stored to retrieve information for creating a DMP.

External (or controlled) sources of information included:

  1. Administrative info (researcher name, project title): Use one or both of these inputs to search the university profile system and/or ORCID API to retrieve additional info (affiliation, contact email, etc).
  2. Find a repository (option 1): Use the OpenDOAR API or re3data API to recommend a repository based on sample data types and location (Europe, Austria)
  3. Get metadata about things deposited in a repository (option 2): Collect as much info as possible from the GitHub API about software products and OAI-PMH compliant repositories (e.g., license, format, size, etc) for other products.
  4. Select a license (if not provided in step 3): EUDAT license selector, reuse existing code.
  5. Preservation details: Allow users to tag all research products (e.g., input data, output data, software, documentation, presentation, etc.). Group them if appropriate. Provide a combo-box to define how long each product will be preserved (5, 10, 20 years).

The final reports describe the architecture and implementation of the tool; demonstrate how it works; include a human-readable and an maDMP created with the tool; and answer some questions about the benefits and limitations of automation.

Results
The student reports emphasize that a mixture of automation and manual processes is necessary to produce DMPs that meet all of the requirements outlined by funders. They demonstrate how we can leverage automation for maDMPs and provide thoughtful analyses about how we can consume available sources of information.

Portions of a DMP that can be automated easily include:

  • Basic project details such as title, names/authors, DMP creation date
  • Information (including metadata) about the research products associated with the project (e.g., data, software…)
  • Repository details: e.g., Zenodo, Github for software

Other automated portions of a DMP enable some inference but aren’t adequate by themselves:

  • Licenses: can be derived from a Github/Zenodo link
  • Software and data preservation details: some data is given for each file; some assumptions can be made based on the repository
  • Data sharing, access, and security details: some data is given for each file; some assumptions can be made based on the repository
  • Costs/resources: estimations can be made based on the size and type of data

Portions of a DMP that cannot be completed via automation:

  • Roles and responsibilities (although at TU Wien this is partially automated; they assume the project uses their infrastructure and provide details to designate individuals responsible for backups, final data deposit, etc)
  • Licenses and policies for reuse, derivatives (complete answers must be provided manually)
  • Ethical and privacy questions

Check out this example of a human-readable landing page for the DMP produced by one student team (Rafael Konlechner and Simon Oblasser) and the corresponding json output for the maDMP version. Some other examples of maDMP-creation tools for both assignment options are available here (ex 1, ex 2, ex 3, ex 4, ex 5, ex 6); they’re provided as Docker containers that can be launched quickly.

Discussion
The student prototypes and some other projects in this arena (e.g., UQRDM) inform larger maDMP goals surrounding automation and maintenance/versioning (i.e., keeping info in a DMP up to date). They identify sources/systems of existing information, mechanisms (APIs, persistent identifiers) for consuming and connecting it, and some important limitations regarding the informational content that require manual interventions and enrichment.

Our own prototype is following a similar trajectory as the student assignment. We’re defining existing data sources/systems and exploring the possibilities for moving information between them. The good news is that there are lots of sources and APIs out there in the wild with implications for maDMPs. There are also lots of existing initiatives to connect all the things that could become part of an maDMP framework (e.g., Scholix, ORCIDs, OrgIDs).

By taking this approach, we want to make the creation and maintenance of a DMP an iterative and incremental process that engages all relevant stakeholders (not just researchers writing grant proposals). Researchers need guides and translators to find the best resources and do their research efficiently, and in a manner that complies with open data policies. As we noted in the previous blog post, we want to enable repository operators, research support staff, policy experts, and many others to contribute to DMPs in order to achieve good data management.

Up next
Some related questions that we’re mulling over, but won’t endeavor to answer in this post:

  • Which stakeholders and/or systems should be able to make and update assertions (in a DMP) about a grant-funded project?
  • What is required to put it all together?

A teaser for the second question: interoperability and delivery of the DMP information across systems requires a common data model for DMPs. You can join the RDA DMP Common Standards working group to contribute to this ongoing effort. We’ll unpack this one in a future blog post.

Thanks to Tomasz (also a co-chair of the RDA group) and his students for taking an inspirational lead in maDMP prototyping!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.