Flies in your metadata (ointment) - Crossref

Quality metadata is foundational to the research nexus and all Crossref services. When inaccuracies creep in, these create problems that get compounded down the line. No wonder that reports of metadata errors from authors, members, and other metadata users are some of the most common messages we receive into the technical support team (we encourage you to continue to report these metadata errors).


This is a companion discussion topic for the original entry at https://0-www-crossref-org.library.scad.edu/blog/flies-in-your-metadata-ointment/

Thanks to all who took time to read the blog post! Was there anything about the research nexus and our explanation about how metadata (errors) gets amplified beyond Crossref into the scholarly ecosystem that was news to anyone? Reactions?

The blog post was quite good. I think part of the problem is inadequate schemas. If you take as an example, part of the problem may be due to the fact that the schema says a string of length up to 32 characters. Of course some page numbers could be roman numerals, so it’s natural that you might accept a string, but it also means that something like 121-123 is not caught as an error. There is a natural tension between trying to be as flexible as possible in allowing weird page numbers and then assuming that most people would use it in a given way. Sometimes it’s better to err on the side of a restrictive schema if you are really counting on that data for matching. In this case I suspect page numbers are diminishing in importance.

We have struggled in collecting author names correctly, because coauthors tend to be sloppy in how they write their coauthor’s name (authors themselves are sometimes sloppy). People also change their names, and some don’t have surnames. I recommend that everyone read “Falsehoods Programmers Believe About Names” to understand how complex it is.

Titles in computer science and mathematics are also problematic, because it is common to use mathematical terminology in a title, and essentially nobody uses MATHML correctly (authors use TeX, and there are no reliable translators). That makes matching on titles problematic. The JATS format accepts inline mathematics in either MATHML or TeX.

I am particularly concerned about the schema for , which is pretty loose (e.g., it only supports a single author name). I suspect that in the case where a DOI is included for a citation, then that is all you need. In the case where there is no DOI, it’s important to have a schema that accurately reflects what a citation would look like. Because of this, we have decided to collect citation information in the JATS format instead of the crossref format.

Hi @mccurley ,

Thanks for reading and your kind words. I also appreciate your follow-up comments.

I’d say that I agree with your comment about the diminishing importance of page numbers for matching, but… There were a few sentences in my blog post about this that I sought clarification on from Dominika Tkaczyk, our Head of Strategic Initiatives. She did tell me that if the pagination registered with us is incorrect, and differs from the pagination stated in the citation, the matching process has a harder job, so it is still part of our process, and, thus, important.

Yes, on the author names. I hadn’t read “Falsehoods Programmers Believe About Names,” but find it illuminating, much like I did Nguyen Tan Thai Hung’s tweets about naming conventions in Southeast Asia from earlier this year: https://twitter.com/hung_tt_nguyen/status/1503936865916919813?s=21. Thanks for sharing.

Here’s some examples of DOIs with MATHML in the title of the metadata record: https://0-api-crossref-org.library.scad.edu/works?query=mml:math&select=DOI,title - for those reading who want to see what you mean about the complications before us on matching with titles that include MATHML or LaTeX.

I suspect that in the case where a DOI is included for a citation, then that is all you need.

Yes, that’s right.

In the case where there is no DOI, it’s important to have a schema that accurately reflects what a citation would look like.

Not sure if you have seen this, but Dominika did address some related concerns with this blog post: What's your (citations') style? - Crossref a couple years ago. I certainly understand your concern, but I am curious about your reaction to her analysis therein.

Thanks again,
Isaac

I saw the blog post by Dominika, but that seems to refer to when people post <unstructured_citation>. In that case you need an ML classifier, but that should be a last resort. If people posted a that was complete and followed a well-specified schema to break the elements apart, then we wouldn’t need to be guessing on the basis of a classifier for unstructured_citation. Relying on that will often introduce errors. In most papers in mathematics, computer science and physics, a citation starts from a well-structured bibtex entry, but gets munged into some blob of text by the bibtex/biblatex bibliography style. We shouldn’t have to recover the original metadata elements from this text blob, particularly when every journal uses their own style. As she said, she only got 94.7% accuracy on the test set. In arXiv:2006.05563 they got up to 96.3%, but I maintain that this is still an inferior approach when a well-structured schema could completely solve the problem. A good start would be to include all known authors instead of just one. The average number of authors on a scientific paper has been reported to be 4.4. We’re building a journal workflow that captures very detailed information automatically, but we have to throw a lot away to fit into the crossref schema. You could still fall back to trying to guess from an unstructured_citation.

Hi @mccurley, thanks for this feedback (I’m the person who oversees our metadata development). I agree that matching against the robust structured metadata supported by JATS would improve things.

We don’t currently do this for a few reasons:

  • most of our matching code is fairly old and it would be a big job to update - it’s important that we do this, but we have other projects under way that need to be completed first. The existing refs markup is admittedly stale and hasn’t kept pace with the metadata and content that’s now available for references, we do intend to address that in the future.
  • we’d like to better support JATS and have a project underway to transform JATS XML to Crossref XML (which won’t address the refs issue) and would ultimately like to accept JATS as a native format - that said, JATS is journal-specific, and we need to have a solution for all types of content that is registered with us
  • not all JATS users do a good job of structuring their references, and many members are not able to structure their references at all, so focusing on unstructured matching helps us help everyone

So the short answer is yes we’d like to do that some day and recognize the value, but that’s not where we are today. We have some members who pay close attention to the metadata they send us, and others who …don’t. I’m glad to hear that you are storing your citation metadata in JATS as we we’ll most likely be able to make use of that soon.

Also,

Yes this is so true - the 32 character page number limit predates me, but we’ve done a lot of tweaking of character limits over the years and try to be restrictive while also supporting the wide range of metadata provided by our members. We don’t really see lengthy page numbers in our metadata, the problems that pop up are more incorrect numbers like ‘1’ or ‘null’ as a default, but I can certainly take a closer look at that.

Thanks again for your feedback,

Patricia

1 Like