Photo by Ryan Quintal on Unsplash

A walkthrough of Orion’s backend, data and design decisions

Kostas Stathoulopoulos

--

In the previous blog, I introduced Orion, an open-source tool for the science of science that I am developing as part of my fellowship with Mozilla. Orion can be split into three parts; a backend that includes the processes that collect and analyse research data, a search engine and a data visualisation layer. In this post, I will discuss Orion’s backend as well as some important data and design choices.

Orion’s backend

We expect Orion to be used by researchers and policymakers and their needs and expertise can vary. We organised Orion’s backend accordingly to accommodate the needs of both communities. Orion should be flexible, meaning that it can be easily reconfigured to work with different thematic topics and levels of analysis. Orion should also be modular so that users with technical expertise can tailor the backend by adding or removing components according to their needs. Lastly, Orion should provide a good level of openness, meaning that our workflow should be visible and our data decisions have to be well-documented and accessible.

Orion’s backend comprises of three parts; data collection, data enrichment and data analysis.

Data collection

The first component of Orion’s backend orchestrates the collection and parsing of Microsoft Academic Graph (MAG) data. MAG is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. It contains more than 232M documents, that cover every academic discipline. We decided to develop Orion using MAG because of its great data coverage as well as its easy to use, expansive API. This publication by Kuansan and his team provides a detailed description of MAG for the science of science.

Orion offers multiple entry points to MAG; users can collect academic documents by querying MAG with conferences, journals or fields of study (ie paper keywords). For example, we queried Orion with a journal, bioRxiv, to collect all of the papers published on its platform. In another use case, we collected all of the MAG papers containing one of the following Fields of Study; climate emergency, climate security, tipping point (climatology) and disaster risk reduction. Furthermore, researchers can enrich an existing database with additional metadata by querying it with the paper titles. Lastly, it is possible to bound the timeframe of the data collection as MAG has papers from decades ago and a user might be interested only in recent research. This plurality of methods to collect MAG data reflects one of our core design principles; flexibility.

After querying MAG, the raw data files are stored on AWS S3. Then, they are parsed and stored in a PostgreSQL database. This completes the core data collection activity in Orion.

Data enrichment

The second component of Orion’s backend organises the collection of databases that enrich the MAG data. Currently, we are collecting data from the sources described below.

Location data

We geocode author affiliations using Google Places API. We use a two-step process to do the matching:

  1. Find the unique Place ID that Google assigns to every place by querying its API with the affiliation names (same as you would do with Google Maps).
  2. Query the Place ID to retrieve all of its details.

We decided to use this API mainly due to its great coverage, scalability and smartness in matching a queried name with the right place. Note that the service is not free but it is fairly cheap for small to medium size projects. A good alternative could be OpenStreetMap.

Gender data

We infer the authors’ gender using the GenderAPI. Its database contains more than 2M validated names from 177 countries, that are collected from publicly available governmental sources and combined with data crawled from social networks. Each name has to be verified by different sources and the API provides two confidence parameters, namely the number of samples and their accuracy. A review of name-to-gender inference systems suggested that the GenderAPI is overall the best performing Python service, however, it also revealed that its performance is not as good with Asian names as with European ones.

Orion removes any authors without a complete first name before feeding them to the API. For any downstream tasks such as producing gender diversity indicators, we remove matches with less than 70% accuracy.

We should also highlight that inferred genderisation assumes that gender identity is both a fixed and binary concept. This does not reflect reality as an individual might identify with a different gender from the one assigned at birth. Nevertheless, we decided to include this dimension in Orion since we believe it’s better to provide a partial view of this important issue than simply disregarding it. For those being sceptical about it, it is possible to use Orion without inferring the authors’ gender.

World bank indicators

We collect indicators from the World Bank to provide more contextual information to the users when they examine research on country-level. We use the pandas-datareader, a Python package that provides access to economic databases since it offers users the flexibility to collect indicators by querying their unique code. Orion currently collects the following country-level indicators:

Country details

We provide additional information about countries using the restcountries API. This includes metadata such as the country code, continent, subregion and population of a country.

Lastly, the restcountries API, Google Places API and the World Bank use slightly different naming conventions for countries. Orion homogenises these country names so that the collected data can be sourced together for analysis and data visualisation.

Data analysis

The last component of Orion’s backend is focused on creating new features from the data we collected previously.

Level of analysis

We want Orion to be useful not only as a data collection and enrichment system but also as a tool that enables users to measure research activity and make comparisons between countries, institutions or authors. One of the first questions we had to answer is “What’s the right level of analysis?”.

To answer this question, we decided to break it down to (1) time, (2) thematic topics and (3) entities and geography. We opted for examining publications on a country-level and annual basis while we leveraged MAG’s Fields of Study taxonomy to create a set of topics that are granular enough to make meaningful comparisons and broad enough to capture the diversity of the research topics in the data. This resulted in 64 topics including Artificial intelligence, Bioinformatics, Genomics, Neuroscience and Immunology.

Metrics

Orion contains metrics that show the topic-level similarities and differences between entities (in this case, countries) and how they have evolved through time. In detail, we measure the following for each year and topic:

  • The research specialisation of a country and how it compares with the rest by calculating its revealed comparative advantage.
  • The research interdisciplinarity of a country by recursively collecting all of the children Fields of Study of the topics we identified and measuring the Shannon-Wiener and Simpson diversity indexes.
  • The country-level gender diversity.
  • The country-level semantic similarity based on the publication content. In detail, we encode paper abstracts to high-dimensional vectors using Google’s pre-trained Universal Sentence Encoder model. Then, for each topic, we average the abstract vectors to create a country vector and generate an index with FAISS which we use to measure country similarity.

Other features

Apart from the metrics, Orion’s backend contains other features that can be used in research. For example, it enables researchers to draw a topic-specific, country collaboration network to examine how their links change over time. Moreover, Orion separates the industry from non-industry affiliations while it enables users to visualise the semantic similarity of papers by projecting their high-dimensional vectors on a 2D or 3D space (this feature is also used in Orion’s front-end).

Conclusion

Orion is an ever-changing pot of experimentation and blue-sky ideas that is being progressively consolidated as a flexible and modular tool. It is supported by Mozilla and developed by me and Zac Ioannidis. Orion is still in development and might change in the future as we add new features and modify old ones. We will update you on any major changes.

In the next blogs, I will do a tutorial on how we use Airflow, a workflow management tool, to orchestrate Orion’s backend and discuss how Orion’s semantic search works.

Acknowledgements

Lilia Villafuerte, an amazing HCI researcher and Interface Designer, is helping us to make Orion not only visually pleasing but also useful.

A feature that is not currently being used but is available as an option, is to query MAG with paper titles. Joel Klinger has developed this feature for Nesta.

If you have any questions about our work or would like to collaborate, send me an email at kostas@mozillafoundation.org.

--

--