• Mariam R. Rizkallah

Data engineering and the challenging ecosystem of biological data

It has been called “data plumbing”, and its people have described themselves as “data janitors” and “blacksmiths”. In this post, I introduce you to data engineering, and I also point at where data engineering fits in the context of biological research.

Data engineering addresses the acquisition and transformation of large collections of data to facilitate information extraction and to enable data-driven decision making. Therefore, data engineering is concerned with the procedures and tools that facilitate the flow and, consequently, the use of data. Data engineering is widely applied in fields such as business intelligence, multimedia, environment and health. Every time you build or use tools to integrate, clean or transform data to produce new data, or when you extract information from data, you “do” data engineering.

Data engineering tasks include designing, constructing and maintaining data processing systems such as classical Extract, Transform, and Load pipelines, or more complex data wrangling techniques such as cloud-based real-time data processing pipelines. This unique set of tasks makes data engineers not only integral to organizations, but also the lead in the organization's data-handling structure (i.e., the data practitioners). In an organization, the data engineers enable the data scientists to deploy their developed machine learning models and allow for mass processing of data. Data engineering tools and skills span big data, machine learning, data mining, storage (e.g., compression and caching), networking, databases, distributed computing, statistics, and software engineering. The illustration below shows a number of technologies used by a data engineer, and it distinguishes between those a data engineer and a data scientist use, and where they intersect.

Data Engineer vs Data Scientist. Source: Medium

Biological data represent a large and a challenging sector of data engineering applications including environment and health. The data ecosystem in biology is very rich. Biomedical data can be collected from people through examinations, interviews or sensors (e.g., wearable devices, sequencers, images, and laboratory measurements). Data may also be obtained from other sources such as large administrative databases (e.g., health insurance databases, cancer registries). Moreover, linking these data is of major interest in biological research, for example for investigating the interaction between environmental and genetic factors and its effect on disease etiology. To be able to link data, the data are to be matched, pseudoymized or anonymized and stored properly. In addition, data collected from sensors may feed the Internet of Medical Things (a.k.a., Internet of Health Things or Smart Healthcare) to monitor patients' treatment and health status. Given the ecoystems described above, we can roughly categorize biological data based on source: primary data (e.g., from experiments and research projects) and databases of routinely collected (i.e., secondary) data.

We are delightedly aware of the speed at which different types of biological data become available, thanks to the advances in biological data acquisition systems (e.g., high-throughput platforms in genomics and metabolomics). Similarly, throughout the years, electronic medical records, national registries and healthcare claims databases grew providing information on the disease status of individuals at the national-, continental- and international levels. Such advancements result in an increase of both the depth and the breadth of biological data. The depth of data collected on individuals is manifested in more wholesome profiling of individuals including genetic variants and microbiome characteristics. As in breadth, the number of organisms sampled and sequenced is increasing as well; projects such as Oceanomics yield an amount of data comparable to that from the Human Microbiome projects.

Although storage and computational costs are decreasing; managing and analyzing biological data have a number of challenges. At the data front, it is often the case that biological data are poorly standardized, unstructured or simply high-dimensional challenging conventional analytical methods. Moreover, biological software development is troubled by the lack of clear requirements, and it is often developed on single-project basis, which makes skill transfer difficult. Finally, each data source has its own challenges in data acquisition, privacy and preprocessing.

How could applying data engineering principles make a difference in processing and analyzing sequence and non-sequence-based biological data? Perhaps I could talk about that in an upcoming post. Stay tuned!

I thank my PhD supervisor Prof. Iris Pigeot, Director of the Leibniz Institute for Prevention Research and Epidemiology – BIPS, and I thank Nandan Mishra (Jacobs University Bremen; MS ’19, Data Engineering), for reviewing this post.


What is Data Engineering? - DataCamp:

Data Engineer Exam Guide - Google Cloud:

We don’t need data scientists, we need data engineers | Hacker News:

Cloud Data Engineering for Dummies - Snowflake:

The Oceanomics Project:

Principles of Data Wrangling: Practical Techniques for Data Preparation 1st ed. Rattenbury T, Hellerstein JM, Heer J, Kandel S, & Carreras C. O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017.

Internet of Things for Smart Healthcare: Technologies, Challenges, and Opportunities. Baker SB, Xiang W, Atkinson I. IEEE Access 2017; 5: 26521–26544.

Recent Posts

See All