Interview with Dacheng Liu
What are typical data science methods that we are not yet using enough?
Which areas will become important in the future in terms of data science?
Today, I am interviewing Dacheng Liu from Boehringer-Ingelheim and talk about the different opportunities open for data scientists in the pharmaceutical industry and answer the different interesting points below:
- How did you get into data science and what does “data science” mean to you?
- What are typical data science methods that we are not yet using enough?
- Which areas do you think, have currently the biggest potential to benefit from such methods?
- Which case studies do you have in mind?
- Which areas will become important in the future in terms of data science?
Dacheng Liu is the Global Head of Clinical Data Sciences of Boehringer Ingelheim with 16 years of experience in pharma industry. He leads the global team of 230 clinical data scientists, including statisticians and programmers, which focuses on drug development life cycle activities, including early and late clinical development, and medical affairs and real-world data applications etc. At BI he led early/late-phase projects in multiple disease areas, including several landmark studies. He was experienced with various regulatory submissions and FDA advisory committee meetings. He also led SOP process harmonization, and standardization of statistical methodologies within the company. He represents BI on industry-wide working groups, such as PhRMA clinical development working group. He has over 40 publications in areas of clinical research, statistical methodology and machine learning.
Alexander: You’re listening to The Effective Statistician Podcasts, a weekly podcast with Alexander Shacht, Benjamin Piske, and Sam Gardener. Designed to help you reach your potential, lead great signs and serve patients without becoming overwhelmed by work. Today. We are talking about the opportunities for Data Science in the Pharmaceutical Industry. A really hot topic, so stay tuned for this.
There are already lots of different other episodes about Data Science and we’ll have a couple of more methodological things coming up later this year about this area to this, so there’s lots of other things you can learn about Data Science.
I’m producing this podcast in association with PSI, a community dedicated to leading and promoting the use of Statistics within the healthcare industry for the benefit of patients. Join PSI today to further develop your statistical capabilities with access to the video on demand Content Library, free registration to all PSI webinars and much, much more head over to psiweb.org to learn more about PSI activities, and become a member today.
Welcome to another episode of The Effective Statistician. Today, I’m talking with Dacheng Liu from Boehringer-Ingelheim. Hi! Happy to have you on the podcast.
Dacheng: Hi! Alex. Thanks for having me.
Alexander: Very good. You work in an organization that very much emphasizes the use of Data Science and Data Science Applications in the Pharmaceutical Industry. How did you get into this area of data science? And what does Data Science mean to you?
Dacheng: Thank you. Great question and makes me reflect high and up to where I am now. So if I think, you know, back in my college days, I studied mathematics as an undergraduate when I was in China. And then after that, I actually spent a few years doing my masters in the area of Information System, so these two are kind of closely connected, you know, the Information System with what we call nowadays. Information Technology, Data Science, these are all connected. Now, I remember, you know, as part of the requirement to get my Master’s Degree, I had to study all the Data Structure Algorithms.
Alexander: What is Data Structure Algorithm? What does it mean?
Dacheng: Well, it basically, you know, the types of data you would analyze in a computer system and then the typical algorithm such as sorting those kinds of things.
Alexander: Okay, okay.
Dacheng: Like a tree structure in the Computer System in how they do things based on that data structure. So there are various algorithms attached to it. In my case, I actually wrote a paper while doing my Master’s. It’s using statistical methodology, a simple one that is to do the sorting. So I still remember, it’s basically about when you encounter a large data set and the distribution is usually unknown, right? And then your sorting algorithm will suffer. So I have to turn that unknown distribution into something uniform. Essentially you apply this inverse function of your cumulative distribution. The turning into uniform and then that makes things so much faster. And after that I went to the U.S. to study Statistics. You can see the connection right? And then after I got my PhD in Statistics, I did a postdoc in Rochester Union in Rochester and now I am studying mathematical modeling of the immune system. And after that, I joined Boehringer Ingelheim. And I always had a very strong interest when I was doing my PhD in the area of computing, you know Patient Computing and Model Selection in particular. So that was a very strong interest of the area for me and all these things are really connected in terms of data science.
Alexander: Yeah. If you would give a definition of what Data Science is, what, what would that be? Maybe you would say, who is actually a data scientist? What’s the job of a Data Scientist?
Dacheng: Okay, I think you may get many different answers from different people. You know, I have to say that there are really two papers that really influenced me a lot. In this area, one is Leo Breiman’s “Two Cultures” paper, It’s very well known. And then there’s another paper by David Donoho. He wrote the paper, Stanford Professor he wrote the papers called “The 50 years of Data Science”. I tend to agree with the definition David Donoho provided in the paper. He basically says Data Science is just the Science you learn from data, doing whatever you need to learn from data. They both associated Data Science with big data, for example with AI machine learning, but I think a word or just many aspects of data science.
Alexander: Yeah, I think there’s lots of different aspects in Data Science, you know. It starts with organizing your data, some people call Data Management, but not Data Management in the way we think about Clinical Trial Data Management, but a much broader view on how you organize all your data in your company and govern it and things like this. And all the ways, up to the point where you get real information out of it and communicate this information right?
Dacheng: I mean, it’s very broad as you said, you know, in Donoho’s paper, He also said various aspects. What you said about Data Manipulation, Data Exploration and visualizing the data by finding some patterns and then trying to build a model, try to compute with the data. And finally, once you have something available, then you have to be able to interpret the data, right? Interactive elements, that we constantly discuss the results with our Clinical Development Team, Physicians, Etc. So there’s just many, many aspects to this.
Alexander: So if we have this very broad definition of Data Science and, you know, pretty much everything is a Data Science method, like a kind of computing, a means of doing a test, all these kinds of different things are basically Data Science methods. If you would look beyond the typical things that we do like, mixed models and logistic regression and linear models, things like this. What would be the more advanced typical data science methods that you see we are currently using?
Dacheng: I think it’s really difficult to kind of isolate a masters from this kind of question You’re trying to answer. So to me it is always, question first, and then how you can address a question using the data at hand. Some data may be simple, say have some Univariate Data, it’s pretty simple. Sometimes you have High Dimensional Multivariate Data, which can be complex and you mentioned some of the typical methods we are using. Let’s say in clinical research, right? Linear model, search for logical questions. So I think this goes back to Leo Breiman’s “Two Cultures” paper. In drug development, I think historically especially in clinical development. We really care about inference, right? At the end of the day you want to say. Okay, treatment A is better than treatment B. Yeah, have a p-value less than 0.05 in the phase 3 study. So your drug is just going to be sent for submission and get approved, right? So it’s more about inference while, you know, in Leo Breiman’s paper, He mentioned these two ways of thinking things. One is around inference, the other is prediction. So generally speaking, in clinical trials, I think we do lots of prediction than inference. So once you get into the complexity of prediction where you try to link all kinds of data sources, that’s where you know, more complex methods are used. No, I wouldn’t say it’s more complex, sometimes you’re just less familiar for Statisticians with the traditional training in an inference framework.
Alexander: It could be just a logistic regression where you have some kind of model selection and maybe some.
Dacheng: Yeah, you can turn logistic regression into a neuron and colored cross entropy for example, right?
Alexander: Yeah. Predictions is one area, kind of prediction of okay people that will respond. People know we get a side effect or kind of these different questions. How about most of these approaches where you have lots of data and lots of end points and you need to reduce dimensionality, where would these kinds of areas be important? We see these more dimensional reduction problems coming up.
Dacheng: Yeah. So, let’s take one step back. So let’s think about a typical clinical trial, right? Phase 2 or Phase 3, usually what we call the baseline is really typically demographic data, some baseline based on efficacy, some baseline values, and you know in conjunction with very strong entry criteria, right? I think about our clinical trials, usually we apply, I don’t know, a dozen 20 entry criteria.
Dacheng: But the patients also tend to be rather homogeneous.
Dacheng: So, in this rather homogeneous population, if you think about the feature space, right? When you want to apply machine learning or whatever the feature space tends to be futurious. You know we’ve done exercises, if I try to do some disease modeling using our combined clinical trial database just based on baseline some advocacy at Baseline. And then it’s usually pretty challenging to create a model out of such a data set that can be reasonably predictive. So that’s hard. So, which means really we have to augment the typical data with some feature rich data set, right? So, in some early phase trials, you actually measure a lot of things like biomarkers, sometimes we measure the genes, so we measure all kinds of things which have a kind of feature rich data to work with, okay? However, when you have a feature rich data set in order phase at the constraints among the data you have.
Alexander: You have very few patients, but for these very few patients, you know quite a lot.
Dacheng: Right, think about your typical phase 1, those finding study, right? You have a few cohorts each cohort with 3, 5, 4, 6 patients, so you have a very small data set, yet it’s feature rich. In addition the other challenge is, you know, the early phase trials tend to be short term, right? So you only follow patients for maybe a couple of months and that’s it. And sometimes what you really care about is the long-term endpoints. So there’s a huge issue I think in terms of how to utilize early phase data. How do you really translate that into something that’s predictive into the future? So just going back to the question. You know, I think maybe one of the possibilities is somehow we have to augment the data with early phase data. So let’s say we augmented it with real world data. That might be one possibility.
Alexander: If you speak about real-world data, Is that more kind of for example, data that is collected through variables or things like that?
Dacheng: I think that’s part of the so-called real world data. If you look at the FDA’s guidance document, that’s considered real world data as well. What I was thinking about was along the lines of the typical electronic health record, reimbursement data, those kinds of data. I think the benefit of this data is, you can have a long-term fall out of certain patients. And in addition nowadays, there’s a possibility of integrating lots of lab data where you can get something that’s really comparable to an actual clinical trial. In addition there’s a huge volume right? There are a lot of patients. However, there are so many issues with real world data.
Alexander: Yes. It’s really interesting for certain areas for certain disease areas. The amount of data that you can capture increases dramatically. So, if you think about, for example, diabetes, where you have ongoing blood glucose measurements. You have basically a day to get thousands of measurements. Or you’re interested more in movement, and movement disorders. And you would measure these kinds of things with variables. You know, these kinds of topics.
Dacheng: Yes, I think that’s another aspect that, you know, what you’re talking about is things related to digital endpoints, right? As you mentioned, this continuous glucose monitoring, where you can actually measure the data on a continuous basis. So you actually got to this kind of high frequency data so to speak. And certainly, you have missing data issues, yet the hope is by augmenting the frequency somehow you can boost the signal, right? That’s the hope. I think sometimes it depends on the kind of indication you’ll study. In communications I think this high frequency measurement can be beneficial. Okay, in some situations, where you also have to wait for a while to see any kind of any kind of benefits that’s going to show up. I think, for example, in the area of Dermatitis right? I think there are companies which are really measuring the nighttime scratching behavior of patients. Each patient got itchy at night. Certainly, if you have a device that can really capture that, that’s beneficial. It’s much better than a PRO instrument where you ask a patient to recall, how’s your sleep two weeks ago?
Alexander: Yeah. Or questionnaires that you need to fill in every morning, things like that.
Dacheng: Exactly. Yeah, and there you have, you know, this variability issue right? With the PRO, you have the issue with what is yours? MCID by the way? So there are all kinds of subjectivity and variability attached to the typical, know, you PRO instruments, but if you can substitute another way, something can be objectively measured and that’s that’s much better.
Alexander: Yeah, It’s really interesting. The other part is if I think about Dermatology indications as a kind of Imaging data. Instead of measuring for example, psoriasis why our questionnaire basically. Let’s say a physician fills in, the patient fills in. You could take pictures of Ziplocs said would be also completely different. I’m sure there’s other disease indications where measurement and imaging data is even more important, gets more and more used.
Dacheng: Yeah, I think so. I think if you don’t look outside the healthcare environment, just look at generally speaking in computer vision and machine learning that space. I think we’re seeing a really rapid development in the area of computer vision, right? In terms of Imaging analysis, you know, we’re doing so much better with our cell phone, with portable devices. In terms of disease diagnosis, there are papers, publications, some in major journals and to show that, you know, a computer-based system actually can already meet or even beat experts assessment. Yeah, so I think that there’s definitely a lot in that area.
Yeah, even me within my company, I mean, my group last year, we had a company-wide competition by using Baseline images of Patients have this, you know, this, you know IPF it’s called Idiopathic Pulmonary Fibrosis. I actually try to use the image or the scanner of those patients to predict the disease progression. So again, if we go back to what we discussed, we try to really argue the Baseline Data with something. That’s really feature-rich.
Alexander: Yeah. I think that is a really interesting topic. You know, if you have these image data database lines that both in terms of predicting will respond, as well as also narrowing down kind of what are real cases in terms of the diagnosis. In some areas there’s a lot of potential for misdiagnosis and if you can, get rid of these patients that actually don’t have the disease but just look a little bit like the disease. That truly improves quite a lot. And the efficacy.
Dacheng: Exactly. So Imaging Data, I mean, one major benefit is non-invasive, right? So let’s say, if we are studying Nash, the liver disease, the typical way of doing that is you have to do a biopsy, which isn’t is not really pleasant for anybody.
Alexander: Nope. quite invasive.
Dacheng: Quite invasive right? So if there’s a way that you can somehow leverage the power of image, right? To create some kind of validated endpoint to substitute or to become even a surrogate of perhaps and point. That would be a major benefit to the patients.
Alexander: Yeah, it would be a major benefit to research overall because you can do things much faster. So much easier from a safety perspective.
Dacheng: Absolutely. I think Pfizer has a group. I think they are doing some kind of study in that direction in Nash.
Alexander: Yeah, that’s really interesting. What other areas are there with this kind of machine learning? And more advanced technologies can help you better make sense of data?
Dacheng: I think there are a couple of things. One is, as we said, there may be a scientific question we need to answer. That’s one thing. The other thing is rather, you know from the operational perspective, we can introduce certain measures to improve the quality. We can introduce methods that can kind of automate things. There’s another possibility. I think we’ve been talking about the meaning around the scientific important disease, those kinds of topics, but if you think about even running Clinical Trials, right? Other ways to improve patient recruitment, right? Yeah, that’s one possibility. Is there a way that we can assess our entry criteria? So maybe based on the data we can say, okay, this criteria we can relax and, you know, we may not need all 20, 30 or 40 to include criteria. We can voice down to 20 or 15. So they make him proud and the patients.
There was actually an article this year written by I think a Stanford group. A major on this topic about how to assess entry criteria for Oncology Patient Recruitment. And that’s pretty interesting, “Duty Roy” in my group and she talked about AiCure, right?
Alexander: Yeah, I had her on the podcast. Yeah, this is cool.
Dacheng Yeah, exactly. So yeah, AiCure is this company which essentially uses a camera to see if a patient is truly swallowing the drug and then based on that to assign some kind of compliance score. So essentially “Duty” and her team did was to build a model. So then, you know, based on the AiCure data you can say, okay if I look at the data from this patient, from the past four weeks, or three weeks or two weeks. We see a trend that this patient may not be compliant. Then maybe there’s an opportunity to inform the physician or the site, then introduce intervention to help the patient with compliance. So that’s again, that’s about operational efficiency.
Another thing I could think of is that, coming to data management, biostat data management. As we said earlier, we have this SDTM stuff. As an insider of you know, it can be a little bit complicated and can be error-prone. Sometimes we have quality issues. So I think there are also possibilities. Let’s say can we do something automatic, you know, maybe use natural language processing to automate this process and also reduce the error.
Alexander: That is interesting. So you basically use natural language processing. For example, look for similarities and things like this.
Alexander: So you use spelling, in english spelling or, you know, symbol is a kind of a typo somewhere and all these kinds of different things that initially, you can sort out instead of in the past where you can just say, is equal to or is not equal to.
Dacheng: Right. You can even look up at the sky and say, okay. Can I have an LP that helps the machine to understand the protocol? Yeah, somehow that can also help you with a lot of things Downstream and try to make things more automatic.
Alexander: Natural language processing. I think it is a really big area because if you think about it, lots of the data that we have is text. Yeah.
Alexander: If you think about what Physicians write down on their notes, its text. Yeah, case reports and ranked data, there’s a lot of texts in there. And if you think about, let’s say real world data and just about all the Social Media stuff and things like this and you want to better understand potentially certain Trends out there. Yeah, whether they’re some kind of you know, is that drug name associated with something that happens on social media and all these kinds of different things become really, really interesting.
Dacheng: Yeah, that’s right. So I think in the safety space certainly, as you said, right? Can you deploy about that, can really help you to extract any safety signal of your drug? And in a typical clinical trial, right? If you think about the narratives, we have to manually prepare the narratives. You know, it’s many texts to you. I think that’s also where our team may play a role. And if we are talking about real world data, you know, there’s very rich information in Physicians notes. Yeah, which unfortunately, usually is not available to Pharma. But I think given the richness of the information. I know that some of the companies, they do an LP, they extract the information from the unstructured notes and convert it into some kind of structured field. And it’s also pretty useful. If you think about it, some of the challenges we face with real world data. So a very simple example, in real world data, you have this disease coding, right? The ICD-10 coding, but oftentimes the coding is incorrect. You know, there’s research to show that the coding itself can be 40 percent, 50 percent incorrect, even for the term diabetes. So we actually work with a group in half or so, they develop some methods to extract information out of an LP using an LP out of Physicians notes, and then they actually have an actual Physician. To review the information to provide the right coding. And then you can build a model, right? Once you have a label data. You can build a model to have a better writer coding than ICD-10.
Dacheng: Yeah, so even in my team we were running the clinical trial in patients with borderline personality disorder BOPD, which is also challenging for us. If you just use ICD code. So we actually worked with a physician down in Mount Sinai. She did some case reviews for us about 200-300 cases and she would label each case as you know, highly likely bopd or unlikely, on a scale from 1-5 and then our team actually build a model in conjunction with some of the labels in the real world data to have some kind of screening tool. So then we can deploy the to at the investigational site and then we can provide recommendations to the Physicians and these are the possible BOPD patients. So there, I think there are a lot of things to explore in this space.
Alexander: Yeah, absolutely. So there’s a couple of areas that I’m sure in the past were not really able to leverage or maybe just in a descriptive way or in listings that maybe in the future will be much much easier to tackle. Speaking about the future. What do you think are the areas where in the future we can do much more? Because we have more data, better data, more advanced methodology to solve questions. I know we all don’t have a crystal ball but based on the trends that you currently see. What would be your kind of predictions, your gut feelings?
Dacheng: Okay, I mean if we go beyond clinical development, if you look at Drug Discovery as a whole, then I think there has been a lot of movement in the area of Drug Discovery, you know, we all know that this DeepMind AlphaFold right? Which can really predict the protein structure based on the sequence of amino acids. So I think in that space in the drug Discovery Space AI is going to play a really more and more important role. Now going to Clinical Drug Development, I think at the moment we are still kind of constrained with the challenges that I mentioned earlier, right? For some large studies, we don’t necessarily have those feature rich data and for some early phase studies. We have some feature rich data, but then the data size is too small. There’s another part, if you identify some signal, right? Let’s say you have a first in class drug, in the early phase you have some signal out of some biomarkers, but then you need to validate it. So by validation, you have to run another study. And we all know how expensive and how time consuming it is to generate another piece of data.
Dacheng: So these are really the challenges. Nevertheless, I see as we talked about there are some new types of data such as Imaging Data, you know, Multi-Omics Data, which is also High dimensional. There could be some data, which is High Frequency. So these types of data, they do introduce new opportunities for us. So in the space of digital endpoints, I think we talked about it. I think there are definitely opportunities in the area of digital endpoint. Just one reason is, for example, I think Mark submitted their phase 3 program with, I think it’s cough medication based on the measurement of sound or cough through a device. That was submitted, I think earlier this year. And I mentioned this, nine night time scratching behavior for a topic dermatitis indication.
Dacheng: So it’s coming in a way and also what we don’t see is how often, not publicized enough, is really in the early phase space where this kind of data can help us to make a decision. Either to continue, which means substantial investment on the road or to stop early, which I think is equally important.
Alexander: Yeah. I think the biggest gains probably will be before Phase 2 or maybe leading up to phase 2 and after phase 3 because after phase 3, you can collect much more data in the field you have. I’m pretty sure in the future we will have much more data collected by patients. Yeah, when they actually take these medications and that some companies will use these data to help the patient’s, you know, stay on treatment, optimize doze, manage side effects, whatsoever or kind of different things. Yeah, and even maybe alert relatives in case anything is happening. So I think there’s a lot of things we can think about that will happen in the future in space. Because well, just think about what we are already measuring in terms of devices on our body. And I think that will only increase over the years because, there’s a lot of symptoms, a lot of diseases we can avoid. If we detect some earlier, we can better treat them if we detect them early. And I’m pretty sure that there will be in a couple of years. Don’t know when but probably have lots of monitors for typical things on our body. I would guess.
Dacheng: Yeah, I agree. Just think about other functionalities of your Apple Watch, right?
Alexander: Yeah, all cardiovascular topics. Yeah, I’m sure it will be very easy. There’s lots of things that are rather easy to measure. Yes, which will be important.
Dacheng: Yeah. And just tack on what you just said about, you know, post-approval data collection, right? So if you have access to real-world data, post-approval, and there you also have the opportunity as you said to to provide some kind of individualized treatment solution, right? If you think about the space of diabetes. Let’s say your company has the luxury of having all the classes of drugs and if the database is large enough then maybe you can even develop an algorithm to say “this patient should follow this kind of treatment pattern”, if you fail this one then maybe the next one should be known all over the five classes. You should be class number 3, you should do this, right?
Alexander: Yeah. These high cost chronic diseases would probably be the first ones and diabetes is very much at the forefront of that. Yeah.
Dacheng: That’s right, yes.
Alexander: Awesome! We touched on a lot of different things, starting from what data science is? For how we can describe it? And going into a couple of different examples and use cases of Data Science up to where we might be in a couple of years or a couple of decades. And things that just tell us that we need to be on top of what’s happening here.
And there’s a lot of opportunities for quantitative people like us to play a significant role here in this field.
Any kind of key takeaways that you would like the listener to leave this episode with?
Dacheng: Okay. I think, you know, really coming from the The Clinical Development perspective, I think we will not guess less data that they will become more and more. Yeah, and in terms of types of data again, there will be more of a diverse variety coming along way. So certainly, I see the importance for our profession, trained Statisticians, to really embrace all the changes. So it’s not only about inference, there’s a lot to do with prediction modeling, automation, these are relevant for us.
Alexander: Yeah, I completely agree. And as you said, there’s a lot of areas beyond the purely Medical Data where we can play our role, Steve Pike mentioned in an episode some time ago. Such as, you know, lots of opportunities for us there as well. Thanks so much. It was awesome to have you.
Dacheng: Thank you.
Alexander: And keep in touch. All right, you will find all that Dacheng mentioned in the show notes. So check out all the show notes on theeffectivestatistician.com.
Dacheng: Thank you for having me.
Alexander: The show was created in association with PSI. Thanks to Reine and her team who helped the show in the background and thank you for listening. Head over to theeffectivestatistician.com to find the show notes and learn more about this podcast. Boost your career as a Statistician in the Health Area. Reach your potential, lead great signs and serve patients. Just be an Effective Statistician.