Purpose of Q&A
This group is designed to serve as an open-source platform for addressing queries related to real-world data (RWD). We aim to create an environment similar to Stack Overflow, where you can receive answers to your questions.

Every organisation develops internal documents outlining learnings and best practice for conducting research with real-world data. This might include database quirks, standard cohort definitions and preferred algorithms. Currently, there’s no open-source alternative to these internal documents, which leads to duplicated errors, slower individual learning and slower advancement of real-world evidence at the industry level. By creating an open-source platform, we hope to prevent duplication of effort, accelerate the evolution of the real-world evidence space and ultimately offer better insights for participants and patients. Stakeholders include data vendors, pharmaceutical companies and any other entities that use real-world data.

QuestionTeams Collective Response

Are there open-source templates for creating RWE cohorts?

Response 1: Open-source cohort tool https://github.com/r-world-devs

Response 2: OHDSI cohort tool ATLAS, which uses OMOP terminology you can find in ATHENA: https://github.com/OHDSI/PhenotypeLibrary

What are some recommended algorithms for propensity score matching in RWE studies?

Response 1: One commonly used method is logistic regression, but other machine learning approaches such as random forests and boosting can also be used. Libraries like MatchIt in R provide functions to facilitate propensity score matching.

Response 2: You might also want to consider the twang package in R, which uses a generalisation of propensity score weighting to estimate causal effects, making it a good fit for some RWE studies.

How can I handle missing data in my RWE Dataset?

Response 1: There are many ways to handle missing data, including simple methods such as listwise deletion and mean imputation and more sophisticated methods such as multiple imputation and maximum likelihood estimation. Choice of method depends on the type and pattern of missing data.

Response 2: MICE (Multivariate Imputation by Chained Equations) is a popular package in R for handling missing data. It allows you to impute missing data multiple times to account for the uncertainty of missing values.

How can I ensure the quality of RWE data?

Response 1: Regular audits, checks for missing data and checks for outliers are important. Tools like DataExplorer and validate in R can be very helpful in this regard.

Response 2: R packages like DataMaid and validate are designed to assist in data cleaning and validation, which are essential steps in ensuring data quality.

Are there any considerations to be made when using natural language processing (NLP) in RWE research

Response 1: Yes. Privacy and anonymisation are key considerations when dealing with patient data. Also, language and context can greatly affect the interpretation, so having a domain expert can be beneficial.

Response 2: Tools like Python’s NLTK or Stanford’s CoreNLP can assist in handling many NLP tasks. For R, the tm and text2vec packages offer functionalities for managing text data.

How do we deal with the potential biases or limitations associated with real-world evidence data?

Response 1: To address potential biases in real-world evidence (RWE) data and programming, perform thorough data preprocessing and standardisation to identify and address data quality issues, missing values, outliers and inconsistencies.

Response 2: Clearly define research questions and robust statistical methods. For example, employ methods that handle missing data appropriately, account for confounding variables or address selection biases.

Response 3: Seek feedback from peers, subject matter experts and other programmers. Collaboration and peer review can help identify potential biases that may have been overlooked and provide valuable insights for improving the analysis.

How would you ensure the inclusion of diverse patient populations and capture relevant social determinants of health to address healthcare disparities effectively?


Response 1: Clearly articulate the goals of data collection process. Actively seek out diverse sources of data to ensure representation of different demographic groups, including race, ethnicity, gender, age, socioeconomic status and geographic locations.

Response 2: Engage and involve diverse stakeholders throughout the research process, including data collection, analysis and interpretation.

What data cleaning and preprocessing techniques can be applied to improve data quality in real-world evidence studies?


Response 1: Protect privacy and confidentiality by de-identifying sensitive information such as personally identifiable information (PII) through techniques such as removing or generalising identifiers, pseudonymisation or data anonymisation.

Response 2: Implement data validation checks to identify and correct errors, inconsistencies or discrepancies in the data. This includes removing duplicates, resolving data entry errors and verifying data against predefined rules or reference sources.

What KPIs and metrics should be considered to improve data quality in real-world evidence studies?

Response 1: Evaluate the completeness and quality of data documentation, including data dictionaries, metadata and data collection protocols, to ensure transparency and understanding of the data.

Response 2: Solicit feedback from data users and stakeholders to identify areas of improvement and address specific data quality concerns.

Community Help

If you would like to contribute towards answering the questions below, please contact workinggroups@phuse.global.

QuestionTeams and Community Collective Response

Are there any open-source discussion platforms/documents to ensure accuracy, completeness and representativeness of real-world evidence data?


What considerations should be given to data governance and data sharing agreements when linking datasets from different sources in RWE studies?


How can data quality issues impact the process of linking datasets in RWE studies?


What are the considerations for selecting a suitable comparator group in RWE studies?


What are the emerging trends and future directions in RWE data collection, analysis and utilisation?


How can data from electronic patient-reported outcomes (ePROs) or mobile applications be integrated into RWE analyses?


What are the popular and emerging real-world evidence (RWE) data sources used in clinical trials and patient recruitment?


What patient-reported outcome measures (PROMs), quality-of-life assessments or health behaviour data would be essential to evaluate the intervention’s impact on patient well-being?


  • No labels