Skip to main content

6 guiding steps for selecting a fit-for-purpose data set

Using RWD that are not fit for purpose can waste time and money. Review 6 simple steps to guide your research towards the right data.

October 2023 | 7-minute read

Impactful analyses require the right data

From development through commercialization, biopharma companies are increasingly turning to real-world data (RWD) to uncover insights from cost and clinical information. It’s an exciting time to be working with and deriving evidence from these data in support of your pipeline. But your results won’t be as impactful or efficient if you’re looking for evidence in all the wrong places.

Researchers know how crucial it is to select data that are fit for purpose, aligning key elements in the data asset to the needs of your business questions. But there’s a plethora of data types and sources out there. And without a common definition of fit for purpose, how do you go about selecting a data set?

While there’s no magic answer to that question, it can be helpful to follow a framework to prepare yourself to evaluate a data asset before proceeding down an analytic path. Here are 6 steps to help assess the fit-for-purpose nature of a data set.

1. Know your audience

A critical consideration in understanding the fit-for-purpose nature of a data set is thinking about stakeholder expectations, including what they want to learn and how the outputs of the research will be employed. This includes assessing if the analysis will be for internal use only, or if it will be included in a regulatory submission.

Regulatory bodies such as the FDA have high standards for real-world data and evidence used to support new products and indications. Using an analysis in regulatory submissions will require certain data transparency in terms of traceability, auditing and sharing. Understanding the ability of a given asset to support those needs is essential.

Ensuring you know the final audience for your work well in advance — and what that audience requires from a data quality documentation and transparency perspective — will start you on a solid foundation before diving into the specifics of your study.

2. Understand the research question, hypothesis or business issue

Having a comprehensive understanding of the question you’re asking, the hypothesis to be tested, or the business issue at hand is inarguably one of the most important steps in determining which data asset to employ.

Skipping this step, even in part, has the potential to set you on a path that may lead to a time-consuming, costly failure. Given the criticality of the question and hypothesis definition, you should invest heavily in the design, definition and validation of the question(s) being asked, and the answer(s) being sought.

Iterating on such questions can ensure the question(s) meet the appropriate level of specificity well in advance of selecting a data asset.

For example, consider an organization that wants to investigate bariatric surgery outcomes. The following examples illustrate how one way of approaching the research goal is better than the other because of the level of specificity.

  • Example 1: “Examine bariatric surgery to understand the outcomes.”
  • Example 2: “What is the rate of success of bariatric surgery over time and what percentage of patients do not see success? What does that lack of success translate to in terms of cost?”

The second example is a better place to start because it allows you to understand which data elements are clearly necessary to answer the underlying questions. From the second example, you can see that you need a data set that includes both patient outcomes and cost metrics and identifies patients who’ve undergone a surgical procedure. These are all simple but useful considerations in selecting a fit-for-purpose data set.

3. Be specific with data metrics that will define success

Getting granular in your thinking can further refine your evaluation of what constitutes a fit for purpose data set. 

With a well-defined question(s) in hand, you can begin to zero in on specific outcomes you wish to target. This approach is critical for understanding the fit-for-purpose nature of a data set. As with the example in step 2, you should avoid overgeneralization and attempt to get as precise as possible in identifying the success measure(s) to be tested.

A lack of specificity often leads to false starts and the potential for lost time and money. One common method that researchers turn to when they skip this step is using the feasibility process to determine if a data asset is research ready.

Multiple question feasibilities do not constitute an effective evaluation of fit-for-purpose data. This approach is less efficient and prone to inconclusive assessments of such data — being proactive ahead of feasibility assessment is best. 

Following the example of bariatric surgery, a list of more specific criteria may look like the following:

  • Define what bariatric surgery is. For example, ICD-10 or CPT® codes may be used.
  • Ensure that literature scans and/or clinical expertise is engaged to work through these definitions and success metrics to ensure they are accurate.
  • Establish how the rate of success will be measured. For example, a change in BMI (specific thresholds may be defined) from pre-surgery to a specific time post-surgery would be a suitable starting point.
  • Think through what defines an unsuccessful outcome. In this case, a minimal to no change may be appropriate (again, specific thresholds may be defined).
  • Identify any relevant time frames for measurement. For example, one year and five years post-intervention may be isolated.
  • Identify any important hard outcomes. In this case, cost outcomes may be desirable, including what the patient and health plan ultimately paid for surgery. 

With this example, the research question can now be further refined, as follows: 

  • What is the success rate of bariatric surgery, as measured by a (potentially specifically defined) reduction in BMI at one year and five years post-surgery, and what percentage of patients do not see success (0 change in BMI)? What are the differences in cost — at both the patient and plan level — for successful versus unsuccessful bariatric surgery patients?

Based on this question, you can see there are 5 key metrics (BMI, time, procedure codes, plan and patient amount paid) that must be resident in or calculable from a data set for it to be considered fit for purpose for this study.

You can also list out any desirable supporting data elements to evaluate, if available. Consider how important those are to the overall analysis. Examples may include supporting data such as demographics, social determinants of health (SDOH) information or other factors that may be used as control variables or to profile populations.

Image direction: People looking at data (i.e., charts or graphs or something similar) together Image direction: People looking at data (i.e., charts or graphs or something similar) together

4. Determine if the data set can support key metrics of interest

With a full set of desired data elements and variables in hand, you can methodically create a table to serve as a virtual checklist to evaluate the ability of a given asset to support an analysis. See our example below. The grid identifies the important variables in the rows and the different types of data that may be available in the columns.

You can set up a table to include all data types you currently have access to, considering both cost (claims data) and clinical data sources. With the structure established, you can systematically complete the table according to the presence and absence of the desired data elements.


5. Design feasibility to help develop your analysis

Only after you’ve taken the steps to refine your research question(s) and identify the type(s) of data that best address the project’s needs is it time to conduct a feasibility analysis. 

One common pitfall in this process is beginning the feasibility assessment before taking steps 1 through 4, which may cause you to prematurely select a data set that isn’t fit for purpose. Feasibility doesn’t sufficiently approximate the final study protocol, resulting in sample problems later that may become problematic. Executing step 5 well, much like the other steps, should result in more seamless project execution downstream.

Given the bariatric surgery example, the following would be an appropriate example of a feasibility request:

  • Identify all first-time bariatric surgery patients from January 2015–2017 using ICD-10 and CPT® codes that have 3 months pre-eligibility and 12- or 60-months post-eligibility (from date of surgery) with a corresponding BMI measurement in the pre- and post- (12 and 60 months +/- 1 month) periods with valid (non-negative) plan and patient paid amounts for the surgery.

6. Ask general questions to make final informed decisions

At this point in the decision-making process, outstanding questions will likely be about which data provider(s) and number of sources to use. The right approach could be to use multiple sources, based upon specific needs. Some general questions and thoughts that can support decision-making include:

Representativeness and diversity of the data

  • Is a given data source skewed by age, gender, geography, race, line of business, etc.?
  • What is the longitudinality of the data set, type of source (open or closed), representativeness by site of service, etc.?

Understanding considerations such as these will allow you to identify any material limitations that may inhibit the generalizability of your research conclusions.

Rights to use the data

  • Is the question you are attempting to address, including use of the data, permitted by the data provider?

This can be a contentious point, especially if you don’t check beforehand and end up finding yourself in breach of a data agreement. In general, you should vet your intended data use with a data provider and thoroughly understand any data use agreements in place.

Implications of linking data sources

If no single source of data can address all research questions, you may seek to link 2 distinct assets to gain additional insight not resident in a single source. In this case, it’s critical that you understand the implications for the privacy regulatory status of data sets when linking or combining them.

Maintaining de-identified patient data should be of the upmost importance when considering linking assets. Check with the data provider to verify if a data source can be linked and by what methods this can be safely and securely accomplished.


Cost should always be evaluated if you’re considering a source that’s not already a licensed asset your company has access to.

Be thoughtful in your approach to the data

Keep in mind that the process for selecting a fit-for-purpose data set is not a one-size-fits-all game. Just as RWD reflect real life, your research process does as well — things can get messy from time to time.

Distinguish between the must-haves and the nice-to-haves in the data. And in parallel, ensure your key stakeholders are aligned with any gaps in the data before you proceed.

Evaluate if data are fit for purpose well in advance

Harnessing RWD can be an insightful, game-changing process. But picking the right tools is just as important as what you do with them. Addressing the steps outlined here can help set you on the right path with your research, bringing you one step closer to getting your products to market and to the patients who need them.

We can help you put evidence into action.

Related content

Life Sciences: Top stakeholder questions to consider

Ask your stakeholders what they need early on. Consider these top questions that patients, providers, payers and regulators are looking to answer.

RWD Pitfalls Part 1: Using the right tools to approach RWD

When working with real-world data, learn from the lessons of others to help maximize your own business return.

Reconcile your RWD expectations to maximize your investment

Understand how routine clinical practice impacts information captured in real-world data (RWD).