Analyzing The Wisconsin Breast Cancer Dataset With Azure ML Studio

Frank Mendoza of  Catalytics  gave a presentation on analyzing the same Wisconsin Cancer Data Set that we presented with the HealthCare.AI toolkit in November 2017.  It was interesting to see how Azure Machine Learning works at a higher level with intuitive, graphical predictive analysis as compared to Healthcare.AI.  I intend to add Azure Machine Learning to my ML arsenal, for sure.


Machine Learning And Cancer

If you were to perform a daily web search for “Machine Learning and Cancer”, you may find more companies each day that are using Supervised Machine Learning to perform Predictive Analytics on cancer data sets.  There is Tempus, here in Chicago, IL, Berghealth in Framingham, MA, and FlatIron Health  in New York, NY, as examples.  So, how do these companies get their data?  It seems theses companies are well funded, and they likely pay good money to partner with health organizations.  As partners they categorize themselves as Business Associates as they comply with HIPAA in order to access patient information.  These companies are doing great work!

Although, as a small consulting group, we have been able to get data from the All Payor/All Claims databases from certain states (This site lists the status of participation in the All Payor/All Claims process for many states in the U.S.), the data we get is not very useful, as the ability to track a patient’s disease is virtually nil.  In the initial stages of analysis, I thought I could track a patient from the beginning to ending service dates for a patient’s medical encounters.   Although states have anonymized the patient ids (a good thing), they go overboard and don’t give you enough information to piece together the correct sequence of service dates for a specified patient.  Furthermore, they limit the tracking of a patient only within a specific insurance carrier.  What if the person changes insurance during treatment (a very likely occurrence)?  You just can’t get the proper data to describe a person’s episode with the diagnosis being studied.  The All Payor/All Claims database program was a great idea, begun during the Obama Administration, but it hasn’t gone far enough to provide useful, quality data.

In fact, there is no easy way for me to get good patient data anywhere, such that patient identifiers and  names are anonymized, and dates of birth are null.  The reason is that I am neither a well-funded Business Associate in a contractual relationship with a Covered Entity nor am I a researcher at a university, approved by an Independent Review Board, and also with Business Associate status.

The problem with both these barriers is that it takes a lot of money to overcome; and therefore, the analysis of such data sets are limited to a select few researchers and companies.

I believe, just like universal/free education and open source software, open healthcare data (anonymized to protect any specific patient) but with accurate diagnostic, patient characteristics (e.g. age, sex, county, city, state) would give anyone the possibility to generate useful models in order to predict whether a tumor is benign or malignant, for example.

This would be a great function of the federal government, to oversee such an open healthcare data program.  Many states are already trying to do something similar with All Payor/All Claims.  The federal government can learn from all the state efforts and create such a program, build the quality data that is needed, and make available in one place.

Comments on this are welcome…

HealthCare.AI Applied To Cancer

On November 28, 2017, Dan Wellisch led a second revision of his October 2017 discussion.  In this discussion, the Breast Cancer Wisconsin (Diagnostic) Data Set was used.   Dan also revised his slides to show the overall Machine Learning process along with graphs that can help one determine which Machine Learning model performs best on the given data set.


HealthCare.AI (Python version) based on Python sk-learn library

On October 30, 2017, Dan Wellisch, presented the Python version of the HealthCare.AI open source library.  He went through a basic example of a diabetes data set where the goal was to predict 30 day re-admissions for diabetes patients.  He discussed the process of training several models, selecting the highest performance one, saving that model, and last running predictions of 30 day re-admissions using that saved model.  Here is the full presentation.:


Simple Linear Regression: Step By Step

On September 26, 2017, Dan Wellisch, Organizer of the Chicago Technology For Value-Based Healthcare Meetup, presented a detailed breakdown of the Linear Regression Model.  The Linear Regression Model is the first model that gets taught to Data Science students.  In its simplest form, there are 2 variables, an input (or feature) variable and an output (or prediction) variable.  The idea is to find the best-fit straight line between all the (x,y) points on a 2 dimensional graph where  the input variables (x) predict the output variables (y).  By minimizing the Sum Of The Squared Error, we can obtain that best-fit straight line.   Here is the full presentation:

An Exercise In Gathering Healthcare Intelligence From An All Payer/All Claims Data Set

I have written an article with Mike Ghen ( on data analysis of healthcare data.  We used IBM Watson Analytics, Microsoft Azure SQL Database, and some Linux editing tools to product a nice summary table of findings.  The article is found here.: An Exercise In Gathering HealthCare Intelligence Using Azure SQL Database, Linux Editing Commands, And IBM Watson Analytics.


Data Warehouse 201

So, continuing on from the previous article, Data Warehouse 101, I would like to delve a little bit deeper.  First, I would like to backtrack on how I defined a data warehouse in the Data Warehouse 101 article.  I defined a data warehouse as follows.:

a large set of databases where data from disparate systems may be stored before their use for reporting

This is one way to construct a data warehouse, but the requirement that data come from disparate systems is not true.  Another type of data warehouse is one where there is a transactional system, perhaps an in-house transactional system.  The performance of this transactional system is very important, so we don’t want to run report queries against its database.  The source system for the data warehouse (Enterprise Data Warehouse or EDW) is only one system, namely, our in-house system.

The transactional system is comprised of one or many “normalized” databases.  A “normalized” database has many tables in order to reduce the redundancy of data.  Normalized databases are not efficient for reporting, so we need an ETL process (Extract, Transform, and Load) to copy the normalized data to a new set of databases and schemas.  These schemas will each end up being architected into a “star” configuration.  Hence, they are called “star” schemas.  “Star” schemas are the de facto standard architecture for reporting in an EDW.  Simply put, a “star” schema is comprised of a fact table with foreign keys, each of which point to a dimension table. A fact table contains quantities of things that are already pre-calculated and ready to report on.

Dimension tables answer the questions of “who, what, where, when, and how?”. A fact table answers the question of “how many or how much?”. For example, a fact table named FactSales could contain the following.:  DateId, StoreId, ProductId, NumberOfUnitsSold.  The FactSales table would be at the center of the “star”.  The NumberOfUnitsSold is the lone fact in this table.  This fact table answers “how many?” -> NumberOfUnitsSold. The other attributes are foreign keys.   Each of the foreign keys point to a dimension table.  There would be 3 dimension tables: DimProduct, DimStore, and DimDate.  Each of these answer “what?” -> (Product), “where?” -> (Store), and “when?” -> (Date).   The schema looks like this.:



The Fact_Sales table may be useful, in this case, if there are enough different combinations of date, store, and product that identify numberOfUnitsSold.  Imagine 10 years of history on many permutations of these attributes.

But, what about data warehouses for healthcare?  A healthcare data warehouse may not look like one for product sales.  What would it look like?  It may not be that easy to define.  We will take a look at some of these challenges in another post.

Thanks for reading!

Data Warehouse 101

What is a data warehouse?

I like to answer this question by defining a warehouse first.  What is a warehouse?

Here is the definition provided by Google: a large building where raw materials or manufactured goods may be stored before their export or distribution for sale

So, where do these raw materials or manufactured goods come from?

Raw materials and manufactured goods come from a variety of different sources.  They are harvested or produced by different companies.  It is the warehouse that provides the item to the next step in the supply chain.

So, the data warehouse  is the same thing.  The data warehouse is supplied by many different systems.  Each source database will define its own primary keys for each of its tables, but those primary keys are specifically designed for each specific source database.  Since a data warehouse is comprised of multiple sources, then it follows that each set of primary keys (for each source database) are designed differently than another source database.  There also could be duplicate primary keys between source databases.

So, what needs to happen?

We need to add a new primary key called a surrogate key so that we can guarantee primary key uniqueness.

We also need to cleanse the data, meaning possibly reformatting it.  For example, if source database 1 has phone numbers in its tables that has dashes in it, and source database 2 has phone numbers with no dashes, then we need to make a decision on how to format the phone number in the data warehouse, because we want it to be in the same format for searching purposes.

Back to the original question, what is a data warehouse?  I tweaked the definition of a physical warehouse noted above to be the following.:

a large set of databases where data from disparate systems may be stored before their use for reporting

Next up, we will dive a little deeper into how a data warehouse is constructed.  Stay tuned!

Value-Based Care And Bundled Payments

As I am currently employed by PinpointCare (, I would like to explain why we are in business.  To do that, I must define the Bundled Payment.  The Center For Medicare And Medicaid Services  or CMS ( formerly called the Health Care Financing Administration) is working towards enabling a “Value-Based” care environment.   A Bundled Payment is a form of Value-Based care.

Value-Based care is essentially oriented towards payment for making patients well.   At PinpointCare, our focus is currently on Orthopedic Surgeries.  CMS has focused on Orthopedic Surgeries as it is an area where they have seen a wide disparity of charges for an episode of care in one area of the country versus another.  So, CMS is creating these models where a provider manages the whole episode of care (evaluation, surgery, and post-acute care) and receives a lump-sum payment for the whole episode.  This is contrasted with Fee-For-Service which has been mostly in effect until now.  Fee-For-Service leads to overpayments and fraud, as it is the model where a provider gets paid for every encounter with the patient, no matter if it helps the patient get well or not.

The Bundled Payment is a model for Value-Based care.  One provider, either the hospital or doctor’s group, will assume the risk of getting a patient well.  This managing provider agrees on a contracted price from CMS such that this provider is responsible for the surgery and post-acute care.  The managing provider will profit if the cost of the episode of care costs less than the contracted price.  If the episode of care costs more than the contracted price, then the provider will need to eat that cost.  It makes the provider accountable for the patient’s recovery.  So, the Bundled Payment is simply the concept of this lump sum payment for care.   The concept was born out of the Affordable Care Act.  Stay tuned for what happens next (with the new administration)!  You can Google “Bundled Payments” for more details.

So, what do Bundled Payments have to do with PinpointCare?

PinpointCare has a “Coordinated Care” software platform.  Coordinated Care is the idea that all providers involved in an episode of care must communicate with each other and document a specific piece of information only one time, versus many times.

Our platform promotes efficiency in that providers (hospital, physician, skilled nursing, home health, physical therapy) have logins to our system, so all the information that each provider enters is all in one place and is not duplicated.

Our platform promotes communication as we have notifications that tell the next provider in the continuum of care when they will start their next phase.

This is how we help our customers reduce their costs and improve their quality….a necessity for our customers who are the ones who are at-risk for  the entire episode of care.