ai aws comprehend discovery and framing machine learning Microsoft

A Data-Driven Discovery & Framing Process

At Enfuse, our top priority is ensuring that we deliver relevant, valuable, timely suggestions and solutions to our clients. When we begin a consulting engagement, one of our first goals is to get a clear understanding of the issues they (you!) are trying to solve. One of the tools in our toolbelt is the Discovery & Framing (D&F) process in which: we attempt to understand and prioritize the problem/s you are facing (Discovery), and brainstorm, ideate, test and prioritize potential solutions (Framing). We come out of this 4-12 week phase with a clear problem statement and prioritized solutions and an action plan to get started.

D&F usually includes extensive interviews with staff, if we’re lucky — users too, a lot of research; maybe, some creative surveys. We then analyze the survey results, conduct follow-up interviews, and working sessions with the team to ensure we are moving in the right direction. Recently, we had the added bonus of having all our interviews automatically transcribed through MS Teams, which presented us with the opportunity to explore our client’s thoughts on their issues in a completely new, data-driven, way.

Our top priority is to understand and prioritize the client’s concerns and to arrive at the appropriate, and efficient, solution. So we are always looking for new ways to discover more insights and as a data-driven company, how can we use data to drive these insights instead of relying solely on human intuition.

So we took these transcripts as an opportunity to dig a little deeper. To ask ourselves: What insights might be revealed if we remove the human element from the analysis? Perhaps there are underlying issues that get overlooked or even go unnoticed during the interviews–issues the client is not aware even exist. With all the natural language processing (NLP) and machine learning (ML) methods and tools at our disposal, we had to take advantage of this opportunity.

In this blog, we are going to dive deeper into the technology, methodologies, and thoughts behind our analysis during our D&F process. If you would like to hear more about our consulting engagements and the D&F process, please leave a comment below or reach out to us directly.

Project Background

For this project, we are working with a company that provides services (warranty, preventative maintenance, etc) for Lab equipment. They have much stale and inconsistent data through their different CRM and ERP systems. They came to us with a list of “CDEs” Critical Data Elements (also referred to as KBEs, Key Business Elements) which are simple and derived data elements such as a sold-to party (encompassing the name, company, physical address, billing information, and more).

Analysis

Text Clustering

Our first method was to perform text clustering analysis with the transcripts. This method aims to organize data into themes, or clusters. It proved to be a great tool to help us narrow down areas of importance and to prioritize concerns.

We chose K-Means clustering for its simplicity. It is an unsupervised ML algorithm, which means the data itself will provide all that is necessary. No need for a training set. So what do we need? Our analysis is all contained in Jupyter notebooks using Python.

Our chosen libraries:

Text Pre-Processing

Before using machine learning with K-Means clustering, almost all text will need a little cleaning up, especially unstructured data like our set. Punctuation and line feeds or carriage returns need to be removed, along with other specifics depending on the data. Ours has lines we don’t really need, like timestamps and such. Next is stemming and/or lemmatizing your text. Stemming relies on set rules to basically shorten a word. Lemmatizing requires specific language knowledge to find the root of a word. It is recommended to do one or the other, but it may not prove efficient if both methods are applied. And, usually the difference is minimal.

At this point, the data is still not ready. Stop words (e.g. “like”, “and”, “to”) need to be taken out to improve your results. NLTK comes in handy for this, but most of the time it’s necessary to extend the list of stop words to suit your data.

Vectorizing

Now the data can be converted into a numerical representation or vectors. Some popular methods are count vectorizing, TF-IDF vectorizing, and Word2Vec vectorizing which uses word embedding (for another blog). We went with scikit-learn’s TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer. It offers the option of only vectorizing a max number of features, which we used in a couple of ways before applying any ML algorithm. To get to know our data a little more as we go along, we set out to find the top 50 features in our dataset. From the word cloud, we see data, customers, and services stand out as first-runners, while product, team, and processes follow not too far behind.

Top 50 Features (Stemmed)

During our interviews we asked each person which of the CDEs we were given were the most important, and if they felt any were left off. We filtered the data to only include comments related to the mentioned CDEs. We then vectorized this subset and generated the word cloud below with the TF-IDF scores. Our initial goal was to ensure that the top 10 were the right ones, that none were missing and none were coasting (including, but should be prioritized). The results allowed us to confirm the priority of our initial CDEs (blue colors) and identify a couple more to be included (purple: Contract and Contact). Ultimately, the client supported adding Contact to the priority list but keep Contract for the next phase, as it was more of a high-effort data element.

K-Means Clustering

Finally, we are ready to implement the scikit-learn’s K-Means algorithm. But, in order to do that you must magically know how many clusters you’d like to see. Unless you are very familiar with the data, this is not an obvious decision. We had no idea, so we turned to the trusted methods out there to figure this out: the Silhouette and the Elbow methods. The Elbow method came through for us and so we continued with an initial choice of 2 clusters. We also explored with 4 and 9 clusters and found some insights. However, we found better cohesion when we looked at the results for 3 clusters. The ordered features are displayed below.

Top 20 features per cluster

Looking at the top features in each cluster points us in the right direction to focus our efforts. It is clear that the client’s priorities are their data and how to manage it, the handling of their customers’ accounts, and their product.

While pointing us to the areas of concern, these clusters provide a head start in the solutions while also validating some of the initial choices for product CDEs: component, family, manufacturer, instrument, lab, and description. These will be used when working on the data models that will contribute to the proper handling of their data.

Since the #1 feature is ‘data’, we wanted to dig a little deeper into the ‘data’ related theme. Is the client concerned about organizing and accessing the data? Or the methods to import it and process it? Perhaps data security is lacking? The details will determine which route to follow when we search for a solution.

We filtered the transcripts to only include strings containing data-related terms. We processed the text once again to join words into a single compound word to make our analysis easier. So, if we find ‘data’ followed by ‘governance’ we turn it into ‘data-governance’.

From the top 10 counts (stemmed) we see that data-quality is, by far, the biggest concern. Data-governance is the next area we’d want to address. One that stands out is customer-data. Part of their product is data itself, but this suggests there is a separate set that needs attention as well, and that is the internal data about their clients.

The top 20 features list now shows data-quality taking priority over customer-related features.

Sentiment Analysis with AWS Comprehend

Focusing on the problems that need attention is not enough to paint a full picture. Knowing what is functioning well can guide us to a better solution. We want to fix the issue without breaking something that is already working. With this in mind, we used Sentiment Analysis to get a feel for what areas are perceived as positive, negative, or neutral.

Amazon Web Services (AWS) offers Comprehend as a quick and effective NLP tool for sentiment analysis. We stored the text files in an s3 bucket. Once Comprehend is done processing the data, a file with the results is dropped in the same bucket.

But that is not where the analysis ends. Comprehend assigns positive, negative, neutral and mixed scores to each line, as well as the sentiment derived.

Mostly, we were interested to see the sentiment surrounding some of the critical data elements that will be used in a possible data modeling solution. The sentiment wheel below gives a quick overview, starting with the most mentioned term ‘ship to‘ and moving clockwise to the least mentioned ‘instrument‘.

Top CDEs Sentiments

Now as a quick note, we should not read too much into whether the sentiment is “positive” or “negative” as this simply means that the CDE has a higher opportunity for benefit. The same sentiment is conveyed in the sentence, “The Lab information is always wrong which prevents us from […]” and “Having accurate Lab information allows us to […]”. Both convey the importance of it, so while interesting to see the positive/negative sentiment, the total is really what helps us drive priority. In this case, these are the CDEs with the highest total sentiment scores.

The Bottom Line

The NLP and ML methods we described above were very useful tools in our Discovery & Framing that allowed our team to get a quick overall view and feel for our client, their priorities, and how we can move forward. We get the added bonus of showing that valuable data can be found in unexpected places. This moves conversations from ‘How do we manage our data?‘ to ‘Do we have a full understanding of our data?’. This allowed us to drive prioritization and conversations that drive higher value with less work for all parties. Now, the client can fully appreciate the options we bring to them.

Author

Mayra Quiroga