An Experimental Development Process for Making an Impact with Machine Learning

Originally published on Towards Data Science.

It’s really hard to build product features and internal operations tools that use machine learning to provide a tangible user value. Not just because it’s hard to work with data (it is), or because there are many frivolous uses of AI that are neat but aren’t that useful (there are), but because it’s almost never the case that you’ll have a clearly defined and circumscribed problem handed to you, and there are many unknowns outside of the purely technical aspects that could derail your project at any point. I’ve seen a lot of great articles written on the technology side providing code and advice on how to work with data and build a machine learning model, and the people side of how to hire engineers and scientists, but that’s only one part. The other part is how to steer the technology and people through the hurdles of getting this kind of work to have an impact.

Fortunately, I’ve failed to deploy AI many times and for many reasons to provide business and user value, and watched friends and colleagues from startups to Fortune 500 data science and research groups struggle with the same. Almost invariably the technology could have been valuable, and the people were competent, but what made the difference was how people were working together and what technology they were working on. In other words, I trust you can hire good technically-able people that can apply their tools well, but unless it’s the right people building the right things at the right time for the right business problem, it’s not going to matter. (Yeah, no duh, right, but it’s harder to do that than you may think).

In this post I’ve tried to consolidate learnings, reference existing articles I’ve found useful (apologies to the references I’m sure to have missed), and add some color to why building experimental products is hard, how it’s different from other engineering, what your process could look like, and where you’re likely to encounter failure points.

Caveat: As with anything this dynamic, you’ll need to adapt the parts that make sense to your own needs and constantly evolve it, but most of the basic building blocks should be applicable. There’s also nothing smart here, mostly common sense, but a lot of things are like that in retrospect. What I’ve tried to do is put things into a conceptual framework that supports an experimental development process — establishing a problem, defining metrics, building something simple end-to-end that can be learned from and iterated — that’s deceptively similar in theory but quite different in practice from what many developers, managers, and especially folks outside R&D are used to.

Feel free to skip ahead to the summary or further reading.

Why an Experimental Development Process

In the software engineering we’re most familiar with, we start development with a set of specified product or business requirements that need to be implemented as software. Current popular processes based on sprints of small features that incrementally build a project were established and are great for building around predictability, not around experimentation. They’re optimized for building software where we may be inventing the product capability, but not the technology — so we know it will work and can estimate approximately how long it should take. This is by no means easy. There is still uncertainty, software and design constraints, and cost/time/performance tradeoffs, but those tradeoffs and the methods for dealing with them are better understood. At a high level these make sense for experimental development, but when applied they often lead to failure, as it’s very hard to estimate how well an experiment will work out and when.

On the other hand, when we start product development on an AI enabled feature, or to optimize an internal business process, we have many open questions, the major ones being:

  • What is the right question to ask?
  • What is the best solution to address it?
  • Do we have the data for it?
  • How well will it work?
  • How well can we deliver it into production?

In addition to the software and design constraints, we have uncertainty in the problem and solution itself. Thus, these products have a high experimental and operational risk. Experimental risk exists because, by definition, there are many unknowns. This risk is on both the user side (i.e. the problem, what do they actually want?) and technology side (i.e. the solution, what can we actually build?), which are closely linked.

As we experiment on what we can build and how well it will perform, we change our understanding of how best it will solve a user problem, and evolve the right question to ask.

Thus, these features almost never have a clear finish point, each iteration creates further understanding of the problem and opens new solutions.

Operational risk exists as with any project, and is increased by the experimental process, since timeline uncertainties and iterative solutions can lead to longer project lifecycles and missed opportunities for delivery. It’s very difficult to build an experimental project on a predictable schedule.

In fact, it’s useful to expect to build it wrong the first (multiple) times no matter how much time you spend on it. This means you won’t understand everything perfectly, nor build it right before getting anything out. Data products have a natural course of evolution: the guiding principle is that we have an reasonable, working but imperfect solution that we use to collect more knowledge to improve the solution.

Failures

Some of the most common experimental and operational failures are below. No matter what you do don’t expect to avoid them entirely, but you can definitely learn to reduce them.

Not getting to the point of building a solution. Not because it was tested, failed, and the project was closed, that’s fine, but getting stuck iterating on a problem with no end in sight. This often comes back to not having established user success criteria, or a progress tracking mechanism with frequent stakeholder check-ins.

Building a solution that is not productionized. This is often accompanied by not having sufficient organizational or stakeholder buy-in early on, not establishing engineering deployment feasibility or accounting for cost of infrastructure, or time of tech transfer and maintenance, or going for an overly complex solution to start which cannot make the transfer from prototype.

Productionizing a solution but not iterating on it. This is often accompanied by not having the proper infrastructure for rebuilding and redeploying solutions and the necessary data pipelines to collect feedback and monitor metrics.

Building a solution that is not useful. Everything worked out, but it’s still a failure, and arguably the worst kind since more time was wasted. This could happen for a number of reasons, but most commonly it comes back all the way to the beginning in not having alignment with the stakeholders on an impactful and actionable problem and thus working on a problem with unclear value for the business.

We have all experienced each of these failures several times over, and it’s disheartening for everyone.

You think you’ve wasted time and feel you are incapable of making an impact, stakeholders think you don’t have empathy with the user and don’t understand the business problems, engineering has no idea what you’re doing and think that you don’t do anything, all meanwhile the rest of the company is working to hit deadlines. Over time this compounds into a loss of credibility and questioning what value experimental projects bring if they never actually end up solving a business problem.

Experimental Development Stages

No one likes process for processes sake; they add overhead, and when executed poorly (which is likely) they take away otherwise productive time. But unless you’re working by yourself, and likely even then, you need some structure to organize around. A process is only good insofar as it helps you achieve an outcome, so make sure that the process isn’t the goal, and it’s as lightweight as possible.

The ultimate outcome we’re focusing on is solving a business or user need. But let’s break that down into intermediate outcomes:

  • identifying an impactful business problem
  • getting a solution out quickly
  • learning from and iterating on it

The single most important predictor of success to all of these outcomes is making sure everyone involved understands and agrees on what problem is being solved and how (in corporatese: you need alignment with the stakeholders)Not just once at the beginning, but as things inevitably keep coming up. Since this requires constant and active communication throughout between you, product, design, engineering and any other business stakeholders, a large part of the below is really just meant to facilitate you talking to someone at the right times, telling them what they need to know, and getting out of them what you need to know (we’ll talk about what the what is later, but for now notice the latter isn’t a passive ‘they tell you’, it’s an active effort on your part to gather information).

You need that communication to give you insight into the business or user needs to identify a concrete, valuable problem with a clear impact; as you experiment feedback loops on the quality of results to inform direction and acceptance, engineering assessment on viability of the solution; resources for data collection and pipelines, monitoring performance and creating feedback loops to the models; and capability for allocating infrastructure resources for efficient experimentation and productionizing the solution.

This can be broken down into the following major stages Ideation, Experimentation, Data Collection, Prototyping, Tech Transfer, and Monitoring —illustrated in the diagram below. I’m going to assume that you’re working in an environment where your primary stakeholders are product and engineering teams, but most remains applicable even if the names change.

Experimental development stages, assuming research, product, and engineering organization. Product and engineering stages may differ, but should have a closely corresponding one. The light pink box is dominated by your experimentation, the light purple by the user problem and solution, and the light green by engineering. Gray boxes represent expected output from each stage.

These stages are in practice almost never sequential; rather going back and forth, especially between ideation, data collection, and prototyping. Each stage in the process has at least four definable components: goals, expected inputs, expected outputs, and estimated timelines. On a high-level these components will be consistent across projects, but the exact instantiation for each will vary from project to project.

A key to keep your process transparent and accountable is to maintain a shared document for each project where you define what the components are as you go through each stage. In addition, it’s a good idea to maintain a shared document that has the updated status of all projects you’re working on across all stages of development, including previous successes that you can point to, and previous failures and where they failed.

The table below summarizes the process of what we’re trying to achieve in each stage and what to watch out for; the rest of the document describes each of these in full.

Ideation

The first, and most important step of the entire process is to establish the problem you’re solving. This should be a concrete user problem — ill-defined and ill-structured problems are very tough to solve (if we don’t know what we want, how would we know when we solved it?). It should also have a high measurable value to the business — increasing customers or revenue, e.g. offering a new product, opening a new market, increasing customer satisfaction; or saving cost, e.g. by optimizing or automating part of a workflow.

Almost every discussion on data science or machine learning in business will start with getting alignment from the business. It’s almost impossible to overstate how critical that is. Two broad strategies for identifying problems are a) coming up with them internally and “pitching” solutions to the stakeholder, and b) stakeholders coming to you with a problem. In reality, it’s usually more collaborative than either one or the other, but in so far as the distinction exists, b) is much more likely to expose a valuable problem for the business, create a sense of alignment for the stakeholder that will lead to accountability for you, and serve to prevent getting stuck; in short, much more likely that the solution will be built and have an impact.

It’s tempting and inescapable that you will at one time or another think but I have a great new idea for something no one is even thinking about.Unfortunately it’s also very difficult for that to become a success story without a lot of extra effort on your part getting buy-in (fear not though, you may eventually get there after building trust by repeatedly delivering something of value). If that is the route you want to take, then make sure to focus on why what you’ll build provides value, not how or what you’re building.

So how do you know you’ve got the right problem? It’s not only okay, but requisite, here that you question the stakeholder on the proposed business value intensely. This can be tricky, as many people aren’t used to separating an intellectual challenge on an idea with a challenge of themselves or their role, especially at different organizational levels. You also don’t want to let them down and diminish their enthusiasm for working with you, as they might find it odd that you’re not jumping at the opportunity to apply your skills, but everyone benefits from establishing the strongest case for or against a project as early as possible. It takes some practice, but making that explicit upfront, and then affably and collaboratively arriving at the business value is well worth it.

We’re all keen to please, and with our mighty machine learning hammer most things look like a nail, so make sure it makes sense to apply machine learning (we’ll get into this more in the later stages, but even at this point this will save you time). It may create some buzz, and that may be why you’re doing it, but machine learning won’t magically make a feature valuable if there isn’t something there already, and even if you could do it, not every problem needs AI. If the problem can be solved with a simple algorithm, do the simple thing, it may hurt your ego, but it will help your chances. Much easier to pass on a project now than have it not be useful after all the work is done.

Another key is to ensure you and the stakeholder understand how the user behavior will change based on the output and analysis provided, and what change management is required to get there. Even if you can affirmatively answer the subsequent technical questions (i.e. you have the data, can build a model, can scale it), if there will be no change in action based on the output provided (likely because it’s solving the wrong problem, or because the output is unimportant or immutable), or the change requires significant change management (and therefore likely lack of adoption), then you’re not adding value.

Useful areas to consider for determining if it’s the right time are your honest guess at the risk of failure at each stage and the effort involved, resource dependencies, including infrastructure and data, and the complexity of the delivery mechanism. We’ll get to each of these in turn.

A good place to start is with problems that have a high impact value (the output will result in user action), enhance an existing user workflow (not requiring them to change it), and have low risk and effort (available data, ability to quickly prototype, performant algorithm, simple delivery mechanism).

The more of the above you can satisfy, the more likely you are to get relatively quick demonstrations of value. Starting with these kinds of problems helps build confidence from stakeholders and you in your ability to deliver value, thus creating the trust necessary to work on problems that have higher risk.

That doesn’t mean you shouldn’t tackle complex problems where things aren’t as clear, ofcourse you should. Most of us enjoy challenging problems without easy answers, where we only get to claim victory after persevering over difficulties and creating a novel solution (that’s what drives us to do what we do, right?). These lead to more personal growth and fulfillment, and sometimes solving them can create tremendous value. That’s why they are so attractive.

However, you have constraints (your limited bandwidth, your company’s finite runway, etc.), so you can’t work on everything, When prioritizing which problem to pursue, be careful not to equate difficulty (of the problem and of what you’re going to have to go through to solve it) with value. Difficult problems are often not where the greatest business impact lies. It often lies in the simpler problems (or at least problems that can be reframed as or broken down into simple problems). So take on difficult problems (live a little!), just try to simply them.

If you find yourself spending lots of time fighting against something (e.g. convincing the stakeholder the output is valuable, taxing the infrastructure), that’s a good sign it’s not the right time. Sometimes it’s tough to let these go, especially if you strongly believe in it’s value, but that time could be better spent now.

Helping stakeholders identify these kind of problems will increase the likelihood your work has an impact (this is where the distinction between a) and b) above gets blurry). To many (most?) people AI and machine learning are still unknowns, and they don’t understand why predicting A is currently impossible but reframing the problem to predict B — which looks a lot like A — is super easy, or how by solving B you can actually give them C, D, and E too, which may actually be more valuable for them. You need to bridge those gaps.

Even when someone comes asking you to build a specific machine learning solution, it’s beneficial to open up a brainstorming session and back off to the problem. Don’t get too technical, but suggest different framings of the problem, and examples of different kinds of things you could do to solve the problem.

Some formulations of the problem may be equivalent to the stakeholder or user, but you know may take it from impossible to easy. It’s up to you to really understand what the problem is (rephrased: it’s up to you to help others understand their problem).


After the problem, the next most important step is to identify the evaluation metrics and success criteria; knowing what to measure and how you evaluate performance means you know how well you’re solving the problem (and when you’re good enough, or not). Importantly, although this includes experimental model evaluation metrics (e.g. accuracy, recall), it’s really about aligning on user or business success criteria with the stakeholder. Theevaluation metric is what you optimize, the success criteria is what the stakeholder or user cares about, and they’re not likely to be the same. For example, it’s usually not clear how a 5% improvement in recall or a drop in log-loss will affect the user experience, it may or may not, so what’s important is to translate that into actual value to the user or business (e.g. conversion ratio, number of items saved, time spent to perform an action, number of additional seats sold)Do your best to agree on the target success criteria before starting, otherwise it can be moving target later on (and an enabler for getting stuck in endless optimization).

One area many users struggle with is that no matter what you do, the model will make mistakes — there will be an error rate. Some applications, like a movie recommendation, can be forgiving, whereas others, like a medical diagnosis, have a much higher risk, where even one mistake can completely degrade user trust in the model. Make sure you and the stakeholder understand what the cost of a mistake is. Working on higher risk applications increases the chance of failure, but again, that doesn’t mean they should be avoided. Instead, try to reframe the problem by breaking it into simpler pieces, such that the output may be partially correct, or still provide something of value even when there are errors.

It’s also useful to put the error rate in the context of the status quo. Sometimes people have a tendency to downweigh the errors in whatever they do now, and you are exposing errors to them that were always there, but were never quantified. Now that they are, for better or worse, they’re associated with your model. Whether you’re allowing for a feature that would not otherwise be possible, or automating a previously manual task, if you can show that some dimension has improved even though it’s not perfect (e.g. similar accuracy as human but much faster), you don’t need to worry about the model errors quite so much.

Once you’ve established it’s an important problem that makes sense to move forward on now, with a known metric of success, you can propose a solution.This solution is like a hypothesis, and will make a lot of assumptions:

  • That you have the data / resources
  • That you know how to model the problem and build a system that will function well enough
  • That the system will scale
  • That it is technically viable and affordable for the company to run now

To set the right expectations with your stakeholders it’s important to break the problem down, explain why and which pieces are hard (without getting too technical), where the assumptions come in, and where risks and limitations are likely to be.

Each of these assumptions will be tested and validated throughout the next stages of the project, but each will inevitably be wrong in some way. It’s totally fine, and expected to follow your intuition, and make simplifying assumptions. Experimental projects are usually trying to solve a hard problem, oftentimes without a great understanding in the beginning of the problem you’re trying to solve. It’s easy for people unfamiliar with this process to lose enthusiasm or confidence as things drag on. You’ll need to be the champion of the project, believe it will work and bring value, and shepherd it all the way through the next stages.

Data Collection

With a problem and proposed solution in hand, you need to identify if you have the data resources to build it. A lot of helpful advice has already been written on data collection, so I’ll be brief. This is where most projects quickly run into problems.

Every data product relies on having available, clean, and reliable data for building, and this is almost never the case. Data is reliably missing, noisy, and unreliable. A huge chunk of time for any machine learning project will be spent aggregating and cleaning the data. That’s often where the biggest performance gains come from too. Cleaning and prepping the data isn’t outside the problem, it is in large part the problem.

It is often necessary here to iterate and reframe the user problem in terms of available data. Note that we do not start by examining our data and asking what problem we could solve with it, but rather once we have a problem in hand, what data do we have available to solve it. As you understand the limitations of your data, and reframe the problem, it’s critical to ensure with the stakeholder that this does not fundamentally change the value.

In most cases, the data may be partially available, but incomplete, or ill-structured, since it wasn’t collected with your solution in mind. In others, the data may not exist at all and you have to collect it. This includes finding it available in the wild, collecting it automatically (from users or outside your application), manually creating it, or crowdsourcing. If the data needs to serve as a training set with some form of class labels that do not exist, this includes automatically bootstrapping labels or manually labeling.

It’s okay to have a smaller sample size and some noise at this point. Try to gather enough data that it’s reasonably representative but not a burden, and clean enough so that it’s not clearly wrong. Beyond that there’s diminishing returns from spending too much time here now.

As a running theme, don’t prematurely optimize. No matter how much time you spend here, you could spend more. It’s just you shouldn’t spend too much time here before you move through the other stages and know you’re going to be using the data. I’ve seen this go poorly on both sides. On one, too little or messsy data is carried forward to a prototype, turning out to be unrepresentative. On the other, lots of time is spent gathering and waiting for more and more data before getting to a prototype. In the former, you can cycle back and include more data in your existing model and immediately measure the improvement. In the latter, who knows at what point it would have sufficed.

At this point you should think ahead if you could create robust data pipelines and monitor data quality if you needed to, but don’t worry about actually doing it, yet (we’ll worry about that later).

Prototyping

This stage is the major reason for the necessity of an experimental development process. It’s close to impossible to conduct experiments on a fixed schedule; it may take a day to train model, realize it doesn’t work, implement new features, get new data, retrain, repeat…for a week or month before realizing it’s not working out.

This leads to a natural and inevitable friction between experimentation and delivering a feature on the roadmap or internally on schedule. It’s easy to follow down rabbit holes: trying new data, new models, new features, new architectures, etc. It makes you feel busy and productive, but you may not actually be accomplishing much. This is especially the case without a clear user problem or well defined success criteria. The overarching goal is to get to a point where you know how to solve the problem or realize you can’t as quickly as possible.

The process challenge is to create a balance: make space for exploration while identifying when you’re heading down an unproductive rabbit hole.

You need a mechanism for setting goals and monitoring your progress against them in small timeboxes, usually weekly. You can create a prioritized roadmap of experiments to do, record the results internally, and report out changes in performance on the success criteria to the appropriate stakeholders on a frequent basis to validate how to proceed. Each week you should update your intuitions about the likelihood of success, using evidence to show that you are either moving forward, or you are stuck.

A nice forum for this is having demos of ongoing projects with intermediate results to show progress. It’s natural to be uncomfortable here, since you know these early prototypes aren’t perfect, and you may still have many open questions, but the benefits are creating a forcing function for getting a project to a point of intermediate completion, thinking about design and UX, getting stakeholder feedback in early and changing direction, and finally, creating excitement about the possibilities and sense of progress. If you wait until you have everything sorted to share with stakeholders, you’ve set ourselves up for failure.

Building simple, working, prototypes and sharing them externally is a great way to derisk the project to ensure the core value is there, motivate the potential value, expose problems early, increase buy-in, and keep momentum going.

I want to underscore the importance of thinking about UX and design early. You are likely able to look at a terminal output with some examples or a confusion matrix and visualize several hypothetical displays (charts, lists, percentages, numbers, etc.). Since you know that their information content is isomorphic to one another, it doesn’t really matter to you how it’s visualized. Your stakeholders likely cannot do the same transformations. They will get caught up in specific examples and how they’re displayed. This will limit the broader message and derail them from seeing the potential value of the output.

As part of the ongoing demos don’t just present terminal output; provide several examples and variations of mocked up output. For example, some people like looking at numbers (mostly you), some like looking at visuals (where you still have to do cognitive work), while some just want the insights given to them directly. Almost no one understands what to do with a probability.

Often a good place to start is boiling down the output to something very simple, like a rating or label. Surprisingly (but not really) most people don’t like looking at numbers as much as we do. This may change how you model the problem, or just how the output is presented to the user.

It’s rare that you won’t be able to fairly quickly build at least one simple solution that works somewhat well (at least as a baseline). In practice, I’ve found the potential for getting stuck in rabbit holes to be high after this, as you start prematurely optimizing by trying to improve the evaluation metric performance, without knowing if the current solution may be good enough for the user success criteria. I know you can make it better, but this is a good reason to get your solution out in front of people quickly, so that you have something to deliver, even as you work to improve it.

An important part of that is whether it will it work at the quality level it needs to. Regardless of how simple or complicated the solution, whether it’s a rule-based or deep learning model, most solutions will do something. I’ve found that this is conceptually a major sticking point for people, especially those who are used to traditional software engineering where it’s more obvious to verify if the functionally works as intended (the software runs or it doesn’t).

In one sense the model works, as it will give a prediction, or a recommendation, or some set of results. The problem is evaluating whether the set of results is acceptable to solve the user problem. Note that this is not the same as improving performance on your model evaluation metrics, unless they happen to be the same (which is very unlikely). That’s why it’s so important to have an established success criteria, and evaluation tools to measure performance.

If you don’t know how your evaluation metric affects your user success (and you may not), you have no way of determining the right recall-precision tradeoff, or log-loss or accuracy threshold, and you can happily keep iterating and opining whether it’s good enough.

Especially during the times there’s uncertainty about the experimental evaluation-to-user metric relationship you need to keep the solution as simple as possible, so that you can get it out to the end user sooner and start testing the relationship.

So you’ve been going at it but the algorithm isn’t working out, or despite plateauing experimental metrics you may not be able to achieve performance that satisfies the user need, or the data is not as representative as you believed, perhaps the breakthrough is just around the corner, or perhaps it’s just not the right problem.

How do you know when to stop pursuing a solution that is not coming to fruition? This is one of the hardest parts of the process; there’s no good answer, so this should be iterated as you learn by doing (and to be honest, I don’t know anyone who’s nailed it yet). Since it’s also the most precarious, it requires careful attention and enforcement (having stakeholders who care about the problem hold you accountable is a good start).

A natural time for this is after the proposed set of experiments and intuitions have been exhausted, and the likelihood of success seems low. Here you evaluate the opportunity cost of continuing (including other problems you’re not working on) against the benefit of success.

The exploration of solutions will usually result in modification of the question — you’ll realize that you can’t solve the problem as first posed well enough, but can solve a closely related one. Because you don’t know which question you’re going to end up solving, you should not prematurely optimize the first solution: it shouldn’t be built to scale, and the models don’t need to be optimized. At this point you need to be optimized for iterating on proof-of-concept prototypes.

That said, while not building for scalability, you should be thinking ahead toward scalability, and find the simplest algorithm that will function at scale.

There’s a natural desire to add complexity from the beginning, to try the latest algorithm or package that blew everything previous away, or build something novel, but remember, you’re doing that for you, the user doesn’t care how you’re doing it as long as it works. The most important point is to have a functional system that can be deployed to gather feedback. For example, a linear model will likely work just fine to start, you can increase the complexity of the algorithm later as necessary. The longer it can stay simple, the better.

The more complex the solution, the more resources (computational, engineering, maintenance) it will require to get off the ground, and the more things can go wrong. If you start with something complex, you’ve likely spent too long on prototyping, and decreased the likelihood of success of deploying it, so will likely come back to prototyping after failing at tech transfer. You also don’t need to build everything yourself. Using existing open source tools will often get you mostly there. Now you can take all the time saved here and use it to do work on that difficult problem for yourself :).

Tech Transfer

Once you have built and validated a functional system, the major remaining problem is to productionize the model.

This is often where our vernacular gets us into trouble — you say you’re “done” with prototyping, and it’s easy for a stakeholder to hear “it’s ready”, or assume that since you’ve removed a lot of the uncertainty and unknowns, the problem becomes much closer to one of software engineering and should now be easy or quick.

Despite having gone through the uncertainties of the previous stages, figuring out how to productionize the solution is just as hard, if not harder.

There are many reasons a prototype may not successfully transfer to production. One is scaling. Making something work sufficiently efficiently offline on your Jupyter notebook, where you pulled and joined data once from multiple sources to make a smaller dataset is one thing. Doing the same in production in real-time while merging changing data streams from multiple sources is another. It may be computationally infeasible on more data, due to the computational complexity of the feature computation or the complexity of the algorithm. Or certain assumption were made about how the data comes in or what it contains that don’t hold true in production, and will require a significant investment in data pipelines that validate, normalize, merge and serve all the data you need.

Thus, productionization often requires simplification and optimization of the model, along with making it robust and error tolerant enough to satisfy the service level agreement and quality of service needed in production. It also requires infrastructure which may be financially costly, or difficult to maintain, and resource allocation from engineering teams to integrate, and thus integration on a product roadmap.

With the constant flow of demands on the resource you’ll need from product and engineering and from other places in the business, often with more immediate or concrete business value, early alignment is absolutely necessary, although insufficient, for successful tech transfer.

A related problem is a bit circular, in that you often need to prove the value a solution will bring in order to prioritize it for production, but you need to deploy a solution, and maybe even iterate on it to gather data to show its value. Similar to the prototype, the simpler the production delivery mechanism can be, the more likely it will be successfully integrated. In other words, the burden of proof of the value is correlated with the cost of delivery, and when the value is uncertain, having a lost cost delivery increases the likelihood of development.

The easiest way to decrease the cost of delivery is to reduce your computational requirements. We have a natural tendency during model building to build the best model possible — pull in more data, generate more features, combine more algorithms. This is the opposite exercise. Do you need to merge multiple data streams, or can the user metric be satisfied with just one? Do you need to run online, or is a batch computation once a day sufficient? Even when your model or feature computation is trivially parallelizable when you throw enough money at compute, it’s worth simplifying for reasons we’ll discuss below.

You could also start by delivering internal tools for business operations, or dogfooding features. Since these can often be accomplished through simple delivery mechanisms, like passing csv’s, or notebooks, and will not require the same polish as user-facing production features, it’s a nice way to establish value without engaging as many other organizational resources. You could also try leveraging existing delivery mechanism, like Elasticsearch, or whatever search engine you already use, by indexing your output into them, thus allowing it to be served immediately alongside existing data.

Finally, strive for a repeatable delivery process. At first the deployment of these features will likely be in one-off or ad hoc fashion, where you create a service or library and bespoke pipelines. Not only does each one take significant time, but as you launch a few of these, the maintenance cost will start building up, and the data pipelines will get cumbersome. As you build momentum within your organization with a few wins, you should start thinking about how to create a more robust engineering pipeline for the delivery of these features that allows you to plug-in new models and make them available relatively easily. Ultimately your goal will be the ability to reuse and chain your work together. By caching certain data or model artifacts, creating common feature stores, and having robust APIs, you’ll not only be able to deliver more reliably into production, but will be able to work through the earlier phases of data collection and prototyping for a new project better too (who doesn’t like compounding returns?).

A sometimes overlooked non-technical aspect of productionizing is supporting the go-to-market effort. Even if there should be little change management involved the user base needs to be educated. By now the stakeholders hopefully know no machine learning solution is perfect, but the users may not, and ultimately your success depends on them. If they don’t find a feature valuable, even if the stakeholders did, it’s still a failure (For an internal tool, the stakeholders have likely only been a subset of your users, and for a product feature the internal stakeholder isn’t the user).

Users likely don’t need to know or understand the internal workings, error rates, etc. as well as the internal stakeholders, but you can be sure they’ll have questions. You may not be directly responsible for dealing with them, but providing resources for those that are can be really useful to make sure user expectations are set correctly and there are minimal misunderstandings and misrepresentations of the feature capabilities (‘they think this does what?’ sound familiar?). You can help by writing content for the user, and since you’ll likely get mostly flavors of the same question, a FAQ.

Monitoring

We’ve made it into production, we’re done right?! Not quite. Now you need to monitor the performance of your model and update as necessary. Failing to do this reminds me of the Lewis Carroll quote — “here we must run as fast as we can, just to stay in place.” If you don’t update your model it will eventually get stale, and when it does, users relying on it will start noticing that something is wrong and its value will drop.

This stage is a crucial partner to building a simple solution and deploying it quickly. Before a solution is productionized, there needs to be a plan on how to monitor and maintain the quality of results, how to update the model, as well as collect feedback on the system performance from the user to see what impact it’s having and if it’s achieving the expected user value.

No matter how representative and carefully constructed your training and test data is, once live data is flowing you will uncover new issues (you didn’t spend too much time in data collection, right?). Hopefully your model can tolerate most of them, but just like any bugs in software, you need to be paying attention when they happen. The underlying data distributions you initially modeled will also likely drift over time, causing your model to misbehave in much more subtle ways than outright failure.

You can address these by collecting the new data, updating your training and test sets, and periodically retraining your models. A machine learning feature needs feedback loops to improve our understanding of the user, the data, and quality of our solution. At first you can have a simple feedback loop where data is collected and then used to manually retrigger retraining to update a single version of the model. That’s fine as long as you can articulate a plan to the stakeholder and user of how the models will be maintained.

Eventually, as you understand the necessary schedule for recomputation, you may want the model to update online, or maintain versions of everything — models, training, and test sets — for reproducibility and ability to rollback if something went wrong. Each of these introduces further engineering complexity that should be avoided until necessary. But as you mature and create more robust pipelines, versioning and monitoring is going to be an integral piece.

Closing Thoughts

Machine learning and other AI technologies are extremely powerful, and when everything aligns properly, you can make a huge impact with them. As a researcher, it’s fun to explore difficult problems and algorithms, but there’s much more to making AI solutions useful than building a well optimized model. You’ve noticed the themes: talk to people frequently and really make sure you understand each other, plan ahead to the requirements of the next stages so you’re not surprised by them, but don’t get bogged down too early, and keep everything technical as simple as possible for as long as possible.

Despite the best laid plans, you and I are still going to fail sometimes (lots?). No process will save you from that, but with some structure to help you navigate the common pitfalls, at least you can learn to reduce them. Once you walk through a few projects it’ll start becoming more natural, and once experimental work starts becoming part of your organizational culture (what’s culture anyway but a set of shared rituals?), then you can really fly.

Further Reading

[1] Domino Data Labs Managing Data Science (Strata Lectureblog)
Excellent guide covering lots of challenges and presenting a similar lifecycle.

[2] IBM’s CRISP-DM
As with most things in tech, IBM did it first, comprehensively.

[3] Science of Managing Data Science

[4] Challenges of Production ML Systems

[5] AI Hierarchy of Needs

[6] Managing DS is Different

[7] Thoughts on Managing DS

[8] Guide to Starting AI

[9] DS Workflow

[10] A Best-Practice Approach to Machine Learning Model Development

[11] Data Science Process Rediscovered

[12] Construct Valuable Data Science Projects

[13] The Most Difficult Thing in Data Science: Politics

[14] Building Data ProductsEvolution of Data Products

Author: Vlad Eidelman

Vlad Eidelman is the VP of Research at FiscalNote. Prior to that he worked as a researcher in a number of academic and industry settings, completing his Ph.D. in CS, as an NSF and NDSEG Fellow, with Philip Resnik at the University of Maryland and his B.S. in CS and Philosophy at Columbia University. His research focuses on machine learning for natural language processing applications in computational social science, and has been published in conferences like ACL, NAACL and EMNLP, and appeared in media like Wired, Vice News, Washington Post and Newsweek.

Leave a Reply

Your email address will not be published. Required fields are marked *