One of the first decisions that you need to make as a startup is choosing a cloud provider. You might have the feeling that selecting a cloud provider is a lifelong commitment. Therefore, you might be tempted to make a lengthy comparison of all the different features so that you definitely have the ‘best’ cloud solution. You might choose fast and build a lot of layers of abstraction so that you might be able to switch quickly to a new cloud provider if the opportunity arises. But is it really necessary to be so cautious?
Let’s take a step back. If you are in the early stages of your startup, you are still figuring out a product market fit and building a lot of minimal viable products (MVPs). The motto at this stage is to build things fast and fail fast. This means there’s no time for a lengthy cloud comparison study. In this blog post, we’ll cover some of the main considerations for building your MVPs, from the perspective of both architecture and data science.
The architecture for an MVP will consist of the following five components. You will harvest a data source through an API (1). You will next have either a batch or a streaming Extract Transform Load (ETL) (2) process that transforms your data and stores them in your data store (3). Your data store is either a relational database, a noSQL database or a data lake. Next, you will have an AI-API (4) that extracts data from your database, applies your AI algorithms. The results of this AI is finally shown in the front-end (5). Next, you also need to think about security and dev-ops.
When you are a data or AI focused startup, your goal is to use or develop an AI-API that will disrupt a particular industry. This will be the main task of your data science team. At the same time, your engineering team will build the architecture. You want the engineering team to get comfortable as quickly as possible with your new cloud provider to build a basic infrastructure, don’t be too afraid to experiment with infrastructure because of cloud costs and enable your data scientists to develop fast machine learning algorithms. Also an important note, when you are an early stage startup, both the team of data scientists and engineers can consist of only one person.
To get comfortable with the technology of a cloud provider, conferences like Google Cloud OnBoard, Microsoft Ignite and the AWS Summit are ideal. These conferences enable you to to absorb a lot of information about the cloud providers in a short time. You can also meet people that have experience using these technologies for real-world applications. Is there no conference in your area in the near future? No worries, their websites offer considerable information on their data solutions.
You need two important parts to get started for your basic infrastructure: a relational database and virtual machines to run your APIs and your front-end layers. The benefit of using a SQL database is that it will make the model development for the data scientists easier because they don’t need to learn a new query language to fetch data. For relational databases, you have SQL database for Azure, Redshift for AWS and BigQuery for Google Cloud. For building the APIs, you can work with Linux VMs and containers what is readily available for all the providers: Azure Containers for Azure, Amazon Elastic Container Service for AWS and Containers on Compute Engine.
When you have your infrastructure in place, you should not be paralyzed by war stories or impossible high cloud bills of founders of old startups. Trying to optimize cloud costs can slow you down a lot. Therefore, the various cloud providers are providing grants for startups.
It is not useful for cloud providers to take you out of business with one big cloud bill. Their goal is for you to be successful. Therefore there are programs like Azure for startups and Google for Entrepreneurs. Azure “provides startups with up to $120,000 in free Azure credits, enterprise-grade technical support and development tools – supporting the languages of their choice, such as Node.js, Java and .NET. In addition, qualified startups also get access to productivity and business applications, including Office 365 and Microsoft Dynamics 365.”
Finally, we come to the core aspect of your business, building the AI solution. First consider, do you really need to build all your machine learning solutions from scratch? Can you make already a good data solution with products like Amazon Comprehend , Cognitive Services from Microsoft Azure or ML APIs from Google Cloud? All providers have a lot of NLP capabilities available like Named Entity Recognition and Sentiment analysis.
When you find the models that you need, it is important that you first evaluate how well these models perform on your types of data and whether you can build everything that you need with one of these providers. It is important to know that your data will also live in their cloud. You will never be charged to move your data to a cloud, but you will always be charged to move your data out of a cloud solution. Therefore, it is important that you hire developers who will work with the cloud provider that you picked. It is also important to know that the data will live in their cloud and may have privacy and legal implications. If you build your models yourself, make sure that you will empower your data scientists to be as independent as possible. Enable them, for example, to deploy their own web services with a solution like Azure Machine Learning Studio.
I hope you feel more empowered to choose your cloud provider. If you still have questions and are you located in the Waterloo region you can always ask for expertise from the growth coaches from ODX.
Written by Mary Loubele, Data Growth Coach
Mary Loubele is the Analytics Dev Manager at MappedIn. Prior to joining MappedIn she was a Senior data engineer at TalkIQ which got acquired by Dialpad in May 2018.
Prior to joining TalkIQ she was the director of data science and engineering at FunnelCake. Before that she held positions as a Data Scientist at D2L and as a NLP Software Developer at Maluuba, now a Microsoft company. She holds a PhD in Medical Image Computing and a Master’s degree in Computer Engineering from KU Leuven. She also organizes several meetups in the Waterloo region including KW Intersections and Waterloo Data Science and Engineering.