Recall the functional and nonfunctional requirements you gathered for the batch data pipeline used to serve training data for the product recommendation system. Now, let's suppose that since you are the first data engineer at this e-commerce company, you prefer to use data tools that provide more convenience and help you avoid undifferentiated heavy lifting, such as writing custom code.
Which of the following combinations of AWS services would you use to implement the batch data pipeline?
Correct
✅ Source System: Since you'll be ingesting tabular data, the source system can be a relational database, like Amazon Relational Database Service (RDS).
✅ Ingestion and Transformation: Since the batch pipeline needs to run based on demand (not continuously), and considering the fact that transforming data into training data sets is a simple task that can be executed in a short period of time, a serverless option might be best here. AWS Glue is a serverless ETL service that makes it easy to create ETL jobs. It can extract data from the source database, transform data and then load it into downstream storage. So, AWS Glue ETL is a suitable option for this batch pipeline. For simple workloads that run based on demand, AWS Glue can be reasonably priced. Feel free to check out the pricing of AWS Glue here to learn more. Since AWS Glue ETL is serverless, it makes it easy for you to maintain and manage the data pipeline, even if you're the only data engineer at the company. Finally, as you saw in the previous video with Morgan, Glue ETL is a more convenient option than EMR, which is what you're looking for here.
✅ Storage: Amazon S3 is a durable, performant, and low-cost storage solution that allows you to serve data to a machine learning training process. Since the downstream data scientist is another technical data professional who’s planning to manipulate the data and use it to train the product recommendation system, S3 is a great storage option because it is flexible, scalable, and relatively cost-effective.
Status: [object Object]
1 / 1 point
2.
Question 2
Recall the functional and nonfunctional requirements you gathered for the streaming pipeline that will be used to provide the product recommendations to the users. Again, let's suppose that you prefer to use services that provide more convenience because you're newer to data streaming architectures, and want to avoid undifferentiated heavy lifting, such as writing custom code.
Which of the following is the best combination of AWS services to implement this streaming data pipeline?
Correct
✅ Streaming System: Amazon Kinesis Data Streams is a highly scalable streaming solution that provides low latency access to data. It offers an on-demand serverless deployment that makes it easy to set up and manage your data pipeline. It is a simpler solution than Amazon MSK, and can help you get started quickly without requiring special expertise. Kinesis Data Streams integrates well with other AWS services, and scales with increased volume in data (you can check out the pricing for this here). Amazon Data Firehose is also used here because it helps you deliver streaming data from Kinesis data streams into data stores such as S3. Moreover, in the lab, you will use Kinesis data firehose to invoke a lambda function that allows you to run the computations of the deployed model to find the products to recommend.
✅ Storage: Amazon S3 is a durable, performant, and low-cost storage solution that you can use to store the product recommendations for later analysis. Since the downstream data scientist is another technical data professional who’s planning to analyze the data and use it to retrain the model when needed, S3 is a great storage option because it is flexible, scalable, and relatively cost-effective.