Data Engineering Pipeline
File System (block storage of raw data)
- AWS S3
- Azure Blob Storage
- AWS Kinesis
- GCP Pub/Sub
- Spark Streaming
- Kafka Streams
- Redis: OSS in-memory data structure storage, used as a database, cache, and
message broker. Key-value based database system.
- Apache Cassandra: Better than Redis on fault tolerance. Has support for HiveQL. Useful
for when you are writing more than reading.
- **Azure Cosmos DB
- Hive: Data wharehouse built on top of Hadoop. SQL queries must be written in the
MapReduce Java api.
- HBase: Non-relational distributed database written in Java (part of Hadoop), and runs
on top of HDFS.
Web app frameworks
- Flask: Web platform for Python
Orchestrators (workflow management)
- Airflow: Author workflows as DAGs.
- Ballerina: Cloud Native Programming Language
- Hadoop: OSS utilties orchestrating distributed problem solving.
- Spark: Unified analytics engine.
- Capistrano: OSS framework for building automated deployment scriopts (Ruby).