Data Engineer with expertise in building scalable data pipelines and real-time processing systems. Strong foundation in full-stack development with 10+ years of backend experience (PHP, Yii2, Laravel) and deep knowledge of SQL and relational database design, now specialized in modern data infrastructure and distributed systems.
Currently completing Master's degree in Data Engineering with focus on real-time analytics, cloud-native architectures, and machine learning integration. I combine software engineering best practices with cutting-edge data technologies to deliver production-ready, enterprise-scale solutions.
Core expertise: Real-time data processing (Kafka, Spark Streaming) • ETL pipelines (Airflow, AWS Glue, Azure Data Factory) • Cloud data warehouses (Snowflake, Redshift) • NoSQL & distributed databases (MongoDB, DynamoDB, Redis) • SQL optimization • Web scraping & data collection • Cloud data services (AWS S3/EMR, Azure Data Lake) • Backend APIs (FastAPI, PHP) • Desktop applications (Delphi, VB) • ML deployment
Real-time Analytics Pipeline
- Kafka + Spark Streaming architecture for live data processing
- Event-driven data ingestion and transformation
- Deployed on AWS with automated monitoring
Cloud Data Warehouse on Snowflake
- Designed and implemented scalable data warehouse on Snowflake
- Multi-layered architecture (raw, staging, production layers)
- Star schema dimensional modeling with optimized SQL queries
- Query performance tuning with Snowflake-specific optimizations
- Cost optimization through virtual warehouses and clustering
ETL Framework
- Modular data transformation system built with Apache Airflow
- Orchestrates complex workflows loading data into Snowflake and traditional databases
- Automated data quality checks and error handling
- Support for incremental and full-load strategies
Web Scraping & Data Collection
- Automated data extraction from multiple web sources using Python (Beautiful Soup, Scrapy)
- Robust scraping pipelines with error handling and retry logic
- Integration with cloud storage (S3) and databases (MongoDB)
- Scheduling and monitoring with Airflow
Cloud Data Infrastructure
- AWS data services: S3 (data lake), DynamoDB (NoSQL), EMR (Spark clusters), Glue (ETL), Redshift (warehouse)
- Azure data services: Data Lake Storage, Cosmos DB (NoSQL), Databricks (Spark), Data Factory (ETL)
- Multi-cloud data architecture design and implementation
ML Deployment Pipeline
- End-to-end machine learning model lifecycle management
- Automated training, validation, and deployment workflows
- Production monitoring with performance tracking
REST API Suite
- Production-ready APIs built with Yii2 framework and FastAPI and Laravel
- MySQL/MariaDB and PostgreSQL database integration
- JWT authentication and role-based access control
- Complex SQL queries and database optimization
- Comprehensive API documentation and testing
Microservices Architecture
- Scalable backend services with Docker containerization
- Inter-service communication patterns
- Distributed system design and implementation
Desktop Applications
- Legacy desktop applications built with Embarcadero Delphi
- Excel integration and data processing tools
- Windows GUI applications with Visual Basic
- Database connectivity and reporting features
- Advanced Apache Spark optimization and performance tuning
- MLOps practices and model deployment strategies
- Cloud-native data architectures on AWS
- Distributed systems design patterns
- LLM integration for intelligent data automation
- Complete Master's degree in Data Engineering
- Build comprehensive portfolio of production-grade data engineering projects
- Expand expertise in cloud-native architectures and real-time systems
- Transition fully into Data Engineering role at enterprise level
I'm open to discussing data engineering opportunities, collaboration on interesting projects, or technical conversations about data infrastructure and scalable systems.
LinkedIn: Marcello Orru
Email: marcelorru@gmail.com
Portfolio: Coming soon