Mandatory Skills:
Apache Spark and either PySpark or Scala: Extensive hands-on experience with Spark for large-scale data processing and analysis. Proficiency in either PySpark or Scala for developing Spark applications.
Databricks: Strong expertise in using Databricks for big data analytics, data engineering, and collaborative work on Apache Spark.
Github: Proficient in version control using Git and GitHub for managing and tracking changes in the codebase.
Data Warehousing (DWH): Experience with one or more of the following DWH technologies �� Snowflake, Presto, Hive, or Hadoop. Ability to design, implement, and optimize data warehouses.
Python: Advanced programming skills in Python for data manipulation, analysis, and scripting tasks.
SQL: Strong proficiency in SQL for querying, analyzing, and manipulating large datasets in relational databases.
Data Streaming or Data Batch: In-depth knowledge and hands-on experience in both data streaming and batch processing methodologies.
Good to Have:
Kafka: Familiarity with Apache Kafka for building real-time data pipelines and streaming applications.
Jenkins: Experience with Jenkins for continuous integration and continuous delivery (CI/CD) in the data engineering workflow.
Responsibilities:
Design, develop, and maintain scalable and efficient data engineering solutions using Apache Spark and related technologies.
Collaborate with cross-functional teams to understand data requirements, design data models, and implement data processing pipelines.
Utilize Databricks for collaborative development, debugging, and optimization of Spark applications.
Work with various data warehousing technologies such as Snowflake, Presto, Hive, or Hadoop to build robust and high-performance data storage solutions.
Develop and optimize SQL queries for efficient data retrieval and transformation.
Implement both batch and streaming data processing solutions to meet business requirements.
Collaborate with other teams to integrate data engineering