Data Engineer
Responsibilities:
- Involved in Monitoring, Maintaining and Reporting Cloudera Hadoop Cluster
- Responsible for data services and data movement infrastructures
- Processing Millions of JSON, AVRO and PARQUET events from various probes(sources) in to target HDFS Data Lake for real time analysis
- Creating secured pipeline for different protocols like PROXY, DHCP, NETFLOW, DNS, etc.,
- Developed Ingestion and Enrichment solution using Apache Spark - Kafka Streaming using Python
- Transforming data using in-house developed API in Spark Ingestion solution and Enrichment Lookups in to HBASE
- Implemented Kafka offset management using Direct Stream API
- Implementing Partitioning and Bucketing concepts in Hive on External tables to optimize performance
- Created Oozie workflows to automate the jobs for Hive and Impala
- Used Azure Data Bricks services and PySpark to enrich and transform the data
- Tuning Performance of Spark jobs for high volume feeds on Production Environment
- Regular Interaction with Clients to show and tell of end results
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
- Collaborated with team members and stakeholders in design and development of data environment
- Involved in preparing associated documentation for specifications, requirements, and testing
Environment: PySpark, Cloudera, Hadoop, Hive, Impala, Oozie, Kafka, Flume, Maven, GitHub.