Photo by Glenn Carstens-Peters on Unsplash
Hi! I'm Raviraj Baraiya, a senior at the IIT Kharagpur Department of Industrial and Systems Engineering. This post is about my experience as an intern at Zingg. I hope it helps those who are looking for career options at Zingg as well as other students who pursue internships with startups.
I started looking for an internship in the early summer to improve my software development abilities. My department’s primary study domain is optimization methods and therefore, I was eager to work on deep technology products. I came across Zingg on LinkedIn and the problem of entity resolution really intrigued me. As the company was hiring interns, I got in touch with the founder, Sonal Goyal. The interview process was quite smooth and I was made very comfortable. I learned about the company vision and internship roles during the interview. I discovered that Zingg is doing cutting-edge work at the intersection of machine learning, distributed computing, open source, data, and cloud. So, when I got the internship, I was thrilled. It was going to be my first professional as well as start-up experience!
When I joined, the team was planning a major change to the way Zingg is set up and used. The existing way to run Zingg is through building a configuration JSON that defines the match types as well as the input and output data. This configuration is used as the basis for Zingg Spark jobs for different phases like training, matching, and linking for identity and entity resolution. In many instances, users wanted a way to build the configuration programmatically and run Zingg through their notebooks directly. This led to the idea of a Python interface, as most Zingg users were Python users. My big responsibility was to develop, package and test this part.
I have been working with Python and hence I felt it was going to be an easy project for me. But, well, I was wrong! Zingg code is a mix of Java and Scala, chosen for their performance as entity resolution has a scalability problem. But, support for Python was necessary for ease of use. A new separate Python code base would be a maintenance nightmare. Hence, I had to figure out a way to invoke the existing interface through Python. The problem befuddled me, and I worked closely with the team, borrowing principles from PySpark to replicate something similar in our code base. Py4J seemed like magic to me, it was eye-opening to see how we could expose our Java classes through Python. I was ecstatic when our initial experiments worked, we could write wrappers over the existing code and build out the Python API seamlessly. A great lesson on simplicity in complexity!
An external-facing API needs documentation and testing, and packaging. It is not enough to write code alone (certainly different from my college coding experiences). Hence, I was entrusted to figure it all out, with some input on the directions I could take. I worked on the pip packaging, and soon realized after team discussions that our packaging was far more intricate than bundling the python modules alone. We needed the python modules as well as the script, the jar, the examples, and the models. This turned out to be a lot of really cool work with lots of learning. I also worked on the python unit-testing and the documentation using Sphinx. I saw how focused the Zingg team was on user experience, and I spent a lot of time documenting the API really well. But given the internal focus, I suspect we are going to polish it even further when the 0.3.4 release comes out ;-)
As a part of my internship, I attended a community meet-up on the 7th of June in which we had a very lively discussion on what we are building. We discussed our product roadmap and the significant features we have added to Zingg to give users more control over their training and matching processes. We connected with the early adopters and took feedback from the users. Though I knew it already from the Slack chatter, I was amazed at the love Zingg is getting in the data community, and how appreciative the users are of our work. By attending the meetup, I saw firsthand why and how open source works. The users were excited about the upcoming release, and their excitement was a motivation for us to build better!
Besides the Python work, I also worked on new match types like EmailMatchType and PinCodeMatchType which define new similarity functions. I worked on model documentation changes and learned Freemarker. It is mind-boggling how much I accomplished in such a short time and how productive I have been in the last 2 months. There is a great sense of accomplishment too. I have grown personally, and working at Zingg has helped me to gain a sense of professionalism. I have a clearer view of what it means to work in open source on complex technical problems. I have learned about teamwork and how the freedom to innovate along with respect and support from mentors makes you a responsible employee and brings out your best work. I have also witnessed the emphasis on deep focused work, getting things done, and refining features till they are ready to roll out. These are lessons I will carry forward in my life.
Zingg is a foundational product enabling a variety of use cases and now that I know more about the modern data stack, I am waiting for that Modern Data Stack infographic that puts Zingg on the map :-). I feel it's only a matter of time as the base building blocks of the stack are already in place, and the need for identity and entity resolution will become more and more critical.
Working at Zingg has been a wonderful experience as it offered a very supportive and peaceful work environment. Although I have to join back college, I will continue to contribute to Zingg and other open-source projects. I am strongly rooting for Zingg, and am super proud to have contributed to it. Wishing Zingg the very best!