Tuesday, December 04, 2018

AI Expo 2019 Notes: Ameen Kazerouni (Zappos) - Part 1 of 4

I recently attended the AI Expo 2019 at the Santa Clara Convention Center where there were talks on various ML platforms. I'm leading the ML Training Infra platform at Pinterest. These notes are from those talks and are summarized from the speaker's presentations. Any and all errors are mine alone. Hope you find the below useful.

Speaker: Ameen Kazerouni (Zappos)

The scope of the talk was Zappos' ML Platform ecosystem, the problems they faced after solving the basic 5: Problem specification, dataset design, model selection, training and validation. A condensed list of their issues is:


1. Data management: data lifetime (how long?), security footprint (who should have access?), governance issues (who did have access?), data scrubbing and anonymization (avoiding privacy issues under GDPR).


2. Team: There are very few unicorns that can do PhD statistics, ML math and write distributed systems. They hire for domain competence and the ability to communicate to others in the ML domain. They're seeing a changing ecosystem where ML scientists and engineers are collaborating rather than a waterfall model where a model is built and a data engineer puts it into production.


3. Language dependencies: ML researchers and Statisticians are using Python Scikit, TF, R for analysis; the data team in production is using Scala and Java - finding a way to bridge the language gap: Protobufs for all data intermediates and their generated code for typed access to the data across languages and systems; serialized (typed) model representations for delivering into production.


4. Tier 1 SLAs: Making the latency - feature value tradeoffs (how slow is too slow?): precomputed models, ensemble models (personalization on top of precompute) help break down complex problems into simpler solutions that are manageable from a latency perspective.


5.  Model Interoperability: Their Java serving systems use DL4J for TF model serving. Some systems use TFServing. For model representation, they rely heavily on PMML and (unclear) but they round trip through ONNX for model format conversions across language boundaries and DL systems (eg. Scikit, Spark, TF).


6. Productionization: There's an emphasis on constant communication between teams, they decouple systems through microservice APIs, they aim to serve product features (not recommendation algos) and they degrade gracefully from high precision to high coverage models.


7. Product Roadmap: Learning: ML everywhere doesn't work as a driver of the product roadmap. ML is supporting cast. Build the product roadmap first and fit ML later.