Tuesday, December 04, 2018

AI Expo 2019: Tim Jurka (LinkedIn, Director Feed AI) - Part 4 of 4

I recently attended the AI Expo 2019 at the Santa Clara Convention Center. Notes are from my understanding of the talk. Any errors are mine and mine alone.

LinkedIn: A look behind the AI that powers the LI feed

Tim Jurka (Dir. Feed AI)

The talk was focused on the objectives of LinkedIn's Feed. The talk was focused to a high level (exec) audience. While I was familiar with the space, the objective function formulation and presentation was interesting:

The recommendation problem for LinkedIn is maximizing Like/Comment/Share CTR + downstream network activation (virals) + encouraging new creators.

Problem Formulation:

P(click) + P(viral) * (alpha_downstream + alpha_creator * e ^ (- decay * E[num_response_to_creator])

alpha_downstream accounts for downstream effects; alpha_creator penalizes popular creators to induce diversity.

General approaches (Toolbox):

Multi Objective Optimization (ads vs organic content).

Logistic Regression: Features, Embeddings and Decision Trees (XGBoost for Feature Importance), occasional pruning 

Auto tuning of weights of the MOO to correct for drifts in accuracy of the component models (to meet a product goal).

Running the Team:

The goal of the team is to maximize: Successful Experiments / Total Experiments Run

Two approaches: maximize successful experiments, minimize unsuccessful experiments.

Maximize successful experiments:

1. Hire the best talent

2. Increase the total number of experiments being run online.

3. Automate deployments, parameter tuning, retraining, rebasing, ramping to maximize developer throughput.

Minimize unsuccessful experiments:

1. Replay models over historical data to figure out whether they would perform better than the current model before moving to online.

2. Compute actual business metrics, Determine precision @ 1, precision @ top3/top5 over a randomized sample of data.

3. Use bandits to figure out how to be intelligent about collecting data and exploiting the current model.

4. AI for Ai: Auto retrain and evaluate models. Identify promising features and ramp online. Find optimizations for existing models automatically. Highlight promising variants to engineers.

AI Expo 2019: Emilio Billi (CTO, A3Cube) - Part 3 of 4

Why and How the computational power influences the rate of progress in the technology

Emilio Billi CTO A3Cube Inc

Background: ML, Big Data & Analytics, AI, HPC.

This was a big data infra focused talk. The speaker had a background in systems infra with past DoD experience. Not the most engaging delivery, but really nice takeaways:

Moving 128 bytes on a CPU using 100Gbit ETH: CPU waits 8900ns for nothing (~7.1M compute ops lost);

Moving the same 128 bytes using optimized RDMA intra-cluster costs 1200ns CPU time (~0.96M compute ops lost)

You get 6M ops extra per second for ML. That's a great acceleration for ML workloads.

Basic contention: ETH, TCP, slow storage is legacy technology.

The clusters of the future will look like the supercomputer systems of today:

1. Low latency converged parallel file systems (think S3 for the cluster).

2. Built in Distributed Resource scheduler (think Kubernetes for the cluster).

3. Cooperative RAM over network fabric (RDMA over Infiniband)

4. Cluster wide sharing of Accelerators (eg. GPUs / FPGAs) (

Do the above without changing the server setup.

Case Study:

Optimizing the stack with these features got 64 nodes to the same compute capacity as 360 AWS nodes (6x speedup).

Work was done for a DoD project.

AI Expo 2019 - Prakhar Mehrotra (Walmart Sr. Director of ML) - Part 2 of 4

I attended the AI Expo 2019 at the Santa Clara Convention Center where Prakhar gave a talk. Notes are my summarization of the talk. Any errors are mine and mine alone. 

Walmart - Prakhar Mehrotra (Sr. Director of ML, previously at Uber)

Walmart has huge scale: 0.5Trillion+ revenue, 3000+ stores with massive physical footprints, a massive global supply chain, Jet.com, Walmart.com, Shoes.com, Flipkart.com and it keeps growing.

The talk was focused on Walmart's application of ML, the contrasts of Uber-style surge pricing vs Walmart's fixed in-store pricing ("everyday low pricing"). A focus point was causality over correlation: understanding Walmart's customer and its supply chain (the Why?). Their primary domain was solving for shelf placement of inventory. Other interesting problems were inventory management, bridging the online and offline worlds (if we ship from warehouse, it's going to cost you X but if you pick up at this store where it's in stock it's X-3). The takeaway: Omni-channel shoppers are changing the customer profile for Walmart but adjusting for that isn't as simple as taking the .com purchase data and feeding it into the models.

The crux of the talk was a discussion of (NP-hard) Causality finding Bayesian Networks; Walmart worked around the NP-hardness by manually decomposing the Walmart supply chain into relatively independent units and their interrelations (suppliers, merchants, inventory, pricing, warehousing, transportation etc.). There was also a discussion about counterfactuals ("would sales have declined if I had not placed this item on promotion?"). A/B Testing is hard in the Walmart space because pricing is fixed in-store (and not per-shopper).

Fundamental datasets and metrics: Imagery, Similarity, Variants, Attributes, Classification, Quality, Scoring, Analysis

They're solving for:

1. Interventions (should I provide an offer to this user or intervene in another way?)

2. Associations (shelf placement, cart composition, suppliers)

3. Counterfactuals (what-if analysis)

Walmart's tech stack and algorithms:

Models/Algos: NN, Bayesian nets, Structural models

Tech Stack: Hive, Hadoop for big data, Teradata for medium-data, ETL systems, Jupyter for ad-hoc, hyper parameter optimization systems, CPU&GPU training, Scala and Python as primary languages for data scientists.

The final food for thought was: how fast will the online and offline worlds converge on dynamic pricing? What does that look like to the customer?

AI Expo 2019 Notes: Ameen Kazerouni (Zappos) - Part 1 of 4

I recently attended the AI Expo 2019 at the Santa Clara Convention Center where there were talks on various ML platforms. I'm leading the ML Training Infra platform at Pinterest. These notes are from those talks and are summarized from the speaker's presentations. Any and all errors are mine alone. Hope you find the below useful.

Speaker: Ameen Kazerouni (Zappos)

The scope of the talk was Zappos' ML Platform ecosystem, the problems they faced after solving the basic 5: Problem specification, dataset design, model selection, training and validation. A condensed list of their issues is:

1. Data management: data lifetime (how long?), security footprint (who should have access?), governance issues (who did have access?), data scrubbing and anonymization (avoiding privacy issues under GDPR).

2. Team: There are very few unicorns that can do PhD statistics, ML math and write distributed systems. They hire for domain competence and the ability to communicate to others in the ML domain. They're seeing a changing ecosystem where ML scientists and engineers are collaborating rather than a waterfall model where a model is built and a data engineer puts it into production.

3. Language dependencies: ML researchers and Statisticians are using Python Scikit, TF, R for analysis; the data team in production is using Scala and Java - finding a way to bridge the language gap: Protobufs for all data intermediates and their generated code for typed access to the data across languages and systems; serialized (typed) model representations for delivering into production.

4. Tier 1 SLAs: Making the latency - feature value tradeoffs (how slow is too slow?): precomputed models, ensemble models (personalization on top of precompute) help break down complex problems into simpler solutions that are manageable from a latency perspective.

5.  Model Interoperability: Their Java serving systems use DL4J for TF model serving. Some systems use TFServing. For model representation, they rely heavily on PMML and (unclear) but they round trip through ONNX for model format conversions across language boundaries and DL systems (eg. Scikit, Spark, TF).

6. Productionization: There's an emphasis on constant communication between teams, they decouple systems through microservice APIs, they aim to serve product features (not recommendation algos) and they degrade gracefully from high precision to high coverage models.

7. Product Roadmap: Learning: ML everywhere doesn't work as a driver of the product roadmap. ML is supporting cast. Build the product roadmap first and fit ML later.

Sunday, October 28, 2018

The toughest interview questions asked recently

During a recent round of interviews, I was asked these 5 questions that I found particularly interesting / challenging. They each covered a different interesting pieces of computer science that I thought worth sharing. Check them out and see what you think about them:

1. A set of overlapping ranges are provided in [start, end) format. Find the max number of ranges that overlap each given range. Improve your solution to O(N log N) complexity (basic solution is: O(N^2) complexity for a set of ascending ranges [1,4), [2, 4), [3, 4) etc.).
2. Implement a multithreaded producer / consumer queue using condition variables.
3. Implement a multithreaded rate limiter (token bucket with defined capacity) using no hard-coded poll durations and without a background thread for "filling". Discuss fairness vs head of line blocking tradeoffs of the implementation.
4. Implement a multithreaded scheduler that executes tasks repeatedly at specified time intervals which manages task overruns (task time > interval time) and does not skip scheduled run points. Estimate the maximum number of outstanding threads that a task may produce.
5. Given a matrix of size NxM with 0 and 1 entries, find the number of connected regions in the matrix. Extend your solution to handle updates of 0 entries to 1s that may connect existing regions. Expected solution is O(NxM) complexity in the worst case even for the second case with updates being approximately O(1) and space being O(NxM). Hint: Disjoint sets. 

For System Design, the following were interesting questions:
1. Design typeahead search (given a query prefix, provide completions). 
The interesting part was managing hotspots for small query lengths, the use of a KV store for serving given the low latency requirements and managing personalization.
2. Design an ad click prediction system (ML infra pipeline).
3. Design a lineage system for tracking input data to ML model training to avoid sample bias.

Overall, interviewing was a pleasant process that refreshed a lot of skills. Let me know about other interview questions that might have been interesting for you.


Friday, August 17, 2018

Great workplace habits

  1. Wellness: Maintaining a healthy body, mind, and spirit/mood.
  2. Self-presentation: Controlling one’s grooming, attire, and manners—given the social and cultural situation at hand—so as to make a positive impression on others.
  3. Timeliness: Arriving early, staying late, and taking short breaks. Meeting or beating schedules and deadlines.
  4. Productivity: Working at a fast pace without significant interruptions.
  5. Organization: Using proven systems for documentation and tracking—note taking, project plans, checklists, and filing.
  6. Attention to detail: Following instructions, standard operating procedures, specifications, and staying focused and mindful in performing tasks and responsibilities.
  7. Follow-through and consistency: Fulfilling your commitments and finishing what you start.
  8. Initiative: Being a self-starter. Taking productive action without explicit direction. Going above and beyond; the extra mile.
Found from: https://www.td.org/insights/8-good-work-habits-that-really-matter

I'm not the best at them, but I've found that they have made me a better person. Highly recommended.

Friday, May 11, 2018

Actionable Production Escalations

I've long considered the following items the basics of an actionable production escalation. These were taught to me by Googlers (mostly when I violated these understated values). The fundamentals of any production escalation require the documentation of the following from SREs:
1. An exception, call graph, logs or metrics showing the problem
2. A first pass characterization of the problem (what is it / how much impact)
3. Why me? (Do we need a PoC that you wouldn't know otherwise?) 
4. What have you already tried. 
5. Things that you have noted that are out of the ordinary.
6. How specifically can I help solve this problem? (Find a PoC? look at the code? Judge downstream impact? Validate severity?)

Following the above process keeps a check on the level of due diligence needed before a Dev escalation. It also helps formulate concrete action items as part of the escalation process. I've found that this helps resolve issues quicker and keeps the prod overhead low for devs. What do you think?