How to use open source technologies to build search engines for billions of users, billions of revenue, billions of documents

Andrei Lopatenko

11 min readMay 15, 2020

My keynote talk at The 16th International Conference on Open Source Systems.

https://www.slideshare.net/AndreiYigalLopatenko/building-multi-billion-search-engines-on-open-source-technologies

and a podcast about building teams, my path

https://podcasts.apple.com/us/podcast/e83-andrei-lopatenko-vp-of-engineering-at-zillow-group/id1442272943

Extract:

1. Building multibillion search engines on open source technologies Andrei Lopatenko, PhD Vice President of Engineering, Zillow Group
2. Who am I Core contributor to Google Search (2006–2010), Apple AppStore/iTunes Co-designed and Co-implemented Apple Maps Search (2010), Walmart Grocery Has been leading search teams: Zillow (now), Walmart, eBay, PhD in Computer Science, The University of Manchester, UK My path: from core contributor of Google Search to leading search ecosystem of market leaders of Real Estate (Zillow, Trulia), eCommerce (Walmart, eBay), digital distribution (Apple), Designing, building, implementing, improving, running multi-billion search engines for last 15 years
3. My goal for this talk I want to demonstrate that using open source signiﬁcantly helps in ● implementation ● continuous improvement both in search infrastructure and in search quality ● support and operations of search engines for billions of users, billions of documents, (dozens) billions of revenue/GMV Typical search engine uses hundreds of open source products/libraries, I’ll focus on some of them which are proven to be useful
4. How I am going to do it Show a reference architecture of a typical multibillion search engine Show typical open source based implementation Show limitations of open source implementation (more to QA) (My talk is limited to 30 min, so I’ll be brief, the topic deserves a day of tutorial) The demo will be encyclopedic in the style, let’s go deeper in QA. I’ll focus more on software used or tried in search I’ve built, rather than the comprehensive survey of every open source package available (I might be biased sometimes) and I’ll focus on what is useful for different types of search applications
5. Why multibillion search engine? Many billions of what? 1. Multi billion dollars in revenue/GMV 2. Multi billion users -> Multi billions queries per day 3. Multi billion documents How these numbers are relevant to the software architecture?
6. Multi billion dollars Implies 1. Potential of hundreds millions to billions dollars per year gains because of higher conversion due to better ranking, retrieval and query understanding functions, latency improvements, UX 2. continuous work on search quality, search features, search infrastructure, ranking improvements to get billion dollar gains 3. Complexity of the search stack and many of its features due to complexity of business it supports
7. Multi billion users Implies 1. Multi billions queries per day 2. tough throughput requirements 3. distributed deployments for high load
8. Multi billion documents 1. Distributed retrieval and ranking systems to run retrieval and ranking over billions of documents 2. Complexity of search functions (one needs good ranking if there are too many documents for a query) 3. Frequency of updates (even small update rate causes frequent updates if there are billions of documents) -> update latency for index requirements
9. Are there many search engines based on open source Google, Amazon, eBay, MS/Bing use own technologies 1. For all of them, the search engine is the core of the business 2. The complexity of ranking, index etc is well beyond what open source provides 3. Huge monetary and user gains because of optimizations in the size of the index, frequency of index updates, complexity of ranking/query understanding functions, integration with other systems 4. Each company has a big search team to improve and develop their search engines Linkedin, ebay, airbnb moved from open source due to limitations of open source search engines
10. Are there many search engines based on open source Walmart eCommerce — Solr moved to Elastic Apple Store, iTunes (Solr) Adobe Cloud (Elastic) Uber (Solr) Salesforce (See conferences such as Activate and Elastic{On} for details how open source is used in search
11. Typical Search Engines Simple Search: small collection of document, low sensitivity to user satisfaction Search SAAS Algolia, Google Custom search SwiftType (bought by Elastic) Searchify Typical big search: hundred million/billion users, billions revenue, billions documents Open source based search engines Very big search: Billions users, 100s billions documents, high demand for search quality, huge prices for search ops Custom search engines (Google, Amazon etc) Complexity, Size, Demand for quality, Operation cost The focus of this talk
12. Historical reasons are important We build custom Apple Search in 2010 (C++). Solr/Lucene was not good at the time for expected number of users (billion) and search quality. Now I would take another decision We build custom graph based search engine in 2016 (Ozlo, sold to Facebook). Open source was not good enough for qps speed for graph language we needed to build Natural language search, support of frequency of graph updates etc. Now open source solutions for graph search engines are good for reasonable requirements
13. Typical Search Engine — High Level View Data Acquisition Indexing Ranking Retrieval Query Understating UX Search Assistance Logging Monitoring Experiment management SERP logic Other
14. Key Main Components Aka Search Quality: Search assistance services (Autosuggest, dynamic facets etc), Query Understanding, Ranking, SERP logic (snippet building, universal search — mixing results of different corpora) Aka Search Infrastructure: retrieval, loging, monitoring, ops Aka Indexing: data acquisition (crawling web, feeds, data imports), includes data enrichment (duplicate resolution, cleaning data, mapping into common dictionaries, extraction), indexing (building index for retrieval systems)
15. Query Understanding Task: process a query, parse, extract information from it, classify it, add information useful for retrieval and ranking Latency requirement: typical limit: 10 ms, up to 50ms Throughput requirement: at least qps of the system, billions per day -> 10⁵ qps Frequently distributed: requires loading large vocabularies (language models, embeddings) gigabytes, naturally many components for different classiﬁcation, parsing, expansion tasks (up to hundreds), heavy CPU and GPU load
16. Query Understanding: Example Query: 52 inches tv samsung alexa Query Understanding: Category: TV, size: 52 inch, brand: samsung additional: alexa Relaxations: size: 48–60 inch, additional: alexa<- optional, brand: sharp, sony
17. Query Understanding — Open Source 1. Spacy 2. Fasttext 3. Stanford CoreNLP 4. Apache OpenNLP 5. Google Sling Frequently low performant, one of my groups has to rewrite internals of Stanford parser to make it performant
18. Query Understanding — Open Source 1. Hugging face — Transformers 2. Zalando Flair 3. Facebook PyText 4. ULMFiT 5. OpenNMT for machine translation
19. Query Understanding Latency and throughput requirements are tight A lot of cashing for hard to compute models (query distribution is skewed) Fast performant models such as fasttext based has a lot of advantage
20. Query Understanding — Open Source — Tools Doccano — annotation tool, to create training sets AllenNLP as a NLP RankLab, to test models Snorkel Metal — great weak vision tools, weak super-vision is used for many practical NLP tasks
21. Search Assistance Autosuggest / Type ahead , Solr has Suggester, Elastic has similar features too But building custom autosuggest on top of them is huge improvement 1. Better language modeling 2. Context (user, location) 3. Improved retrieval (substring, not preﬁx only, spell correction in auto suggest) All are big improvements in quality/user satisfaction Difference is big 30%+ engagement for ecommerce
22. Search Assistance Filter — Facets Both Solr and Elastic generate Facets But usually it requires to build your faceting on top of them to generate faceting good for users Different might be huge — 15+% conversion for eCommerce
23. Ranking — what is it? 1. a query with all information derived about the query 2. an user with all information known about the user 3. a set of retrieved results Produce the rank of results optimized for highest success Where success might be click through rate, conversion rate, GMV, revenue, other engagement, satisfaction and monetization metrics Latency requirement: typical limit: 10 ms, up to 50ms Throughput requirement: at least qps of the system, billions per day -> 10⁵ qps
24. Ranking — technologies — LeToR LeToR (learning to rank) / MLR (machine learning ranking) function are proven to be highly successful for ranking 1. complexity of many query/user/document features to be used for ranking 2. complexity of ranking functions 3. Turning for various metrics/re-tuning as metrics changes May require intervention (policies and law regulations search engines, multi metric optimization, internal policies not expressible in learnable functions)
25. Ranking — Open Source — Training Popular approach Proprietary LeToR ‘RankLab’ based on a LTR type such as gbdt or svm rank using open source such as XGBoost, CatBoost, SVM light to learn ranking function Converting model generates by previous level into the source code of your ranking engine (java, c++) Compiling and deploying ranking function as a part of ranking layer
26. Ranking — Open Source — Training 1. Solr / LTR contrib module- integrated in Solr, 2. Google TF Ranking 3. RankLib
27. Ranking — Feature Storage Some ranking features are in index, many are not (due to storage limits, performance costs) Features must be retrieved during the ranking stage, for every query, ~1000+ documents during 2–3 stage of ranking, hundreds features -> Feature storage Google Feast, CouchDB etc Redis Wide column stores, document stores Optimized for multi id lookups
28. Post ranking — feature layer Many search systems require post ranking layer to re-arrange, remove results, change ranking based on secondary criteria: current price which changes frequently, availability in store/warehouse which changes frequently, optimizing cross-category selling etc etc non-persistent features delivered by external systems (warehouse management, sales management) Key-value stores optimized for read-write, redis etc
29. Cashing High load systems, Cashing on top of the search to keep results of frequent queries to reduce load on the system and have low latency for cached queries (no need to recompute result set). Caches are implemented in Solr, Elastic Frequently, cache is needed on the top of the search engine (all layer of rankings after L0 — personalization, diversiﬁcation etc, merge of outputs of multiple solr/elastic search engines to be cashed / aka universal search) — Redis, MemCashed
30. ML and NLP training / Rank Lab / ML tools Mostly for ranking and query understanding, but used for many other things too : duplicate resolution, anomaly detection Spark Platforms — getting maturity and popularity Kubeﬂow, ML Flow Pytorch, Tensorﬂow — and building your ML stack on top of them Xgboost, catboost , etc (this slide deserves a separate 1 hour talk, as all other)
31. ML and NLP training / Rank Lab / ML tools R/CRAN, despite its limits, is quite useful for many search ML tasks at exploration stage and still some team use it for production learning of LeToR models My experience: Time series R modules to learn time dependencies for ranking function (for queries sensitive with fast topicality shift), R stat packages to learn language afﬁnity features etc other ranking tasks Analysis of A/B tests, contextual bandits and other types of experiments — typically python code build on the top of scikit-learn, pandas, numpy
32. Retrieval Given a query Q, return an ordered list <DocID, Score> Lucene ecosystem Solr + Elastic for text documents, text scoring, multiple ﬁelds, ﬁlters, sorts, spatial queries, Yahoo Vespa Geographic: Solr Spatial, Graphs: Neo4J, Apache Giraph, Titan DB Vector search: Facebook FAISS, MS SPTAG Google S2 geometry, JTS Topology Suite library — to implement spatial indexing/retrieval
33. Retrieval But , any real search application will require a lot of optimization on top of existing system for index sizes, qps loads etc Such as sharding, (depends on search application) etc
34. Retrieval — open source problems Both Solr and Elastic are good for limited cluster sizes, serious problems scaling to very large collections There is huge progress recently (2010+) in new types of index compression techniques which makes smaller index with fast retrieval — dozens of millions of dollars of hardware costs for ‘big’ search etc None of them are available in Solr/Elastic engines Default query parsers are quite limited (in 2015 multi word synonyms were not supported) — making search engines requires a lot of work in internals of Solr/Elastic
35. Indexing and Data Acquisition 1. Acquire — Get data from external sources (the web, feeds, etc etc) into internal systems. Some data might be in bulk volumes (petabytes per day — web crawl), some might be frequently updated (100s M updates per day — prices, availability) etc etc 2. Enrich — Merge, resolve duplicates, remove noise, normalize, clean, enrich, — are two pages, two items represent the same object? is latitude, longitude, consistent with the address? Map full text descriptions of item into a set of attributes with values ? Add derived signals (probability to sell, demand, similarity to other items) to the item etc etc 3. Index — Map into format understandable by retrieval systems
36. Acquisition — open source Crawling web: Apache Nutch and Hetrix, for scalable, many other for smaller scale Apache Atlas / governance and discovery, data registry, data discovery A lot of domain speciﬁc data acquisition tools: ACQ4 neurophysiology, open source GISs Transport: Apache Kafka, Apache Pulsar to bring data from external systems Document stores, wide column stores: Cassandra/Scylla DB, Hbase, many other top store acquired data before indexing — depending on needs, data types, update frequency
37. Enrich Duplication resolution, cleaning, adding derived data — There is no good open source for it. A lot of workﬂow management , see another slide Spark/Streaming, Flink at higher level / running jobs but nothing at the task level (and task are too domain speciﬁc)
38. Indexing Speciﬁc for the search and search engines provide their indexing Solr and Elastic for text search , Vespa, Facebook FAISS, MS SPTAG for similarity search Neo4G and other graph systems
39. Acquisition, Enrichment, Indexing — orchestration Typically, data pipelines of hundreds individual components Integrating external data, internal data, derived data, cleaning, enriching Orchestration becomes very important, multiple jobs written in different languages, accessing different data stores, AirFlow, Oozie, even luigi — depending on complexity, environment
40. Other useful stuff to build search systems Low level: gRPC, protobuf, snappy, gﬂags, glogs, google benchmark, etc etc etc : typically in search stack everywhere Google S2 geometry or JTS topology suite — location based search Gtest, xunitest etc testing Jaeger or Apache SkyWalking for distributed tracing
41. Ops Kubernetes Knative, automated running serverless containers on k8s with autoscaling, revision tracking, Bloomberg Solr operator — to run solr on a K8S cluster Netﬂix’s Chaos Monkey — chaos engineering for a distributed system as search, kill instances periodically. Multibillion search- typically hundreds of services running on many nodes, the system must be fault tolerant
42. Ops — Solr Cloud — tools to set up fault tolerant, high available solr cluster Elastic Search Cross Center Replication (CCR) — replication index across data centers for elastic Terraform, deﬁning and provisioning data centers for search ops Jenkins, deployment
43. Other blocks Experimentation : facebook planout, intuit wasabi, wix petri Graphical , log reports: Graphana, Kibana Logs: ELK Stack Logstash, ﬁlebeat, ﬂuentd,
44. Conversational Search Task oriented Dialog management , NLU speciﬁc for conversational tasks, slot extraction, intent classiﬁcation — Rasa. You can use Rasa to manage dialog state (it is great in that from learning to manual tools to analyze dialogs), but build own NLU A plenty of research open source projects as outcomes of DSTC competitions focused on various problems in building dialog systems, dialog act classiﬁcation, slot extraction, dialog break detection Nvidia Nemo if you need own Automated Speech Recognition (tuning for a speciﬁc language, domain, business )
45. Conversational Search QA over paragraphs Results of academic research and competitions are frequently available online Passage reRanking with BERT — dl4Marco BERT Tanda from Alexa YodaQA cdQA DrQA Requires a lot of work to tune for your corpora
46. NL Search Over structured data SQLNet Uni of Freiburg Aqqu Percy Liang’s Dependency Based Compositional Semantics Mostly academic code
47. Conclusions Open Source technologies are useful to build many parts of search engine stack from low level code using libraries (ex gRPC ) to using open source systems ex: Solr/Elastic, Kafka, Tensorﬂow Runtime Each part will require a lot of work to tune it for your environment, code on top of it, designing ops Do not be surprised, if you have to rewrite certain open source to your implementation as you search engine grow and you will need ‘extreme’ tuning, performance, quality,
48. Conclusions Search engine is a joint work of many, many people There are many people who know xgboost, catboost, tensorﬂow, solr, kafka, kibana Using open source simpliﬁes hiring for expertise Search systems require continuous never ending evolution: they need to be modiﬁed and rewritten many times (module by module) — open source is a big advantage due to access to the source and community help (sometimes) Very few companies can afford search teams of hundreds/thousands people as in very big search, even 20+ billion dollar search engine teams are quite small. OS helps

How to use open source technologies to build search engines for billions of users, billions of revenue, billions of documents

Written by Andrei Lopatenko

No responses yet