<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Pardis Noorzad</title>
 <link href="https://djpardis.com/atom.xml" rel="self"/>
 <link href="https://djpardis.com/"/>
 <updated>2026-06-10T06:33:39+00:00</updated>
 <id>https://djpardis.com</id>
 <author>
   <name>Pardis Noorzad</name>
 </author>

 
 <entry>
   <title>The evolution of software engineering</title>
   <link href="https://djpardis.com/blog/2026/02/20/evolution-software-engineering-fortran-llms/"/>
   <updated>2026-02-20T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2026/02/20/evolution-software-engineering-fortran-llms</id>
   <content type="html">&lt;div class=&quot;note-container timeline-link-container post-container&quot;&gt;
&lt;strong&gt;Interactive.&lt;/strong&gt; Explore the evolution of software engineering in a visual, scrollable timeline with eras and milestones from the article.
&lt;br /&gt;
&lt;br /&gt;
Click to open the &lt;a href=&quot;/timeline/&quot;&gt;interactive timeline&lt;/a&gt;.
&lt;/div&gt;

&lt;div class=&quot;toc-container post-container&quot;&gt;
&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#foundations&quot;&gt;Foundations&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#fortran-1957&quot;&gt;1957. FORTRAN eliminates the need for scientists to understand computer hardware&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#structured-1968&quot;&gt;1968. Structured programming makes programs comprehensible by constraining control flow&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#relational-1970&quot;&gt;1970. Relational databases enable declarative data access independent of storage implementation&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#unix-1971&quot;&gt;1971. Unix establishes the operating system as a portable hardware abstraction layer&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#c-1973&quot;&gt;1973. C makes systems software like Unix portable across different computer architectures&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#oop-1970s&quot;&gt;1970s–1980s. Object-oriented programming enforces encapsulation to manage large system complexity&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#internet-and-web&quot;&gt;Internet and Web&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#tcpip-1983&quot;&gt;1983. TCP/IP makes the Internet a universal network layer&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#web-1989&quot;&gt;1989–1993. The World Wide Web enables universal software distribution through browsers&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#python-1991&quot;&gt;1991. Python becomes the default for scripting, automation, and data science&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#stdlib-1994&quot;&gt;1994–1998. Standard algorithm libraries make common algorithms and data structures reusable&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#gc-1995&quot;&gt;1995. Garbage collection makes entire categories of memory errors impossible&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#package-managers-1995&quot;&gt;1995–2010. Package managers make dependency management automatic&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#opensource-1998&quot;&gt;1998. Open source makes collaborative, publicly developed software the default&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#rest-2000&quot;&gt;2000. REST APIs standardize how web services communicate&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ides-2001&quot;&gt;2001. IDEs automate the mechanical scaffolding of programming, an early step toward code generation&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#di-2002&quot;&gt;2002. Dependency injection frees enterprise programmers from framework boilerplate&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#cloud-and-infrastructure&quot;&gt;Cloud and infrastructure&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#mapreduce-2004&quot;&gt;2004–2009. MapReduce and Hadoop make processing massive datasets accessible&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#git-2005&quot;&gt;2005. Git enables distributed collaboration at global scale&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#cloud-2006&quot;&gt;2006. Cloud platforms transform infrastructure into elastic, pay-per-use resources&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#mobile-2007&quot;&gt;2007. Mobile platforms turn the phone into a general-purpose computer with app ecosystems&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#microservices-2008&quot;&gt;2008–2012. Microservices replace monoliths as the architecture for large-scale applications&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#nosql-2009&quot;&gt;2009. NoSQL databases trade consistency for scale and flexibility&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#nodejs-2009&quot;&gt;2009. Node.js makes JavaScript full-stack and enables the npm ecosystem&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#language-tooling-2010&quot;&gt;2010–2015. Modern language features and ecosystem tooling bring structure and safer concurrency to mainstream development&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#containers-2013&quot;&gt;2013–2014. Containers and orchestration make deployment portable and scalable&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#serverless-2014&quot;&gt;2014. Serverless computing shifts the unit of deployment from servers to functions&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ml-frameworks-2015&quot;&gt;2015–2016. ML frameworks democratize machine learning without research-level expertise&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#ai-coding&quot;&gt;AI coding&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-transformers-2017&quot;&gt;2017. Transformers replace recurrence with self-attention&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-llm-2020&quot;&gt;2020. Large language models demonstrate in-context learning&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-copilot-2021&quot;&gt;2021. Copilot and Codex bring AI code generation to mainstream development&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-rlhf-2022&quot;&gt;2022. RLHF aligns code models to programmer intent&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-rag-2022&quot;&gt;2023. RAG grounds code generation in the codebase&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-agentic-2023&quot;&gt;2023–2024. Long-context and agentic interfaces expand scope&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-reasoning-2024&quot;&gt;2024. Extended reasoning and enterprise fine-tuning complete the AI coding assistant stack&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-benchmarks-2024&quot;&gt;2024. Code evals establish comparable benchmarks and reveal the gap to real-world tasks&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#discussion&quot;&gt;Discussion&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#where-ai-fits&quot;&gt;The internet, cloud, and mobile eras put AI in context&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#whether-ai-displace-saas&quot;&gt;Verification and maintenance costs determine whether AI displaces SaaS&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#each-past-abstraction&quot;&gt;Each past abstraction eliminated the need to acquire entire areas of knowledge&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#english-not-pl&quot;&gt;English is not a programming language&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#open-source-ai&quot;&gt;Open source creation and maintenance both benefit from AI&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#pl-not-consolidating&quot;&gt;Languages, frameworks, and tools are consolidating, and AI may accelerate the trend&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ai-improve-abstractions&quot;&gt;Can AI improve existing abstraction layers?&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div class=&quot;post-hero-image&quot;&gt;
&lt;img src=&quot;/files/pics/blog/2026/camera%20obscura.jpg&quot; alt=&quot;Camera obscura&quot; /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Built by Floyd Jennings in 1946, a rotating mirror projects the outside world through lenses onto a horizontal viewing table. A decade later the building was modified to look as if it were a camera left behind by visitors. More &lt;a href=&quot;https://noehill.com/sf/landmarks/nat2001000522.asp&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Introduction&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;My first exposure to AI, save a few books from Scholastic, was through several electives in college. The understanding at the time was that if we were to achieve the promises of AI, we would not merely write ordinary software but rather software that would write other software. Of course, code generation was already underway in many forms. Compilers turned high-level code into machine code. Parser generators like yacc turned a grammar into a parser. But that was all purpose-built and deterministic. AI code generation is different.&lt;/p&gt;

&lt;p&gt;The discourse about AI today is rightfully grand. “&lt;a href=&quot;https://www.techradar.com/pro/nvidia-ceo-ai-could-be-the-largest-technological-leap-weve-ever-seen&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;It’s the largest technological leap we’ve seen.&lt;/a&gt;” “&lt;a href=&quot;https://www.businessinsider.com/ben-horowitz-says-ai-is-bigger-than-internet-not-bubble-2026-1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;It’s bigger than the internet.&lt;/a&gt;” “&lt;a href=&quot;https://www.pcmag.com/news/apple-ceo-ai-is-as-big-or-bigger-than-the-internet-smartphones&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;It’s as big as smartphones.&lt;/a&gt;” “&lt;a href=&quot;https://youtu.be/Gnl833wXRz0?t=3435&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;SaaS is so over.&lt;/a&gt;” This article examines seven decades of software engineering evolution, from FORTRAN to LLMs. We draw on this history to analyze where today’s AI tools fit. What changed when new paradigms arrived? What stayed the same? And what can the economic patterns of previous breakthroughs reveal about this one?&lt;/p&gt;

&lt;p&gt;The article is organized into four eras, each built from milestones presented in roughly chronological order. We begin in 1957.&lt;/p&gt;

&lt;h2 id=&quot;foundations&quot; class=&quot;era-heading&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Foundations&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;This era established the core abstractions that programming would build on for decades.&lt;/p&gt;

&lt;h2 id=&quot;fortran-1957&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1957. FORTRAN eliminates the need for scientists to understand computer hardware&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; In the mid-1950s, scientific computing required programming in assembly language. The IBM 704 had 36-bit words and three 15-bit index registers. A programmer writing code to solve differential equations needed to understand both the mathematical method and the hardware details. These included which registers to use, instruction timing, and minimizing the instruction count in inner loops.&lt;/p&gt;

&lt;p&gt;This dual expertise created a bottleneck. Universities employed small numbers of programmers who understood both scientific problems and machine architecture. A physicist at Los Alamos might wait weeks for a programmer to translate equations into code &lt;a href=&quot;#ref-Met59&quot; id=&quot;ref-Met59-back&quot;&gt;[Met59]&lt;/a&gt;. The programmer might not understand the scientific context, causing errors.&lt;/p&gt;

&lt;p&gt;In addition, programs were non-portable. Code for the IBM 704 would not run on UNIVAC. Each new machine required complete reimplementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; In 1956, John Backus was able to convince IBM executives to fund FORTRAN. He  estimated that in 1954, more than half of operating costs were programming costs, despite computers being enormously expensive. He argued that automating translation would reduce programming costs. The first FORTRAN compiler shipped in 1957 &lt;a href=&quot;#ref-Bac57&quot; id=&quot;ref-Bac57-back&quot;&gt;[Bac57]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;FORTRAN (Formula Translation) let scientists write mathematical expressions in notation close to standard syntax. The statement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Y = A * X + B&lt;/code&gt; directly expressed the computation without registers or memory addresses. A recurrence like the Fibonacci sequence could be written in a handful of lines:&lt;/p&gt;

&lt;div class=&quot;language-fortran highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;INTEGER&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TMP&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;READ&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;DO&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TMP&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TMP&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
   &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CONTINUE&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;PRINT&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The compiler performed register allocation, instruction selection, and optimization. Many believed compilers could never match skilled assembly programmers. But Backus demonstrated the compiler often generated faster code than hand-written assembly by performing tedious optimizations systematically.&lt;/p&gt;

&lt;p&gt;Within five years, most scientific computing moved from assembly to FORTRAN. Scientists became programmers. Computational fluid dynamics, molecular modeling, weather forecasting, and financial modeling all benefited. FORTRAN did not merely accelerate existing work. It made feasible work that had been almost impossible to undertake.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Bac57&quot; href=&quot;#ref-Bac57-back&quot;&gt;[Bac57]&lt;/a&gt; Backus, J. 1957. &quot;The FORTRAN Automatic Coding System.&quot; &lt;em&gt;Western Joint Computer Conference&lt;/em&gt;. ACM. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/1455567.1455599&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Met59&quot; href=&quot;#ref-Met59-back&quot;&gt;[Met59]&lt;/a&gt; Metropolis, N. et al. 1959. &quot;Early Computing at Los Alamos.&quot; &lt;em&gt;Annals of the History of Computing&lt;/em&gt; 1(1):23-34. Available at &lt;a href=&quot;https://ieeexplore.ieee.org/document/4640758&quot; target=&quot;_blank&quot;&gt;ieeexplore.ieee.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;structured-1968&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1968. Structured programming makes programs comprehensible by constraining control flow&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By 1968, program size had outpaced human comprehension. FORTRAN, COBOL, and assembly relied heavily on GOTO statements that could transfer control to any labeled statement. A program might contain hundreds of GOTOs jumping to labels scattered throughout thousands of lines.&lt;/p&gt;

&lt;p&gt;GOTO statements made local reasoning impossible. Understanding what a program did at any point required tracing all possible execution paths from anywhere. A label on line 500 might be reached by GOTOs from lines 100, 250, 780, and 1200. The number of paths grew combinatorially with program size.&lt;/p&gt;

&lt;p&gt;Dijkstra called this “spaghetti code” where control flow wove like tangled strands &lt;a href=&quot;#ref-Dij68&quot; id=&quot;ref-Dij68-back&quot;&gt;[Dij68]&lt;/a&gt;. By the late 1960s, commercial systems exceeded 50,000 lines and operating systems approached 100,000 lines. NATO convened a conference in 1968 to address “the software crisis” &lt;a href=&quot;#ref-NR69&quot; id=&quot;ref-NR69-back&quot;&gt;[NR69]&lt;/a&gt;. Programs had become too complex to understand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Dijkstra’s “Go To Statement Considered Harmful” argued that GOTOs should be eliminated entirely &lt;a href=&quot;#ref-Dij68&quot; id=&quot;ref-Dij68-back&quot;&gt;[Dij68]&lt;/a&gt;. He proposed restricting control flow to three constructs. These were sequential execution, conditional execution (if-then-else), and iteration (while loops).&lt;/p&gt;

&lt;figure&gt;
&lt;a href=&quot;https://xkcd.com/292/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;img src=&quot;https://imgs.xkcd.com/comics/goto.png&quot; alt=&quot;xkcd: goto&quot; /&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;p class=&quot;image-caption&quot;&gt;goto. Randall Munroe, &lt;a href=&quot;https://xkcd.com/292/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;xkcd&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Böhm and Jacopini proved these three constructs were sufficient to express any algorithm expressible with GOTOs &lt;a href=&quot;#ref-Boh66&quot; id=&quot;ref-Boh66-back&quot;&gt;[Böh66]&lt;/a&gt;. The restriction did not reduce expressive power. Floyd and Hoare had shown that structured constructs admitted formal reasoning (preconditions, postconditions), whereas arbitrary GOTOs did not &lt;a href=&quot;#ref-Flo67&quot; id=&quot;ref-Flo67-back&quot;&gt;[Flo67]&lt;/a&gt; &lt;a href=&quot;#ref-Hoa69&quot; id=&quot;ref-Hoa69-back&quot;&gt;[Hoa69]&lt;/a&gt;. Niklaus Wirth designed Pascal &lt;a href=&quot;#ref-Wir71&quot; id=&quot;ref-Wir71-back&quot;&gt;[Wir71]&lt;/a&gt; to enforce structured programming through syntax. The language had no GOTO statement. Programs could be understood by reading top to bottom, following nested control flow.&lt;/p&gt;

&lt;p&gt;The practical effect was that software could grow. Before structured programming, systems above a certain size simply could not be understood or maintained. After it, teams could build operating systems, banking platforms, and airline reservation systems. The improvement in maintainability was consequential. Programs hundreds of thousands of lines long became possible.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Boh66&quot; href=&quot;#ref-Boh66-back&quot;&gt;[Böh66]&lt;/a&gt; Böhm, C. &amp;amp; Jacopini, G. 1966. &quot;Flow Diagrams, Turing Machines and Languages with Only Two Formation Rules.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 9(5):366-371. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/355592.365646&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Dij68&quot; href=&quot;#ref-Dij68-back&quot;&gt;[Dij68]&lt;/a&gt; Dijkstra, E. W. 1968. &quot;Go To Statement Considered Harmful.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 11(3):147-148. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/362929.362947&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Flo67&quot; href=&quot;#ref-Flo67-back&quot;&gt;[Flo67]&lt;/a&gt; Floyd, R. W. 1967. &quot;Assigning Meanings to Programs.&quot; &lt;em&gt;Proceedings of Symposium in Applied Mathematics&lt;/em&gt; 19:19-32. Available at &lt;a href=&quot;https://people.eecs.berkeley.edu/~necula/Papers/FloydMeaning.pdf&quot; target=&quot;_blank&quot;&gt;berkeley.edu&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Hoa69&quot; href=&quot;#ref-Hoa69-back&quot;&gt;[Hoa69]&lt;/a&gt; Hoare, C. A. R. 1969. &quot;An Axiomatic Basis for Computer Programming.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 12(10):576-580. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/363235.363259&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-NR69&quot; href=&quot;#ref-NR69-back&quot;&gt;[NR69]&lt;/a&gt; Naur, P. &amp;amp; Randell, B. 1969. &lt;em&gt;Software Engineering: Report on NATO Conference&lt;/em&gt;. NATO Science Committee. Available at &lt;a href=&quot;https://eprints.ncl.ac.uk/158767&quot; target=&quot;_blank&quot;&gt;Newcastle ePrints&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Wir71&quot; href=&quot;#ref-Wir71-back&quot;&gt;[Wir71]&lt;/a&gt; Wirth, N. 1971. &quot;The Programming Language Pascal.&quot; &lt;em&gt;Acta Informatica&lt;/em&gt; 1(1):35-63. Available at &lt;a href=&quot;https://link.springer.com/article/10.1007/BF00264291&quot; target=&quot;_blank&quot;&gt;link.springer.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;relational-1970&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1970. Relational databases separate logical data organization from physical storage implementation&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Through the 1960s, programs stored data in flat files with application-specific formats. Each program defined its own file structure and wrote custom parsing code. This worked for isolated applications but created problems as organizations accumulated data and needed to share it.&lt;/p&gt;

&lt;p&gt;The fundamental issue was tight coupling. Every program accessing a customer file needed to understand the exact byte layout, and adding a new field required modifying every program that touched that data, even those not using the new field.&lt;/p&gt;

&lt;p&gt;IBM’s Information Management System (IMS), introduced in 1966, provided hierarchical organization. But accessing data required manual navigation. To find all orders for a customer, a program traversed pointers to child records. There was no declarative way to express access patterns. Different applications wrote redundant filtering logic. When business rules changed, organizations faced updating inconsistent code across dozens of programs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Codd’s 1970 paper “A Relational Model of Data for Large Shared Data Banks” &lt;a href=&quot;#ref-Cod70&quot; id=&quot;ref-Cod70-back&quot;&gt;[Cod70]&lt;/a&gt; proposed organizing data as mathematical relations, that is, tables with rows and columns. Each table represented an entity type, each row an instance, each column an attribute.&lt;/p&gt;

&lt;div class=&quot;link-cards&quot; style=&quot;grid-template-columns: 1fr; max-width: 320px; margin-left: auto; margin-right: auto;&quot;&gt;
&lt;a class=&quot;link-card&quot; href=&quot;https://en.wikipedia.org/wiki/Codd%27s_12_rules&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
&lt;div class=&quot;link-card-image&quot; style=&quot;background-color: #f8f9fa; position: relative;&quot;&gt;
&lt;div style=&quot;position: absolute; inset: 0; padding: 10px; overflow: hidden; font-size: 0.7rem; line-height: 1.35; color: #202122; font-family: Georgia, serif;&quot;&gt;
&lt;strong&gt;Rule 0 (Foundation).&lt;/strong&gt; The system must manage databases entirely through its relational capabilities.&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;Rule 1 (Information).&lt;/strong&gt; All information is represented explicitly by values in tables.&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;Rule 2 (Guaranteed access).&lt;/strong&gt; Every datum is accessible by table name, primary key value, and column name.&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;Rule 3 (Nulls).&lt;/strong&gt; Null values are supported for missing or inapplicable information.
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;link-card-body&quot;&gt;
&lt;span class=&quot;link-card-title&quot;&gt;Codd&apos;s 12 rules&lt;/span&gt;
&lt;span class=&quot;link-card-domain&quot;&gt;en.wikipedia.org&lt;/span&gt;
&lt;/div&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;p class=&quot;image-caption&quot;&gt;Codd&apos;s 12 rules, often called the &quot;twelve commandments,&quot; define what makes a database management system fully relational. &lt;a href=&quot;https://en.wikipedia.org/wiki/Codd%27s_12_rules&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;en.wikipedia.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Codd grounded the model in set theory and predicate logic. Relational algebra and calculus were equivalent, so systems could accept declarative queries and automatically generate efficient procedural execution plans.&lt;/p&gt;

&lt;p&gt;SQL &lt;a href=&quot;#ref-Cha74&quot; id=&quot;ref-Cha74-back&quot;&gt;[Cha74]&lt;/a&gt; provided the practical implementation of these ideas. This single statement replaced hundreds of lines of code. The database optimizer analyzed the query and generated an execution plan.&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customer_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_amount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customer_id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customer_id&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;CA&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;2026-01-01&apos;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customer_name&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;HAVING&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_amount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Later, relational databases made ACID transactions &lt;a href=&quot;#ref-HR83&quot; id=&quot;ref-HR83-back&quot;&gt;[HR83]&lt;/a&gt; (atomicity, consistency, isolation, durability) standard. Applications could focus on business logic rather than implementing concurrency control and crash recovery.&lt;/p&gt;

&lt;p&gt;The crucial innovation was separating logical organization from physical storage. Users worked with tables conceptually. The database system decided how to store them, what indexes to maintain, and how to organize bytes. Changing storage layout did not require modifying applications. That data independence made possible a single, shared source of truth. Multiple applications and users could read and update the same data with consistent results. These are the systems we take for granted today, from banking and reservations to inventory and ERP, where many programs depend on the same records.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Cha74&quot; href=&quot;#ref-Cha74-back&quot;&gt;[Cha74]&lt;/a&gt; Chamberlin, D. D. &amp;amp; Boyce, R. F. 1974. &quot;SEQUEL: A Structured English Query Language.&quot; &lt;em&gt;Proceedings of ACM SIGFIDET Workshop on Data Description, Access and Control&lt;/em&gt;, 249-264. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/800296.811515&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Cod70&quot; href=&quot;#ref-Cod70-back&quot;&gt;[Cod70]&lt;/a&gt; Codd, E. F. 1970. &quot;A Relational Model of Data for Large Shared Data Banks.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 13(6):377-387. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/362384.362685&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-HR83&quot; href=&quot;#ref-HR83-back&quot;&gt;[HR83]&lt;/a&gt; Härder, T. &amp;amp; Reuter, A. 1983. &quot;Principles of Transaction-Oriented Database Recovery.&quot; &lt;em&gt;ACM Computing Surveys&lt;/em&gt; 15(4):287-315. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/289.291&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;unix-1971&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1971. Unix establishes the operating system as a portable hardware abstraction layer&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Before Unix, operating systems were tightly coupled to their hardware. Software written for an IBM mainframe could not run on a DEC minicomputer. The problem was not just different instruction sets. The ways programs interacted with the system (file I/O, process creation, inter-process communication), together with device drivers, memory management, and scheduling, were all machine-specific. Moving software to new hardware meant rewriting it for a new operating environment, not just recompiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Thompson and Ritchie built Unix at Bell Labs between 1969 and 1971 &lt;a href=&quot;#ref-RT74&quot; id=&quot;ref-RT74-back&quot;&gt;[RT74]&lt;/a&gt;. Unix exposed a single interface instead of vendor-specific system calls. Programs ran as processes. The same read and write operations applied to files on disk, terminals, and devices. “Everything is a file” meant that one abstraction covered all I/O. The kernel implemented the interface in privileged mode and mediated access to hardware. Programs invoked it through system calls, so the kernel hid the details of any particular device.&lt;/p&gt;

&lt;p&gt;That interface had two consequences. Software written to it could run on any machine running Unix, so organizations could change hardware without abandoning software. The abstraction also allowed an interpreted shell. Thompson added one that read command sequences from the terminal or from scripts and executed them. The shell turns command strings into system calls the same as any other process. Programmers could orchestrate tools with a short script instead of compiled software. For example, the following runs two compressions at once. A trailing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;&lt;/code&gt; runs a command in the background so the next one starts right away. The shell’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wait&lt;/code&gt; pauses until all background commands finish.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;gzip &lt;/span&gt;file1.log &amp;amp; &lt;span class=&quot;nb&quot;&gt;gzip &lt;/span&gt;file2.log &amp;amp; &lt;span class=&quot;nb&quot;&gt;wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Software portability and orchestration efficiency gave organizations incentive to adopt Unix and to port the kernel to new architectures. Linux, BSD, and macOS implemented the same interface and became the principal Unix-like systems. The shell established scripted tool chains as the standard approach to orchestration. Unix-like systems dominate servers, mobile devices, and cloud infrastructure.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-RT74&quot; href=&quot;#ref-RT74-back&quot;&gt;[RT74]&lt;/a&gt; Ritchie, D. M. &amp;amp; Thompson, K. 1974. &quot;The UNIX Time-Sharing System.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 17(7):365-375. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/361011.361061&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;c-1973&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1973. C makes systems software like Unix portable across different computer architectures&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Unix had made application software portable across machines that ran Unix. However, systems software such as kernels, device drivers, and system utilities was not portable. It was written in assembly for performance via control over memory layout, interrupts, and registers. No high-level language had shown it could match assembly for that workload. Different machines (IBM, DEC, CDC, and others) came with different instruction sets and architectures. Assembly code for one did not run on another. Porting a Unix kernel or systems stack meant a full rewrite in that machine’s assembly. Thus portability and performance seemed to be in conflict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Dennis Ritchie developed C between 1969 and 1973 at Bell Labs to achieve portability without sacrificing performance &lt;a href=&quot;#ref-Rit93&quot; id=&quot;ref-Rit93-back&quot;&gt;[Rit93]&lt;/a&gt;. C provided pointers and low-level operations while abstracting machine-specific details. Data types such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt; had no fixed size, so compilers could map them to each architecture. Programmers confined machine-dependent code to a small amount of assembly or conditional code. The rest was portable C. The same source could be compiled for different CPUs with minimal changes.&lt;/p&gt;

&lt;p&gt;A program written to the Unix interface could run on any machine that ran Unix. But porting Unix to a new machine meant rewriting the kernel and utilities in that machine’s assembly. C provided a portable language for systems programming. In 1973 Thompson and Ritchie rewrote Unix in C &lt;a href=&quot;#ref-RT74&quot;&gt;[RT74]&lt;/a&gt;. Porting the Unix kernel to a new architecture was as easy as recompiling the C source and adapting a small amount of assembly, not rewriting the entire system.&lt;/p&gt;

&lt;p&gt;Unix had made programs portable across machines that ran Unix. C made Unix portable across architectures, and made systems software written in C portable by recompilation. C performance stayed close to assembly. It became and remains the standard language for operating systems, databases, and network stacks.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Rit93&quot; href=&quot;#ref-Rit93-back&quot;&gt;[Rit93]&lt;/a&gt; Ritchie, D. M. 1993. &quot;The Development of the C Language.&quot; &lt;em&gt;ACM SIGPLAN Conference on History of Programming Languages&lt;/em&gt;, 201-208. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/155360.155580&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;oop-1970s&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1970s–1980s. Object-oriented programming enforces encapsulation to manage complexity&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By the late 1970s, software systems had grown to hundreds of thousands of lines. Procedural programs organized code as functions operating on global or parameter-passed data. That structure worked for small programs but failed at scale. The central issue was that data structures lived as global variables or parameters, and any function could read or modify their internals. Verifying that an invariant held, such as that account balances never went negative, required checking every function that touched the relevant data. Changing the representation of a type, such as dates, forced updates across every function that used it. With all functions in a single namespace, naming conflicts and unintended coupling were common. Architectural boundaries existed only by convention. Programmers under pressure could bypass them, and large systems tended to degrade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Object-oriented languages such as C++ &lt;a href=&quot;#ref-Str85&quot; id=&quot;ref-Str85-back&quot;&gt;[Str85]&lt;/a&gt;, Smalltalk, and Java addressed the encapsulation problem by making boundaries enforceable in the type system. Classes bundled data with the operations that could act on that data. Callers could use only the exposed methods. Invariants could be enforced in one place instead of by auditing every function. Inheritance and polymorphism supported reuse and abstraction without breaking encapsulation. Design patterns &lt;a href=&quot;#ref-GHJV94&quot; id=&quot;ref-GHJV94-back&quot;&gt;[GHJV94]&lt;/a&gt; codified recurring designs. Before OOP, keeping a large system coherent depended on every programmer respecting boundaries by discipline. After, the language enforced those boundaries. Teams could own classes, change internals without breaking callers, and build systems that could grow to millions of lines without the same collapse into unmaintainability. OOP became and remains the dominant basis for enterprise and systems software.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-GHJV94&quot; href=&quot;#ref-GHJV94-back&quot;&gt;[GHJV94]&lt;/a&gt; Gamma, E., Helm, R., Johnson, R., &amp;amp; Vlissides, J. 1994. &lt;em&gt;Design Patterns: Elements of Reusable Object-Oriented Software&lt;/em&gt;. Addison-Wesley.&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Str85&quot; href=&quot;#ref-Str85-back&quot;&gt;[Str85]&lt;/a&gt; Stroustrup, B. 1985. &lt;em&gt;The C++ Programming Language&lt;/em&gt;. Addison-Wesley.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;internet-and-web&quot; class=&quot;era-heading&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Internet and Web&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;This era saw the Internet become a common foundation and the Web the primary way software reached users.&lt;/p&gt;

&lt;h2 id=&quot;tcpip-1983&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1983. TCP/IP makes the Internet a universal network layer&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Through the 1970s, computer networks had proliferated in isolation. ARPANET used its Network Control Protocol (NCP). The Xerox PARC Ethernet had different conventions. Packet radio networks, satellite networks, and local area networks each had distinct protocols for addressing, routing, and reliability. Interconnecting them required understanding each network’s quirks. A program written for one could not simply talk to another. Programmers building distributed systems had to implement compatibility layers or choose a single network and accept its limitations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt; Vint Cerf and Bob Kahn had laid the theoretical foundation in 1974 with “A Protocol for Packet Network Intercommunication” &lt;a href=&quot;#ref-CK74&quot; id=&quot;ref-CK74-back&quot;&gt;[CK74]&lt;/a&gt;, which described how to interconnect dissimilar networks through gateways. The protocol split into two layers. IP (Internet Protocol) handled addressing and routing packets across networks, and TCP (Transmission Control Protocol) handled reliable, ordered delivery on top. The design was deliberately minimal. Networks kept their internal structure, and the Internet layer handled only what was necessary to pass packets between them.&lt;/p&gt;

&lt;p&gt;On January 1, 1983, ARPANET completed the transition from NCP to TCP/IP &lt;a href=&quot;#ref-Pos81&quot; id=&quot;ref-Pos81-back&quot;&gt;[Pos81]&lt;/a&gt;. Every host on the network switched to the new protocol. The “flag day” created a single, interoperable network.&lt;/p&gt;

&lt;p&gt;The abstraction was profound. Programmers no longer needed to understand packet switching, routing algorithms, or the differences between Ethernet and satellite links. They wrote to sockets, a simple API for sending and receiving byte streams, and the network handled the rest. TCP guaranteed delivery and ordering. IP handled addressing across any connected network. Applications could be built once and run anywhere the Internet reached.&lt;/p&gt;

&lt;p&gt;File transfer (FTP/SFTP), email (SMTP), naming (DNS), the Web (HTTP), and every subsequent Internet application built on this foundation. TCP/IP eliminated the need to understand the network layer.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-CK74&quot; href=&quot;#ref-CK74-back&quot;&gt;[CK74]&lt;/a&gt; Cerf, V. G. &amp;amp; Kahn, R. E. 1974. &quot;A Protocol for Packet Network Intercommunication.&quot; &lt;em&gt;IEEE Transactions on Communications&lt;/em&gt; 22(5):637-648. Available at &lt;a href=&quot;https://ieeexplore.ieee.org/document/1092259&quot; target=&quot;_blank&quot;&gt;ieeexplore.ieee.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Pos81&quot; href=&quot;#ref-Pos81-back&quot;&gt;[Pos81]&lt;/a&gt; Postel, J. 1981. &quot;NCP/TCP Transition Plan.&quot; RFC 801. Available at &lt;a href=&quot;https://www.rfc-editor.org/rfc/rfc801&quot; target=&quot;_blank&quot;&gt;rfc-editor.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;web-1989&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1989–1993. The World Wide Web enables universal software distribution through browsers&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; In the 1980s, getting software to people was a logistics problem. Applications like WordPerfect and Lotus 1-2-3 were sold in boxes of floppy disks, each compiled for a specific operating system. A program for MS-DOS would not run on Mac OS or Unix. Updates required mailing new physical media to every user. Business applications accessed by multiple users followed a client-server model that required installing and maintaining software on every client machine independently. Distribution was slow, updates were painful, and every new operating system meant recompiling and repackaging from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Tim Berners-Lee proposed the Web at CERN in 1989 &lt;a href=&quot;#ref-BL89&quot; id=&quot;ref-BL89-back&quot;&gt;[BL89]&lt;/a&gt; as a way to share documents and files across the Internet. Its original motivation had nothing to do with software delivery. CERN’s scientific documentation was scattered across hundreds of incompatible computers. Researchers spent significant time just locating information that existed somewhere on the network. Berners-Lee wanted to link documents through hypertext so people could navigate between them without knowing where they were physically stored. By 1990, he had built the first HTTP server, the first browser, and defined HTML and URLs. Hostnames in URLs were resolved by DNS, the same naming layer the rest of the Internet already used. CERN released the protocol royalty-free in 1993 &lt;a href=&quot;#ref-CERN93&quot; id=&quot;ref-CERN93-back&quot;&gt;[CERN93]&lt;/a&gt;.&lt;/p&gt;

&lt;figure style=&quot;max-width: 400px; margin-left: auto; margin-right: auto;&quot;&gt;
&lt;a href=&quot;https://en.wikipedia.org/wiki/Les_Horribles_Cernettes&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;img src=&quot;https://en.wikipedia.org/wiki/Special:FilePath/Les_Horribles_Cernettes_in_1992.jpg?width=400&quot; alt=&quot;Les Horribles Cernettes, 1992&quot; style=&quot;width: 100%; height: auto;&quot; /&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;p class=&quot;image-caption&quot;&gt;The first image on the Web was a band photo at CERN in 1992 (Les Horribles Cernettes). &lt;a href=&quot;https://en.wikipedia.org/wiki/Les_Horribles_Cernettes&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;en.wikipedia.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The insight that made the Web transformative was universality. Any computer with a browser could access any server, regardless of operating system. Marc Andreessen and Eric Bina built Mosaic &lt;a href=&quot;#ref-AB93&quot; id=&quot;ref-AB93-back&quot;&gt;[AB93]&lt;/a&gt; at the National Center for Supercomputing Applications, the first graphical browser. As adoption grew, programmers saw the implication. Applications could be hosted on a server rather than shipped on floppy disks. A single deployment to the server made the new version available to every user, eliminating the need to mail updated media to customers. Initially that meant downloading software from the Web. Netscape Navigator was distributed that way, as were Winamp, RealPlayer, and countless early desktop applications. The pattern persists today. Zoom, VS Code, and most desktop and mobile installers are still distributed by download from a website or app store.&lt;/p&gt;

&lt;p&gt;Making software run inside the browser, not just be downloaded from it, took two more steps. JavaScript, created by Brendan Eich at Netscape, was the next necessary piece. Static HTML pages couldn’t respond to user input without sending a request back to the server and reloading the entire page. JavaScript ran directly in the browser, so a form could validate input before submission, a button could trigger an action, a page could change without disappearing and reappearing. Web pages started feeling less like documents and more like applications. The shift completed with AJAX. Jesse James Garrett named the pattern &lt;a href=&quot;#ref-Gar05&quot; id=&quot;ref-Gar05-back&quot;&gt;[Gar05]&lt;/a&gt; that programmers had already begun using. Applications sent requests in the background and updated only the changed parts of the page instead of reloading. Gmail (2004) proved this worked at scale. An entire productivity application ran in the browser, feeling as responsive as desktop software. The Web had evolved from a tool for sharing scientific documents into the primary platform for delivering software to users.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-AB93&quot; href=&quot;#ref-AB93-back&quot;&gt;[AB93]&lt;/a&gt; Andreessen, M. &amp;amp; Bina, E. 1993. &quot;NCSA Mosaic: A Global Hypermedia System.&quot; &lt;em&gt;Internet Research&lt;/em&gt; 3(1). Available at &lt;a href=&quot;https://www.emerald.com/insight/content/doi/10.1108/10662249410798803/full/html&quot; target=&quot;_blank&quot;&gt;emerald.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-BL89&quot; href=&quot;#ref-BL89-back&quot;&gt;[BL89]&lt;/a&gt; Berners-Lee, T. 1989. &quot;Information Management: A Proposal.&quot; CERN. Available at &lt;a href=&quot;https://www.w3.org/History/1989/proposal.html&quot; target=&quot;_blank&quot;&gt;w3.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-CERN93&quot; href=&quot;#ref-CERN93-back&quot;&gt;[CERN93]&lt;/a&gt; CERN. 1993. &quot;CERN Puts Web into Public Domain.&quot; Available at &lt;a href=&quot;https://home.cern/science/computing/birth-web/licensing-web&quot; target=&quot;_blank&quot;&gt;cern.ch&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Gar05&quot; href=&quot;#ref-Gar05-back&quot;&gt;[Gar05]&lt;/a&gt; Garrett, J. J. 2005. &quot;Ajax: A New Approach to Web Applications.&quot; Available at &lt;a href=&quot;https://designftw.mit.edu/lectures/apis/ajax_adaptive_path.pdf&quot; target=&quot;_blank&quot;&gt;MIT&lt;/a&gt;. See also &lt;a href=&quot;https://jessejamesgarrett.com/2025/02/18/ajax-at-20/&quot; target=&quot;_blank&quot;&gt;Garrett&apos;s 2025 reflection&lt;/a&gt;.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;python-1991&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1991. Python becomes the default for scripting, automation, and data science&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; The dominant assumption at the time was that runtime performance mattered most. C and FORTRAN optimized for execution speed. The opposite insight was that programmer time is more expensive than machine time. Most code runs once or rarely (scripts, glue code, prototypes). The cost of writing, debugging, and maintaining it dwarfs execution time. A language that made the common case fast to write, even if slow to run, would win.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Guido van Rossum released Python in 1991 &lt;a href=&quot;#ref-Pyt91&quot; id=&quot;ref-Pyt91-back&quot;&gt;[Pyt91]&lt;/a&gt;. Python prioritized readability and ease of use over raw performance. It required no compile step, used clear syntax, and would become “batteries included” as its standard library grew. The crucial design choice was extensibility. When a hot path needed speed, programmers could drop into C. NumPy (2006) demonstrated the pattern. Python for glue code and control flow, C (via extensions) for the numerical inner loops. Programmers got productivity for the 95% of code that wasn’t performance-critical, and C-level speed where it mattered. pandas, Django, TensorFlow, and PyTorch followed the same model. Python became the default for data science, ML, and glue code because it optimized for the right variable (programmer time).&lt;/p&gt;

&lt;div class=&quot;link-cards&quot; style=&quot;grid-template-columns: 1fr; max-width: 320px; margin-left: auto; margin-right: auto;&quot;&gt;
&lt;a class=&quot;link-card&quot; href=&quot;https://peps.python.org/pep-0020/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
&lt;div class=&quot;link-card-image&quot; style=&quot;background-color: #f8f9fa; position: relative;&quot;&gt;
&lt;div style=&quot;position: absolute; inset: 0; padding: 10px; overflow: hidden; font-size: 0.68rem; line-height: 1.35; color: #202122; font-family: Georgia, serif;&quot;&gt;
Flat is better than nested.&lt;br /&gt;
Sparse is better than dense.&lt;br /&gt;
Readability counts.&lt;br /&gt;
Special cases aren&apos;t special enough to break the rules.&lt;br /&gt;
Although practicality beats purity.&lt;br /&gt;
Errors should never pass silently.&lt;br /&gt;
Unless explicitly silenced.&lt;br /&gt;
In the face of ambiguity, refuse the temptation to guess.&lt;br /&gt;
There should be one obvious way to do it.
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;link-card-body&quot;&gt;
&lt;span class=&quot;link-card-title&quot;&gt;PEP 20, Zen of Python&lt;/span&gt;
&lt;span class=&quot;link-card-domain&quot;&gt;peps.python.org&lt;/span&gt;
&lt;/div&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;p class=&quot;image-caption&quot;&gt;Guiding principles for Python design. &lt;a href=&quot;https://peps.python.org/pep-0020/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;peps.python.org&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Pyt91&quot; href=&quot;#ref-Pyt91-back&quot;&gt;[Pyt91]&lt;/a&gt; Python Software Foundation. &quot;History of Python.&quot; Available at &lt;a href=&quot;https://www.python.org/doc/essays/blurb/&quot; target=&quot;_blank&quot;&gt;python.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;stdlib-1994&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1994–1998. Standard algorithm libraries make common algorithms and data structures reusable&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Programmers implementing a sorted collection had to build their own balanced tree or settle for a slower linked list. Those needing O(log n) lookup implemented a red-black tree. Those needing to sort implemented quicksort or merge sort. These implementations were subtle. Off-by-one errors, edge cases with empty collections, and incorrect handling of equal elements were common. Every team duplicated the same work, and bugs in algorithm implementations were hard to detect because the logic was buried in application code.&lt;/p&gt;

&lt;p&gt;Algorithm theory (Big O notation, complexity analysis) had given programmers a vocabulary for reasoning about performance, but it did not eliminate the need to implement. The gap between theory and practice remained.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Alexander Stepanov and Meng Lee developed the Standard Template Library (STL) for C++ at Hewlett-Packard in 1994 &lt;a href=&quot;#ref-Step94&quot; id=&quot;ref-Step94-back&quot;&gt;[Step94]&lt;/a&gt;. The design separated containers, iterators, algorithms, and functors. The key insight was generic programming. Algorithms were written once in terms of iterators and worked with any container. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::sort&lt;/code&gt; worked on a vector, a queue, or a custom container, as long as it provided random-access iterators. HP released the STL freely. It was incorporated into the C++ standard and shipped with every C++ compiler.&lt;/p&gt;

&lt;p&gt;In 1998, Java followed with the Collections Framework in JDK 1.2 &lt;a href=&quot;#ref-Bloch01&quot; id=&quot;ref-Bloch01-back&quot;&gt;[Bloch01]&lt;/a&gt;. List, Set, Map, and their implementations (ArrayList, HashMap, TreeMap) became part of the standard library, with interfaces for sorting, searching, and bulk operations. Like the STL, it provided complexity guarantees. Donald Knuth had established the theoretical basis in &lt;em&gt;The Art of Computer Programming&lt;/em&gt;, from 1968 onwards &lt;a href=&quot;#ref-Knuth68&quot; id=&quot;ref-Knuth68-back&quot;&gt;[Knuth68]&lt;/a&gt;. The STL and Java Collections made those guarantees practical. Programmers chose by need (sorted, O(1) lookup, ordered iteration) and the library supplied a correct implementation. The pattern spread to Python’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;list&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dict&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set&lt;/code&gt;, C#’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;System.Collections&lt;/code&gt;, and Rust’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::collections&lt;/code&gt;. Algorithms and data structures became part of the standard toolkit. Programmers used them without implementing them.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Step94&quot; href=&quot;#ref-Step94-back&quot;&gt;[Step94]&lt;/a&gt; Stepanov, A. &amp;amp; Lee, M. 1994. &quot;The Standard Template Library.&quot; Hewlett-Packard. Available at &lt;a href=&quot;https://www.stepanovpapers.com/Stepanov-The_Standard_Template_Library-1994.pdf&quot; target=&quot;_blank&quot;&gt;stepanovpapers.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Bloch01&quot; href=&quot;#ref-Bloch01-back&quot;&gt;[Bloch01]&lt;/a&gt; Oracle. 1998. &quot;Java Collections Framework.&quot; JDK 1.2. Design by J. Bloch. See &lt;a href=&quot;https://docs.oracle.com/javase/8/docs/technotes/guides/collections/designfaq.html&quot; target=&quot;_blank&quot;&gt;Java Collections Design FAQ&lt;/a&gt;.&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Knuth68&quot; href=&quot;#ref-Knuth68-back&quot;&gt;[Knuth68]&lt;/a&gt; Knuth, D. E. 1968. &lt;em&gt;The Art of Computer Programming, Volume 1. Fundamental Algorithms&lt;/em&gt;. Addison-Wesley. Subsequent volumes from 1969 onwards.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;gc-1995&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1995. Garbage collection makes entire categories of memory errors impossible&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Through the early 1990s, most commercial software was written in C and C++ requiring manual memory management. Programmers explicitly allocated memory with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc()&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new&lt;/code&gt; and deallocated with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free()&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delete&lt;/code&gt;. Getting the pairing wrong led to memory leaks, use-after-free, double-free, and corruption. Memory leaks consumed all available memory in long-running programs. Use-after-free errors occurred when code freed memory but later accessed it through a dangling pointer, often causing data corruption. Double-free errors corrupted the allocator’s internal structures. These bugs were insidious because they might not manifest during testing but caused failures only after days of production operation.&lt;/p&gt;

&lt;p&gt;The consequences were real. BlueKeep (2019), a use-after-free in Windows RDP, let attackers execute arbitrary code with kernel privileges over the network without authentication. The NSA issued a rare advisory. Microsoft patched even end-of-life systems. Microsoft estimated 70% of their 2006–2018 security vulnerabilities were memory-safety issues &lt;a href=&quot;#ref-MSRC19&quot; id=&quot;ref-MSRC19-back&quot;&gt;[MSRC19]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Java, released in 1995 &lt;a href=&quot;#ref-Gos96&quot; id=&quot;ref-Gos96-back&quot;&gt;[Gos96]&lt;/a&gt;, popularized garbage collection for mainstream commercial development. Java’s innovation was demonstrating that automatic memory management was practical despite performance overhead.&lt;/p&gt;

&lt;p&gt;Java eliminated manual deallocation. Programmers allocated objects with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new&lt;/code&gt; but never freed them. The garbage collector, a background process that ran as part of the Java runtime during program execution, periodically scanned the heap to identify objects that were no longer reachable from any live variable or reference, and reclaimed their memory. Because this happened automatically at runtime, programmers could not introduce use-after-free or double-free bugs. The entire class of dangling pointer errors disappeared by construction.&lt;/p&gt;

&lt;p&gt;The idea of automatic memory reclamation was not new. John McCarthy’s Lisp in 1960 &lt;a href=&quot;#ref-McC60&quot; id=&quot;ref-McC60-back&quot;&gt;[McC60]&lt;/a&gt; included the first garbage collector. But early collectors were too slow for commercial systems. The breakthrough was generational collection, developed in the 1980s &lt;a href=&quot;#ref-Lie83&quot; id=&quot;ref-Lie83-back&quot;&gt;[Lie83]&lt;/a&gt;. Most objects die young, so collecting younger generations frequently and older ones rarely reduced overhead enough that GC became viable for production. That made Java’s approach credible when it launched.&lt;/p&gt;

&lt;p&gt;The performance overhead was non-zero but acceptable. Early collectors had pause times of seconds. Improvements reduced pauses to milliseconds for typical applications. An entire category of serious bugs disappeared. Garbage collection also simplified concurrent programming. Threads could share object references without complex deallocation coordination.&lt;/p&gt;

&lt;p&gt;Following Java’s success, garbage collection became standard in new languages. C#, Go, JavaScript, Python, and Ruby all adopted it. Rust (2015) took a different approach. Its ownership and borrowing model, a zero-cost abstraction, enforces memory safety at compile time with no runtime overhead or GC pauses &lt;a href=&quot;#ref-Jun18&quot; id=&quot;ref-Jun18-back&quot;&gt;[Jun18]&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Gos96&quot; href=&quot;#ref-Gos96-back&quot;&gt;[Gos96]&lt;/a&gt; Gosling, J., Joy, B., &amp;amp; Steele, G. 1996. &lt;em&gt;The Java Language Specification&lt;/em&gt;. Addison-Wesley. Available at &lt;a href=&quot;https://docs.oracle.com/javase/specs/&quot; target=&quot;_blank&quot;&gt;docs.oracle.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Jun18&quot; href=&quot;#ref-Jun18-back&quot;&gt;[Jun18]&lt;/a&gt; Jung, R., et al. 2018. &quot;RustBelt: Securing the Foundations of the Rust Programming Language.&quot; &lt;em&gt;Proceedings of POPL&lt;/em&gt;, 66:1-66:34. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3158154&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Lie83&quot; href=&quot;#ref-Lie83-back&quot;&gt;[Lie83]&lt;/a&gt; Lieberman, H. &amp;amp; Hewitt, C. 1983. &quot;A Real-Time Garbage Collector Based on the Lifetimes of Objects.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 26(6):419-429. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/358141.358147&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-McC60&quot; href=&quot;#ref-McC60-back&quot;&gt;[McC60]&lt;/a&gt; McCarthy, J. 1960. &quot;Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 3(4):184-195. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/367177.367199&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-MSRC19&quot; href=&quot;#ref-MSRC19-back&quot;&gt;[MSRC19]&lt;/a&gt; Microsoft Security Response Center. 2019. &quot;A Proactive Approach to More Secure Code.&quot; Available at &lt;a href=&quot;https://msrc-blog.microsoft.com/2019/07/16/a-proactive-approach-to-more-secure-code/&quot; target=&quot;_blank&quot;&gt;msrc-blog.microsoft.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;package-managers-1995&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1995–2010. Package managers make dependency management automatic&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Before package managers, reusing code meant finding it, downloading it, manually placing it in your project, and ensuring it worked with other dependencies. Version conflicts were discovered only when builds failed. There was no central registry, no automated resolution. Dependency management was manual and error-prone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; CPAN, launched in 1995 for Perl, established the pattern. It provided a central archive, a standard layout for how modules were packaged, and a tool that installed modules and dependencies with a single command. Maven, released in 2004, brought the same approach to Java with declarative &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pom.xml&lt;/code&gt; files. npm, launched in 2010, did the same for Node and became the largest package ecosystem in history. The abstraction was declarative. Programmers specified what they needed, not how to get it. The same pattern spread to Python (pip), Ruby (RubyGems), and virtually every language. Dependency management became part of the standard toolkit.&lt;/p&gt;

&lt;h2 id=&quot;opensource-1998&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;1998. Open source makes collaborative, publicly developed software the default&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Through the 1990s, most software was proprietary. Companies kept source code secret to protect their competitive advantage. Reusing code meant licensing it or rewriting it. Richard Stallman had founded the free software movement (GNU, GPL) and framed it as “free as in free speech, not free beer.” That established legal and ethical foundations, but “free” carried political baggage that made businesses hesitant. Linux and Apache had proven that open collaboration could produce production-grade software, yet there was no neutral term that invited broad adoption. Programmers who wanted to share code faced a fragmented landscape of licenses and ideologies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; In 1998, Christine Peterson coined the term “open source,” and Bruce Perens and Eric S. Raymond founded the Open Source Initiative (OSI) &lt;a href=&quot;#ref-OSI98&quot; id=&quot;ref-OSI98-back&quot;&gt;[OSI98]&lt;/a&gt;. The shift from “free software” to “open source” was deliberate. It emphasized practical benefits (peer review, faster iteration, no vendor lock-in) over the freedom-first stance that Stallman had championed. The OSI defined criteria for open source licenses and certified them. Apache, MIT, and GPL became mainstream choices rather than ideological statements.&lt;/p&gt;

&lt;p&gt;The model proved itself. Linux, Apache, MySQL, PHP (LAMP) powered the early web. Firefox challenged Internet Explorer. Android was built on Linux. Companies from Google to Microsoft adopted open source as strategy. GitHub (2008) reduced contribution friction (fork, change, pull request). By the 2010s, open source was the default for infrastructure, frameworks, and tools. Programmers no longer needed to build or buy everything. They could adopt, adapt, and contribute back. The abstraction was organizational. The efficiency gain was communal. The best software in the world was built collaboratively, in public, by anyone who cared to participate.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-OSI98&quot; href=&quot;#ref-OSI98-back&quot;&gt;[OSI98]&lt;/a&gt; Open Source Initiative. &quot;History of the OSI.&quot; Available at &lt;a href=&quot;https://opensource.org/history&quot; target=&quot;_blank&quot;&gt;opensource.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;rest-2000&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2000. REST APIs standardize how web services communicate&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Business-to-business commerce and supply-chain integration required one organization’s applications to talk to another’s over the Internet. Existing distributed computing (CORBA, DCOM, Java RMI) was vendor-specific, complex, and did not work across organizational boundaries. HTTP could carry requests and XML could encode data. What was missing was an agreed way to structure a call. How should one program ask another for a record or submit an update? Without a standard, every service invented its own. SOAP (Microsoft, IBM, W3C 2000) proposed one approach. It was heavyweight (XML envelopes, WSDL, WS-*). No lightweight alternative existed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Roy Fielding &lt;a href=&quot;#ref-Fie00&quot; id=&quot;ref-Fie00-back&quot;&gt;[Fie00]&lt;/a&gt;, a co-author of HTTP/1.1, had helped design the Web’s protocol. In his 2000 doctoral dissertation he named and formalized the architectural style already present in the Web. Resources were identified by URLs. The interface was the HTTP methods (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GET&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;POST&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PUT&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt;). Requests were stateless. He called this style REST. He was not inventing a new protocol. He was documenting what had made the Web scale. That gave programmers a clear model for designing application programming interfaces (APIs). To get a customer, use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GET&lt;/code&gt; on a URL. To create one, use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;POST&lt;/code&gt;. No XML envelope, no WSDL. When JSON replaced XML as the preferred format, REST with JSON was easy to use and to test in a browser. Public APIs that let external developers access a site’s data (Amazon, Google, and others) adopted REST. By the 2010s, REST had become the default way machines talked to each other on the Web.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Fie00&quot; href=&quot;#ref-Fie00-back&quot;&gt;[Fie00]&lt;/a&gt; Fielding, R. T. 2000. &quot;Architectural Styles and the Design of Network-based Software Architectures.&quot; Doctoral dissertation, University of California, Irvine. Available at &lt;a href=&quot;https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm&quot; target=&quot;_blank&quot;&gt;ics.uci.edu&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ides-2001&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2001. IDEs automate the mechanical scaffolding of programming, an early step toward code generation&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; As Java and enterprise applications grew into systems of hundreds of thousands of lines, the mechanical overhead of programming became a significant drag on productivity. Adding a new method to a class meant manually hunting through dozens of files to find every call site that needed updating. Renaming a class required running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grep&lt;/code&gt; across the entire codebase and editing each hit by hand. Finding where a method was actually defined meant navigating through directories of source files. These tasks required no deep thought. They were purely mechanical, but they consumed hours each day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; IntelliJ IDEA (2001) &lt;a href=&quot;#ref-Jet01&quot; id=&quot;ref-Jet01-back&quot;&gt;[Jet01]&lt;/a&gt; and Eclipse (2001, first release 2004) &lt;a href=&quot;#ref-Ecl01&quot; id=&quot;ref-Ecl01-back&quot;&gt;[Ecl01]&lt;/a&gt; represented a generational leap in development tools. They parsed entire codebases and built an internal model of every class, method, and reference. This let them provide intelligent code completion. As a programmer typed, the IDE suggested valid method names and parameter types. Automated refactoring made operations like renaming a class or extracting a method into a single action that propagated correctly across the entire project. Integrated debugging let programmers step through code without leaving the editor. Visual Studio provided similar capabilities for C# and .NET development. Design patterns (Gamma et al., 1994) &lt;a href=&quot;#ref-GHJV94&quot;&gt;[GHJV94]&lt;/a&gt; had codified common OOP solutions. Refactoring (1999) &lt;a href=&quot;#ref-Fow99&quot; id=&quot;ref-Fow99-back&quot;&gt;[Fow99]&lt;/a&gt;, JUnit (1997) &lt;a href=&quot;#ref-Jun97&quot; id=&quot;ref-Jun97-back&quot;&gt;[Jun97]&lt;/a&gt;, and test-driven development (TDD) made restructuring and automated testing mainstream practices.&lt;/p&gt;

&lt;p&gt;The productivity gains were substantial. Operations that previously required manual searching and editing across a codebase became instantaneous. By the mid-2010s, IDEs had become so essential that programmers who worked without one felt as disadvantaged as those without version control.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ecl01&quot; href=&quot;#ref-Ecl01-back&quot;&gt;[Ecl01]&lt;/a&gt; Eclipse. 2001. &quot;Eclipse IDE.&quot; Available at &lt;a href=&quot;https://www.eclipse.org/&quot; target=&quot;_blank&quot;&gt;eclipse.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Fow99&quot; href=&quot;#ref-Fow99-back&quot;&gt;[Fow99]&lt;/a&gt; Fowler, M., Beck, K., Brant, J., Opdyke, W., &amp;amp; Roberts, D. 1999. &lt;em&gt;Refactoring: Improving the Design of Existing Code&lt;/em&gt;. Addison-Wesley. Available at &lt;a href=&quot;https://martinfowler.com/books/refactoring.html&quot; target=&quot;_blank&quot;&gt;martinfowler.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Jet01&quot; href=&quot;#ref-Jet01-back&quot;&gt;[Jet01]&lt;/a&gt; JetBrains. 2001. &quot;IntelliJ IDEA.&quot; Available at &lt;a href=&quot;https://www.jetbrains.com/idea/&quot; target=&quot;_blank&quot;&gt;jetbrains.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Jun97&quot; href=&quot;#ref-Jun97-back&quot;&gt;[Jun97]&lt;/a&gt; Beck, K. &amp;amp; Gamma, E. 1997. JUnit. Available at &lt;a href=&quot;https://junit.org&quot; target=&quot;_blank&quot;&gt;junit.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;di-2002&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2002. Dependency injection frees enterprise programmers from framework boilerplate&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Java 2 Enterprise Edition (J2EE) and Enterprise JavaBeans (EJBs) were the standard platform for enterprise Java in the early 2000s. J2EE was the platform. EJBs were the component model for server-side business logic, objects that ran in a container and handled transactions and persistence. In practice it required extensive XML, deployment descriptors, and boilerplate just to wire objects together. A simple database service might need dozens of config files and hundreds of lines of scaffolding. Objects created their own dependencies. Testing and swapping implementations required rewriting wiring code throughout. Programmer time went to infrastructure, not features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Spring (2003) &lt;a href=&quot;#ref-Spr03&quot; id=&quot;ref-Spr03-back&quot;&gt;[Spr03]&lt;/a&gt; replaced J2EE’s heavy wiring with dependency injection. Rod Johnson’s 2002 book &lt;a href=&quot;#ref-Joh02&quot; id=&quot;ref-Joh02-back&quot;&gt;[Joh02]&lt;/a&gt; had argued that Plain Old Java Objects (POJOs) and a lightweight container could replace EJBs, and Spring implemented that idea. Instead of objects creating their own dependencies, a container created and injected them, so a class that needed a database connection could simply declare the dependency and Spring would provide it. Testing became straightforward because tests could inject mocks. Wiring became explicit and centralized rather than scattered throughout the codebase.&lt;/p&gt;

&lt;p&gt;J2EE had established the enterprise Java market but created the complexity Spring addressed. Practitioner-built frameworks could outcompete committee-designed standards. Spring Boot (2014) took the next step by providing sensible defaults and auto-configuration, so programmers could start a Spring application with minimal or no config files. By the mid-2010s, dependency injection had become standard across languages and frameworks.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Joh02&quot; href=&quot;#ref-Joh02-back&quot;&gt;[Joh02]&lt;/a&gt; Johnson, R. 2002. &lt;em&gt;Expert One-on-One J2EE Design and Development&lt;/em&gt;. Wrox. Available at &lt;a href=&quot;https://www.wiley.com/en-us/Expert+One+on+One+J2EE+Design+and+Development-p-9780764543852&quot; target=&quot;_blank&quot;&gt;wiley.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Spr03&quot; href=&quot;#ref-Spr03-back&quot;&gt;[Spr03]&lt;/a&gt; Spring. 2003. &quot;Spring Framework.&quot; Available at &lt;a href=&quot;https://spring.io&quot; target=&quot;_blank&quot;&gt;spring.io&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;cloud-and-infrastructure&quot; class=&quot;era-heading&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Cloud and infrastructure&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;This era saw cloud computing, mobile, big data, and the commoditization of previously specialized infrastructure.&lt;/p&gt;

&lt;h2 id=&quot;mapreduce-2004&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2004–2009. MapReduce and Hadoop make processing massive datasets accessible&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By the early 2000s, companies like Google were crawling and indexing billions of web pages. The sheer volume of data dwarfed what any single machine could store or process. Google solved this internally by building the Google File System (GFS) in 2003 &lt;a href=&quot;#ref-GGL03&quot; id=&quot;ref-GGL03-back&quot;&gt;[GGL03]&lt;/a&gt;, a distributed file system that spread data across hundreds or thousands of commodity servers, and MapReduce in 2004 &lt;a href=&quot;#ref-DG04&quot; id=&quot;ref-DG04-back&quot;&gt;[DG04]&lt;/a&gt;. MapReduce was a programming model that let programmers express massively parallel computation in a simple way. A Map function processed individual records and a Reduce function aggregated results. The framework handled distributing work, shuffling data, and recovering from failures. Google published papers but did not release the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Doug Cutting and Mike Cafarella had started Nutch, an open-source web crawler, in 2002. When Google’s GFS and MapReduce papers appeared, they implemented the techniques in Nutch but needed institutional backing. Yahoo hired Cutting in 2006 to build distributed data processing for its search engine. He extracted the distributed file system and MapReduce implementation from Nutch into a new project, Hadoop. Hadoop comprised HDFS (the file system) and MapReduce (the processing framework). The same name, MapReduce, was intentional. It implemented the same model from the Google papers. Yahoo dedicated a large team to developing it. By 2007, Yahoo was running Hadoop on a 1,000-node cluster.&lt;/p&gt;

&lt;p&gt;Yahoo open-sourced its Hadoop work in 2009, ran Hadoop at scale, and adoption followed quickly. Facebook ran Hadoop and built Presto for interactive SQL, Twitter built Scalding (a Scala API on Cascading and Hadoop MapReduce), and LinkedIn built Kafka for event streaming &lt;a href=&quot;#ref-KNR11&quot; id=&quot;ref-KNR11-back&quot;&gt;[KNR11]&lt;/a&gt;. eBay and others adopted the ecosystem, and eventually more than half of the Fortune 500 ran big data pipelines on open-source tools. Spark emerged in 2009 &lt;a href=&quot;#ref-Zah10&quot; id=&quot;ref-Zah10-back&quot;&gt;[Zah10]&lt;/a&gt; as an alternative to MapReduce that kept intermediate data in memory rather than writing it to disk, improving performance for iterative workloads while requiring more memory. Kafka became the de facto standard for event streaming, Flink (2014) &lt;a href=&quot;#ref-Car15&quot; id=&quot;ref-Car15-back&quot;&gt;[Car15]&lt;/a&gt; offered true stream processing at lower latency than Spark’s micro-batch model, and Tez optimized batch DAGs for Hadoop. Big data processing went from out of reach to something any company could run.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Car15&quot; href=&quot;#ref-Car15-back&quot;&gt;[Car15]&lt;/a&gt; Carbone, P., et al. 2015. &quot;Apache Flink: Stream and Batch Processing in a Single Engine.&quot; &lt;em&gt;IEEE Data Engineering Bulletin&lt;/em&gt; 36(4):28-38. Available at &lt;a href=&quot;https://arxiv.org/abs/1506.08603&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-DG04&quot; href=&quot;#ref-DG04-back&quot;&gt;[DG04]&lt;/a&gt; Dean, J. &amp;amp; Ghemawat, S. 2004. &quot;MapReduce: Simplified Data Processing on Large Clusters.&quot; &lt;em&gt;Proceedings of OSDI&lt;/em&gt;, 137-150. Available at &lt;a href=&quot;https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean_html/&quot; target=&quot;_blank&quot;&gt;usenix.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-GGL03&quot; href=&quot;#ref-GGL03-back&quot;&gt;[GGL03]&lt;/a&gt; Ghemawat, S., Gobioff, H., &amp;amp; Leung, S.-T. 2003. &quot;The Google File System.&quot; &lt;em&gt;Proceedings of SOSP&lt;/em&gt;, 29-43. Available at &lt;a href=&quot;https://research.google/pubs/the-google-file-system/&quot; target=&quot;_blank&quot;&gt;research.google&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-KNR11&quot; href=&quot;#ref-KNR11-back&quot;&gt;[KNR11]&lt;/a&gt; Kreps, J., Narkhede, N., &amp;amp; Rao, J. 2011. &quot;Kafka: A Distributed Messaging System for Log Processing.&quot; &lt;em&gt;Proceedings of NetDB&lt;/em&gt;. Available at &lt;a href=&quot;https://engineering.linkedin.com/27/project-kafka-distributed-publish-subscribe-messaging-system-reaches-v06&quot; target=&quot;_blank&quot;&gt;engineering.linkedin.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Zah10&quot; href=&quot;#ref-Zah10-back&quot;&gt;[Zah10]&lt;/a&gt; Zaharia, M., et al. 2010. &quot;Spark: Cluster Computing with Working Sets.&quot; &lt;em&gt;Proceedings of HotCloud&lt;/em&gt;. Available at &lt;a href=&quot;https://www.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf&quot; target=&quot;_blank&quot;&gt;usenix.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;git-2005&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2005. Git enables distributed collaboration at global scale&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; For the first decade of Linux kernel development (1991–2002), there was no formal version control at all. Contributors emailed patches to mailing lists, and Linus Torvalds manually applied them to his own source tree before cutting releases. This worked when the project was small, but Linux had grown into the most important open-source project in the world, with thousands of contributors. The manual process became a serious bottleneck.&lt;/p&gt;

&lt;p&gt;In 2002, Torvalds adopted BitKeeper, a proprietary distributed system that was far ahead of CVS or Subversion. In early 2005, BitMover revoked the free license and the kernel community lost its version control overnight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Torvalds had spent months considering what kernel development required. CVS and Subversion were centralized, which made cheap branching and offline work impossible, and no open-source alternative was mature. He began writing Git on April 3, 2005, and had a working system within roughly 10 days. The design was fully distributed. Every clone contained the complete repository history, which allowed programmers to commit, branch, and merge locally without network access. Branching became a lightweight operation, a pointer to a commit that made it essentially free. The Linux kernel 2.6.12 release in June 2005 was the first managed entirely by Git.&lt;/p&gt;

&lt;p&gt;Git-based workflows later enabled continuous integration and deployment. Jenkins (2011) and Travis CI (2011) automated testing and deployment pipelines. Programmers pushed code to Git repositories, triggering automated builds, tests, and deployments. GitHub launched in 2008 &lt;a href=&quot;#ref-Dab12&quot; id=&quot;ref-Dab12-back&quot;&gt;[Dab12]&lt;/a&gt;, adding pull requests and code review workflows that made open-source collaboration frictionless. The model enabled global collaboration at unprecedented scale. Projects like Linux, with thousands of contributors across continents, could coordinate effectively. This DevOps movement reduced the time between writing code and running it in production from weeks to minutes.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Dab12&quot; href=&quot;#ref-Dab12-back&quot;&gt;[Dab12]&lt;/a&gt; Dabbish, L., et al. 2012. &quot;Social Coding in GitHub.&quot; &lt;em&gt;Proceedings of CSCW&lt;/em&gt;, 1277-1286. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/2145204.2145396&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;cloud-2006&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2006. Cloud platforms transform infrastructure into elastic, pay-per-use resources&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Before 2006, running applications meant purchasing servers, networking equipment, and storage, renting rack space and power in a data center, and hiring system administrators to maintain all of it. For a startup launching a web service, the upfront capital was substantial. Ordering, installing, and configuring new hardware took weeks or months. Capacity planning made this worse. Organizations had to forecast future demand and either overprovision and pay for idle capacity or underprovision and risk outages. Spiky workloads, such as retail at holidays or tax software in filing season, made the tradeoff brutal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Amazon Web Services launched Elastic Compute Cloud (EC2) in August 2006 &lt;a href=&quot;#ref-AWS06&quot; id=&quot;ref-AWS06-back&quot;&gt;[AWS06]&lt;/a&gt;, providing virtual servers provisionable through an API in minutes with pay-per-hour billing. This transformed infrastructure from capital expenditure (CapEx) to operational expense (OpEx) and from static to elastic. EC2 is the foundational example of what the industry came to call Infrastructure as a Service (IaaS). The cloud provider manages physical hardware, networking, and virtualization, while the customer retains responsibility for operating systems, applications, and data. The customer rents compute, storage, and network capacity rather than purchasing it.&lt;/p&gt;

&lt;p&gt;This transformed capacity planning. Organizations could scale elastically to match current demand. The cloud model expanded in layers. PaaS (Heroku, Google App Engine, Elastic Beanstalk) shifted OS and runtime management to the provider, so programmers could deploy applications without configuring servers. SaaS (Salesforce, Gmail, Dropbox) delivered entirely managed applications. Infrastructure as code (Chef, Puppet, Terraform) automated the provisioning of IaaS resources through version-controlled scripts, replacing weeks of manual setup. AWS operated data centers worldwide. Startups could deploy globally with the same API calls. Following AWS’s success, Azure and Google Cloud emerged. Cloud computing became the dominant deployment model.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-AWS06&quot; href=&quot;#ref-AWS06-back&quot;&gt;[AWS06]&lt;/a&gt; Amazon Web Services. 2006. &quot;Announcing Amazon Elastic Compute Cloud (Amazon EC2).&quot; Available at &lt;a href=&quot;https://aws.amazon.com/about-aws/whats-new/2006/08/24/announcing-amazon-elastic-compute-cloud-amazon-ec2---beta/&quot; target=&quot;_blank&quot;&gt;aws.amazon.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;mobile-2007&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2007. Mobile platforms turn the phone into a general-purpose computer with app ecosystems&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; At its peak, Nokia controlled over 40% of the global mobile phone market. Within six years, that share had collapsed to under 5%. Hardware was not the issue. Nokia’s model treated phones as closed appliances. SDKs were fragmented (J2ME on some devices, proprietary on others), there was no unified channel for programmers to distribute software to users, and Symbian was not built for a phone as a general-purpose computer running third-party software.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Apple released the iPhone in June 2007. The multitouch screen and full web browser were significant, but the deeper change was conceptual. The iPhone was positioned as a general-purpose computing device that also made calls, with a browser that rendered full web pages rather than a stripped-down mobile experience. The iPhone SDK launched in March 2008 &lt;a href=&quot;#ref-App08&quot; id=&quot;ref-App08-back&quot;&gt;[App08]&lt;/a&gt; and the App Store opened in July 2008. For software distribution, Apple provided a single channel. Programmers could submit apps and reach millions of devices without going through carriers or OEMs. Google followed with Android in September 2008 &lt;a href=&quot;#ref-And08&quot; id=&quot;ref-And08-back&quot;&gt;[And08]&lt;/a&gt; and the Android Market, with a more permissive review process. Both platforms provided high-level APIs and enabled instant global distribution. By the mid-2010s, mobile had created new categories such as ridesharing, mobile payments, and social photography, and changed how software was discovered, distributed, and monetized.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-App08&quot; href=&quot;#ref-App08-back&quot;&gt;[App08]&lt;/a&gt; Apple. 2008. &quot;iPhone SDK Announcement.&quot; Available at &lt;a href=&quot;https://www.apple.com/newsroom/2008/03/06Apple-Announces-iPhone-2-0-Software-Beta/&quot; target=&quot;_blank&quot;&gt;apple.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-And08&quot; href=&quot;#ref-And08-back&quot;&gt;[And08]&lt;/a&gt; Android Open Source Project. 2008. &quot;Android Platform Overview.&quot; Available at &lt;a href=&quot;https://source.android.com&quot; target=&quot;_blank&quot;&gt;source.android.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;microservices-2008&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2008–2012. Microservices replace monoliths as the architecture for large-scale applications&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Large web companies in the late 2000s built their platforms as monolithic applications, that is, large codebases deployed as one unit. Early Netflix illustrates the pattern. Its core system was a Java application backed by an Oracle database. In August 2008, a hardware failure took the entire service down for three days. The cause was initially suspected to be database corruption. Every part of the system depended on the same database, so a failure in one place propagated everywhere. Such failures are inherent to monolithic architecture.&lt;/p&gt;

&lt;p&gt;Beyond availability, monoliths created organizational bottlenecks. Because the application was a single deployment unit, teams working on different features, such as recommendations, billing, and streaming playback, had to deploy together. A bug in one component could take down the whole process and break unrelated features. Because the codebase had no service boundaries, adding a feature required understanding and testing the entire application. Because all components ran in the same process, scaling meant adding more copies of the entire application. Provisioning more compute for one component required scaling everything, wasting capacity on components that needed none. As the codebase grew, development slowed and onboarding became increasingly difficult.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Amazon arrived at the architecture first &lt;a href=&quot;#ref-Vog22&quot; id=&quot;ref-Vog22-back&quot;&gt;[Vog22]&lt;/a&gt;. Its monolithic e-commerce platform became unmanageable as it expanded. Architects required every internal capability to be exposed as an independent service. That restructuring produced the infrastructure that became AWS. Netflix began migrating to that model on AWS in 2009, a seven-year process. The core idea was to break the monolith into small, independently deployable services, each owning its own database. A failure in one service no longer took down the whole platform. Netflix eventually decomposed into over 700 microservices. At that scale, services must find each other, handle failures gracefully, and distribute load across instances. Netflix open-sourced its operational tooling as Eureka (service discovery), Hystrix (circuit breaker), and Ribbon (load balancing) &lt;a href=&quot;#ref-Net12&quot; id=&quot;ref-Net12-back&quot;&gt;[Net12]&lt;/a&gt;. Fowler and Lewis gave the pattern its name and a widely cited reference in 2014 &lt;a href=&quot;#ref-Mar14&quot; id=&quot;ref-Mar14-back&quot;&gt;[Mar14]&lt;/a&gt;. By the 2020s microservices had become the dominant architecture for large-scale web applications, adopted by most large enterprises.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Mar14&quot; href=&quot;#ref-Mar14-back&quot;&gt;[Mar14]&lt;/a&gt; Fowler, M. &amp;amp; Lewis, J. 2014. &quot;Microservices.&quot; Available at &lt;a href=&quot;https://martinfowler.com/microservices/&quot; target=&quot;_blank&quot;&gt;martinfowler.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Net12&quot; href=&quot;#ref-Net12-back&quot;&gt;[Net12]&lt;/a&gt; Netflix. 2012. &quot;Netflix Shares Cloud Load Balancing And Failover Tool: Eureka!&quot; Available at &lt;a href=&quot;https://netflixtechblog.com/netflix-shares-cloud-load-balancing-and-failover-tool-eureka-c10647ef95e5&quot; target=&quot;_blank&quot;&gt;netflixtechblog.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Vog22&quot; href=&quot;#ref-Vog22-back&quot;&gt;[Vog22]&lt;/a&gt; Vogels, W. 2022. &quot;The Distributed Computing Manifesto.&quot; &lt;em&gt;All Things Distributed&lt;/em&gt;. Available at &lt;a href=&quot;https://www.allthingsdistributed.com/2022/11/amazon-1998-distributed-computing-manifesto.html&quot; target=&quot;_blank&quot;&gt;allthingsdistributed.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;nosql-2009&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2009. NoSQL databases trade consistency for scale and flexibility&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By the late 2000s, web-scale data and traffic exceeded what relational databases could handle. Horizontal scaling required ACID across partitions and two-phase commit, which did not scale. The CAP theorem &lt;a href=&quot;#ref-Bre00&quot; id=&quot;ref-Bre00-back&quot;&gt;[Bre00]&lt;/a&gt; formalized the tradeoff. Relational databases chose consistency and became unavailable during partitions. Fixed schemas forced sparse tables or many joins for heterogeneous data. Schema changes required migrations that locked tables. Row-oriented storage made analytical scans expensive. Relational full-text search did not scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; NoSQL databases relax relational constraints in exchange for scale. Different designs addressed different constraints.&lt;/p&gt;

&lt;p&gt;Relational databases had required two-phase commit across partitions and went offline when partitions occurred. Two designs from 2006–2007 showed that abandoning ACID across partitions enabled horizontal scaling. They chose opposite sides of the CAP tradeoff.&lt;/p&gt;

&lt;p&gt;BigTable &lt;a href=&quot;#ref-CDG06&quot; id=&quot;ref-CDG06-back&quot;&gt;[CDG+06]&lt;/a&gt; (Google, 2006) chose consistency. It used a sparse, multi-dimensional sorted map. Data was organized in column families, groups of columns stored together within each row, so each row could have different columns. Heterogeneous data such as web crawl records with varying fields per URL no longer required sparse tables or many joins. It offered strong consistency within a row and suited read-heavy workloads such as Google’s search index and Google Maps. That consistency required a central coordinator, at the cost of availability. BigTable’s design was adopted in the open-source Apache HBase.&lt;/p&gt;

&lt;p&gt;Dynamo &lt;a href=&quot;#ref-DHJ07&quot; id=&quot;ref-DHJ07-back&quot;&gt;[DHJ+07]&lt;/a&gt; (Amazon, 2007) chose availability. It stayed writable during partitions when relational systems went offline. Its key-value model had no central coordinator. It suited shopping carts and session data, where availability mattered more than immediate consistency. Dynamo’s design was adopted in Amazon’s DynamoDB (commercial) and Riak (open-source).&lt;/p&gt;

&lt;p&gt;Document stores addressed sparse data and schema evolution. Fixed schemas had forced migrations that locked tables. MongoDB (2009) &lt;a href=&quot;#ref-Mon09&quot; id=&quot;ref-Mon09-back&quot;&gt;[Mon09]&lt;/a&gt; and CouchDB stored JSON-like documents. Applications could add fields without migrations. The model suited user profiles with varying attributes, product catalogs with nested specifications, and content with arbitrary metadata. The tradeoff was eventual consistency and no ACID across documents. Cassandra (2008) &lt;a href=&quot;#ref-LM10&quot; id=&quot;ref-LM10-back&quot;&gt;[LM10]&lt;/a&gt; combined BigTable’s column-family model with Dynamo’s decentralized distribution. It handled sparse data, scaled horizontally, and offered tunable consistency. Use cases included activity feeds and time-series data. The cost was eventual consistency by default.&lt;/p&gt;

&lt;div class=&quot;video-container&quot; style=&quot;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%;&quot;&gt;
&lt;iframe style=&quot;position: absolute; top: 0; left: 0; width: 100%; height: 100%;&quot; src=&quot;https://www.youtube.com/embed/b2F-DItXtZs&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;&quot;&gt;
&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class=&quot;image-caption&quot;&gt;MongoDB is web scale (parody). &lt;a href=&quot;https://www.youtube.com/watch?v=b2F-DItXtZs&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;youtube.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Columnar stores and search engines addressed analytical scans and full-text search. Row-oriented storage had made broad aggregations slow. Vertica and ClickHouse stored each column separately, so scans could read only the columns needed for aggregations. They suited analytical dashboards, sales reports, and click analytics (OLAP) but were poor for transactional point updates (OLTP). Elasticsearch (2010) &lt;a href=&quot;#ref-Ban10&quot; id=&quot;ref-Ban10-back&quot;&gt;[Ban10]&lt;/a&gt; and Solr, built on Lucene, provided full-text search over HTTP for product search, log analysis, and site search. NoSQL made web-scale storage reachable without purpose-built hardware or specialist teams.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Bre00&quot; href=&quot;#ref-Bre00-back&quot;&gt;[Bre00]&lt;/a&gt; Brewer, E. 2000. &quot;Towards Robust Distributed Systems.&quot; &lt;em&gt;Proceedings of ACM PODC&lt;/em&gt;. Available at &lt;a href=&quot;https://www.researchgate.net/publication/221343719_Towards_robust_distributed_systems&quot; target=&quot;_blank&quot;&gt;researchgate.net&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-CDG06&quot; href=&quot;#ref-CDG06-back&quot;&gt;[CDG+06]&lt;/a&gt; Chang, F., Dean, J., Ghemawat, S., et al. 2006. &quot;Bigtable: A Distributed Storage System for Structured Data.&quot; &lt;em&gt;Proceedings of OSDI&lt;/em&gt;, 205-218. Available at &lt;a href=&quot;https://research.google/pubs/bigtable-a-distributed-storage-system-for-structured-data/&quot; target=&quot;_blank&quot;&gt;research.google&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-DHJ07&quot; href=&quot;#ref-DHJ07-back&quot;&gt;[DHJ+07]&lt;/a&gt; DeCandia, G., Hastorun, D., Jampani, M., et al. 2007. &quot;Dynamo: Amazon&apos;s Highly Available Key-value Store.&quot; &lt;em&gt;Proceedings of ACM SOSP&lt;/em&gt;, 205-220. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/1294261.1294281&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-LM10&quot; href=&quot;#ref-LM10-back&quot;&gt;[LM10]&lt;/a&gt; Lakshman, A. &amp;amp; Malik, P. 2010. &quot;Cassandra: A Decentralized Structured Storage System.&quot; &lt;em&gt;ACM SIGOPS Operating Systems Review&lt;/em&gt; 44(2):35-40. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/1773912.1773922&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ban10&quot; href=&quot;#ref-Ban10-back&quot;&gt;[Ban10]&lt;/a&gt; Banon, S. 2010. &quot;You Know, for Search.&quot; Elasticsearch. Available at &lt;a href=&quot;https://www.elastic.co/guide/en/elasticsearch/guide/current/intro.html&quot; target=&quot;_blank&quot;&gt;elastic.co&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Mon09&quot; href=&quot;#ref-Mon09-back&quot;&gt;[Mon09]&lt;/a&gt; Chodorow, K. &amp;amp; Dirolf, M. 2010. &lt;em&gt;MongoDB: The Definitive Guide&lt;/em&gt;. O&apos;Reilly Media. Available at &lt;a href=&quot;https://www.oreilly.com/library/view/mongodb-the-definitive/9781449381578/&quot; target=&quot;_blank&quot;&gt;oreilly.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;nodejs-2009&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2009. Node.js makes JavaScript full-stack and enables the npm ecosystem&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By the late 2000s, web applications had a split identity. The browser ran JavaScript. The server ran Java, PHP, Python, or Ruby. Programmers wrote frontend and backend in different languages, with different runtimes and toolchains. Building a real-time feature meant WebSockets on the server and JavaScript in the browser. Full-stack development meant context-switching between languages, deployment targets, and debugging environments. Programmer time was spent on integration friction, not features. There was no way to share code reliably between client and server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Ryan Dahl released Node.js in 2009 &lt;a href=&quot;#ref-Dah09&quot; id=&quot;ref-Dah09-back&quot;&gt;[Dah09]&lt;/a&gt;. Node.js ran JavaScript on the server using Google’s V8 engine, the same one powering Chrome. The key innovation was non-blocking I/O. Instead of threads, it used an event loop. A single process could handle thousands of concurrent connections. This suited I/O-bound workloads such as APIs, proxies, and real-time applications that had dominated server-side scaling challenges.&lt;/p&gt;

&lt;p&gt;Node.js made JavaScript full-stack. Programmers could write client and server in the same language. npm, launched in 2010, became the package registry for Node and the browser. The same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;require&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;import&lt;/code&gt; worked on both sides. Rails (2004) and Django (2005) had popularized convention-over-configuration, relying on sensible defaults (e.g. a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Post&lt;/code&gt; model maps to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posts&lt;/code&gt; table) instead of explicit configuration for every detail. Node.js and frameworks like Express (2010) brought the same model to JavaScript. The ecosystem expanded rapidly. By the mid-2010s, Node.js powered Netflix, LinkedIn, Uber, and PayPal. JavaScript went from a browser scripting language to the most widely used language for web development. One language and one ecosystem spanned the stack from end to end.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Dah09&quot; href=&quot;#ref-Dah09-back&quot;&gt;[Dah09]&lt;/a&gt; Node.js. &quot;About Node.js.&quot; Available at &lt;a href=&quot;https://nodejs.org/en/about&quot; target=&quot;_blank&quot;&gt;nodejs.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;language-tooling-2010&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2010–2015. Type safety, component architecture, and safer concurrency reach mainstream development&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Dynamic languages such as JavaScript, Python, and Ruby had become central to web and backend development, but type checking and clear separation of concerns lagged. Dynamic typing deferred errors to runtime that static typing would have caught at compile time. In JavaScript, jQuery-based applications entangled DOM manipulation, business logic, and data fetching with no clear separation.&lt;/p&gt;

&lt;p&gt;Concurrent programming in Java, C++, and similar languages faced a separate set of challenges. Mutable shared state and race conditions produced bugs that were difficult to reproduce and debug. Locks offered a remedy but introduced deadlocks and contention, and correctness depended on precise lock ordering that most programmers found impractical to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; TypeScript (2012) &lt;a href=&quot;#ref-Mic12&quot; id=&quot;ref-Mic12-back&quot;&gt;[Mic12]&lt;/a&gt; and React (2013) &lt;a href=&quot;#ref-Fac13&quot; id=&quot;ref-Fac13-back&quot;&gt;[Fac13]&lt;/a&gt; added type checking and component architecture to JavaScript, transforming jQuery spaghetti into structured development.&lt;/p&gt;

&lt;p&gt;Functional concepts entered mainstream languages. Scala bridged object-oriented and functional programming on the JVM, making functional ideas accessible to Java programmers. Twitter’s adoption of Scala for high-concurrency systems &lt;a href=&quot;#ref-Eri12&quot; id=&quot;ref-Eri12-back&quot;&gt;[Eri12]&lt;/a&gt; demonstrated that type-safe functional programming could handle production scale. Java 8 (2014) &lt;a href=&quot;#ref-Ora14&quot; id=&quot;ref-Ora14-back&quot;&gt;[Ora14]&lt;/a&gt; followed with lambdas and streams, bringing functional patterns to the mainstream. Immutable data and pure functions addressed concurrency without locks.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Eri12&quot; href=&quot;#ref-Eri12-back&quot;&gt;[Eri12]&lt;/a&gt; Eriksen, M. et al. 2012. &quot;Effective Scala.&quot; Twitter Engineering. Available at &lt;a href=&quot;https://twitter.github.io/effectivescala/&quot; target=&quot;_blank&quot;&gt;twitter.github.io/effectivescala&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Fac13&quot; href=&quot;#ref-Fac13-back&quot;&gt;[Fac13]&lt;/a&gt; Facebook. 2013. &quot;React: A JavaScript Library for Building User Interfaces.&quot; Available at &lt;a href=&quot;https://react.dev&quot; target=&quot;_blank&quot;&gt;react.dev&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Mic12&quot; href=&quot;#ref-Mic12-back&quot;&gt;[Mic12]&lt;/a&gt; Microsoft. 2012. &quot;Introducing TypeScript.&quot; Available at &lt;a href=&quot;https://devblogs.microsoft.com/typescript/announcing-typescript-1-0/&quot; target=&quot;_blank&quot;&gt;devblogs.microsoft.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ora14&quot; href=&quot;#ref-Ora14-back&quot;&gt;[Ora14]&lt;/a&gt; Oracle. 2014. &quot;What&apos;s New in Java SE 8.&quot; Available at &lt;a href=&quot;https://docs.oracle.com/javase/8/docs/technotes/guides/whats-new/java-se-8.html&quot; target=&quot;_blank&quot;&gt;docs.oracle.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;containers-2013&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2013–2014. Containers and orchestration make deployment portable and scalable&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Two constraints slowed deployment. The first was environment inconsistency. Applications ran in development but crashed in production due to different library versions, missing dependencies, or configuration drift. Virtual machines provided isolation but were heavyweight. Each VM required a full OS and consumed gigabytes. Manual configuration and documentation were brittle.&lt;/p&gt;

&lt;p&gt;The second constraint was orchestration at scale. Distributing workloads across clusters required placement decisions, failure handling, traffic routing, and elastic scaling. Manual coordination did not scale. Cluster managers such as Apache Mesos offered resource scheduling but did not provide declarative desired-state configuration, automated rollouts and rollbacks, integrated service discovery, or self-healing of failed workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Docker (2013) &lt;a href=&quot;#ref-Mer14&quot; id=&quot;ref-Mer14-back&quot;&gt;[Mer14]&lt;/a&gt; addressed environment inconsistency. Linux containers packaged applications into images that ran identically anywhere. A Dockerfile replaced manual setup. Containers were lightweight compared to VMs and required no full OS per instance.&lt;/p&gt;

&lt;p&gt;Kubernetes (2014) &lt;a href=&quot;#ref-Bur16&quot; id=&quot;ref-Bur16-back&quot;&gt;[Bur+16]&lt;/a&gt;, based on Google’s Borg, addressed orchestration at scale. Programmers declared desired state in YAML, for example &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;replicas: 10&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;containerPort: 80&lt;/code&gt;, and Kubernetes reconciled the cluster to match. The system restarted crashed instances, scaled in response to load, and exposed infrastructure as declarative YAML stored in version control. Cloud providers offered managed Kubernetes. Within a decade, containers and Kubernetes had become the standard for cloud-native deployment that scales and recovers across many machines.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Mer14&quot; href=&quot;#ref-Mer14-back&quot;&gt;[Mer14]&lt;/a&gt; Merkel, D. 2014. &quot;Docker: Lightweight Linux Containers for Consistent Development and Deployment.&quot; &lt;em&gt;Linux Journal&lt;/em&gt; 2014(239):2. Available at &lt;a href=&quot;https://www.linuxjournal.com/content/docker-lightweight-linux-containers-consistent-development-and-deployment&quot; target=&quot;_blank&quot;&gt;linuxjournal.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Bur16&quot; href=&quot;#ref-Bur16-back&quot;&gt;[Bur+16]&lt;/a&gt; Burns, B., Grant, B., Oppenheimer, D., Brewer, E., &amp;amp; Wilkes, J. 2016. &quot;Borg, Omega, and Kubernetes.&quot; &lt;em&gt;ACM Queue&lt;/em&gt; 14(1):70-93. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/2898442.2898444&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;serverless-2014&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2014. Serverless computing shifts the unit of deployment from servers to functions&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Containers and orchestration made deployment portable and scalable, but containers were always-on. A workload handling one request per hour still required a running container and continuous compute cost. For bursty or infrequent workloads, organizations paid for idle capacity, the same inefficiency that cloud computing had aimed to eliminate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; AWS Lambda, launched in November 2014 &lt;a href=&quot;#ref-AWS14&quot; id=&quot;ref-AWS14-back&quot;&gt;[AWS14]&lt;/a&gt;, introduced Function-as-a-Service, or serverless. The platform invoked functions only when triggered by events such as HTTP requests, file uploads, queue messages, or scheduled jobs. Each execution was ephemeral. The runtime was torn down afterward, so no compute was allocated between runs and pricing was per-invocation and per-duration. Cost aligned with actual usage rather than provisioned capacity, eliminating idle cost for bursty or infrequent workloads. Lambda scaled automatically from zero to thousands of concurrent executions.&lt;/p&gt;

&lt;p&gt;The stateless model imposed constraints. Cold starts introduced latency after inactivity, and execution time was capped, so serverless suited event-driven workloads such as APIs, background processing, data pipelines, and scheduled tasks but was not a universal replacement for containers. Google Cloud Functions and Azure Functions followed. Serverless became a standard deployment option alongside IaaS and containers, selected by workload characteristics.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-AWS14&quot; href=&quot;#ref-AWS14-back&quot;&gt;[AWS14]&lt;/a&gt; Amazon Web Services. 2014. &quot;Announcing AWS Lambda.&quot; Available at &lt;a href=&quot;https://aws.amazon.com/blogs/aws/run-code-cloud/&quot; target=&quot;_blank&quot;&gt;aws.amazon.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ml-frameworks-2015&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2015–2016. ML frameworks democratize machine learning without research-level expertise&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Before 2015, applying machine learning meant implementing algorithms from academic papers and using tools like R and Python that required statistical expertise. Deep learning had emerged as a research direction, but implementing backpropagation, designing architectures, and training at scale demanded deep knowledge of linear algebra, optimization, and distributed systems. Google, Facebook, Twitter, and a few labs built their ow internal frameworks. They either had to hire PhDs or stay out. The gap between “research breakthrough” and “programmer can use it” was enormous. The bottleneck was expertise, not compute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; TensorFlow (2015) &lt;a href=&quot;#ref-Aba15&quot; id=&quot;ref-Aba15-back&quot;&gt;[Aba+15]&lt;/a&gt; and PyTorch (2016) &lt;a href=&quot;#ref-Pas17&quot; id=&quot;ref-Pas17-back&quot;&gt;[Pas+17]&lt;/a&gt; made deep learning tractable for programmers without research-level expertise. Both provided backpropagation, GPU acceleration, and a high-level API. Programmers defined computation as a graph or in imperative code. The frameworks handled the math, learning-rate tuning, and distributed training across GPUs. Both integrated with NumPy, pandas, and Jupyter. Transfer learning allowed fine-tuning of pretrained models with minimal data. Scikit-learn had already made classical ML such as regression, classification, and clustering accessible. TensorFlow and PyTorch did the same for deep learning, and the language models behind AI coding assistants such as Copilot and Codex were trained with these frameworks &lt;a href=&quot;#ref-CKB21-ml&quot; id=&quot;ref-CKB21-back-ml&quot;&gt;[CKB+21]&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Aba15&quot; href=&quot;#ref-Aba15-back&quot;&gt;[Aba+15]&lt;/a&gt; Abadi, M., et al. 2016. &quot;TensorFlow: A System for Large-Scale Machine Learning.&quot; &lt;em&gt;Proceedings of OSDI&lt;/em&gt;, 265-283. (Released 2015.) Available at &lt;a href=&quot;https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi&quot; target=&quot;_blank&quot;&gt;usenix.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Pas17&quot; href=&quot;#ref-Pas17-back&quot;&gt;[Pas+17]&lt;/a&gt; Paszke, A., et al. 2019. &quot;PyTorch: An Imperative Style, High-Performance Deep Learning Library.&quot; &lt;em&gt;Advances in NeurIPS&lt;/em&gt; 32. (Released 2016.) Available at &lt;a href=&quot;https://arxiv.org/abs/1912.01703&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-CKB21-ml&quot; href=&quot;#ref-CKB21-back-ml&quot;&gt;[CKB+21]&lt;/a&gt; Chen, M., Tworek, J., Jun, H., et al. 2021. &quot;Evaluating Large Language Models Trained on Code.&quot; &lt;em&gt;arXiv:2107.03374&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2107.03374&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-coding&quot; class=&quot;era-heading&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;AI coding&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;AI coding assistants depend on language models that map input sequences to output sequences. The following milestones trace the architectural and scaling developments that made those models possible.&lt;/p&gt;

&lt;h2 id=&quot;ai-transformers-2017&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2017. Transformers replace recurrence with self-attention&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By 2017, neural networks had replaced rule-based systems, count-based n-grams, phrase tables, and hand-engineered features for tasks like machine translation, language modeling, and question answering. The dominant architecture was the encoder-decoder, implemented with recurrent neural networks (typically LSTMs). It mapped an input sequence to an output sequence, for example one sentence to another or a natural language description to code. The encoder processed the input token by token and produced a fixed-length vector (the final hidden state). The decoder consumed that vector and generated the output token by token, autoregressively. The limitation was recurrence. At each step $t$, the hidden state $h_t$ depended on the previous hidden state $h_{t-1}$ and the current input $x_t$, so computation was strictly sequential.&lt;/p&gt;

\[h_t = f(h_{t-1}, x_t)\]

&lt;p&gt;Because $h_t$ depends on $h_{t-1}$, the forward pass required $n$ sequential steps and could not be parallelized. Information from position $t$ to $t+k$ propagated through $k$ steps. During backpropagation, the gradient was multiplied by $\partial h_t / \partial h_{t-1}$ at each step. The product of $k$ Jacobians often had spectral norm below one, so the gradient decayed exponentially and long-range dependencies received negligible signal. An architecture that allowed parallel computation and direct flow between arbitrary positions was needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Vaswani et al. &lt;a href=&quot;#ref-VSP17&quot; id=&quot;ref-VSP17-back&quot;&gt;[VSP+17]&lt;/a&gt; introduced the Transformer, an encoder-decoder that dispenses with recurrence. Instead of the recurrent update above, each layer uses self-attention. Let $i$ and $j$ denote sequence indices (positions). At each layer, the representation at position $i$ is computed as a weighted sum over all positions $j$,&lt;/p&gt;

\[\begin{aligned}
h_i &amp;amp;= \sum_j \alpha_{ij} V_j \\\\
\alpha_{ij} &amp;amp;= \mathrm{softmax}\bigl( q_i \cdot k_j \big/ \sqrt{d} \bigr)
\end{aligned}\]

&lt;p&gt;The query $q_i$, key $k_j$, and value $V_j$ are learned linear projections of the input representations. Unlike recurrence, $h_i$ depends on all $h_j$ in one step, with no sequential dependency. All positions are updated in parallel, and information between any two positions flows in one layer regardless of their distance in the sequence. This design yields three effects. Training is fully parallelizable over the sequence. Long-range dependencies avoid vanishing gradients because the gradient between any two positions traverses one layer, not many. The architecture scales to very large models and datasets. Transformers underpin BERT (2018), GPT-2 (2019), GPT-3 (2020), Codex (2021), and all subsequent code-capable language models.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-VSP17&quot; href=&quot;#ref-VSP17-back&quot;&gt;[VSP+17]&lt;/a&gt; Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., &amp;amp; Polosukhin, I. 2017. &quot;Attention Is All You Need.&quot; &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt; 30. Available at &lt;a href=&quot;https://arxiv.org/abs/1706.03762&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-llm-2020&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2020. Large language models demonstrate in-context learning&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By 2019, Transformer-based language models such as BERT and GPT-2 had been pretrained on large text corpora. The standard way to apply these models to a specific task was supervised fine-tuning. A practitioner took a pretrained model, collected labeled examples for the target task, and trained the model on those examples. Translation required labeled translation pairs. Sentiment analysis required labeled sentences. Code generation required labeled specification-code pairs. Each task demanded its own dataset and its own training run. Deploying a new capability meant fine-tuning, validating, and shipping a new model variant. The cost of data collection and the expertise required for training and deployment limited adoption to organizations with dedicated ML infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Brown et al. &lt;a href=&quot;#ref-BMR20&quot; id=&quot;ref-BMR20-back&quot;&gt;[BMR+20]&lt;/a&gt; showed that fine-tuning could be dropped and demonstrated it at scale. A 175-billion-parameter model, trained only on next-token prediction over text, performed well across many tasks when given a few in-context examples and no gradient update. Smaller models had not shown the same capability, so scale mattered. The pretraining corpus contained many input–output style subsequences, such as translations, Q&amp;amp;A, and code with comments, so the model had already learned to continue them without task labels. At inference the input was a prompt of a few pairs plus the new query. A translation prompt could look like:&lt;/p&gt;

&lt;p&gt;“Hello, world.” → “Bonjour, le monde.”&lt;br /&gt;
“Good morning.” → “Bonjour.”&lt;br /&gt;
“See you tomorrow.” → ?&lt;/p&gt;

&lt;p&gt;The model produced the continuation by computing $P(\text{next token} \mid \text{prefix})$ with the prompt as prefix, with no second phase or parameter update. They called this in-context learning because the task was specified only in the prompt at inference.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-BMR20&quot; href=&quot;#ref-BMR20-back&quot;&gt;[BMR+20]&lt;/a&gt; Brown, T. B., Mann, B., Ryder, N., et al. 2020. &quot;Language Models are Few-Shot Learners.&quot; &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt; 33:1877-1901. Available at &lt;a href=&quot;https://arxiv.org/abs/2005.14165&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-copilot-2021&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2021. Copilot and Codex bring AI code generation to mainstream development&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Software engineering continued to face productivity bottlenecks. Significant time was spent on mechanical tasks, including implementing CRUD endpoints and validation logic, consulting documentation for library and API usage, searching codebases for analogous implementations, translating schemas to types and API specs to stubs, and writing unit tests with conventional arrange-act-assert structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Research had already shown that pretraining on code improved over general-purpose LMs. CodeBERT &lt;a href=&quot;#ref-Fen20&quot; id=&quot;ref-Fen20-back&quot;&gt;[Fen+20]&lt;/a&gt; and related work demonstrated that joint representations of code and natural language supported search, summarization, and completion. Codex &lt;a href=&quot;#ref-CKB21&quot; id=&quot;ref-CKB21-back&quot;&gt;[CKB+21]&lt;/a&gt; was a GPT model fine-tuned on publicly available code from GitHub. It used the same next-token, in-context paradigm as Brown et al., but with a code-heavy training distribution, and outperformed general-purpose models on code. GitHub Copilot &lt;a href=&quot;#ref-Git21&quot; id=&quot;ref-Git21-back&quot;&gt;[Git21]&lt;/a&gt; (June 2021) was the first mainstream assistant, with 55% faster task completion &lt;a href=&quot;#ref-Git22&quot; id=&quot;ref-Git22-back&quot;&gt;[Git22]&lt;/a&gt;. The model completed code as programmers typed. The abstraction was autocomplete at the level of functions and blocks. Verification remained necessary. Output was statistically plausible, not formally correct.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-CKB21&quot; href=&quot;#ref-CKB21-back&quot;&gt;[CKB+21]&lt;/a&gt; Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., et al. 2021. &quot;Evaluating Large Language Models Trained on Code.&quot; &lt;em&gt;arXiv:2107.03374&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2107.03374&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Fen20&quot; href=&quot;#ref-Fen20-back&quot;&gt;[Fen+20]&lt;/a&gt; Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., et al. 2020. &quot;CodeBERT: A Pre-Trained Model for Programming and Natural Languages.&quot; &lt;em&gt;Findings of EMNLP&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2002.08155&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Git21&quot; href=&quot;#ref-Git21-back&quot;&gt;[Git21]&lt;/a&gt; GitHub. 2021. &quot;Introducing GitHub Copilot: Your AI pair programmer.&quot; Available at &lt;a href=&quot;https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/&quot; target=&quot;_blank&quot;&gt;github.blog&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Git22&quot; href=&quot;#ref-Git22-back&quot;&gt;[Git22]&lt;/a&gt; GitHub. 2022. &quot;Research: Quantifying GitHub Copilot&apos;s impact on developer productivity and happiness.&quot; Available at &lt;a href=&quot;https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/&quot; target=&quot;_blank&quot;&gt;github.blog&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-rlhf-2022&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2022. RLHF aligns code models to programmer intent&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Models optimized for next-token prediction did not reliably follow instructions or match user preference. A programmer asking to “add error handling” might receive technically valid code that didn’t match their error-handling conventions. Early Copilot and Codex produced code that was statistically plausible but often misaligned with intent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; The fix was to add a second training phase that optimized the policy for human preference, not only for next-token likelihood. Here the policy is the code model that, given a prompt $x$, defines a distribution over completions $y$, written $\pi_\theta(y \mid x)$ with parameters $\theta$. The procedure is reinforcement learning from human feedback (RLHF). It has two components. (1) A reward model and (2) an RL phase.&lt;/p&gt;

&lt;p&gt;(1) The reward model is a network that assigns each prompt-completion pair a scalar reward, e.g. $r(x, y) \in \mathbb{R}$ where $x$ is the prompt and $y$ is the completion. It is trained on human pairwise preferences so that preferred completions receive higher reward. That required new human-labeled data, but only preference labels that indicate which of two completions is better, not full target completions. The scale of this preference data is on the order of tens of thousands of comparisons, far less than pretraining data.&lt;/p&gt;

&lt;p&gt;(2) The RL phase has one goal. The policy is adjusted so that its completions get high reward from the reward model (i.e. what humans prefer), without drifting so far from the pretrained reference that outputs degenerate into gibberish or reward-hacking (e.g. repeating phrases the reward model likes). The policy is fine-tuned with PPO (proximal policy optimization), an RL algorithm that updates the policy in constrained steps. In plain language, the training objective is to maximize the average reward on completions the policy produces, then subtract a penalty for how far the policy has drifted from the pretrained reference. So the policy is pushed toward high-reward outputs but kept close to the reference so that outputs stay readable, valid code.&lt;/p&gt;

&lt;p&gt;Formally, the objective is&lt;/p&gt;

\[\mathbb{E}_{y \sim \pi_\theta}[r(y)] - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\mathrm{ref}})\]

&lt;p&gt;where $\mathbb{E}_{y \sim \pi_\theta}$ is the expectation (average over completions drawn from the policy), $y$ is a completion, and $r(y)$ is its reward. Thus the objective maximizes the first term (expected reward under the policy) and minimizes the second (deviation from the reference policy). The coefficient $\beta$ controls the tradeoff between reward and staying close. KL denotes the Kullback–Leibler divergence,&lt;/p&gt;

\[\mathrm{KL}(\pi_\theta \| \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\left[\log \pi_\theta(y) - \log \pi_{\mathrm{ref}}(y)\right],\]

&lt;p&gt;Without the KL term, the policy can collapse toward high-reward, low-fluency or reward-hacking outputs. The KL term keeps outputs close to the reference distribution so that they stay readable, valid code. The policy is thus optimized for preference, not only for likelihood on a fixed corpus.&lt;/p&gt;

&lt;p&gt;InstructGPT (March 2022) &lt;a href=&quot;#ref-Ouy22&quot; id=&quot;ref-Ouy22-back&quot;&gt;[Ouy+22]&lt;/a&gt; and ChatGPT (November 2022) established the pipeline. Code assistants adopted it with programmer labelers and code completions.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ouy22&quot; href=&quot;#ref-Ouy22-back&quot;&gt;[Ouy+22]&lt;/a&gt; Ouyang, L., et al. 2022. &quot;Training language models to follow instructions with human feedback.&quot; &lt;em&gt;Advances in NeurIPS&lt;/em&gt; 35. Available at &lt;a href=&quot;https://arxiv.org/abs/2203.02155&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-rag-2022&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2023. RAG grounds code generation in the codebase&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; A language model’s context window is the maximum number of tokens (roughly, words or subwords) it can take as input in one call. Code-capable models of the Codex and Copilot era (2021–2022) had context windows of 2k–8k tokens. That was enough for a short prompt and a few in-context examples, but not for real codebases. Typical limits remained 4k–8k tokens through 2022. A programmer fixing a bug needed relevant files in context, but those limits could not hold them. Even a modest service spanning dozens of files and tens of thousands of lines exceeded the window, so the model never saw most of the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; RAG (retrieval-augmented generation) &lt;a href=&quot;#ref-Lew20&quot; id=&quot;ref-Lew20-back&quot;&gt;[Lew+20]&lt;/a&gt; was introduced for knowledge-intensive NLP in 2020. Code assistants adopted it for the codebase context problem in 2023. The delay reflected two factors. Code-specific retrieval infrastructure (repository indexing, code-aware embeddings) had to be developed. In addition, context limits became a pressing constraint only once coding assistants were widely adopted. RAG sidesteps the context limit by not sending the whole codebase. A retrieval step (e.g. semantic search over embeddings or a code index) selects a subset of files or snippets relevant to the programmer’s request. Only that subset is concatenated into the prompt, so the model’s fixed context window holds the query plus the retrieved material instead of the entire repo. The model’s output is therefore grounded in actual codebase structure rather than generic patterns. Cursor, GitHub Copilot Chat, and others adopted RAG for codebase search. Programmers could point the assistant at a repo and get answers grounded in its structure and contents.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Lew20&quot; href=&quot;#ref-Lew20-back&quot;&gt;[Lew+20]&lt;/a&gt; Lewis, P., et al. 2020. &quot;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.&quot; &lt;em&gt;Advances in NeurIPS&lt;/em&gt; 33. Available at &lt;a href=&quot;https://arxiv.org/abs/2005.11401&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-agentic-2023&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2023–2024. Long-context and agentic interfaces expand scope&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; RAG addressed context limits by supplying a retrieved subset of the codebase to the model, but the assistant remained a single-turn completer. It produced output only in response to the current prompt and had no ability to execute tools, query the repository, run tests, or incorporate execution results into the next step. Any task that required multiple steps (for example, fixing failing tests by running the test suite, reading failures, editing code, and re-running until green) therefore had to be orchestrated entirely by the programmer, who ran each step, read the outcome, and re-prompted by hand. The cognitive and manual burden of multi-step tasks stayed with the programmer rather than shifting to the assistant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Two developments unfolded over 2023 and 2024.&lt;/p&gt;

&lt;p&gt;First, context windows grew. Models with 100k-token context (e.g. Claude 2, GPT-4 Turbo) reached production in 2023, and 200k-token windows became available by 2024. Entire moderate-sized repositories could fit in context, so the model could reason about architectural patterns, cross-file dependencies, and project-wide conventions without retrieval.&lt;/p&gt;

&lt;p&gt;Second, agentic interfaces enabled multi-step behaviour. The enabling mechanism is tool use (function calling). The model emits structured tool invocations (e.g. run command, read file, edit file, run tests). The host executes them and appends the results to the model context, so that the model chooses the next action in a repeating plan, act, observe loop. Cursor embedded this pattern in the IDE. Devin (Cognition, March 2024) applied it to autonomous multi-file coding. Claude’s “computer use” capability &lt;a href=&quot;#ref-Ant24cu&quot; id=&quot;ref-Ant24cu-back&quot;&gt;[Ant24cu]&lt;/a&gt; (Anthropic, October 2024) extended it to direct desktop control (cursor, keyboard, screen) in addition to structured tool APIs.&lt;/p&gt;

&lt;p&gt;A single request such as “fix the failing tests” could thus trigger a multi-step workflow (run tests, read failures, locate code, generate fixes, rerun tests, iterate) without the programmer re-prompting at each step. Multi-agent systems (e.g. MetaGPT, Devin) went beyond a single model driving tools by deploying several agents that divide the work. Each agent has a distinct role (e.g. planning, coding, reviewing, testing), and they pass outputs to one another so that planning, implementation, and verification are separated and sequenced rather than performed by one monolithic assistant.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ant24cu&quot; href=&quot;#ref-Ant24cu-back&quot;&gt;[Ant24cu]&lt;/a&gt; Anthropic. 2024. &quot;Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku.&quot; Available at &lt;a href=&quot;https://www.anthropic.com/news/3-5-models-and-computer-use&quot; target=&quot;_blank&quot;&gt;anthropic.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-reasoning-2024&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2024. Extended reasoning and enterprise fine-tuning complete the AI coding assistant stack&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; By 2024, coding assistants combined RLHF, RAG, long-context windows, and agentic tool use. Two gaps remained.&lt;/p&gt;

&lt;p&gt;The first gap was the absence of an explicit reasoning phase before the model produced code. Many tasks benefited from weighing options before committing, such as diagnosing a failing test whose cause might lie in several files or choosing among plausible implementations. The autoregressive model used in these assistants did not. It generated one token at a time, at each step computing the distribution over the next token given the prefix and then emitting it.&lt;/p&gt;

\[P(x_t \mid x_1, \ldots, x_{t-1})\]

&lt;p&gt;That autoregressive model did not weigh alternatives before committing. When asked to fix a bug, it could output the first line of a patch immediately, without having considered other possible causes. A human might consider several alternatives before writing any code; such a model did not, and on those tasks its outputs were often wrong or suboptimal.&lt;/p&gt;

&lt;p&gt;The second gap was a distribution mismatch between model output and each organization’s codebase. RAG and long-context windows both supplied the organization’s code as input in the prompt, so that the model had access to it at inference. The weights, however, had been learned only on public corpora and did not change at inference. The model could reuse names or patterns from the prompt, but when the prompt did not fully determine style, structure, or naming, it fell back on what it had learned in training. Output often looked more like public repos than the organization’s code, so programmers edited heavily or rejected it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; Each gap had a direct technical fix.&lt;/p&gt;

&lt;p&gt;The first gap was the absence of explicit reasoning before committing to code. Extended-reasoning models solved it. Models such as o1 &lt;a href=&quot;#ref-Ope24&quot; id=&quot;ref-Ope24-back&quot;&gt;[Ope24]&lt;/a&gt; add an internal chain-of-thought phase before the final output. Instead of generating code tokens directly from the user prompt, the model first generates a sequence of reasoning tokens $r_1, \ldots, r_k$ and then the answer tokens $y_1, \ldots, y_n$ (the code). The user sees only the answer. The next-token distribution at each step conditions on the full prefix, including the model’s own reasoning, so the model can explore steps or alternatives before committing to code. Formally, the output distribution is&lt;/p&gt;

\[P(y_{1:n} \mid x) = \sum_{r_{1:k}} P(r_{1:k} \mid x)\, P(y_{1:n} \mid x, r_{1:k}).\]

&lt;p&gt;In practice the model is trained to produce $(r_{1:k}, y_{1:n})$ and is given more inference-time compute for the reasoning segment. For complex tasks (e.g. multi-file debugging or choosing among implementations) this yielded substantially better results than direct generation.&lt;/p&gt;

&lt;p&gt;The second gap was a distribution mismatch between model output and each organization’s codebase. Enterprise fine-tuning solved it. The model’s parameters are updated on the organization’s code. Let $\theta$ denote the base parameters (trained on public corpora). Let $\mathcal{D}_{\mathrm{org}}$ denote the organization’s dataset (e.g. proprietary code or prompt–completion pairs). Fine-tuning minimizes the negative log-likelihood on $\mathcal{D}_{\mathrm{org}}$,&lt;/p&gt;

\[\mathcal{L}(\theta) = -\sum_{(x,y) \in \mathcal{D}_{\mathrm{org}}} \log P_\theta(y \mid x),\]

&lt;p&gt;yielding parameters $\theta_{\mathrm{org}}$ that assign higher probability to continuations consistent with the organization’s style, naming, and structure. The model’s default behaviour at inference then reflects the fine-tuning corpus rather than public code. Copilot Enterprise (2024) &lt;a href=&quot;#ref-Git24&quot; id=&quot;ref-Git24-back&quot;&gt;[Git24]&lt;/a&gt; offered such customization on proprietary repositories.&lt;/p&gt;

&lt;p&gt;By 2024 the ecosystem had diversified. Developers could choose among multiple leading models (Claude, GPT-4, Gemini, open code models such as DeepSeek Coder) and AI-native IDEs (Cursor, Windsurf) alongside incumbent tools. The jump from 2021 Copilot to 2025-era assistants came not mainly from larger base models but from adding RAG, long context, tool use, extended reasoning, and enterprise fine-tuning. Those additions changed what the assistant can do and how well it matches an organization’s codebase.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Git24&quot; href=&quot;#ref-Git24-back&quot;&gt;[Git24]&lt;/a&gt; GitHub. 2024. &quot;Fine-tuned models are now in limited public beta for GitHub Copilot Enterprise.&quot; Available at &lt;a href=&quot;https://github.blog/news-insights/product-news/fine-tuned-models-are-now-in-limited-public-beta-for-github-copilot-enterprise&quot; target=&quot;_blank&quot;&gt;github.blog&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ope24&quot; href=&quot;#ref-Ope24-back&quot;&gt;[Ope24]&lt;/a&gt; OpenAI. 2024. &quot;Introducing OpenAI o1.&quot; Available at &lt;a href=&quot;https://openai.com/o1/&quot; target=&quot;_blank&quot;&gt;openai.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;ai-benchmarks-2024&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;2024. Code evals establish comparable benchmarks and reveal the gap to real-world tasks&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem.&lt;/strong&gt; HumanEval was introduced alongside Codex in the 2021 Codex paper &lt;a href=&quot;#ref-CKB21&quot;&gt;[CKB+21]&lt;/a&gt; and gave the field its first standardized benchmark for code generation. However, it only measured function-level generation from docstrings. Modifying a large, unfamiliar codebase from an ambiguous bug report was a different kind of work and still had no shared evaluation. The field could not separate algorithmic performance from real-world codebase capability, so capability claims that mixed the two were not distinguishable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution.&lt;/strong&gt; SWE-bench &lt;a href=&quot;#ref-JYW24&quot; id=&quot;ref-JYW24-back&quot;&gt;[JYW+24]&lt;/a&gt; in 2024 supplied the missing benchmark for codebase-editing evaluation. SWE-bench Verified is the curated subset with validated, solvable tasks used for the results reported here. Each instance is an actual bug from open-source repos such as Django, Flask, Matplotlib, and Scikit-learn. The model gets the GitHub issue and must produce a patch that passes the project’s test suite. Success depends on locating the relevant code, respecting architecture and invariants, and avoiding regressions.&lt;/p&gt;

&lt;p&gt;In June 2024 Claude 3.5 Sonnet reached 93% on HumanEval and 33.5% on SWE-bench Verified &lt;a href=&quot;#ref-Ant24&quot; id=&quot;ref-Ant24-back&quot;&gt;[Ant24]&lt;/a&gt;. The same model thus showed a wide spread between the two benchmarks. On SWE-bench Verified, GPT-4 reached 1.74% in early 2024, OpenAI o1 reached 48.9% in December 2024 &lt;a href=&quot;#ref-Ope24&quot;&gt;[Ope24]&lt;/a&gt;, and Claude 4 reached 72.5% in May 2025 &lt;a href=&quot;#ref-Ant25&quot; id=&quot;ref-Ant25-back&quot;&gt;[Ant25]&lt;/a&gt;. The spread between function-level generation under clear specs and codebase editing under ambiguous, multi-file constraints is therefore substantial. Leaderboards at &lt;a href=&quot;https://www.swebench.com/&quot; target=&quot;_blank&quot;&gt;swebench.com&lt;/a&gt; track current results, and any capability claim must state the benchmark and task class.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ant24&quot; href=&quot;#ref-Ant24-back&quot;&gt;[Ant24]&lt;/a&gt; Anthropic. 2024. &quot;Claude 3.5 Sonnet.&quot; Available at &lt;a href=&quot;https://www.anthropic.com/claude/sonnet&quot; target=&quot;_blank&quot;&gt;anthropic.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ant25&quot; href=&quot;#ref-Ant25-back&quot;&gt;[Ant25]&lt;/a&gt; Anthropic. 2025. &quot;Introducing Claude 4.&quot; Available at &lt;a href=&quot;https://www.anthropic.com/news/claude-4&quot; target=&quot;_blank&quot;&gt;anthropic.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Ope24&quot; href=&quot;#ref-Ope24-back&quot;&gt;[Ope24]&lt;/a&gt; OpenAI. 2024. &quot;Introducing OpenAI o1.&quot; Available at &lt;a href=&quot;https://openai.com/o1/&quot; target=&quot;_blank&quot;&gt;openai.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-JYW24&quot; href=&quot;#ref-JYW24-back&quot;&gt;[JYW+24]&lt;/a&gt; Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., &amp;amp; Narasimhan, K. 2024. &quot;SWE-bench: Can Language Models Resolve Real-World GitHub Issues?&quot; &lt;em&gt;ICLR 2024&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2310.06770&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;discussion&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Discussion&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The historical framework above equips us with a lens to understand where AI coding stands today and what its impact may be. The following sections use that lens to further assess the impact of AI coding on software engineering.&lt;/p&gt;

&lt;h3 id=&quot;where-ai-fits&quot;&gt;The internet, cloud, and mobile eras put AI in context&lt;/h3&gt;

&lt;p&gt;The internet (TCP/IP, 1983) became a common foundation for connecting machines and distributing software. Cloud computing (AWS EC2, 2006) turned infrastructure from capital expenditure into operational expense and enabled elastic scaling. Mobile (iPhone and Android, 2007–2008) made the phone a general-purpose computer and established app stores as a dominant distribution channel. All three changed how software reached users. AI coding operates at a different layer. It alters how code is produced, not how it reaches users. Nevertheless, our framework does not settle the magnitude of AI’s economic impact relative to the internet, cloud, or mobile.&lt;/p&gt;

&lt;h3 id=&quot;whether-ai-displace-saas&quot;&gt;Verification and maintenance costs determine whether AI displaces SaaS&lt;/h3&gt;

&lt;p&gt;SaaS prevails where the cost of building, operating, and maintaining software in-house has historically exceeded the cost of subscription. Vendors amortize development, maintenance, security, and compliance across many customers. AI may lower the cost of initial construction and can reduce ongoing maintenance, integration, and compliance. In each use case, subscription is displaced only when AI-assisted in-house development costs less in total than subscribing. Bacchelli and Bird &lt;a href=&quot;#ref-BB13&quot; id=&quot;ref-BB13-back&quot;&gt;[BB13]&lt;/a&gt; find that the expertise to verify code matches the expertise to write it, so verification cannot be offloaded yet and remains a large share of in-house cost. Where that holds, total in-house cost may stay above subscription even when AI lowers the cost of producing code.&lt;/p&gt;

&lt;p&gt;SaaS has other moats that in-house builds do not easily reproduce. Vendors spread the cost of compliance certifications (e.g. SOC 2, HIPAA), availability and SLAs, ongoing R&amp;amp;D, and data that grows with the customer base. A single organization replicating that must bear the full cost of audits, redundancy, feature development, and acquiring equivalent data.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-BB13&quot; href=&quot;#ref-BB13-back&quot;&gt;[BB13]&lt;/a&gt; Bacchelli, A. &amp;amp; Bird, C. 2013. &quot;Expectations, Outcomes, and Challenges of Modern Code Review.&quot; &lt;em&gt;Proceedings of ICSE&lt;/em&gt;, 712-721. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.5555/2486788.2486882&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;each-past-abstraction&quot;&gt;Each past abstraction eliminated the need to acquire entire areas of knowledge&lt;/h3&gt;

&lt;p&gt;Successive abstractions removed entire domains of knowledge from the programmer’s burden. FORTRAN let programmers write in a high-level language instead of coding machine instructions. Unix removed vendor-specific system calls and device interfaces. Relational databases removed the need to understand the physical storage layout. TCP/IP obviated the knowledge of each network’s internals so any machine could talk to any other on a single global network. In each case the abstraction was sound. Programmers could rely on it without verifying the layer below.&lt;/p&gt;

&lt;p&gt;The AI case differs. Coding assistants significantly reduce the effort of producing code, but not the need to verify it. Past abstractions eliminated the need to acquire certain knowledge. AI may accelerate production without removing the expertise required to evaluate and maintain the result. Whether that distinction holds as tools evolve remains an open question.&lt;/p&gt;

&lt;h3 id=&quot;english-not-pl&quot;&gt;English is not a programming language&lt;/h3&gt;

&lt;p&gt;One obstacle persists no matter how capable AI becomes at testing and verification. The artifact that is stored, run, reviewed, and maintained is code, not the natural-language prompts that may have produced it. Meyer puts it directly. Programmers save the source code, not the prompts, because prompts cannot serve as reproducible specification &lt;a href=&quot;#ref-Mey25&quot; id=&quot;ref-Mey25-back&quot;&gt;[Mey25]&lt;/a&gt;. English is not a programming language because the code is not in English.&lt;/p&gt;

&lt;p&gt;Programming languages serve two important purposes. First, they eliminate ambiguity for machines. Natural language is inherently ambiguous. Berry and Kamsties show that ambiguity in requirements is inescapable; different readers take different meanings from the same text &lt;a href=&quot;#ref-BK04&quot; id=&quot;ref-BK04-back&quot;&gt;[BK04]&lt;/a&gt;. “Export recent orders” leaves format, date range, and fields unspecified. “Retry on failure” leaves how many attempts, which exceptions, and whether to back off unspecified. “Delete inactive users” leaves the inactivity threshold and soft-delete versus purge unspecified. The same prompt yields different code from an LLM on different runs.&lt;/p&gt;

&lt;p&gt;Second, they force precision in human thinking. Writing code commits to each choice. In Python you might call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;export_orders(since_days=7, format=&apos;csv&apos;, fields=[&apos;id&apos;, &apos;total&apos;, &apos;created_at&apos;])&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;retry(times=3, on=TimeoutError)&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delete_inactive_users(inactive_since_days=90, soft=False)&lt;/code&gt;. Each argument answers a question the English left open. The language demands answers. The discipline of expressing logic in code makes the logic itself clearer.&lt;/p&gt;

&lt;p&gt;Intentional Software and generations of research aimed to let humans specify intent without writing code. The idea was that domain experts would edit in their own notation (e.g. tax rules or business logic in domain vocabulary) and the system would maintain a single representation and generate code, like WYSIWYG for documents but for software &lt;a href=&quot;#ref-Sim95&quot; id=&quot;ref-Sim95-back&quot;&gt;[Sim95]&lt;/a&gt;. The vision was influential. Intentional Software was acquired by Microsoft in 2017, but the approach never became mainstream. Nevertheless, the dream persists and may be more accessible than ever.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Mey25&quot; href=&quot;#ref-Mey25-back&quot;&gt;[Mey25]&lt;/a&gt; Meyer, C. 2025. &quot;English Isn&apos;t a Programming Language.&quot; &lt;em&gt;Substack&lt;/em&gt;. Available at &lt;a href=&quot;https://csmeyer.substack.com/p/english-isnt-a-programming-language&quot; target=&quot;_blank&quot;&gt;csmeyer.substack.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-BK04&quot; href=&quot;#ref-BK04-back&quot;&gt;[BK04]&lt;/a&gt; Berry, D. M. &amp;amp; Kamsties, E. 2004. &quot;Ambiguity in Requirements Specification.&quot; In &lt;em&gt;Perspectives on Software Requirements&lt;/em&gt;, Springer, 7-44. Available at &lt;a href=&quot;https://link.springer.com/chapter/10.1007/978-1-4615-0465-8_2&quot; target=&quot;_blank&quot;&gt;link.springer.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Sim95&quot; href=&quot;#ref-Sim95-back&quot;&gt;[Sim95]&lt;/a&gt; Simonyi, C. 1995. &quot;The Death of Computer Languages, The Birth of Intentional Programming.&quot; Microsoft Research Technical Report MSR-TR-95-52. Available at &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/the-death-of-computer-languages-the-birth-of-intentional-programming/&quot; target=&quot;_blank&quot;&gt;microsoft.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;open-source-ai&quot;&gt;Open source creation and maintenance both benefit from AI&lt;/h3&gt;

&lt;p&gt;Empirical work on open source suggests that both creation and maintenance benefit from AI. Hoffmann et al. find that maintainers with access to GitHub Copilot increase coding activity and reduce project management load, and that these effects persist for at least two years &lt;a href=&quot;#ref-HBB24&quot; id=&quot;ref-HBB24-back&quot;&gt;[HBB24]&lt;/a&gt;. Yeverechyahu et al. find that maintenance-related contributions rise more than original contributions &lt;a href=&quot;#ref-YMO24&quot; id=&quot;ref-YMO24-back&quot;&gt;[YMO24]&lt;/a&gt;, so the larger gain appears to be in maintenance.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-HBB24&quot; href=&quot;#ref-HBB24-back&quot;&gt;[HBB24]&lt;/a&gt; Hoffmann, M., Boysel, S., et al. 2024. &quot;Generative AI and the Nature of Work.&quot; &lt;em&gt;SSRN&lt;/em&gt;. Available at &lt;a href=&quot;https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5007084&quot; target=&quot;_blank&quot;&gt;papers.ssrn.com&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-YMO24&quot; href=&quot;#ref-YMO24-back&quot;&gt;[YMO24]&lt;/a&gt; Yeverechyahu, D., Mayya, R., &amp;amp; Oestreicher-Singer, G. 2024. &quot;The Impact of Large Language Models on Open-source Innovation.&quot; &lt;em&gt;SSRN&lt;/em&gt;. Available at &lt;a href=&quot;https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4684662&quot; target=&quot;_blank&quot;&gt;papers.ssrn.com&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;pl-not-consolidating&quot;&gt;Languages, frameworks, and tools are consolidating, and AI may accelerate the trend&lt;/h3&gt;

&lt;p&gt;Consolidation around a small number of languages, frameworks, and ecosystems has long been the norm. Language adoption follows a power law, with a few languages accounting for most use &lt;a href=&quot;#ref-MR13&quot; id=&quot;ref-MR13-back&quot;&gt;[MR13]&lt;/a&gt;. Gu et al. run thousands of algorithmic coding tasks and hundreds of framework selection tasks and find that mainstream languages and frameworks achieve significantly higher success rates in AI-generated code than niche ones &lt;a href=&quot;#ref-Gu25&quot; id=&quot;ref-Gu25-back&quot;&gt;[Gu25]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Twist et al. give eight LLMs coding tasks with expert-written reference solutions that specify which libraries to use. They find that LLMs import dominant libraries like NumPy even when those libraries do not appear in the reference solution, in up to 48% of cases. For language choice, the same models are given project initialization tasks in domains where Python is suboptimal for performance. Python is still chosen in 58% of cases and Rust zero times &lt;a href=&quot;#ref-Twist25&quot; id=&quot;ref-Twist25-back&quot;&gt;[Twist25]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;On the frontend, observers note that models default to React because it dominates training data, even when simpler approaches would serve the task &lt;a href=&quot;#ref-NS25&quot; id=&quot;ref-NS25-back&quot;&gt;[NS25]&lt;/a&gt;. The winner-take-all pattern predates AI, however, current models are accelerating it.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-MR13&quot; href=&quot;#ref-MR13-back&quot;&gt;[MR13]&lt;/a&gt; Meyerovich, L. A. &amp;amp; Rabkin, A. S. 2013. &quot;Empirical analysis of programming language adoption.&quot; &lt;em&gt;Proceedings of OOPSLA&lt;/em&gt;. ACM. Available at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/2509136.2509515&quot; target=&quot;_blank&quot;&gt;dl.acm.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Gu25&quot; href=&quot;#ref-Gu25-back&quot;&gt;[Gu25]&lt;/a&gt; Gu, F., Liang, Z., Ma, J., &amp;amp; Li, H. 2025. &quot;The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution.&quot; &lt;em&gt;arXiv:2509.23261&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2509.23261&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Twist25&quot; href=&quot;#ref-Twist25-back&quot;&gt;[Twist25]&lt;/a&gt; Twist, L., Zhang, J. M., Harman, M., Syme, D., Noppen, J., Yannakoudakis, H., &amp;amp; Nauck, D. 2025. &quot;A Study of LLMs&apos; Preferences for Libraries and Programming Languages.&quot; &lt;em&gt;arXiv:2503.17181&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2503.17181&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-NS25&quot; href=&quot;#ref-NS25-back&quot;&gt;[NS25]&lt;/a&gt; Cass, S. 2025. &quot;Web Development in 2025. AI&apos;s React Bias vs. Native Web.&quot; &lt;em&gt;The New Stack&lt;/em&gt;. Available at &lt;a href=&quot;https://thenewstack.io/web-development-in-2025-ais-react-bias-vs-native-web/&quot; target=&quot;_blank&quot;&gt;thenewstack.io&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;ai-improve-abstractions&quot;&gt;Can AI improve existing abstraction layers?&lt;/h3&gt;

&lt;p&gt;Research shows AI improving existing abstraction layers in several domains. Learned query optimizers outperform classical optimizers on some workloads, and GenJoin consistently outperforms PostgreSQL on standard benchmarks &lt;a href=&quot;#ref-Gen24&quot; id=&quot;ref-Gen24-back&quot;&gt;[Gen24]&lt;/a&gt;. In compilers, models trained on LLVM IR and assembly reach a substantial fraction of autotuning search potential &lt;a href=&quot;#ref-Met24&quot; id=&quot;ref-Met24-back&quot;&gt;[Met24]&lt;/a&gt;. In cloud infrastructure, reinforcement learning for dynamic resource allocation has been shown to reduce CPU allocation and improve utilization over rule-based autoscaling &lt;a href=&quot;#ref-Fet23&quot; id=&quot;ref-Fet23-back&quot;&gt;[Fet23]&lt;/a&gt;. Nevertheless, these results show that AI is already delivering real gains within existing abstraction layers.&lt;/p&gt;

&lt;p&gt;The cost framework in this article implies that new abstractions emerge when the cost of the incumbent exceeds the cost of the alternative. Further, past transitions such as relational algebra, garbage collection, and TCP/IP required conceptual shifts. AI may lower the cost of exploring new designs. Whether that yields qualitatively new abstractions, such as new models of concurrency, persistence, or distribution, or meaningfully better cloud, databases, or languages remains to be seen.&lt;/p&gt;

&lt;div class=&quot;section-references&quot;&gt;
&lt;strong&gt;References&lt;/strong&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Gen24&quot; href=&quot;#ref-Gen24-back&quot;&gt;[Gen24]&lt;/a&gt; Sulimov, P., Lehmann, C., &amp;amp; Stockinger, K. 2024. &quot;GenJoin: Conditional Generative Plan-to-Plan Query Optimizer.&quot; &lt;em&gt;arXiv:2411.04525&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2411.04525&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Met24&quot; href=&quot;#ref-Met24-back&quot;&gt;[Met24]&lt;/a&gt; Meta. 2024. &quot;LLM Compiler: Foundation Models of Compiler Optimization.&quot; &lt;em&gt;arXiv:2407.02524&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2407.02524&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;ref-item&quot;&gt;&lt;a id=&quot;ref-Fet23&quot; href=&quot;#ref-Fet23-back&quot;&gt;[Fet23]&lt;/a&gt; Fettes, Q., Karanth, A., Bunescu, R., Beckwith, B., &amp;amp; Subramoney, S. 2023. &quot;Reclaimer: A Reinforcement Learning Approach to Dynamic Resource Allocation for Cloud Microservices.&quot; &lt;em&gt;arXiv:2304.07941&lt;/em&gt;. Available at &lt;a href=&quot;https://arxiv.org/abs/2304.07941&quot; target=&quot;_blank&quot;&gt;arxiv.org&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Seven decades of software engineering followed one dynamic. When the cost of manual work exceeded the cost of automation, the abstraction won. What counted as cost varied. Programmer time, portability, errors, capital, and scale each drove different shifts and produced layer upon layer of abstractions that reduced cost and expanded what was possible.&lt;/p&gt;

&lt;p&gt;AI coding fits the same cost logic. Assistants reduce programmer time on mechanical work and accelerate production. Open source evidence suggests both creation and maintenance benefit. Code remains the durable artifact that teams can review, refine, and own. Consolidation around dominant languages and frameworks may deepen, a pattern that has long applied.&lt;/p&gt;

&lt;p&gt;Research already shows AI improving existing layers. Query optimizers, compilers, and cloud resource allocation all see gains. Whether AI will yield qualitatively new abstractions remains open. Better tooling and formal methods may narrow the verification gap further. Software engineering has absorbed every prior shift, and there is good reason to expect it to absorb this one.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Revisiting Moneyball</title>
   <link href="https://djpardis.com/blog/2025/07/24/revisiting-moneyball/"/>
   <updated>2025-07-24T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2025/07/24/revisiting-moneyball</id>
   <content type="html">&lt;div class=&quot;update-container post-container&quot;&gt;
&lt;strong&gt;Update, July 25, 2025.&lt;/strong&gt; This post reached the front page of &lt;a href=&quot;https://news.ycombinator.com/item?id=44676348&quot; target=&quot;_blank&quot;&gt;Hacker News&lt;/a&gt;. Thanks to &lt;a href=&quot;https://x.com/matsonj&quot; target=&quot;_blank&quot;&gt;@matsonj&lt;/a&gt; and &lt;a href=&quot;https://x.com/akm&quot; target=&quot;_blank&quot;&gt;@akm&lt;/a&gt; for the heads-up.&lt;br /&gt;
&lt;!-- 2. &lt;a href=&quot;https://x.com/WiLuisE/status/1948550390397763759&quot; target=&quot;_blank&quot;&gt;My friend Luis&lt;/a&gt; used &lt;a href=&quot;https://notebooklm.google.com/notebook/a28919ba-fc09-43e6-8fa6-491365e8525a?artifactId=e9345839-0370-471a-83b1-99cd10227847&quot; target=&quot;_blank&quot;&gt;NotebookLM&lt;/a&gt; to create a podcast discussion of this post.&lt;br&gt;&lt;br&gt;
&lt;div style=&quot;position: relative; width: 100%; height: 140px; max-width: 100%; overflow: hidden;&quot;&gt;
  &lt;iframe src=&quot;https://jumpshare.com/share/chrIh55BD3L7dXfrQDjw&quot; width=&quot;100%&quot; height=&quot;140&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot; style=&quot;position: absolute; top: 0; left: 0; width: 100%; height: 140px !important; max-height: 140px !important; aspect-ratio: unset !important; overflow: hidden;&quot;&gt;&lt;/iframe&gt;
&lt;/div&gt; --&gt;
&lt;/div&gt;

&lt;div class=&quot;toc-container post-container&quot;&gt;
&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#understanding-moneyball&quot;&gt;Understanding Moneyball&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#the-2001-as-lost-significant-talent&quot;&gt;The 2001 A&apos;s lost significant talent.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#analytics-unlocks-transparent-decision-making&quot;&gt;The A&apos;s used analytics to unlock transparent decision-making.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#getting-on-base-was-undervalued&quot;&gt;Getting on base was undervalued.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#scott-hatteberg-was-undervalued&quot;&gt;Scott Hatteberg was undervalued.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#the-streak-was-historic-and-remarkable&quot;&gt;The streak was historic and remarkable.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#you-can-build-a-player-in-aggregate&quot;&gt;You can build a player in aggregate.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#the-playoffs-are-a-crapshoot&quot;&gt;The playoffs are a crapshoot.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#its-hard-not-to-be-romantic-about-baseball&quot;&gt;It&apos;s hard not to be romantic about baseball.&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#critiquing-moneyball&quot;&gt;Critiquing Moneyball&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#moneyball-overlooks-existing-talent&quot;&gt;Moneyball overlooks existing talent on the 2002 roster.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#moneyball-promotes-low-payrolls&quot;&gt;Moneyball promotes low payrolls in baseball, thus ruining the game.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#moneyball-promotes-analytics&quot;&gt;Moneyball promotes the use of analytics in baseball, thus ruining the sport.&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#moneyball-suggests-baseball-is-unfair&quot;&gt;Moneyball suggests baseball is unfair, even though it&apos;s not that unfair, comparatively speaking.&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div class=&quot;post-hero-image&quot;&gt;
&lt;img src=&quot;/files/pics/blog/2025/oakland1.jpg&quot; alt=&quot;Oakland Coliseum&quot; /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Oakland Coliseum, September 2022.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Introduction&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Why are we still talking about Moneyball? Why am I talking about Moneyball?&lt;/p&gt;

&lt;p&gt;For one, I’ve been meaning to write this post since 2019. &lt;a href=&quot;https://x.com/djpardis/status/1089264609305911296&quot; target=&quot;_blank&quot;&gt;2018&lt;/a&gt; even.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/friendsarethebest.png&quot; alt=&quot;My friends are the best&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;My friends are the best. From &lt;a href=&quot;https://twitter.com/djpardis/status/1316095842434134017&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The book was published in 2003, and the movie was released in 2011. It feels silly to rehash, except the whole thing fascinates fans two decades later. Why? On the one hand, it is loved because it tells the classic underdog story while suggesting that nerdy analytics contributed to the wins; on the other, and especially as the A’s leave Oakland, it is blamed for encouraging low payrolls and critiqued for overstating the impact of analytics on the 2002 season and beyond.&lt;/p&gt;

&lt;div class=&quot;image-row&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/moneytweets1.png&quot; alt=&quot;Moneyball tweet 1&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/moneytweets2.png&quot; alt=&quot;Moneyball tweet 2&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/moneytweets3.png&quot; alt=&quot;Moneyball tweet 3&quot; /&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;From &lt;a href=&quot;https://twitter.com/djpardis/status/1316095842434134017&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/djpardis/status/1316095842434134017&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;, and &lt;a href=&quot;https://twitter.com/djpardis/status/1316095842434134017&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In this post, I’ll go over the author’s intentions for writing the book, followed by popular critiques of Moneyball. The goal is to address some of the recurring debates as we cover the main themes and provide historical context for each.&lt;/p&gt;

&lt;h2 id=&quot;understanding-moneyball&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Understanding Moneyball&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Let’s begin by discussing the book’s central themes and what the author found intriguing about the A’s in 2002. Moneyball has been so influential that many of these themes have since become memes.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/oakland2.jpg&quot; alt=&quot;Oakland Coliseum&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;Oakland Coliseum, September 2022.&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;the-2001-as-lost-significant-talent&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The 2001 A’s lost significant talent.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;After losing MVP Jason Giambi (7-year, $120M), Johnny Damon (4-year, $31M), and closer Jason Isringhausen (4-year, $27M), totaling $31.6 million in annual value ($17.1M + $7.75M + $6.75M), the cash-strapped A’s faced an impossible challenge with just a $41 million payroll. This is what made the season irresistible to Michael Lewis as he watched the A’s replace these stars not with expensive equivalents, but with undervalued players.&lt;/p&gt;

&lt;h3 id=&quot;analytics-unlocks-transparent-decision-making&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The A’s used analytics to unlock transparent decision-making.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The A’s GM, Billy Beane, working with statistician Paul DePodesta, used sabermetrics to challenge traditional scouting methods that relied on more subjective evaluations.&lt;/p&gt;

&lt;p&gt;This transparency allowed decisions regarding players to be justified through objective metrics. When the A’s signed Scott Hatteberg or traded for David Justice, they could point to specific evidence, like on-base percentage (OBP) or plate discipline, to justify their decision.&lt;/p&gt;

&lt;h3 id=&quot;getting-on-base-was-undervalued&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Getting on base was undervalued.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;While scouts focused on batting average, home runs, and RBIs, Beane recognized that OBP had a stronger correlation with run production than any of the traditional metrics.&lt;/p&gt;

&lt;p&gt;The table below reproduces r² from a Bucknell University paper on MLB run scoring that covers 146 MLB team seasons from 1996 through 2000 &lt;a href=&quot;#ref1&quot;&gt;[1]&lt;/a&gt;. Each row regresses runs per game on the statistic in the first column.&lt;/p&gt;

&lt;table class=&quot;sortable&quot; data-sort-default-col=&quot;2&quot; data-sort-default-dir=&quot;desc&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th scope=&quot;col&quot; data-sort-type=&quot;text&quot;&gt;
        &lt;button type=&quot;button&quot; class=&quot;sort-table__btn&quot; aria-label=&quot;Sort by statistic name&quot;&gt;
          &lt;span class=&quot;sort-table__text&quot;&gt;Stat&lt;/span&gt;
          &lt;span class=&quot;sort-table__sort-icon&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;
        &lt;/button&gt;
      &lt;/th&gt;
      &lt;th scope=&quot;col&quot; data-sort-type=&quot;number&quot;&gt;
        &lt;button type=&quot;button&quot; class=&quot;sort-table__btn&quot; aria-label=&quot;Sort by r squared&quot;&gt;
          &lt;span class=&quot;sort-table__text&quot;&gt;r²&lt;/span&gt;
          &lt;span class=&quot;sort-table__sort-icon&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;
        &lt;/button&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;td&gt;On-base plus slugging (OPS)&lt;/td&gt;&lt;td&gt;.900&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;&lt;strong&gt;On-base percentage (OBP)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;.835&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Slugging percentage (SLG)&lt;/td&gt;&lt;td&gt;.804&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Batting average (BA)&lt;/td&gt;&lt;td&gt;.672&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Home runs (HR)&lt;/td&gt;&lt;td&gt;.542&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Walks (BB)&lt;/td&gt;&lt;td&gt;.404&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Strikeouts (SO)&lt;/td&gt;&lt;td&gt;.078&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Stolen bases (SB)&lt;/td&gt;&lt;td&gt;.001&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Separate work on Southeastern Conference team seasons from 2014 through 2017 still regresses team runs on team on-base percentage at the college level, which supports treating the link as more than a one-off MLB artifact &lt;a id=&quot;ref2-back&quot; href=&quot;#ref2&quot;&gt;[2]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The math was simple. Teams that get on base more frequently score more runs, and teams that score more runs win more games. Yet in 2002, players with high OBP were available at below-market prices. Players like Scott Hatteberg (.361 career OBP) and David Justice (.378 career OBP) were affordable because their most valuable skill, i.e., getting on base, wasn’t appreciated sufficiently by the market.&lt;/p&gt;

&lt;p&gt;The strategy worked. The 2002 A’s ranked 4th in baseball in OBP (.349) despite having the 3rd lowest payroll in baseball.&lt;/p&gt;

&lt;div class=&quot;image-row&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/obp1.png&quot; alt=&quot;2002 A&apos;s OBP stats&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/obp2.png&quot; alt=&quot;2002 A&apos;s payroll stats&quot; /&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;In 2002, the A’s achieved 4th highest OBP (.349) with 3rd lowest payroll ($40M) &lt;a href=&quot;#ref3&quot;&gt;[3]&lt;/a&gt;&lt;a href=&quot;#ref4&quot;&gt;[4]&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;scott-hatteberg-was-undervalued&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Scott Hatteberg was undervalued.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Scott Hatteberg was available at a below-market price due to perceived limitations. The former Boston Red Sox catcher developed nerve damage in his throwing elbow, making catching nearly impossible and leaving him without a clear defensive position.&lt;/p&gt;

&lt;p&gt;Oakland signed Hatteberg to first base for $950K on a one-year contract after Colorado declined salary arbitration. The A’s saw past the injury to identify exceptional plate discipline and on-base skills. His .367 OBP in 2000 and .410 OBP in 1999 demonstrated an ability to get on base.&lt;/p&gt;

&lt;p&gt;Catcher-to-first-base conversions are common in baseball, and most are injury-driven. But players like Joe Mauer and Buster Posey had some experience at first base before switching permanently. Hatteberg had zero professional innings at the position. He went from injured catcher to opening day first baseman in one spring training, learning the position from infield coach Ron Washington.&lt;/p&gt;

&lt;p&gt;Hatteberg’s 2002 performance validated the analytical approach: .280/.374/.433 slash line, 68 walks versus 56 strikeouts, and 4.15 pitches per plate appearance (3rd in AL). His defining moment came on September 4, 2002, when his walk-off home run against Kansas City secured the A’s 20th consecutive victory and set an AL record.&lt;/p&gt;

&lt;div class=&quot;video-container&quot; style=&quot;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%;&quot;&gt;
  &lt;iframe style=&quot;position: absolute; top: 0; left: 0; width: 100%; height: 100%;&quot; src=&quot;https://www.youtube.com/embed/qWMwo_qEQW8&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;&quot;&gt;
  &lt;/iframe&gt;
&lt;/div&gt;
&lt;p class=&quot;image-caption&quot;&gt;Hatteberg&apos;s historic walk-off home run that gave the A&apos;s a 12–11 win and a then-AL record 20-game winning streak.&lt;/p&gt;

&lt;h3 id=&quot;the-streak-was-historic-and-remarkable&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The streak was historic and remarkable.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The 20-game winning streak (August 13 - September 4, 2002) became the dramatic centerpiece of the Moneyball story.&lt;/p&gt;

&lt;p&gt;During the historic run, Billy Koch (acquired December 7, 2001) recorded wins or saves in 12 of the 20 games, while Cory Lidle (acquired January 8, 2001), who was arguably the streak’s MVP, posted a microscopic 0.20 ERA in August with 32 consecutive scoreless innings.&lt;/p&gt;

&lt;p&gt;The streak’s climactic finish, Oakland blowing an 11–0 lead to Kansas City before Hatteberg’s walk-off home run, provided Hollywood-worthy drama that helped make the analytical approach culturally compelling.&lt;/p&gt;

&lt;p&gt;More importantly, the streak occurred during a season where the A’s won 103 games despite having one of baseball’s lowest payrolls.&lt;/p&gt;

&lt;div class=&quot;image-row&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/wins1.png&quot; alt=&quot;2002 A&apos;s wins stats&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2025/wins2.png&quot; alt=&quot;2002 MLB payroll stats&quot; /&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;The A’s achieved 103 wins (tied for MLB lead) with the 3rd lowest payroll ($40M) &lt;a href=&quot;#ref4&quot;&gt;[4]&lt;/a&gt;&lt;a href=&quot;#ref5&quot;&gt;[5]&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;you-can-build-a-player-in-aggregate&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;You can build a player in aggregate.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The A’s discovered they could construct effective offensive production by combining players with complementary skills rather than seeking complete players. This insight challenged the traditional scouting preference for “five-tool players” who could hit for average, hit for power, run, field, and throw.&lt;/p&gt;

&lt;p&gt;Instead of expensive superstars, the A’s assembled a roster where different players contributed specific, undervalued skills:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Scott Hatteberg (signed Jan 2, 2002): Injured catcher converted to first baseman, valued for exceptional plate discipline&lt;/li&gt;
  &lt;li&gt;David Justice (traded Dec 14, 2001): Aging slugger acquired cheaply after the Yankees gave up on him, provided veteran power and leadership&lt;/li&gt;
  &lt;li&gt;Jeremy Giambi (traded Feb 18, 2000): High-OBP outfielder whose walk rate offset his defensive limitations&lt;/li&gt;
  &lt;li&gt;Chad Bradford (traded Dec 7, 2000): Submarine reliever overlooked by scouts, acquired for a minor league catcher&lt;/li&gt;
  &lt;li&gt;Billy Koch (traded Dec 7, 2001): Proven closer bought low after a rough 2001, rebounded to win AL Reliever of the Year&lt;/li&gt;
  &lt;li&gt;Cory Lidle (traded Jan 8, 2001): Control pitcher with poor surface stats but strong peripherals, became a reliable #4 starter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This aggregate approach allowed Oakland to compete with teams spending three times their payroll. Rather than paying premium prices for complete players, they constructed a competitive roster through strategic combinations to produce runs and wins.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/buildinagg.png&quot; alt=&quot;Building a player in aggregate&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;From &lt;a href=&quot;https://bsky.app/profile/foolishbb.bsky.social/post/3lawoaoavhc2y&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;the-playoffs-are-a-crapshoot&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The playoffs are a crapshoot.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Billy Beane famously told Michael Lewis, “My shit doesn’t work in the playoffs,” acknowledging that his analytical approach, while dominant over 162 games, couldn’t overcome October’s inherent randomness. It’s a “crapshoot” because the best regular season team frequently loses due to small sample sizes and variance inherent in short series &lt;a href=&quot;#ref6&quot;&gt;[6]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The statistical evidence supports this theory. Since the Wild Card era began in 1995, the team with the best regular season record has won the World Series only 8 out of 29 times (a 28% success rate).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/wswinners.png&quot; alt=&quot;World Series winners&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;World Series winners with best regular season record since 1995 and their win-loss records.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Oakland A’s themselves became the perfect case study for this phenomenon. From 2000 to 2003, they averaged 98 wins per season, yet lost in the Division Series (first round of the playoffs) each of the four years, with each series going the full five games.&lt;/p&gt;

&lt;p&gt;Academic analysis by Stanford &lt;a href=&quot;#ref6&quot;&gt;[6]&lt;/a&gt; found no correlation between regular season and postseason performance (p = .6201), while a comprehensive Braves Journal study concluded that playoffs are “90% crapshoot, 10% skill” &lt;a href=&quot;#ref7&quot;&gt;[7]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The small sample size problem is a fundamental issue in the playoffs. While batting statistics require 200+ plate appearances to stabilize, playoff series provide players with only 15–30 plate appearances, resulting in massive variance. Even Oakland’s signature OBP, which is the cornerstone of their analytical advantage, declined 11% in the playoffs (.305 vs .341 regular season), demonstrating how short series neutralize statistical edges.&lt;/p&gt;

&lt;p&gt;This randomness explains why sabermetricians often view regular-season performance as a more reliable indicator of a team’s true quality than its playoff results.&lt;/p&gt;

&lt;h3 id=&quot;its-hard-not-to-be-romantic-about-baseball&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;It’s hard not to be romantic about baseball.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The A’s story showed ordinary players achieving extraordinary things when given the opportunity. The analytics enhanced rather than detracted from the game with moments like Bradford’s submarine delivery, Hatteberg’s transformation from injured catcher to first baseman, or the electricity of the 20-game streak.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/romantic.png&quot; alt=&quot;Baseball romance&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;From &lt;a href=&quot;https://x.com/MLB/status/1914109400413016515&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now that we’ve gone through the book’s central themes, let’s discuss how they hold up. In particular, let’s examine them through the lens of criticisms of Moneyball and whether those criticisms are warranted.&lt;/p&gt;

&lt;h2 id=&quot;critiquing-moneyball&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Critiquing Moneyball&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;moneyball-overlooks-existing-talent&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Moneyball overlooks existing talent on the 2002 roster.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The most substantial criticism of Moneyball, both the book and especially the film, is that it ignored the exceptional talent on the 2002 Oakland roster. The narrative focused so heavily on the undervalued acquisitions that it obscured the presence of conventional superstars.&lt;/p&gt;

&lt;p&gt;The A’s were excellent at drafting talent. And their lineup in 2002 was made up of more than just underdogs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Big Three&lt;/strong&gt;. Tim Hudson (6th round, 1997), Mark Mulder (2nd overall, 1998), and Barry Zito (9th overall, 1999) were all top-10 picks. By 2002, this trio anchored Oakland’s rotation with elite performance.&lt;/p&gt;

&lt;p&gt;Miguel Tejada (international signing, 1993) and Eric Chavez (10th overall, 1996) provided MVP-caliber offense and Gold Glove defense. Both were premium draft investments that matured into franchise cornerstones.&lt;/p&gt;

&lt;p&gt;Miguel Tejada won the 2002 AL MVP award, hitting .308 with 34 home runs and 131 RBIs while providing leadership throughout the season. Barry Zito won the AL Cy Young Award with a 23–5 record and 2.75 ERA. These weren’t marginal players elevated by analytics; they were elite performers by any standard.&lt;/p&gt;

&lt;p&gt;Oakland executives Billy Beane, David Forst, and scout Ron Washington later acknowledged that “there’s no way the A’s make the playoffs every year from 2000 through 2003, and no way a best-selling book and Brad Pitt movie ever happen, if not for the efforts of the Big Three” &lt;a href=&quot;#ref8&quot;&gt;[8]&lt;/a&gt;. The Big Three compiled a collective 261–131 record from 1999–2006, providing the foundation that allowed Beane’s analytical approach to flourish.&lt;/p&gt;

&lt;h3 id=&quot;moneyball-promotes-low-payrolls&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Moneyball promotes low payrolls in baseball, thus ruining the game.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Oakland won 103 games in 2002 with the third-lowest payroll in baseball. Sports economists have argued that &lt;em&gt;Moneyball&lt;/em&gt; became a cultural reference point in arguments about spending restraint &lt;a id=&quot;ref11-back&quot; href=&quot;#ref11&quot;&gt;[11]&lt;/a&gt;. The critique is that the story made low payrolls look like the intelligent default rather than a limitation.&lt;/p&gt;

&lt;p&gt;However, the actual lesson of Moneyball for the league was not “cheap is good.” In fact, the most successful application of Moneyball principles came from the Boston Red Sox, who were anything but frugal.&lt;/p&gt;

&lt;p&gt;After the 2002 season, Red Sox ownership (led by John W. Henry) tried to hire Billy Beane as a GM with a five-year, $12.5M contract that would have made him the highest paid GM in baseball history. The offer was a testament to how highly Henry and then-CEO Larry Lucchino regarded Beane’s analytical approach.&lt;/p&gt;

&lt;p&gt;Although Beane declined, the Red Sox promoted Theo Epstein to GM in November 2002. Under Epstein, they adopted a more data-driven approach, complemented by their substantial financial resources. In 2004, only two years after the attempt to hire Beane, the Red Sox broke their 86-year Curse of the Bambino by winning the World Series, and again in 2007.&lt;/p&gt;

&lt;p&gt;When DePodesta left Oakland in 2004 to become the Dodgers’ GM, he was replaced by Farhan Zaidi. Zaidi carried forward the analytical tradition through the Dodgers (2014–2018) before joining the San Francisco Giants, where his 2021 team achieved a franchise-record 107 wins, demonstrating how Moneyball principles scale effectively when combined with greater financial resources.&lt;/p&gt;

&lt;p&gt;Fast forward to today, the Dodgers won the 2024 World Series while employing one of baseball’s largest analytics departments (with over 47 personnel, compared to 3 in 1988) and maintaining one of MLB’s highest payrolls. Their championship validated that analytics enables more effective spending rather than reduced spending.&lt;/p&gt;

&lt;p&gt;The 2024 season saw a record nine teams exceed the luxury tax threshold, resulting in $311 million in penalties. The Mets, under Steve Cohen, topped the league with a $333 million payroll, while the A’s payroll of $66.5 million represents ownership decisions rather than analytical ones.&lt;/p&gt;

&lt;p&gt;It turns out that Moneyball is about money. Billy Beane’s Oakland A’s succeeded not because they rejected the importance of money, but because they maximized every dollar’s impact when competing against teams with three times their budget. The book’s enduring relevance lies in demonstrating how intelligence and efficiency can overcome, though maybe not eliminate, financial disadvantage.&lt;/p&gt;

&lt;p&gt;Rather than promoting frugality, Moneyball’s lasting impact has been to make teams more analytical. Every MLB franchise now employs statisticians and data scientists. The result has been more informed decision making at all spending levels, not a reduction in overall spending. Moneyball’s impact, short-term or long-term, has not been frugality.&lt;/p&gt;

&lt;p&gt;The criticisms tend to confuse cause and effect. The analytical methods were in response to frugality, not the cause of it. The cost-cutting was already occurring due to economic pressures unrelated to analytics or the impact of Moneyball.&lt;/p&gt;

&lt;h3 id=&quot;moneyball-promotes-analytics&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Moneyball promotes the use of analytics in baseball, thus ruining the sport.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The critique is that &lt;em&gt;Moneyball&lt;/em&gt; helped normalize a front-office style tied to more strikeouts, walks, and home runs, fewer balls in play, longer games, and shifts that many fans experience as less action on the field. Figures inside that movement, including Theo Epstein and Bill James, have said publicly that analytics-driven optimization has hurt the game’s aesthetic.&lt;/p&gt;

&lt;p&gt;It’s important to note that the analytics revolution was already underway before Moneyball. The book brought it into mainstream consciousness rather than creating it. Allan Roth served as baseball’s first team statistician with the Brooklyn Dodgers from 1947–1964, tracking advanced metrics like OBP and developing situational statistics. Earnshaw Cook’s 1964 book “Percentage Baseball” was the first full-length sabermetrics work, while Bill James began publishing Baseball Abstracts in 1977, coining “sabermetrics” in 1980.&lt;/p&gt;

&lt;p&gt;To understand baseball’s numerical history, you can go back even further. Henry Chadwick, an English cricket lover enchanted by the new American sport, created the first box score for a Brooklyn Excelsiors game in 1859, establishing the systematic tracking that made all future analytics possible.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/1876boxscore.jpg&quot; alt=&quot;A vintage 1876 baseball box score showing detailed game statistics in handwritten format&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;From &lt;a href=&quot;https://upload.wikimedia.org/wikipedia/commons/c/cc/1876boxscore.jpg&quot; target=&quot;_blank&quot;&gt;1876&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Analytics has improved baseball in measurable ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Player development&lt;/strong&gt;. Teams now use personalized training programs based on biomechanical analysis, injury prevention through wearable technology, and real-time feedback systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategic decision making&lt;/strong&gt;. Real-time analytics enable managers to optimize defensive positioning, make data-driven pitching changes, and predict injury risks before they occur. Front offices can identify undervalued talent and make smarter roster construction decisions.&lt;/p&gt;

&lt;p&gt;However, the use of analytics has not been without downsides:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The “Three True Outcomes” problem&lt;/strong&gt;. Home runs, walks, and strikeouts now dominate baseball, with 35% of plate appearances ending without involving seven defensive players. This has reduced balls in play by 20% since 1980, resulting in longer games with less action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Entertainment decline&lt;/strong&gt;. As Theo Epstein admitted, “executives like me who have spent a lot of time using analytics…have unwittingly hurt the aesthetic value of the game.” Even Bill James acknowledged the game’s aesthetics “went to hell in a dump truck” due to excessive strikeouts and endless pitching changes.&lt;/p&gt;

&lt;p&gt;Analytics improved baseball’s efficiency but arguably damaged its entertainment value. This is a tradeoff the MLB addressed with 2023 rule changes, including pitch clocks and shift bans.&lt;/p&gt;

&lt;h3 id=&quot;moneyball-suggests-baseball-is-unfair&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Moneyball suggests baseball is unfair, even though it’s not that unfair, comparatively speaking.&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The critique is that baseball rewards money in a way fans experience as unfair, with postseason outcomes and playoff gates increasingly correlated with payroll rank in recent years.&lt;/p&gt;

&lt;p&gt;However, despite conventional wisdom linking salary caps to competitive balance, the data shows that MLB’s luxury tax system has produced championship diversity equal to hard-cap leagues. Since 1995, counting 30 championship years, 16 different winners each in MLB, NFL, and NHL, with only the NBA lagging at 13 unique champions. This challenges the assumption that financial constraints alone determine competitive outcomes.&lt;/p&gt;

&lt;p&gt;Having said that, recent trends suggest MLB’s historical parity may be eroding under financial pressure. From 2015–2024, eight of ten World Series champions ranked in the top 10 for payroll, with only the 2015 Royals and 2017 Astros winning from outside the top half of spending teams &lt;a href=&quot;#ref9&quot;&gt;[9]&lt;/a&gt;. Meanwhile, in 2024, all six highest-spending teams made the playoffs, while the Dodgers’ $327 million payroll advantage over Miami represents the largest spending gap in modern baseball history.&lt;/p&gt;

&lt;p&gt;Despite this, playoff access remains remarkably broad as 28 of 30 MLB teams (93.3%) have made the playoffs since 2015, compared to 28 of 32 NFL teams (87.5%).&lt;/p&gt;

&lt;p&gt;While the NBA currently enjoys unprecedented parity, seven different champions in seven years (2019–2025), this represents a dramatic shift from its historically dynasty-heavy nature where superstars concentrated championships among elite teams.&lt;/p&gt;

&lt;p&gt;The NFL’s “National Parity League” reputation may be an overstatement. Only five AFC teams (Broncos, Ravens, Patriots, Steelers, Colts) represented the conference in 13 consecutive Super Bowls, while the Chiefs have reached more Super Bowls in six years than any MLB team has World Series appearances in the 21st century &lt;a href=&quot;#ref10&quot;&gt;[10]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This suggests that playoff format, season length, and sport-specific factors matter more than financial structures alone.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Whether the 2002 A’s succeeded primarily through analytics or traditional talent is less important than the broader principle that analytics can reveal value that conventional wisdom misses. They also showed the power of intelligence, creativity, and sheer persistence in overcoming financial disadvantage.&lt;/p&gt;

&lt;p&gt;Ultimately, it’s hard to overlook the attention that Moneyball has brought to baseball, as it continues to capture the imagination of fans, bandwagoners, and Jonah Hill admirers.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/djpardistwete.png&quot; alt=&quot;DJ Pardis tweet&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;From &lt;a href=&quot;https://x.com/djpardis/status/1188636997557993472&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/billybeane.jpg&quot; alt=&quot;Billy Beane&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;Proud to finally meet Brad Pitt back in October 2019. He gave one of the best talks I ever heard.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/datingmoneyball.png&quot; alt=&quot;Dating Moneyball&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;From &lt;a href=&quot;https://x.com/MoneyballMemes/status/1859314869972902336&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;References&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;a id=&quot;ref1&quot; href=&quot;#ref1-back&quot;&gt;[1]&lt;/a&gt; Bucknell University Baseball Statistics Research (1996–2000). “Runs Scored Correlations.” Available at: &lt;a href=&quot;https://www.eg.bucknell.edu/~bvollmay/baseball/runs1.html&quot; target=&quot;_blank&quot;&gt;https://www.eg.bucknell.edu/~bvollmay/baseball/runs1.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref2&quot; href=&quot;#ref2-back&quot;&gt;[2]&lt;/a&gt; FanGraphs Community Research (2014–2017). “Relationship Between OBP and Runs Scored in College Baseball.” Available at: &lt;a href=&quot;https://community.fangraphs.com/relationship-between-obp-and-runs-scored-in-college-baseball/&quot; target=&quot;_blank&quot;&gt;https://community.fangraphs.com/relationship-between-obp-and-runs-scored-in-college-baseball/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref3&quot; href=&quot;#ref3-back&quot;&gt;[3]&lt;/a&gt; Baseball Almanac - 2002 AL OBP Leaders &lt;a href=&quot;https://www.baseball-almanac.com/yearly/top25.php?s=OBP&amp;amp;l=AL&amp;amp;y=2002&quot; target=&quot;_blank&quot;&gt;https://www.baseball-almanac.com/yearly/top25.php?s=OBP&amp;amp;l=AL&amp;amp;y=2002&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref4&quot; href=&quot;#ref4-back&quot;&gt;[4]&lt;/a&gt; The Baseball Cube - 2002 MLB Team Payrolls &lt;a href=&quot;https://www.thebaseballcube.com/content/payroll_year/2002/&quot; target=&quot;_blank&quot;&gt;https://www.thebaseballcube.com/content/payroll_year/2002/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref5&quot; href=&quot;#ref5-back&quot;&gt;[5]&lt;/a&gt; Baseball-Reference - 2002 MLB Standings &lt;a href=&quot;https://www.baseball-reference.com/leagues/majors/2002-standings.shtml&quot; target=&quot;_blank&quot;&gt;https://www.baseball-reference.com/leagues/majors/2002-standings.shtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref6&quot; href=&quot;#ref6-back&quot;&gt;[6]&lt;/a&gt; Stanford Sports Analytics Club - “Examining MLB Postseason Cluster Luck: or, Why the Playoffs Might Be a Crapshoot” &lt;a href=&quot;https://stanfordsportsanalytics.wordpress.com/2015/03/24/examining-mlb-postseason-cluster-luck-or-why-the-playoffs-might-be-a-crapshoot/&quot; target=&quot;_blank&quot;&gt;https://stanfordsportsanalytics.wordpress.com/2015/03/24/examining-mlb-postseason-cluster-luck-or-why-the-playoffs-might-be-a-crapshoot/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref7&quot; href=&quot;#ref7-back&quot;&gt;[7]&lt;/a&gt; Braves Journal - “The Playoffs are a Crapshoot” &lt;a href=&quot;https://bravesjournal.com/2019/12/30/the-playoffs-are-a-crapshoot-a-5-part-series-introduction/&quot; target=&quot;_blank&quot;&gt;https://bravesjournal.com/2019/12/30/the-playoffs-are-a-crapshoot-a-5-part-series-introduction/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref8&quot; href=&quot;#ref8-back&quot;&gt;[8]&lt;/a&gt; Grantland - “Baseball’s Big Three: A Look Back at Tim Hudson, Mark Mulder, and Barry Zito in Oakland” &lt;a href=&quot;https://grantland.com/the-triangle/mlb-oakland-as-big-three-tim-hudson-barry-zito-mark-mulder-billy-beane-moneyball/&quot; target=&quot;_blank&quot;&gt;https://grantland.com/the-triangle/mlb-oakland-as-big-three-tim-hudson-barry-zito-mark-mulder-billy-beane-moneyball/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref9&quot; href=&quot;#ref9-back&quot;&gt;[9]&lt;/a&gt; Cronkite News - “How big MLB payrolls affect postseason success” &lt;a href=&quot;https://cronkitenews.azpbs.org/2024/11/12/big-mlb-payrolls-affect-postseason-success/&quot; target=&quot;_blank&quot;&gt;https://cronkitenews.azpbs.org/2024/11/12/big-mlb-payrolls-affect-postseason-success/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref10&quot; href=&quot;#ref10-back&quot;&gt;[10]&lt;/a&gt; The Athletic - “NFL is the parity league? MLB would like a word” &lt;a href=&quot;https://www.nytimes.com/athletic/6116536/2025/02/06/mlb-nfl-parity-super-bowl-world-series/&quot; target=&quot;_blank&quot;&gt;https://www.nytimes.com/athletic/6116536/2025/02/06/mlb-nfl-parity-super-bowl-world-series/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref11&quot; href=&quot;#ref11-back&quot;&gt;[11]&lt;/a&gt; Matheson, Victor. Sportico. “Study Table: Blame Moneyball for Major League Labor Strife.” Available at: &lt;a href=&quot;https://www.sportico.com/leagues/baseball/2022/study-table-blame-moneyball-1234668622/&quot; target=&quot;_blank&quot;&gt;https://www.sportico.com/leagues/baseball/2022/study-table-blame-moneyball-1234668622/&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;crosspost-container post-container&quot;&gt;
This post was originally published on &lt;a href=&quot;https://djpardis.medium.com/revisiting-moneyball-074fc2435b07&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Medium&lt;/a&gt; and is cross-posted here.
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Building an auth flow with AI</title>
   <link href="https://djpardis.com/blog/2025/07/20/introducing-the-data-room-app/"/>
   <updated>2025-07-20T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2025/07/20/introducing-the-data-room-app</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/data-room-post-vintage-stove.png&quot; alt=&quot;Vintage white Standard Electric stove with curved doors in the foreground, dining room with wooden table and chairs visible through a doorway in the background&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;a href=&quot;/blog/2025/06/20/vibe-coding-data-room-app/&quot;&gt;first part of this series&lt;/a&gt;, I shared my experience with vibe coding to build a markdown-based data room application for General Folders. In particular, I highlighted best practices for working with AI coding assistants.&lt;/p&gt;

&lt;p&gt;Now, in the second part, I’ll focus on one of the most critical aspects of any application: secure authentication. Specifically, I’ll walk through implementing magic link authentication, a passwordless approach that provides a seamless experience for investors accessing confidential documents.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/dataroom.png&quot; alt=&quot;Data room app main interface&quot; /&gt;
&lt;em&gt;The Data Room App (&lt;a href=&quot;https://thedataroom.app&quot; target=&quot;_blank&quot;&gt;thedataroom.app&lt;/a&gt;) provides a clean, intuitive interface for investors to access confidential documents.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;solving-the-authentication-challenge&quot;&gt;Solving the authentication challenge&lt;/h2&gt;

&lt;p&gt;At first, I tried to implement magic links with Windsurf. Given that auth flows are not my area of expertise, I needed more opinionated help. I then tried Replit which provided a quick and working solution.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://replit.com&quot; target=&quot;_blank&quot;&gt;Replit&lt;/a&gt; is a browser-based IDE and hosting platform that started in 2016 as a simple code playground but has evolved into a comprehensive development environment. What drew me to it was its ability to handle full-stack applications with built-in authentication, along with serverless deployments.&lt;/p&gt;

&lt;h2 id=&quot;replit-vs-windsurf-technical-comparison&quot;&gt;Replit vs Windsurf: technical comparison&lt;/h2&gt;

&lt;h3 id=&quot;replit-features&quot;&gt;Replit features&lt;/h3&gt;

&lt;h4 id=&quot;authentication-and-session-management&quot;&gt;Authentication and session management&lt;/h4&gt;

&lt;p&gt;Replit’s integrated authentication blueprints provide pre-configured &lt;a href=&quot;https://www.passportjs.org/&quot; target=&quot;_blank&quot;&gt;Passport.js&lt;/a&gt; setups, session store integration with PostgreSQL, and automatic HTTPS for secure cookie handling. Magic link authentication was implemented in 15 minutes compared to hours of OAuth configuration debugging.&lt;/p&gt;

&lt;h4 id=&quot;database-integration&quot;&gt;Database integration&lt;/h4&gt;

&lt;p&gt;Replit enables instant PostgreSQL provisioning through &lt;a href=&quot;https://neon.tech&quot; target=&quot;_blank&quot;&gt;Neon&lt;/a&gt;. Connection strings, pooling, and SSL certificates are handled automatically. &lt;a href=&quot;https://drizzleorm.com&quot; target=&quot;_blank&quot;&gt;Drizzle ORM&lt;/a&gt; integration works seamlessly with database push deployments without migration file management.&lt;/p&gt;

&lt;p&gt;Update (July 2025): Replit has since launched &lt;a href=&quot;https://blog.replit.com/introducing-a-safer-way-to-vibe-code-with-replit-databases&quot; target=&quot;_blank&quot;&gt;separate development and production databases&lt;/a&gt;, which makes the platform more suitable for developing real-world applications. This feature enables safer iteration by isolating development changes from live customer data.&lt;/p&gt;

&lt;h4 id=&quot;deployment-infrastructure&quot;&gt;Deployment infrastructure&lt;/h4&gt;

&lt;p&gt;The platform handles load balancing and scaling automatically.&lt;/p&gt;

&lt;h3 id=&quot;replit-limitations&quot;&gt;Replit limitations&lt;/h3&gt;

&lt;p&gt;Git operations lack the intuitiveness of terminal workflows. Even basic branching and commit management feel clunky in the interface. Simple git commands in a terminal turn into vague processes with less feedback and control.&lt;/p&gt;

&lt;p&gt;The container-based environment limits access to lower-level system functions and file operations, which can make debugging more challenging when compared to local development.&lt;/p&gt;

&lt;h3 id=&quot;windsurf-comparison&quot;&gt;Windsurf comparison&lt;/h3&gt;

&lt;p&gt;Windsurf provides control with traditional Git workflows, intuitive file system access, and familiar terminal operations. However, the setup overhead for authentication, databases, and deployment significantly slows initial development velocity compared to Replit’s integrated infrastructure.&lt;/p&gt;

&lt;h2 id=&quot;data-room-platform-implementation&quot;&gt;Data room platform implementation&lt;/h2&gt;

&lt;p&gt;Here are the implementation details we narrowed in on with Replit.&lt;/p&gt;

&lt;h3 id=&quot;tech-stack&quot;&gt;Tech stack&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;React frontend with TypeScript and shadcn/ui components&lt;/li&gt;
  &lt;li&gt;Express.js backend with PostgreSQL database&lt;/li&gt;
  &lt;li&gt;Drizzle ORM for type-safe database operations&lt;/li&gt;
  &lt;li&gt;JWT-based authentication with 7-day token expiration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;architecture-decisions&quot;&gt;Architecture decisions&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Simple content management&lt;/strong&gt;: Founders can organize their data room using a simple Markdown file, just like writing documentation. No clunky admin panels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/sections.png&quot; alt=&quot;Document sections in the data room&quot; /&gt;
&lt;em&gt;Changes to sections.md are reflected instantly - no rebuild required.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Flexible document handling&lt;/strong&gt;: The app supports both direct file uploads (protected by authentication) and links to existing documents in Google Drive or Dropbox.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Passwordless authentication&lt;/strong&gt;: Investors access the data room through magic links sent to their email.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Role-based access&lt;/strong&gt;: Different permissions for founders (who pay for the service) and investors (who get free access).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Engagement analytics&lt;/strong&gt;: Founders can see which documents investors have accessed and when.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Frictionless invitations&lt;/strong&gt;: Founders can invite investors with a simple email, which in turn generates secure access tokens.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;implementation-results&quot;&gt;Implementation results&lt;/h2&gt;

&lt;p&gt;The platform is deployed at &lt;a href=&quot;https://thedataroom.app&quot;&gt;thedataroom.app&lt;/a&gt; and operates in waitlist mode for market validation. The implementation includes PostgreSQL-backed secure sessions, reliable email delivery for magic links and invitations, and robust file security with strict validation and secure storage for documents up to 50MB.&lt;/p&gt;

&lt;h2 id=&quot;development-timeline&quot;&gt;Development timeline&lt;/h2&gt;

&lt;p&gt;Here’s a summary of our development timeline.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Week 1: Authentication and data modeling implementation&lt;/li&gt;
  &lt;li&gt;Week 2: Document management and UI development&lt;/li&gt;
  &lt;li&gt;Week 3: Email system integration and production deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;feedback-request&quot;&gt;Feedback request&lt;/h2&gt;

&lt;p&gt;If you have past experience with data rooms, I’d love to hear your thoughts on the platform. If you’re going to be needing a data room soon, I’d love to show you what we’ve built and get your feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live platform:&lt;/strong&gt; &lt;a href=&quot;https://thedataroom.app&quot;&gt;thedataroom.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access:&lt;/strong&gt; Waitlist signups receive discounted pricing; investor access remains free but you would need an invite.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Best practices in AI coding</title>
   <link href="https://djpardis.com/blog/2025/06/20/vibe-coding-data-room-app/"/>
   <updated>2025-06-20T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2025/06/20/vibe-coding-data-room-app</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2025/vibe-coding-arcade-machines.png&quot; alt=&quot;Vintage arcade cabinets including a Junior Deputy Sheriff shooting game, Love Test machine, and a bill-to-quarters change machine, with Playland memorabilia on the wall&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is part 1 of a 2-part post. This first is about my experience with &lt;a href=&quot;https://twitter.com/karpathy/status/1886192184808149383&quot; target=&quot;_blank&quot;&gt;vibe coding&lt;/a&gt;, or rather, &lt;a href=&quot;https://simonwillison.net/2025/Mar/19/vibe-coding/&quot; target=&quot;_blank&quot;&gt;AI-assisted programming&lt;/a&gt;. The &lt;a href=&quot;/blog/2025/07/20/introducing-the-data-room-app/&quot;&gt;second part&lt;/a&gt; is about how I set up the magic link authentication flow.&lt;/p&gt;

&lt;p&gt;
&lt;img src=&quot;/files/pics/blog/2025/fancy-pooh.jpg&quot; alt=&quot;A cute bear saying &apos;oh bother&apos;&quot; style=&quot;max-width: 350px&quot; /&gt;
&lt;em&gt;It&apos;s different.&lt;/em&gt;
&lt;/p&gt;

&lt;h2 id=&quot;the-problem&quot;&gt;The problem&lt;/h2&gt;

&lt;p&gt;To put these cool tools to use, I’ve been working on a couple of side projects. The one we’ll discuss here is a custom data room app for &lt;a href=&quot;https://generalfolders.com&quot; target=&quot;_blank&quot;&gt;General Folders&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I evaluated several popular data room solutions to understand their strengths and limitations before deciding to work on a new one.&lt;/p&gt;

&lt;h2 id=&quot;the-solution&quot;&gt;The solution&lt;/h2&gt;

&lt;p&gt;Given that context, our goal is to build a secure platform for startups to quickly set up data rooms and share links to confidential documents with authenticated investors.&lt;/p&gt;

&lt;p&gt;What we’re envisioning is one flavor of a file explorer for the browser. In addition to the file explorer, we can add on a file viewer, like a slide viewer or a doc viewer, and replicate the logic that DocSend or Google Workspace has built. However, apart from enabling granular tracking, it’s hard to justify the effort that goes into building a file viewer, given the affordable price tag on Google Workspace products and the familiarity of the experience.&lt;/p&gt;

&lt;p&gt;As a sidenote, this exercise made me realize that GitHub, Dropbox, and Box are all file explorers at the core, which is quite beautiful.&lt;/p&gt;

&lt;h3 id=&quot;the-tools&quot;&gt;The tools&lt;/h3&gt;

&lt;p&gt;Now we get to the topic of vibe coding. I believe the tools have finally reached that critical threshold where they’re not just fun demos but genuinely practical for everyday use. And it looks like others feel the same!&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Last time I felt as giddy as I do when vibe coding was my first ever visual basic app. Seismic shifts afoot people, seismic.&lt;/p&gt;

  &lt;p&gt;— Harry Brundage (&lt;a href=&quot;https://x.com/harrybrundage/status/1928812963085070585&quot;&gt;@harrybrundage&lt;/a&gt;) • May 31, 2024&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Programming with AI is what you thought programming would be like prior to learning it.&lt;/p&gt;

  &lt;p&gt;— Martin Casado (&lt;a href=&quot;https://x.com/martin_casado/status/1929390376185405743&quot;&gt;@martin_casado&lt;/a&gt;) • June 1, 2024&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To build this data room app, I used &lt;a href=&quot;https://windsurf.com&quot; target=&quot;_blank&quot;&gt;Windsurf&lt;/a&gt;, an AI-assisted development platform founded in June 2021 by Varun Mohan and Douglas Chen. Originally launched as Exafunction (focusing on GPU optimization), the company pivoted to developer tools and rebranded as Codeium in 2022, before becoming Windsurf in April 2025.&lt;/p&gt;

&lt;p&gt;For a comparison of coding agents, see &lt;a href=&quot;https://medium.com/@b.yogesh565/comparison-of-ai-coding-tools-a-developers-perspective-cbde8005a7dd&quot; target=&quot;_blank&quot;&gt;Yogesh’s head-to-head comparison&lt;/a&gt;, &lt;a href=&quot;https://www.c-sharpcorner.com/article/top-7-ai-tools-for-software-developers/&quot; target=&quot;_blank&quot;&gt;C# Corner’s top AI tools&lt;/a&gt;, &lt;a href=&quot;https://kingy.ai/blog/ai-coding-agents-in-2025-cursor-vs-windsurf-vs-copilot-vs-claude-vs-vs-code-ai/&quot; target=&quot;_blank&quot;&gt;Kingy AI’s agent analysis&lt;/a&gt;, and &lt;a href=&quot;https://www.blog.brightcoding.dev/2025/03/22/cursor-vs-windsurf-vs-github-copilot-the-ai-coding-assistant-showdown/&quot; target=&quot;_blank&quot;&gt;BrightCoding’s technical benchmarks&lt;/a&gt;. To understand how they work, see &lt;a href=&quot;https://sourcegraph.com/blog/anatomy-of-a-coding-assistant&quot; target=&quot;_blank&quot;&gt;Sourcegraph’s anatomy of coding assistants&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;observations-and-best-practices&quot;&gt;Observations and best practices&lt;/h2&gt;

&lt;p&gt;Now let’s get to the main point of this post. Below, I share some observations I’ve made while using these tools over the past few months.&lt;/p&gt;

&lt;h3 id=&quot;observations&quot;&gt;Observations&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;IDE interface.&lt;/strong&gt; Interacting with LLMs directly inside an IDE is what makes the programming use case for LLMs so successful. Fast feedback loops while iterating on a project is key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill issue.&lt;/strong&gt; These tools increase the surface area of projects that I would take on. I know I can rely on them where I lack skills. This means I can take on brand new projects with more confidence.&lt;/p&gt;

&lt;h3 id=&quot;best-practices&quot;&gt;Best practices&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Version control.&lt;/strong&gt; Version control is good in any situation, but specifically when working with agents that have write access to your codebase. This is not disimlar to how collaborating with a colleague is made practical via version control.&lt;/p&gt;

&lt;p&gt;The agent sometimes changes already functional files and modules that are not relevant to your prompt. Sometimes it hallucinates and makes updates that are all wrong. Some say to lock a page you’re sure about; but practically, you rarely want to lock a file in an evolving project, so the best bet is to have meaningful commits that you can revert to.&lt;/p&gt;

&lt;p&gt;I would go so far to recommend you manage your git workflow manually. That way you can be sure you have meaningful commits and ways to correct big mistakes and hallucinations. Otherwise you end up resetting a full day’s worth of work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local vs global solutions.&lt;/strong&gt; The agent sometimes does a local hacky solution rather than fixing the root cause. For example, instead of fixing a css issue globally, it might fix it locally for a specific part of the website, or separately for every part. On those occasions it requires nudging about best practices, for example, about &lt;em&gt;separation of concerns&lt;/em&gt;. Otherwise, you’ll end up with unmaintainable code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code review.&lt;/strong&gt; As I went deep into a couple of projects, spending more and more time on Windsurf, it had me wondering, am I getting good at anything? What skills, if any, am I gaining? When I program, absent an agent, I get better at programming. When I write, I get better at writing. When I vibe code, especially in a domain I’m not familiar with, am I actually learning anything? This is where I’d recommend reviewing every commit. This way, you’ll not only have better command over your codebase but also learn from your robot friend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Imprecise details.&lt;/strong&gt; Being imprecise can be the source of a lot of pain. Be as clear as possible. &lt;em&gt;English is code.&lt;/em&gt; I wasted a full half day due to imprecisely describing the folder structure in a project only to realize that it was &lt;em&gt;my&lt;/em&gt; prompt and not the agent’s dumbness that was leading the project astray.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical details.&lt;/strong&gt; I’ve found that the more accurate I can explain what I want and the more technical context I can provide, the faster and smoother things go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planning file (plan.md).&lt;/strong&gt; Aside from technically detailed prompts, it helps to have a well-defined project definition at the outset. The project goes smoother if there is a clear roadmap. Windsurf’s &lt;a href=&quot;https://docs.windsurf.com/windsurf/cascade/planning-mode#planning-mode&quot; target=&quot;_blank&quot;&gt;Planning Mode&lt;/a&gt; not only helps keep track of tasks but can also serve as a reference to past prompts instead of the alternative of “as per my tenth to last prompt.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold starts.&lt;/strong&gt; Starting from a template is a great idea, when available. Starting from scratch can be challenging especially if you don’t know what the project structure should look like. Lack of templates isn’t necessarily a blocker but templates make the experience smoother; they steer the agent towards best practices in domains where you are not as opinionated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debugging.&lt;/strong&gt; Debugging works easier when you understand the codebase. For example, if I can identify the root cause of a bug, it’s easier to write a prompt and fix the problem. Otherwise, it’s just an endless loop of the blind leading the blind; which can still work but can take forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Work estimates.&lt;/strong&gt; One thing I’ve struggled with is trying to estimate how long something takes. Is it easier or harder to provide work estimates when working with an agent? I’d say it’s harder. Imagine working with someone without having visibility into their strengths and weaknesses. In my experience I’ve found the agent is not great at auth flows, but really good at CSS, for example, but I wouldn’t have known that until I got deep into the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pure vibes.&lt;/strong&gt; If it’s a small, short-term project, it makes sense to let the agent take the wheel. All you need to do is to nudge it in the right direction once in while but little involvement is necessary. However, for a bigger project, or one that’s meant to be revised later, or one that is shared with other collaborators, it’s different. In this case it’s easier if you know the codebase and review every commit, just as you would if you were working with a colleague who might move to another project at any point. Additionally, if your project is one where performance or security matters then you need to be more alert and involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memories and rules.&lt;/strong&gt; A new chat window doesn’t always remember how we did things before and will find a new way of doing things, that while probably correct, is not consistent with the rest of the codebase. In these settings it makes sense to create a &lt;a href=&quot;https://docs.windsurf.com/windsurf/cascade/memories&quot; target=&quot;_blank&quot;&gt;memory or a rule&lt;/a&gt; to guide the agent.&lt;/p&gt;

&lt;h3 id=&quot;advice-from-windsurf-itself&quot;&gt;Advice from Windsurf itself&lt;/h3&gt;

&lt;p&gt;I asked Windsurf directly what advice it would give to someone starting with AI-assisted programming. Here are some key insights from the tool itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context windows.&lt;/strong&gt; Be mindful of the AI’s context window limitations. For large codebases, focus the AI on specific files or components rather than expecting it to understand the entire system at once. This improves response quality and reduces confusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation-first approach.&lt;/strong&gt; Ask the AI to document its implementation strategy before writing code. This forces clarity of thought and gives you a chance to course-correct before any code is written.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool complementarity.&lt;/strong&gt; AI coding tools work best when complemented with traditional development tools like linters, type checkers, and test suites that can catch issues the AI might miss.&lt;/p&gt;

&lt;h3 id=&quot;elsewhere-on-the-web&quot;&gt;Elsewhere on the web&lt;/h3&gt;

&lt;p&gt;Check out the Windsurf documentation &lt;a href=&quot;#ref7&quot;&gt;[7]&lt;/a&gt; and Windsurf rules at UI Bakery &lt;a href=&quot;#ref8&quot;&gt;[8]&lt;/a&gt; for general Windsurf best practices. Also see David Crawshaw’s blog post &lt;a href=&quot;#ref9&quot;&gt;[9]&lt;/a&gt; for an interesting look at programming with agents. For more on vibe coding, check out 12 Rules to Vibe Code Without Frustration &lt;a href=&quot;#ref10&quot;&gt;[10]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That’s it for now. I’ll add to this list as I learn more.&lt;/p&gt;

&lt;h2 id=&quot;next-steps&quot;&gt;Next steps&lt;/h2&gt;

&lt;p&gt;In the &lt;a href=&quot;/blog/2025/07/20/introducing-the-data-room-app/&quot;&gt;second part&lt;/a&gt; I explain the login flow with magic link authentication. I’ve also published a link for you to try out the data room app.&lt;/p&gt;

&lt;p&gt;Next up, stay tuned as I share some of my &lt;a href=&quot;https://github.com/djpardis/mcp-code-qna&quot; target=&quot;_blank&quot;&gt;explorations&lt;/a&gt; into MCP servers and agents.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you have any feedback or advice, please let me know via &lt;a href=&quot;https://x.com/djpardis&quot; target=&quot;_blank&quot;&gt;X&lt;/a&gt;, &lt;a href=&quot;https://bsky.app/profile/djpardis.com&quot; target=&quot;_blank&quot;&gt;Bluesky&lt;/a&gt;, or &lt;a href=&quot;https://djpardis.medium.com/vibe-coding-a-data-room-app-2858246857e9&quot; target=&quot;_blank&quot;&gt;Medium&lt;/a&gt;. I’m looking forward to hearing from you.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;&lt;a id=&quot;ref1&quot; href=&quot;#ref1-back&quot;&gt;[1]&lt;/a&gt; Papermark. (2023). &lt;a href=&quot;https://www.papermark.com/virtual-data-room-providers&quot;&gt;“Best virtual data room providers comparison”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref1b&quot; href=&quot;#ref1-back&quot;&gt;[1b]&lt;/a&gt; FirmRoom. (2023). &lt;a href=&quot;https://firmroom.com/vdr-providers&quot;&gt;“11 best data room providers you need in 2024: Comparison”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref2&quot; href=&quot;#ref2-back&quot;&gt;[2]&lt;/a&gt; Yogesh, B. (2025). &lt;a href=&quot;https://medium.com/@b.yogesh565/comparison-of-ai-coding-tools-a-developers-perspective-cbde8005a7dd&quot;&gt;“AI coding tools: A developer’s head-to-head comparison”&lt;/a&gt;. A technical comparison of Cursor, GitHub Copilot, Replit, Cline, and Claude Code from an engineer’s perspective.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref3&quot; href=&quot;#ref3-back&quot;&gt;[3]&lt;/a&gt; C# Corner. (2025). &lt;a href=&quot;https://www.c-sharpcorner.com/article/top-7-ai-tools-for-software-developers/&quot;&gt;“Top 7 AI tools for software developers”&lt;/a&gt;. Comprehensive analysis of GitHub Copilot, ChatGPT, Replit AI, Windsurf, Tabnine, Cursor AI, and V0 by Vercel.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref4&quot; href=&quot;#ref4-back&quot;&gt;[4]&lt;/a&gt; Kingy AI. (2025). &lt;a href=&quot;https://kingy.ai/blog/ai-coding-agents-in-2025-cursor-vs-windsurf-vs-copilot-vs-claude-vs-vs-code-ai/&quot;&gt;“AI coding agents in 2025: Cursor vs. Windsurf vs. Copilot vs. Claude vs. VS Code AI”&lt;/a&gt;. In-depth analysis of strengths and weaknesses across code generation, debugging, refactoring, and large-codebase support.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref5&quot; href=&quot;#ref5-back&quot;&gt;[5]&lt;/a&gt; BrightCoding. (2025). &lt;a href=&quot;https://www.blog.brightcoding.dev/2025/03/22/cursor-vs-windsurf-vs-github-copilot-the-ai-coding-assistant-showdown/&quot;&gt;“AI coding assistants compared: technical benchmarks”&lt;/a&gt;. Technical analysis showing Windsurf processes 180 tokens/second with Llama 3.1 405B model, Cursor averages 220 tokens/second with GPT-4o, and GitHub Copilot processes 150 tokens/second with standard models. Engineers prefer Cursor (42%) for refactoring, Windsurf (38%) for speed and privacy, and Copilot (36%) for reliability.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref6&quot; href=&quot;#ref6-back&quot;&gt;[6]&lt;/a&gt; Sourcegraph Engineering. (2024). &lt;a href=&quot;https://sourcegraph.com/blog/anatomy-of-a-coding-assistant&quot;&gt;“The anatomy of an AI coding assistant”&lt;/a&gt;. Technical blog explaining how modern AI coding assistants work under the hood, detailing the vector embedding techniques (using OpenAI’s text-embedding-ada-002 or custom models), context fetching mechanisms, and how different features (autocomplete, chat, test generation) use specialized retrieval methods optimized for latency (76ms) or accuracy depending on the use case.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref7&quot; href=&quot;#ref7-back&quot;&gt;[7]&lt;/a&gt; Windsurf. (2025). &lt;a href=&quot;https://docs.windsurf.com/windsurf/getting-started&quot;&gt;“Getting Started with Windsurf”&lt;/a&gt;. Official documentation on getting started with the Windsurf VSCode fork.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref8&quot; href=&quot;#ref8-back&quot;&gt;[8]&lt;/a&gt; UI Bakery. (2025). &lt;a href=&quot;https://uibakery.io/blog/windsurf-ai-rules&quot;&gt;“Windsurf AI Rules”&lt;/a&gt;. Best practices for working with Windsurf AI and optimizing its performance.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref9&quot; href=&quot;#ref9-back&quot;&gt;[9]&lt;/a&gt; Crawshaw, D. (2025). &lt;a href=&quot;https://crawshaw.io/blog/programming-with-agents&quot;&gt;“Programming with Agents”&lt;/a&gt;. An in-depth look at the paradigm shift of programming with AI agents.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref10&quot; href=&quot;#ref10-back&quot;&gt;[10]&lt;/a&gt; Creator Economy. (2025). &lt;a href=&quot;https://creatoreconomy.so/p/12-rules-to-vibe-code-without-frustration&quot;&gt;“12 Rules to Vibe Code Without Frustration”&lt;/a&gt;. A practical guide to effective AI-assisted programming techniques and best practices.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Startup advice</title>
   <link href="https://djpardis.com/blog/2024/08/12/startup-advice/"/>
   <updated>2024-08-12T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2024/08/12/startup-advice</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*wlkXlVWb8r27CM6ih_EAog.jpeg&quot; alt=&quot;From my time at Techstars SDSU&quot; /&gt;
&lt;em&gt;From my time at Techstars SDSU.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s been quite the journey building &lt;a href=&quot;https://generalfolders.com&quot;&gt;General Folders&lt;/a&gt; so far. I learned things I wish I had known before starting the company. I also learned valuable lessons from past jobs that proved helpful in company building. Below is a list of these lessons. I hope that you find them useful on your journey!&lt;/p&gt;

&lt;div class=&quot;text-center&quot;&gt;
    &lt;span&gt;&amp;#10210;&amp;nbsp;&amp;nbsp;&amp;#10209;&amp;nbsp;&amp;nbsp;&amp;#10211;&lt;/span&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Build a team.&lt;/strong&gt; Bring along co-founders, but &lt;a href=&quot;https://medium.com/@mtrajan/price-of-a-great-co-founder-5fe35d62b441&quot;&gt;for the right reasons&lt;/a&gt;. Two can outperform one. Building the first team is your most important task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Be genuinely passionate.&lt;/strong&gt; The secret to hiring great people is to be knowledgeable and passionate about the product and business. Passion is contagious, and expertise is hard to fake. The same holds for sales: don’t underestimate the impact of &lt;em&gt;stoke&lt;/em&gt; on selling a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Define the MVP.&lt;/strong&gt; Determine the minimum amount of &lt;strong&gt;product&lt;/strong&gt; or &lt;strong&gt;visuals&lt;/strong&gt; required to get 1) customer buy-in and 2) relevant feedback.&lt;/p&gt;

&lt;p&gt;Your ability to come up with a successful MVP is tied to being able to describe a very particular use case with a well-defined customer profile. Having a well-defined use case, more than anything, paves the way for successful distribution.&lt;/p&gt;

&lt;p&gt;As it turns out, this is an exercise that a company inevitably repeats again and again. You can easily bankrupt a company as you spend to develop an upcoming version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sell early and often.&lt;/strong&gt; A startup’s success hinges on its sales and the trajectory of those sales. Life is easier if the product can be sold early and often. A startup will struggle to survive with grand plans and lengthy sales cycles unless it secures substantial external funding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Talk to customers.&lt;/strong&gt; Nothing benefits a company more than customer feedback. Never lose sight of it. Feedback saves products and companies from veering off the tracks into irrelevance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embrace change.&lt;/strong&gt; Don’t obsess over a particular decision or plan. If product reception is anything but stellar, update your strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Follow the crowd.&lt;/strong&gt; Tried-and-true practices are popular for a reason. It pays to follow the crowd if you know when and where to apply a technique.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Follow trends.&lt;/strong&gt; Don’t ignore market trends. Following trends makes you more likely to get noticed, talked about, and funded. It’s an easier ride if you want to take it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hire an accountant.&lt;/strong&gt; While it pays to follow trends, plan for the dry season. New trends always give way to newer trends — it’s funny how that works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stand out.&lt;/strong&gt; Likewise, while following trends can be rewarding, sometimes it pays to focus on what you uniquely believe to be true. You can’t expect a significant return without taking on a risk nobody else wants to take.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hustle.&lt;/strong&gt; Aside from being a function of big risk, a big return is also an outcome of hard work. There are no shortcuts. Pick up the phone. Knock on doors. Get the product in front of customers. Put in the work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Make the call.&lt;/strong&gt; Running a business involves endless decisions with incomplete information. Indecision is a decision. Make the call. Use every decision as an opportunity to learn more about your business.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Move fast.&lt;/strong&gt; Faster is better, but only when the goal is to serve customers and tend to their needs. Rushing for PR reasons comes at the expense of quality, ultimately tarnishing the brand.&lt;/p&gt;

&lt;p&gt;Moving fast is not just about writing more code, sending more emails, and doing more busy work. Building requires clear thinking and planning. It requires coordinating with customers at every step. Skipping steps creates tech debt, ineffective plans, and underwhelmed customers. Rarely do products fail because there isn’t enough &lt;em&gt;code and artifacts&lt;/em&gt;; rather, it is often due to inadequate &lt;em&gt;thought&lt;/em&gt;. Don’t skip steps. And don’t settle for busy work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep learning.&lt;/strong&gt; In every scenario, assume you don’t have all the answers. Learn and evolve to become the leader the company needs at each stage. Can you &lt;a href=&quot;https://youtu.be/qAr-yl9A0Xc?si=wUVTi-zKmvuWsEiK&amp;amp;t=1978&quot;&gt;hire yourself every day&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask for help.&lt;/strong&gt; Ask for things and get help. The startup community is more generous, welcoming, and helpful than expected. Knowing when to ask, who to ask, and for what is a skill worth perfecting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go after capital.&lt;/strong&gt; Raise capital to accelerate growth. Attract investors who are already sold on the vision and are ready to champion the company. If it’s a hard sell, it’s not a good fit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Believe in yourself.&lt;/strong&gt; Be kind, especially to yourself. Self-confidence is the only thing you have going for you for a long time. Things won’t ever be perfect, and that’s okay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep going.&lt;/strong&gt; People (mainly yourself) will tell you that what you’re doing is not enough and that if it were to work, it would have worked by now. They will tell you to give up. Don’t. Opportunities will come around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep focused.&lt;/strong&gt; Don’t compare your startup to other startups. Don’t read into startup news. Startup news doesn’t talk about the deal details. Focus on your early customers. Focus on your collaborators. Focus on well-defined local problems for your first few customers. Don’t attempt to solve global problems for imaginary customers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seize the day.&lt;/strong&gt; You might not have achieved all the milestones, but you’ll never be this young ever again. 💁🏻‍♀️&lt;/p&gt;

&lt;div class=&quot;text-center&quot;&gt;
    &lt;span&gt;&amp;#10210;&amp;nbsp;&amp;nbsp;&amp;#10209;&amp;nbsp;&amp;nbsp;&amp;#10211;&lt;/span&gt;
&lt;/div&gt;

&lt;p&gt;What are some of your experiences from working at startups? I’d love to hear from you!&lt;/p&gt;

&lt;div class=&quot;crosspost-container post-container&quot;&gt;
This article is cross-posted from my original publication on &lt;a href=&quot;https://medium.com/@djpardis/startup-advice-e9459d6c1ebb&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Medium&lt;/a&gt;.
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>The state of data exchange</title>
   <link href="https://djpardis.com/blog/2023/04/03/the-state-of-data-exchange/"/>
   <updated>2023-04-03T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2023/04/03/the-state-of-data-exchange</id>
   <content type="html">&lt;div class=&quot;toc-container post-container&quot;&gt;
&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#abstract&quot;&gt;The abstract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#talk&quot;&gt;The talk&lt;/a&gt;
  &lt;ul&gt;
    &lt;li&gt;&lt;a href=&quot;#business-partners&quot;&gt;Business partners exchange data for a variety of reasons&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;#companies-build-solutions&quot;&gt;Companies build solutions for data exchange today&lt;/a&gt;
      &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;#excel-csv-sftp&quot;&gt;Send or receive data as an Excel file or CSV by Gmail, Slack, Dropbox, or over SFTP&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#share-credentials&quot;&gt;Transfer data by sharing AWS or database credentials&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#expose-api&quot;&gt;Make data available by exposing an API&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#implement-api&quot;&gt;Pull data by implementing an API&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#send-via-api&quot;&gt;Send data via an API&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#implement-sdk&quot;&gt;Send data by implementing an SDK&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#secure-sharing&quot;&gt;Securely share data using Snowflake, Redshift, Azure Data Share, and GCP Datashare&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#marketplace&quot;&gt;Make data available on a marketplace&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#clean-rooms&quot;&gt;Collaborate on overlapping data with data clean rooms&lt;/a&gt;&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;#challenges-remain&quot;&gt;But challenges still remain&lt;/a&gt;
      &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;#type-information&quot;&gt;Data exchange via Excel or CSV loses valuable type information&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#data-validation&quot;&gt;Data validation is manual in most cases&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#not-all-apis&quot;&gt;Not all data providers expose an API&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#vendor-apis&quot;&gt;Not all vendor APIs are implemented by major integration companies&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#staffing-issues&quot;&gt;Some data consumers are not staffed adequately&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#obscure-formats&quot;&gt;Some data consumers ask for an obscure format&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#pricing-complexity&quot;&gt;Pricing an exchange is complex. Who should pay? And what amount?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#speed-challenge&quot;&gt;Speed is a challenge&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#security-compliance&quot;&gt;Security is a challenge. So is compliance&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#pipeline-maintenance&quot;&gt;Monitoring and maintaining so many different pipelines is a challenge&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#auditing-challenge&quot;&gt;Auditing is a challenge&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;#decision-fatigue&quot;&gt;Decision fatigue is real&lt;/a&gt;&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;#solution-properties&quot;&gt;What are the properties of a solution?&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#reactions&quot;&gt;Reactions to the talk&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;I just got back from &lt;a href=&quot;https://www.datacouncil.ai/austin&quot; target=&quot;_blank&quot;&gt;Data Council&lt;/a&gt;. Thanks to Pete Soderling for an excellent conference where people gather to discuss the future of data infra, AI, and analytics. It was an energizing couple of days with many valuable takeaways as I work on &lt;a href=&quot;https://twitter.com/GeneralFolders&quot; target=&quot;_blank&quot;&gt;General Folders&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;video-container&quot; style=&quot;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%;&quot;&gt;
  &lt;iframe style=&quot;position: absolute; top: 0; left: 0; width: 100%; height: 100%;&quot; src=&quot;https://www.youtube.com/embed/Np0kTZlbRO4?list=PLAesBe-zAQmF-GpvZ3ba5YpVzoVbgzl8M&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class=&quot;image-caption&quot;&gt;A recording of the talk is available on YouTube.&lt;/p&gt;

&lt;p&gt;Below you can find the full transcript. I would love to hear about your experiences with data exchange. Please &lt;a href=&quot;https://twitter.com/GeneralFolders&quot; target=&quot;_blank&quot;&gt;reach out&lt;/a&gt;!&lt;/p&gt;

&lt;h2 id=&quot;abstract&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The abstract&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Data exchange is integral to every business partnership. Yet data exchange practices are highly manual, prone to data leaks, difficult to validate, inherently impossible to monitor, and costly to audit. In this talk, we present an overview of the variety of methods enterprises use to share and transfer data. We talk about some of the challenges companies continue to face along the vectors of security, simplicity, and speed. We conclude by enumerating the properties of a good solution.&lt;/p&gt;

&lt;h2 id=&quot;talk&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The talk&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;business-partners&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Business partners exchange data for a variety of reasons&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;It might be surprising to many, as it was for me, but they do this quite often.&lt;/p&gt;

&lt;p&gt;One of the scenarios is SaaS vendors sending data to their customers. SaaS vendors offer some sort of service to their customer’s customers. The customers’ data then ends up in the vendor’s warehouse, and so the vendors need to make that data available to their customers.&lt;/p&gt;

&lt;p&gt;Another scenario is when a company is doing market research and needs to incorporate market data into their analysis. They need to talk to data vendors to purchase a particular data set. The procurement process usually involves a data exploration and assessment aspect and the transaction could be a one-time transfer or require recurring updates.&lt;/p&gt;

&lt;p&gt;Yet another scenario is when a healthcare, transportation, or energy company is responsible for sending data to the local government on a regular basis.&lt;/p&gt;

&lt;p&gt;For AI SaaS evaluation, companies usually want to assess whether a new tool would work for their uses cases. Before buying, they usually transfer data to the vendor’s warehouse and ask that the vendor demonstrates the effectiveness of the new tool on real customer data.&lt;/p&gt;

&lt;p&gt;In M&amp;amp;A, the acquirer needs to run their own diligence on the company they want to acquire. They usually ask for certain data sets and if the deal goes through, a full transfer of data assets.&lt;/p&gt;

&lt;p&gt;Sometimes businesses need to collaborate on the intersection of their respective customers. For example, a makeup brand (i.e., the supplier) needs to share data with all retailers that carry the brand to find the marketing channels that work best in each cohort.&lt;/p&gt;

&lt;p&gt;Yet in other cases, businesses don’t even know the overlap of their customers and that’s what they need to find out. We’ll get into more details about these applications later.&lt;/p&gt;

&lt;p&gt;Okay, enough about use cases — we can go on forever.&lt;/p&gt;

&lt;h3 id=&quot;companies-build-solutions&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Companies build solutions for data exchange today&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Let’s go through some of these solutions and how they work. I’ve had a lot of conversations with companies about this topic and this list is a kind of summary of those conversations.&lt;/p&gt;

&lt;p&gt;The goal is not to judge these methods, or even to assess whether a team made the right call to go with a certain approach — a lot of that depends on infrastructure limitations and deadlines. The goal is to enumerate some of the inevitably great many methods companies use to exchange data with their business partners.&lt;/p&gt;

&lt;h4 id=&quot;excel-csv-sftp&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Send or receive data as an Excel file or CSV by Gmail, Slack, Dropbox, or over SFTP&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;One of the most common ways to exchange data is by setting up an SFTP server, uploading a CSV file, then sharing a link. Some use &lt;a href=&quot;https://aws.amazon.com/aws-transfer-family/&quot; target=&quot;_blank&quot;&gt;Amazon Transfer Family&lt;/a&gt; and managed workflows to move data in and out of S3 over SFTP. Others use &lt;a href=&quot;https://airflow.apache.org/&quot; target=&quot;_blank&quot;&gt;Airflow&lt;/a&gt;, &lt;a href=&quot;https://dagster.io/integrations/dagster-ssh-sftp&quot; target=&quot;_blank&quot;&gt;Dagster&lt;/a&gt;, or &lt;a href=&quot;https://en.wikipedia.org/wiki/Cron&quot; target=&quot;_blank&quot;&gt;cron jobs&lt;/a&gt; to set up and manage pipelines.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/sftp-pipeline.png&quot; alt=&quot;SFTP data exchange pipeline&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;The data provider manages only half the pipeline. The rest of the pipeline is up to the data consumer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What happens in this scenario is that the data provider ensures that the data is securely moved from S3 to the SFTP server. It is then up to the data consumer to decide how to handle the downstream data. They need to build their own pipelines.&lt;/p&gt;

&lt;h4 id=&quot;share-credentials&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Transfer data by sharing AWS or database credentials&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Companies use DB replication tools to implement one-time or recurring syncs to and from customer data stores. You can use &lt;a href=&quot;http://fivetran.com&quot; target=&quot;_blank&quot;&gt;Fivetran&lt;/a&gt; or &lt;a href=&quot;https://airbyte.com/&quot; target=&quot;_blank&quot;&gt;Airbyte&lt;/a&gt;, or even &lt;a href=&quot;https://debezium.io/blog/2018/07/19/advantages-of-log-based-change-data-capture/&quot; target=&quot;_blank&quot;&gt;Debezium&lt;/a&gt; to build your own log-based CDC replication, should a log be available.&lt;/p&gt;

&lt;p&gt;You might say, at this point, that sharing database credentials with business partners is not a secure solution. And this is true. But it happens quite often as businesses need to set up reliable end-to-end cross-company pipelines— particularly when one partner is short on staff.&lt;/p&gt;

&lt;p&gt;It should be noted that AWS lets you create &lt;a href=&quot;https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html&quot; target=&quot;_blank&quot;&gt;IAM roles&lt;/a&gt; to securely provide a business partner read or write access to your S3 buckets. This helps avoid sharing credentials in most cases.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/db-credentials.png&quot; alt=&quot;Database credential sharing&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;expose-api&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Make data available by exposing an API&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;As a SaaS vendor, when you’re relatively sure you know what data a lot of your customers want access to, you can make that data available by exposing an API.&lt;/p&gt;

&lt;p&gt;There are a lot of benefits to this: 1) consumers can pick and choose the data they need and access it when they need it, 2) there are standards for APIs that everyone can implement, like REST or GraphQL, 3) APIs are DB-independent. 4) APIs are open and not tied to a vendor, 5) the data contract is implicit in an API 6) APIs are easily testable, by both sides of the transaction. These are all properties of good data transfer technology.&lt;/p&gt;

&lt;h4 id=&quot;implement-api&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Pull data by implementing an API&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;As a SaaS vendor partner, you can implement an API to pull your data from vendors. It turns out that’s a whole lot of work. APIs and schemas evolve a lot. The good news is that you can also use tools like &lt;a href=&quot;https://www.fivetran.com/&quot; target=&quot;_blank&quot;&gt;Fivetran&lt;/a&gt;, &lt;a href=&quot;https://airbyte.com/&quot; target=&quot;_blank&quot;&gt;Airbyte&lt;/a&gt;, &lt;a href=&quot;https://meltano.com/&quot; target=&quot;_blank&quot;&gt;Meltano&lt;/a&gt;, or &lt;a href=&quot;https://www.rudderstack.com/&quot; target=&quot;_blank&quot;&gt;Rudderstack&lt;/a&gt; that handle this for you.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/api-pulling.png&quot; alt=&quot;API data pulling&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;send-via-api&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Send data via an API&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Moving data out of the warehouse and into operational systems like &lt;a href=&quot;https://mailchimp.com/en-gb/&quot; target=&quot;_blank&quot;&gt;MailChimp&lt;/a&gt; and &lt;a href=&quot;https://www.braze.com/&quot; target=&quot;_blank&quot;&gt;Braze&lt;/a&gt; is useful. These systems provide some sort of value on top of raw data. This is made possible by implementing the API they expose to send data. Given the work involved in maintaining that code, it’s simpler to use “reverse ETL” or “data activation” tools like &lt;a href=&quot;https://www.getcensus.com/reverse-etl&quot; target=&quot;_blank&quot;&gt;Census&lt;/a&gt;, &lt;a href=&quot;https://hightouch.com/&quot; target=&quot;_blank&quot;&gt;Hightouch&lt;/a&gt;, and &lt;a href=&quot;https://www.rudderstack.com/&quot; target=&quot;_blank&quot;&gt;Rudderstack&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;implement-sdk&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Send data by implementing an SDK&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;You can integrate with a vendor’s existing SDK to send events data. For example, &lt;a href=&quot;https://www.braze.com/docs/user_guide/data_and_analytics/user_data_collection/sdk_data_collection/&quot; target=&quot;_blank&quot;&gt;Braze&lt;/a&gt;’s backend service calculate metrics based on the SDK data it receives. Implementing SDKs is not easy and the integrations take months, even up to an entire year. Adding new data points also means new code to be tested, deployed, and shipped. But the data from implementing SDKs is reliable.&lt;/p&gt;

&lt;h4 id=&quot;secure-sharing&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Securely share data using Snowflake, Redshift, Azure Data Share, and GCP Datashare&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;When transferring data is not necessary, sharing data is way more efficient. However, sharing works in very specific scenarios.&lt;/p&gt;

&lt;p&gt;You can share data for &lt;em&gt;read purposes only&lt;/em&gt; across accounts (which can span regions but of course needs to be on the same cloud), on both Snowflake and Redshift. The great thing about it is that it’s instant access. There’s no data copies or data movement. It’s a view of existing data and so any changes are captured instantaneously by the view. Because data never leaves your servers, you also get straightforward access to usage metrics. You can restrict or revoke access at any time.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/redshift-sharing.png&quot; alt=&quot;AWS Redshift data sharing&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;With &lt;a href=&quot;https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-data-sharing-preview/&quot; target=&quot;_blank&quot;&gt;AWS Redshift&lt;/a&gt;, you can create data shares and share it with internal and external customers.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Companies like Stripe use Snowflake and Redshift data sharing capabilities to send all of their customers’ up-to-date data, avoiding an API integration. Of course, the API is still available for those that need it. Salesforce CDP also uses this “zero-copy” or “zero-ETL” strategy to data sharing to make their data available to Snowflake customers.&lt;/p&gt;

&lt;h4 id=&quot;marketplace&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Make data available on a marketplace&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;With &lt;a href=&quot;https://aws.amazon.com/data-exchange/&quot; target=&quot;_blank&quot;&gt;AWS Data Exchange&lt;/a&gt;, data vendors can provide easy and secure access to their data, with the ability to reach AWS customers. Consumers can get read access via Redshift sharing capabilities or write pipelines to export the data to S3 and use it from there.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/marketplace-bruges.jpeg&quot; alt=&quot;Marketplace at Bruges&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;This is &lt;a href=&quot;https://americanart.si.edu/artwork/marketplace-bruges-20270&quot; target=&quot;_blank&quot;&gt;Marketplace at Bruges&lt;/a&gt; by Samuel Prout.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With &lt;a href=&quot;https://www.snowflake.com/en/data-cloud/marketplace/&quot; target=&quot;_blank&quot;&gt;Snowflake Marketplace&lt;/a&gt;, data vendors can make their data available to Snowflake customers. With Snowflake data sharing, consumers get read access to the data instantly without the need for ELT integrations.&lt;/p&gt;

&lt;h4 id=&quot;clean-rooms&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Collaborate on overlapping data with data clean rooms&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Sometimes, all that business partners want is to collaborate on overlapping data without the need to fully exchange copies of data. This makes contracts easier and the whole procurement process faster.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/data-clean-room.jpeg&quot; alt=&quot;Data clean room&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;From &lt;a href=&quot;https://www.searchenginejournal.com/data-clean-rooms/417606/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Clean rooms allow business partners to collaborate on and analyze data in a secure environment, without having to share or reveal user-level data. For example, clean rooms can aid two business partners discover their shared customers. Further, clean rooms sometimes facilitate collaborations even before any sort of partnership contracts are signed.&lt;/p&gt;

&lt;p&gt;With &lt;a href=&quot;https://aws.amazon.com/clean-rooms/&quot; target=&quot;_blank&quot;&gt;AWS Clean Rooms&lt;/a&gt; you can invite any AWS customer to collaborate, select datasets, and configure restrictions. You can analyze data with up to 4 parties in a single collaboration. You can set minimum aggregation thresholds while allowing collaborators to run their queries.&lt;/p&gt;

&lt;p&gt;Similarly, &lt;a href=&quot;https://www.snowflake.com/blog/data-clean-room-explained/&quot; target=&quot;_blank&quot;&gt;Snowflake&lt;/a&gt;, &lt;a href=&quot;https://www.infosum.com/&quot; target=&quot;_blank&quot;&gt;InfoSum&lt;/a&gt;, and &lt;a href=&quot;https://business.pinterest.com/blog/pinterest-liveramp-pilot-data-clean-room/&quot; target=&quot;_blank&quot;&gt;LiveRamp&lt;/a&gt; all offer some flavor of clean rooms. Some companies also &lt;a href=&quot;https://clearcode.cc/blog/data-clean-room/&quot; target=&quot;_blank&quot;&gt;implement their own&lt;/a&gt;. For example, Disney, Unilever, Hershey’s have all built out their own clean rooms to be able to collaborate with marketers and retailers.&lt;/p&gt;

&lt;h3 id=&quot;challenges-remain&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;But challenges still remain&lt;/a&gt;&lt;/h3&gt;

&lt;h4 id=&quot;type-information&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Data exchange via Excel or CSV loses valuable type information&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;CSV suppresses type information that needs to be inferred later on when data is being uploaded back to the database or warehouse.&lt;/p&gt;

&lt;h4 id=&quot;data-validation&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Data validation is manual in most cases&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;This is an issue on both sides of the transaction. This is not an issue for API-based integration. And it’s one of the main areas of strength for APIs.&lt;/p&gt;

&lt;p&gt;Everywhere else, there is always the risk that your business partner might mistakenly send you data they should not be sending. And this is usually PII (personal identifiable information) or PHI (protected health information) where you don’t have the necessary contracts set into place, but also data for other parts of the business you may not need.&lt;/p&gt;

&lt;p&gt;When you talk to data practitioners, you’ll see that they’ve been on the receiving side of a lot of PII data that customers just send by accident.&lt;/p&gt;

&lt;h4 id=&quot;not-all-apis&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Not all data providers expose an API&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Building and managing APIs is not easy. Keeping the API up-to-date and backwards compatible is also time consuming and requires a major commitment from the API provider. We can’t expect a partner API to be available for every data set we need from a business partner.&lt;/p&gt;

&lt;h4 id=&quot;vendor-apis&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Not all vendor APIs are implemented by major integration companies&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Connectors for all possible APIs under the sun just doesn’t exist. As a data consumer, implementing APIs and keeping the integration code up-to-date is very time consuming. Hence the need for integration companies in the first place.&lt;/p&gt;

&lt;h4 id=&quot;staffing-issues&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Some data consumers are not staffed adequately&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Data consumers are sometimes not staffed appropriately to manage the data they receive. This is especially an issue if data is sent via SFTP.&lt;/p&gt;

&lt;p&gt;SFTP data exchange only handles one half of the transaction. Many data consumers are not savvy enough to manage the pipeline on their side. In many cases, you’ll find that data is downloaded onto a laptop before being uploaded to the cloud.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/manual-handling.png&quot; alt=&quot;Manual data handling&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;At a previous company, we needed to send data on a monthly cadence instead of an hourly one, because our partner only had a single staff member to manually download the data. In the meantime, our partner wanted us to provide real time dashboards to show how they were doing before they were able to run their own analytics on the raw data they were going to receive at the end of the month.&lt;/p&gt;

&lt;p&gt;Building SDKs and APIs, but also integrating with them to send and receive data are all time consuming and require expertise.&lt;/p&gt;

&lt;h4 id=&quot;obscure-formats&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Some data consumers ask for an obscure format&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;This adds the additional burden of obscure data transformations when building a transfer pipeline with a new business partner.&lt;/p&gt;

&lt;h4 id=&quot;pricing-complexity&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Pricing an exchange is complex. Who should pay? And what amount?&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;When setting up transfer pipelines, which side of the transaction should be paying for the pipelines, the storage, and the potential egress costs? Accounting for long-running and diverse transactions between two parties can quickly become complex.&lt;/p&gt;

&lt;p&gt;Sharing credentials makes it even more difficult to do cost accounting.&lt;/p&gt;

&lt;h4 id=&quot;speed-challenge&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Speed is a challenge&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;How fresh do we need the data to be? How quickly do we need the data to be made available in each batch? Should speed guarantees be included in partner contracts? It’s a challenge for teams to estimate what is possible right off the bat.&lt;/p&gt;

&lt;h4 id=&quot;security-compliance&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Security is a challenge. So is compliance&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;There is a lot to unroll here. Sharing credentials is not secure. Not knowing the location to which a file from SFTP is downloaded before it makes its way to the warehouse is worrisome. APIs expose backend data and are constantly a source for &lt;a href=&quot;https://nordicapis.com/5-major-modern-api-data-breaches-and-what-we-can-learn-from-them/&quot; target=&quot;_blank&quot;&gt;data breaches&lt;/a&gt;. We also see that without a thorough testing strategy, many partners tend to send or receive PII data that was intended to be sent as part of the contract.&lt;/p&gt;

&lt;h4 id=&quot;pipeline-maintenance&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Monitoring and maintaining so many different pipelines is a challenge&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;Partnership contracts are updated all the time. It’s a huge hassle to have to evolve pipelines alongside the evolving contracts.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/pipeline-maintenance.png&quot; alt=&quot;Pipeline maintenance&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Further, schemas change, APIs change, servers fail, and so maintaining the ever-growing list of highly diverse pipelines quickly becomes a burden on the business. Businesses end up in a place where they have a different type of pipeline for each business partner, and sometimes multiple types of pipelines for the same partner.&lt;/p&gt;

&lt;h4 id=&quot;auditing-challenge&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Auditing is a challenge&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;In a world where a company must maintain all the various disparate pipelines, auditing becomes increasingly complex and time consuming.&lt;/p&gt;

&lt;h4 id=&quot;decision-fatigue&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Decision fatigue is real&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;When you start speaking to a new business partner, or an old business partner about a new use case, you need to go through the list of all possible approaches, depending on the amount of time you have, number of people they have, your infrastructure, their infrastructure, the volume of data being transferred, the direction of the transfer, the cadence, the source, the destination, the required security and speed guarantees, and so on.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/decision-matrix.png&quot; alt=&quot;Decision matrix&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;There’s a lot to consider!&lt;/p&gt;

&lt;h2 id=&quot;solution-properties&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;What are the properties of a solution?&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Now that we talked about all of the challenges, what are the properties of a good solution?&lt;/p&gt;

&lt;p&gt;Here’s my wishlist.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The solution would manage the end-to-end cross-company pipeline, such that it is not relying on either partner to be particularly experienced in data transfer.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The solution would be cloud- and technology agnostic. Business partners shouldn’t need to consider each other’s tech stack before signing a contract.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The solution would make it easy to monitor and test all pipelines and to manage them all in one place. Both partners need to be alerted if a critical pipeline is failing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A good solution would allow an easy way to incorporate the evolving contracts. The nature of partnerships change all the time.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The solution would maintain a ledger of all transactions that happen with a specific partner.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Lastly, yet most important, is that a good solution would make it easy for companies to stay secure and compliant. That’s easy to do when we have the right tools at our disposal.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/solution-properties.png&quot; alt=&quot;Solution properties&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;reactions&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Reactions to the talk&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Below are reactions to the talk.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/tj-murphy-shoutout.png&quot; alt=&quot;TJ Murphy shoutout&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;Thanks TJ Murphy for the &lt;a href=&quot;https://twitter.com/teej_m/status/1643012610433089541?s=20&quot; target=&quot;_blank&quot;&gt;shoutout&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/abhi-sivasailam-support.png&quot; alt=&quot;Abhi Sivasailam support&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;Thank you for the &lt;a href=&quot;https://twitter.com/_abhisivasailam/status/1643024564681863168?s=20&quot; target=&quot;_blank&quot;&gt;support&lt;/a&gt; Abhi Sivasailam!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/matthew-mullins-encouragement.png&quot; alt=&quot;Matthew Mullins encouragement&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;Thank you for the &lt;a href=&quot;https://twitter.com/mullinsms/status/1641474106653650950?s=20&quot; target=&quot;_blank&quot;&gt;words of encouragement&lt;/a&gt; Matthew Mullins!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2023/ananth-packkildurai-shoutout.png&quot; alt=&quot;Ananth Packkildurai shoutout&quot; style=&quot;max-width: 500px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;Thanks for the &lt;a href=&quot;https://twitter.com/GeneralFolders/status/1644394291270422528?s=20&quot; target=&quot;_blank&quot;&gt;shoutout&lt;/a&gt; Ananth Packkildurai! Big fan of the DB comparison and of &lt;a href=&quot;https://twitter.com/data_weekly&quot; target=&quot;_blank&quot;&gt;Data Engineering Weekly&lt;/a&gt;!&lt;/em&gt;&lt;/p&gt;

&lt;div class=&quot;crosspost-container post-container&quot;&gt;
This post was originally published on &lt;a href=&quot;https://medium.com/@djpardis/the-state-of-data-exchange-31049fa229f0&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Medium&lt;/a&gt; and is cross-posted here.
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Models for integrating data science teams within organizations</title>
   <link href="https://djpardis.com/blog/2019/07/31/models-for-integrating-data-science-teams-within-organizations/"/>
   <updated>2019-07-31T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2019/07/31/models-for-integrating-data-science-teams-within-organizations</id>
   <content type="html">&lt;div class=&quot;toc-container post-container&quot;&gt;
&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#center-of-excellence-model&quot;&gt;The center-of-excellence model&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#coe-misconceptions&quot;&gt;Some misconceptions&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#coe-drawbacks&quot;&gt;Drawbacks&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#coe-benefits&quot;&gt;Benefits and success scenarios&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#accounting-model&quot;&gt;Accounting model&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#accounting-drawbacks&quot;&gt;Drawbacks&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#accounting-benefits&quot;&gt;Benefits&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#consultant-model&quot;&gt;The consultant model&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#consultant-benefits&quot;&gt;Benefits&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#consultant-drawbacks&quot;&gt;Drawbacks&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#embedded-model&quot;&gt;The embedded model&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#embedded-benefits&quot;&gt;Benefits&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#embedded-drawbacks&quot;&gt;Drawbacks&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#democratic-model&quot;&gt;The democratic model&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#democratic-benefits&quot;&gt;Benefits&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#democratic-drawbacks&quot;&gt;Drawbacks&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li class=&quot;toc-era&quot;&gt;
  &lt;details class=&quot;collapsible-section&quot;&gt;
    &lt;summary&gt;&lt;a href=&quot;#product-data-science-model&quot;&gt;The product data science model&lt;/a&gt;&lt;/summary&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#pds-benefits&quot;&gt;Benefits&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#pds-drawbacks&quot;&gt;Drawbacks&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/details&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#references&quot;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#acknowledgements&quot;&gt;Acknowledgements&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#citations-and-coverage&quot;&gt;Citations and coverage&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;h2 id=&quot;introduction&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Introduction&lt;/a&gt;&lt;/h2&gt;

&lt;div class=&quot;image-row&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2019/ds-team-models-1.jpeg&quot; alt=&quot;DS Crit meeting photo 1&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2019/ds-team-models-2.jpeg&quot; alt=&quot;DS Crit meeting photo 2&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;image-container&quot;&gt;
    &lt;img src=&quot;/files/pics/blog/2019/ds-team-models-3.jpeg&quot; alt=&quot;DS Crit meeting photo 3&quot; /&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;At our &lt;a href=&quot;https://twitter.com/djpardis/status/955946036693843969&quot; target=&quot;_blank&quot;&gt;inaugural DS Crit meeting&lt;/a&gt; at Twitter HQ.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Beginning in the &lt;a href=&quot;https://en.wikipedia.org/wiki/Apache_Hadoop&quot; target=&quot;_blank&quot;&gt;first decade of the 21st century&lt;/a&gt;, internet companies were able to gain visibility into the business in ways never possible in the age of spreadsheets and relational database management systems. No longer did they need to wait for end-of-quarter financial results in order to gauge business performance; and no more did they need to rely on extrapolations from samples to get a comprehensive view of what was working for all customers. In addition to improved visibility into the state of the business, the new data storage and aggregation capabilities enabled companies to build &lt;a href=&quot;https://www.oreilly.com/library/view/data-analytics-with/9781491913734/ch01.html&quot; target=&quot;_blank&quot;&gt;data products&lt;/a&gt; like search engines, language processors, and recommender systems.&lt;/p&gt;

&lt;p&gt;What became important was to determine how this work could be achieved efficiently and effectively. Designing and building a data science organization is a complex problem, particularly when determining the nature of data science interactions with stakeholders.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A DS team isn’t just the people, it is the process and the interaction of the team with the rest of the company.&lt;/p&gt;

  &lt;p&gt;— DJ Patil &lt;a href=&quot;#ref2&quot;&gt;[2]&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this post, I compare some of the popular models of integrating data science teams within companies. In determining the best model, I take into account the following factors:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coordination efficiency.&lt;/strong&gt; Every team creates new sources of knowledge. Incorporating that knowledge into the business in a timely and repeatable fashion requires robust organization design. Bad designs lead to failures and inefficiencies in knowledge sharing and coordination; this directly affects the &lt;em&gt;speed&lt;/em&gt; and &lt;em&gt;cost&lt;/em&gt; at which work is done.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The goal of work is some output, a strategy, product, marketing plan, budget, account plan, sale, feature, etc. Communication is a way of incorporating stakeholders into a plan &lt;em&gt;before&lt;/em&gt; it is too far along to change or the cost is too high (or coworkers too angry!)&lt;/p&gt;

  &lt;p&gt;— &lt;a href=&quot;https://medium.com/@stevesi&quot; target=&quot;_blank&quot;&gt;Steven Sinofsky&lt;/a&gt; on Twitter&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Employee happiness.&lt;/strong&gt; No discussion of organizational structure is complete without considering employee happiness, motivation, and growth factors. This is not just about reducing the cost of recruiting in response to employee churn, but also about providing employees with the circumstances to do creative and effective work during their tenure. Designing structures without considering employee happiness is a costly failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Product success.&lt;/strong&gt; Data scientists opportunity size new ideas, design experiments and metrics, and design and tune models. They promote the correct use of data within the company. New products shipped without these considerations usually contain deficiencies in instrumentation and implementation, and are potentially misaligned with company strategy. The customer voice is not accurately represented when experiments are incorrectly assessed and metrics incorrectly crafted. The decision making process is delayed without high quality data and metrics. Machine learning projects either fail or lack in quality without data science involvement.&lt;/p&gt;

&lt;p&gt;I make a number of assumptions. The &lt;em&gt;company&lt;/em&gt; is a single &lt;a href=&quot;https://cio-wiki.org/wiki/Strategic_Business_Unit&quot; target=&quot;_blank&quot;&gt;strategic business unit (SBU)&lt;/a&gt;. The SBU is partitioned in two ways. First, it is partitioned into independent &lt;em&gt;functions,&lt;/em&gt; with respect to specialization and responsibilities. Each function (e.g. design, marketing, or sales) is a group managing the needs of the business within the context of their specialization and responsibilities. Second, the SBU is partitioned with respect to outputs and services, into independent &lt;em&gt;products.&lt;/em&gt; The products are independent in that they have independent launch timelines. Using these definitions, a &lt;em&gt;product team&lt;/em&gt; is a subset of the SBU—with &lt;a href=&quot;https://en.wikipedia.org/wiki/Cross-functional_team&quot; target=&quot;_blank&quot;&gt;cross-functional&lt;/a&gt; membership—responsible for delivering a product or service.&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;data scientist (DS)&lt;/em&gt; is skilled in data engineering, data management, data analysis, and machine learning; and &lt;em&gt;data science&lt;/em&gt; is their work. The &lt;em&gt;data science function&lt;/em&gt; is a group of data scientists and their managers.&lt;/p&gt;

&lt;p&gt;We are now ready to review and compare the myriad ways that the data science function has been integrated within the SBU.&lt;/p&gt;

&lt;h2 id=&quot;center-of-excellence-model&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The center-of-excellence model&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;We start with the most centralized of all other models. In the center-of-excellence (CoE) model, also known as &lt;em&gt;the research model&lt;/em&gt;, the expectation is that the data science team works independently to identify big bets and build prototypes. Under this model, the data science team is considered to be the company’s innovation arm.&lt;/p&gt;

&lt;h3 id=&quot;coe-misconceptions&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Some misconceptions&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;There are some misconceptions that lead companies to choosing the CoE model for their data science team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. PhD graduates are hired to do research.&lt;/strong&gt; Data science teams hire many PhD and Master’s graduates. The focus of most of these programs is research, and so there is a misconception that graduates of these programs are hired to do research. However, the true motivation for hiring PhD and Master’s graduates into DS roles is different. Data science is an interdisciplinary field with a wide variety of requirements. Data scientists are usually required to handle data engineering, statistical analysis, and machine learning in highly diverse domains (e.g. Finance, Marketing, Logistics, Healthcare, Social Media). The extra years of studies in engineering, mathematics, and statistics are meant to boost their abilities in producing quality analyses, in communicating results, in working with external stakeholders, and in designing of new methodology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Innovation happens in a lab.&lt;/strong&gt; In cases where companies rightly expect innovation from the data science team, there is a misconception that the innovation arm of the company needs to be freed and independent of the day-to-day requirements of the business. When teams do not consider the company’s existing business model and infrastructure, their output does not translate into functioning products. This is why despite some &lt;a href=&quot;https://en.wikipedia.org/wiki/Bell_Labs&quot; target=&quot;_blank&quot;&gt;historical success stories&lt;/a&gt;, many companies refrain from such an investment even when it is affordable. The question then arises, “who is in charge of innovation?” &lt;a href=&quot;https://medium.com/@djpardis/q-a-with-steven-sinofsky-at-twitter-hq-a658ca5db953&quot; target=&quot;_blank&quot;&gt;That will need to be the topic of a future post&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;coe-drawbacks&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Drawbacks&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;There are important drawbacks to having the data science team operate within the CoE model:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Lack of context about the challenges of the business.&lt;/strong&gt; Without visibility into the day-to-day decision-making challenges, purely centralized data science teams find it difficult to identify the most important problems to tackle. They focus on pie in the sky ideas while the business suffers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Difficulty in closing the loop.&lt;/strong&gt; In cases where they are successful at identifying and solving an important problem, centralized research teams find it difficult to get the solution adopted by the product teams. The adoption of the proposed solution would likely disrupt a team’s existing roadmap—as the two teams are out of sync. Resolving this conflict usually requires actions by higher management, leading to unwelcome interruptions to existing teams and their roadmaps. If higher management does not step in, research teams become demotivated.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;My view is everyone is on the same calendar/cadence. That’s a huge thing for me. If you don’t have that then split resources (all of them) by cadence. Teams on difference cadences can’t collaborate.&lt;/p&gt;

  &lt;p&gt;— &lt;a href=&quot;https://medium.com/@stevesi&quot; target=&quot;_blank&quot;&gt;Steven Sinofsky&lt;/a&gt; on Twitter&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;c. High cost associated with building new team to back initiative.&lt;/strong&gt; Rather than disrupting existing roadmaps, an alternate path is to build a new product team to back a proposed solution. This team would have cross-functional membership to work on the proposals by the research team, making it a costly but valid endeavor. Valid, because ideas need to be backed by a complete team in order to be assessed correctly and quickly. It would be useless to measure the success of an idea if any part of the experience is lacking. A new feature requires design, engineering, data, marketing, comms, and sales involvement to realize its potential.&lt;/p&gt;

&lt;p&gt;The cost of building a brand new product team further increases if the new product team does not form &lt;a href=&quot;https://en.wikipedia.org/wiki/Partition_of_a_set&quot; target=&quot;_blank&quot;&gt;a partition&lt;/a&gt; along with existing product teams. Forming a partition with other product teams is important, otherwise roadmaps and responsibilities would be overlapping. This puts the new team’s longevity at jeopardy as they try to figure out their raison d’être—while confusing other teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Non-recurring and nondeterministic output.&lt;/strong&gt; Under the research model, the product teams might be able to adopt and find value in a single output from the data science team, but wonder if there would be follow-through and continued feedback if they were to go ahead and make the proposed changes.&lt;/p&gt;

&lt;h3 id=&quot;coe-benefits&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Benefits and success scenarios&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;It should be noted that the CoE model works for many types of teams. Centralization helps focus and agency. You should fully centralize that which you can clearly encapsulate from the rest of the organization. Full centralization works when coupling is low and joint meetings are few and far between.&lt;/p&gt;

&lt;p&gt;As an example, consider tooling development teams. Once the company decides on a technology or programming language, tooling improvement efforts can happen more or less independently of product launch timelines.&lt;/p&gt;

&lt;p&gt;Another example of a successful CoE team is &lt;a href=&quot;https://en.wikipedia.org/wiki/Microsoft_Research&quot; target=&quot;_blank&quot;&gt;Microsoft Research&lt;/a&gt;, a subsidiary of Microsoft. Formed in 1991, there is no expectation that the institute produce any result that would be applicable to core Microsoft products. It turns out that &lt;a href=&quot;https://www.forbes.com/sites/louiscolumbus/2019/01/06/microsoft-leads-the-ai-patent-race-going-into-2019/#69ce0e6844de&quot; target=&quot;_blank&quot;&gt;Microsoft is leading the patent race in AI&lt;/a&gt; as a result of its investment in a research institute.&lt;/p&gt;

&lt;h2 id=&quot;accounting-model&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Accounting model&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;In the accounting model, also known as &lt;em&gt;the BI model&lt;/em&gt;, the central data science team produces reports and presentations on a recurring basis (usually monthly and quarterly). The data science team would inform the company of notable movements in top-level metrics. Once the team identifies an interesting or worrying trend, they would work with product teams to investigate the root cause. Thus, quite frequently, playing detective becomes a main activity of the data science team under the accounting model.&lt;/p&gt;

&lt;h3 id=&quot;accounting-drawbacks&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Drawbacks&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;There are three main drawbacks to this model:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Difficulty in attribution and closing the loop.&lt;/strong&gt; As mentioned above, it is near impossible to reason based on global trends. This drawback becomes particularly pronounced when there are many product teams and hence many moving parts and levers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Reorganization and the emergence of tiger teams.&lt;/strong&gt; It is important to have analyses and metrics which are tied to levers (product teams) so that they are actionable quickly and with less cost and reorganization needs. Reorganization happens and new &lt;a href=&quot;https://www.lucidchart.com/blog/what-is-a-tiger-team&quot; target=&quot;_blank&quot;&gt;“tiger” teams&lt;/a&gt; emerge when the data science team is unable to identify the culprit and existing product teams are unable to own and prioritize a fix.&lt;/p&gt;

&lt;p&gt;Tiger teams rarely form a partition with existing product teams and thus disrupt the flow of the organization. The emergence of tiger teams is a drawback of all fully centralized models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Underutilizing technology.&lt;/strong&gt; Having monthly and quarterly reports be the only function of the data science team is failing to fully gauge product quality before reaching certain calendar milestones. If launches are leading to less usage in a particular market, the drop happens &lt;em&gt;a launch at a time&lt;/em&gt;, not a quarter or a week at a time. A product opens up to misuse &lt;em&gt;a launch at a time&lt;/em&gt;. Data security is breached &lt;em&gt;a launch at a time&lt;/em&gt;. Identifying the launch that led to decreased usage in Japan after many launches is an impossible task; so is determining the launch that created incorrect incentives for abusive behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Low quality and stale data.&lt;/strong&gt; Every launch creates new sources of data that need to be incorporated back into existing metrics, considered in future analyses, and incorporated in existing and future models. Accountant data scientists miss all important updates, and usually rely on stale data for analyses. It is difficult to be involved in instrumentation from the sidelines. This is a drawback of all fully centralized models.&lt;/p&gt;

&lt;h3 id=&quot;accounting-benefits&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Benefits&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Reporting on quarterly trends of company metrics is valuable practice. The centralized aspect of the BI team allows for a holistic view of the SBU, inspiring decisions that lead to global optimizations that can balance and correct local decisions. This work is something that the data science team should be tackling as part of their charter, regardless of the model under practice.&lt;/p&gt;

&lt;h2 id=&quot;consultant-model&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The consultant model&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;In the consultant model, the central data science team is assigned tickets or emailed with questions. Data science managers then prioritize the tickets and questions and assign them to data scientists.&lt;/p&gt;

&lt;h3 id=&quot;consultant-benefits&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Benefits&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;In this model, the data science manager overrides any existing data science roadmaps to prioritize the questions and needs of stakeholders. Due to the symmetrical treatment of all members of the team, this model makes managing a data science team easy and cheap.&lt;/p&gt;

&lt;h3 id=&quot;consultant-drawbacks&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Drawbacks&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;There are many drawbacks with this model:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Communications overhead.&lt;/strong&gt; Data scientists in a consulting position usually lack the context to resolve questions effectively in a timely manner. There is communications overhead involved in gaining familiarity with data sources and their creation process. Further, if a follow-up to an analysis is needed and the original consultant data scientist has other ongoing commitments, the work will get assigned to another data scientist. This requires yet another onboarding investment—and thus the cycle continues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Unclear deadlines.&lt;/strong&gt; It is difficult for stakeholders to know when work would get prioritized and assigned to a consultant data scientist. Additionally, the processes affecting the volume of incoming requests are not transparent to the data science team and their managers. Even after work gets assigned and prioritized, it is difficult for the data scientist to be able to estimate the amount of time needed to answer questions due to their unfamiliarity with the limitations and nuances of the ever-changing data sources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Short-term ownership.&lt;/strong&gt; Innovation happens when people plan for years, not days and weeks. Having data scientists act as short-term consultants makes it difficult to incentivize focus on complex or tedious work. This work is needed to ensure quality data, quality experimentation tools, quality data manipulation and visualization tools, and quality results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Unclear ownership.&lt;/strong&gt; A by-product of short-term ownership is unclear ownership. When projects are one-off and seemingly random, people are more likely to step on each other’s toes. This happens inadvertently but is a non-negligible source of inefficiency. It should be noted that this is a drawback of all fully centralized models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;e. Lack of motivation and unfulfilling work.&lt;/strong&gt; Data scientists working under this model usually lack motivation as they are rarely involved in the product decision making process. They also usually find the work unfulfilling as they rarely see the results and impact of their work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;f. Low data quality and recurring emergencies.&lt;/strong&gt; Without maintaining good data practices, products that rely on data as input fall prey to recurring bugs and emergencies. In this model, data scientists are pulled into a project in order to play detective and identify the source of the bug.&lt;/p&gt;

&lt;p&gt;Apart from the unfamiliarity of the data scientist with the data creation process and the product’s evolution, there is also the possibility of missing data due to missing instrumentation. It is impossible to find a needle in a haystack when the needle is not instrumented. It is also painful to look for a needle in a haystack that is extended to the fourth dimension (of time).&lt;/p&gt;

&lt;p&gt;Finding the culprit is nearly impossible in these situations. As discussed in the drawbacks of the accounting model, the organization usually responds by creating a tiger team, thus inducing further distractions and costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;g. Unclear coverage of product areas.&lt;/strong&gt; There are many allocation and prioritization challenges under this model. How does work get prioritized by the data science manager? Which product teams get the most attention—the successful products or the struggling ones? Which decisions are made with data in mind, which are made without, and who would be making these global decisions when the data science team lacks visibility?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;h. No clear sizing and allocation strategy.&lt;/strong&gt; As with any fully centralized model, it is always difficult to determine the number of data scientists needed. Does the size of the team grow with the size of the organization or with the number of requests? If the latter, how does one estimate the number of the requests and total scope? There is no simple strategy for determining the size of a fully centralized consultant team.&lt;/p&gt;

&lt;h2 id=&quot;embedded-model&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The embedded model&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;In this model, product teams hire their own data scientists. Each engineering manager is in charge of planning for data scientist headcount, hiring, allocation, and roadmap. The data scientist within each product team has the engineering team members as their peers.&lt;/p&gt;

&lt;h3 id=&quot;embedded-benefits&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Benefits&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;This model brings welcome independence to the teams and relieves the SBU of the management requirements of a fully centralized data science team. It solves problems with team sizing and communications by distributing responsibility. It also solves the ownership and motivation issues that exist in fully centralized models.&lt;/p&gt;

&lt;h3 id=&quot;embedded-drawbacks&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Drawbacks&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;While there are reductions in data science management costs, this model has important drawbacks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Management complexity.&lt;/strong&gt; Title and role diversity on the same team lead to management headaches. It is difficult for a single manager to maintain and assess multiple career ladders for different members of the team; managers rarely get it right even with a single ladder. Usually, the engineering manager is inadvertently biased towards assessment against the more common requirements—those of the standard engineering ladder. This incentivizes the data scientist to take on a role symmetrical to the other engineers on the team, undermining the original point of hiring a data scientist. Additionally, hiring data scientists, putting together the right interview panel, and on-boarding data scientists are all important challenges within this model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Mentorship deficit and difficulty in maintaining uniform data standards and best practices.&lt;/strong&gt; Data scientists benefit and learn from working closely with their peers, in particular during analysis reviews. An embedded model does not readily offer a path to a recurring and persistent relationship among data scientists. Further, independent data scientists on each team would design their own processes and standards. It should be noted that weak standards is a drawback of all fully decentralized models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Underutilizing technology and data science de-prioritization.&lt;/strong&gt; Some teams might put off hiring a data scientist due to pressing deadlines and costs. In the absence of good data, services are still deployed. This leads to important shortcomings in data quality and data products that becomes cumbersome and oftentimes impossible to fix later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Local rather than global optimization.&lt;/strong&gt; When there is no central ownership over metrics and key results, teams choose metrics and projects that lead to local optimization. Further, in this model, teams are incentivized to compete and ignore cannibalization effects. Local optimization is a drawback of all fully decentralized models.&lt;/p&gt;

&lt;h2 id=&quot;democratic-model&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The democratic model&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;In this model, it is believed that easy and straightforward access to data by product managers, designers, engineering managers, and engineers would lessen or remove the need for a data science role. Many identify the need for data scientists to be due to the lack of proper infrastructure for fast and easy dashboard creation.&lt;/p&gt;

&lt;h3 id=&quot;democratic-benefits&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Benefits&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;It is valuable to invest in data infrastructure and tooling that makes data access, processing, and visualization simpler everyday. This investment is particularly valuable to data scientists as it frees up time for proactive opportunity sizing, experiment design, metric design, model design, and general improvements in methodology.&lt;/p&gt;

&lt;h3 id=&quot;democratic-drawbacks&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Drawbacks&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;While ensuring everyone has direct and easy access to data is a noble goal, there are some drawbacks to this model:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Difficulty in mastering everything and maintaining data best practices.&lt;/strong&gt; Usually, people are mostly specialized and interested in a particular set of tasks. Being skilled at a company’s engineering stack is already a big feat. It is fine to offload design work and sales work and data work. Data scientists enforce good data practices within the organization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Dashboards are not the goal of data science, they are an intermediary step in the exploration of data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://attackwithnumbers.com/the-laws-of-shitty-dashboard&quot; target=&quot;_blank&quot;&gt;&lt;strong&gt;The laws of shitty dashboards * Attack with Numbers&lt;/strong&gt;&lt;/a&gt; (&lt;a href=&quot;https://web.archive.org/web/20200621172508/http://attackwithnumbers.com/the-laws-of-shitty-dashboard&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Wayback snapshot&lt;/a&gt;)&lt;br /&gt;
&lt;em&gt;Disclosure: I have been responsible for building shitty dashboards. I personally made most of the errors below. I…&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;product-data-science-model&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;The product data science model&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Between the extremes of the fully centralized model (the CoE model) and the fully decentralized model (the embedded model), there exists a spectrum of &lt;em&gt;hybrid&lt;/em&gt; models that take characteristics from each of the aforementioned models. Taking advantage of the strengths of both models, while actively making up for their deficiencies is what makes hybrid models successful.&lt;/p&gt;

&lt;p&gt;The product data science (PDS) model is inspired, only in part, by the &lt;a href=&quot;https://courses.lumenlearning.com/boundless-management/chapter/common-organizational-structures/&quot; target=&quot;_blank&quot;&gt;matrix structure&lt;/a&gt;. Individuals are simultaneously members of the data science function and a product team. Data scientists—although each a member of a product team—report only to a central data science management team. Thus, unlike the matrix structure, there is &lt;a href=&quot;https://www.mindtools.com/pages/article/henri-fayol.htm&quot; target=&quot;_blank&quot;&gt;unity of command&lt;/a&gt; under the PDS model.&lt;/p&gt;

&lt;p&gt;Revisiting the assumptions we enumerated at the beginning, in the PDS model, the cross-functional product team would include data scientists.&lt;/p&gt;

&lt;h3 id=&quot;pds-benefits&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Benefits&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;a. Clear ownership, actionable insights, and speed.&lt;/strong&gt; One important benefit of the PDS model is clear ownership of projects by the data scientists, due to their membership in the various product teams. Membership in each product team gives data scientists a thorough understanding of that product, its limits, and its potential. This in turn allows a straightforward mapping of analysis to proposals for action. It is difficult to move fast if newly available insight does not map into reasonable and informed actions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Quality data and quality data products.&lt;/strong&gt; Data scientists close collaborations with a product team improves data quality. Every single launch changes the data, and so it is important to oversee its evolution with careful instrumentation.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2019/data-quality-tweet.png&quot; alt=&quot;Tweet about data quality&quot; style=&quot;max-width: 600px; display: block; margin: 0 auto;&quot; /&gt;
&lt;em&gt;&lt;a href=&quot;https://twitter.com/peteskomoroch/status/1054142127054163969&quot; target=&quot;_blank&quot;&gt;Pete Skomoroch&lt;/a&gt; on Twitter&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Standardized data science processes.&lt;/strong&gt; Data science peers, working on different product teams, come together to establish best practices and onboarding flows within the data science team. They review one another’s code and analyses. They collaborate on complex projects. They benefit from a unified career ladder, with managers who can assess their impact and can plan for their growth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Global optimization.&lt;/strong&gt; The direct and recurring collaboration of data science peers from various product teams has other benefits. Due to their collective birds-eye-view of the business, they are able to connect the dots, identify inconsistencies, and optimize globally. This is &lt;a href=&quot;https://medium.com/@djpardis/q-a-with-steven-sinofsky-at-twitter-hq-a658ca5db953&quot; target=&quot;_blank&quot;&gt;similar to the way design teams should operate&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;e. Sizing and allocation clarity.&lt;/strong&gt; Another benefit of the PDS model is that it simplifies the task of determining the size of the data science team. Once you figure out how to partition the SBU into product teams, and you figure out the number of cross-functional stakeholders per product, the allocation of data scientists can be determined as a proportion. More available &lt;a href=&quot;https://medium.com/@djpardis/recommendations-for-data-science-team-sizing-and-allocation-strategy-a38f943638e5&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;pds-drawbacks&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Drawbacks&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;No model is perfect and each have their drawbacks.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Since there is no optimal or perfect organizational structure […] then the most important thing is to know the weaknesses of your structure and to compensate for them.&lt;/p&gt;

  &lt;p&gt;— Steven Sinofsky, &lt;a href=&quot;https://medium.learningbyshipping.com/functional-versus-unit-organizations-6b82bfbaa57&quot; target=&quot;_blank&quot;&gt;Functional versus Unit Organizations&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Below are some drawbacks of the PDS model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Cost.&lt;/strong&gt; One of the main arguments against the PDS model is the cost of hiring at least one data scientist for every product team; and the associated cost of a centralized data science management team. This assessment does not take into account the savings stemming from the increase in product and data quality, and the more effective use of data for the business. Having said that, organizations should do what they can afford. In the beginning everyone is responsible for engineering and data and design needs. As the SBU grows, one can have specialized functions handling each set of responsibilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Recurring conflicts due to lack of power parity.&lt;/strong&gt; For success on cross-functional teams, all functional leads should have similar amounts of negotiating power. Without power parity, the benefits of cross-functional collaboration are lessened due to recurring conflicts, lack of context, late delivery, and thus suboptimal results. Power parity is ensured by parity in reporting structure and compensation. Many companies, to this day, lack data science representation at the executive level of the SBU.&lt;/p&gt;

&lt;p&gt;Note that the importance of power parity is particularly pronounced when the company is behind on Data Science investments—whether it be in data infrastructure or in people. For data, the expectations are high and the stakeholders numerous. Short-term and long-term planning for the function needs to happen by someone who not only understands the requirements and challenges but is empowered to correctly and efficiently steer the data culture of an existing company.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Information overload and the data science manager.&lt;/strong&gt; Gaining the right amount of knowledge about all products supported by the data science team is not straightforward. However, managers need to be informed and curious about the areas under their purview to be able to build a roadmap and effectively assess contributions, investments, timelines, and tradeoffs. They also need to be able to continually communicate the contract between the data science and stakeholder teams and be mindful of the team’s portfolio. This is a responsibility of the managers of every functional team—not just data science.&lt;/p&gt;

&lt;p&gt;The drawbacks of the PDS model have relatively straightforward solutions, as described above.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Where an SBU is involved, I recommend the PDS model as the best in effectiveness and efficiency in leveraging data for the business.&lt;/p&gt;

&lt;p&gt;The PDS model is compliant with &lt;a href=&quot;https://www.quora.com/What-is-Groves-Law-and-What-is-the-difference-between-Moores-Law-and-Groves-Law&quot; target=&quot;_blank&quot;&gt;Grove’s Law&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;All large organizations with a common business purpose end up in a hybrid organizational form.&lt;/p&gt;

  &lt;p&gt;— Andy Grove&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is also aligned with &lt;a href=&quot;https://object.cato.org/sites/cato.org/files/articles/hayek-use-knowledge-society.pdf&quot; target=&quot;_blank&quot;&gt;Hayek’s views on the use of knowledge in society&lt;/a&gt;, where he motivates the need for a hybrid approach to organization and decision making. Neither end of the spectrum sufficiently meets the speed and context requirements of decision making in society.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We cannot expect that this problem will be solved by first communicating all this knowledge to a central board which, after integrating &lt;em&gt;all&lt;/em&gt; knowledge, issues its orders. We must solve it by some form of decentralization. But this answers only part of our problem. We need decentralization because only thus can we insure that the knowledge of the particular circumstances of time and place will be promptly used. But the “man on the spot” cannot decide solely on the basis of his limited but intimate knowledge of the facts of his immediate surroundings. There still remains the problem of communicating to him such further information as he needs to fit his decisions into the whole pattern of changes of the larger economic system.&lt;/p&gt;

  &lt;p&gt;— F.A. Hayek&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;P.S. I had an easier time saying all of this in a Tweet,&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Embedded for context, relevance, communication efficiency, and to be in sync. Centralized for hiring and promotion purposes, for peer review, for sharing and maintaining best practices [, for global optimization, and to align on strategy].&lt;/p&gt;

  &lt;p&gt;— @djpardis on &lt;a href=&quot;https://twitter.com/djpard1s/status/999784577441787905?s=20&amp;amp;t=tTdNIhFQpuwmEcSwAujllA&quot; target=&quot;_blank&quot;&gt;Twitter&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;references&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;References&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;a id=&quot;ref1&quot; href=&quot;#ref1-back&quot;&gt;[1]&lt;/a&gt; &lt;a href=&quot;https://medium.learningbyshipping.com/functional-versus-unit-organizations-6b82bfbaa57&quot; target=&quot;_blank&quot;&gt;Functional versus Unit Organizations&lt;/a&gt; by Steven Sinofsky&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref2&quot; href=&quot;#ref2-back&quot;&gt;[2]&lt;/a&gt; &lt;a href=&quot;http://www.datascienceassn.org/sites/default/files/Building%20Data%20Science%20Teams.pdf&quot; target=&quot;_blank&quot;&gt;Building Data Science Teams&lt;/a&gt; by DJ Patil&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref3&quot; href=&quot;#ref3-back&quot;&gt;[3]&lt;/a&gt; &lt;a href=&quot;https://www.oreilly.com/ideas/where-should-you-put-your-data-scientists&quot; target=&quot;_blank&quot;&gt;Where should you put your data scientists&lt;/a&gt; by Daniel Tunkelang&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;ref4&quot; href=&quot;#ref4-back&quot;&gt;[4]&lt;/a&gt; &lt;a href=&quot;https://www.youtube.com/watch?v=rqWnEJXnfiY&quot; target=&quot;_blank&quot;&gt;How to play well with others&lt;/a&gt; by Josh Wills&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Acknowledgements&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Thanks to &lt;a href=&quot;https://twitter.com/rakiwane&quot; target=&quot;_blank&quot;&gt;Raki Wane&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/peteskomoroch&quot; target=&quot;_blank&quot;&gt;Peter Skomoroch&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/_saysan_&quot; target=&quot;_blank&quot;&gt;Sayan Sanyal&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/jrmontag&quot; target=&quot;_blank&quot;&gt;Josh Montague&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/chrisalbon&quot; target=&quot;_blank&quot;&gt;Chris Albon&lt;/a&gt;, Josh Silverman, and &lt;a href=&quot;https://twitter.com/_harish_krishna&quot; target=&quot;_blank&quot;&gt;Harish Krishnan&lt;/a&gt; for reviewing and providing valuable feedback.&lt;/p&gt;

&lt;h2 id=&quot;citations-and-coverage&quot;&gt;&lt;a href=&quot;#table-of-contents&quot;&gt;Citations and coverage&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://onlinedatasciencemasters.virginia.edu/blog/need-for-interdisciplinary-data-science/&quot;&gt;University of Virgina Data Science&lt;/a&gt; (&lt;a href=&quot;/files/wayback/uva-need-for-interdisciplinary-data-science-20210517.html&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Wayback snapshot&lt;/a&gt;), &lt;a href=&quot;https://www.97ways.com/thelist/8-sit-with-your-stakeholders&quot;&gt;97 Ways (Matt Wright)&lt;/a&gt;, &lt;a href=&quot;https://www.linkedin.com/pulse/beyond-poc-how-make-machine-learning-real-enterprise-sam-charrington/&quot;&gt;Beyond the POC. How to Make Machine Learning Real in the Enterprise (Sam Charrington)&lt;/a&gt;, &lt;a href=&quot;https://us20.campaign-archive.com/?e=&amp;amp;u=8974b971ec317d8a98dbbf292&amp;amp;id=05f0f9e91a&quot;&gt;Projects to Know (Amplify Partners, Sarah Catanzaro)&lt;/a&gt;, &lt;a href=&quot;http://roundup.fishtownanalytics.com/issues/survival-analysis-better-presto-pinterest-dagster-data-science-in-organizations-a-two-fer-dsr-194-193857&quot;&gt;The Data Science Roundup (Fishtown Analytics, Tristan Handy)&lt;/a&gt;, &lt;a href=&quot;https://vicki.substack.com/p/selling-data-science&quot;&gt;Normcore Tech (Vicki Boykis)&lt;/a&gt;, &lt;a href=&quot;https://femstreet.substack.com/p/-parenthood-and-entrepreneurship-19-08-04&quot;&gt;Femstreet (Sarah Nöckel)&lt;/a&gt;, &lt;a href=&quot;http://lineardigressions.com/episodes/2019/8/25/organizational-models-for-data-scientists&quot;&gt;Linear Digressions&lt;/a&gt;, &lt;a href=&quot;https://analyticaliq.com/data-science-staffing/&quot;&gt;Analytical IQ (Adam Lorton)&lt;/a&gt;, &lt;a href=&quot;https://hex.tech/blog/data-team-roi&quot;&gt;Hex Blog (Hex, Barry McCardel)&lt;/a&gt;, &lt;a href=&quot;https://fall2019.fullstackdeeplearning.com/course-content/where-to-go-next&quot;&gt;Full Stack Deep Learning&lt;/a&gt;, &lt;a href=&quot;https://www.getrevue.co/profile/shashank/issues/the-ml-times-issue-14-192472&quot;&gt;The ML Times&lt;/a&gt;, &lt;a href=&quot;https://dispatch.nibble.ai/issues/nibble-ai-weekly-issue-23-making-data-science-more-useful-deploying-ai-without-technical-debt-191252&quot;&gt;nibble dispatch&lt;/a&gt;, &lt;a href=&quot;https://leanpub.com/dshiring&quot;&gt;Hiring Data Scientists and Machine Learning Engineers. A Practical Guide (Roy Keyes)&lt;/a&gt;, &lt;a href=&quot;https://anchor.fm/blog-cast/episodes/Ep-9-Pardis-Noorzad-Models-for-integrating-data-science-teams-within-companies-e1529qu&quot;&gt;Blog Cast (Sam Bail)&lt;/a&gt;, &lt;a href=&quot;https://www.getdbt.com/data-teams/centralized-vs-decentralized/&quot;&gt;dbt Blog (Erin Vaughan and Janessa Lantz)&lt;/a&gt;, &lt;a href=&quot;https://pedram.substack.com/p/modern-data-team&quot;&gt;Building The Modern Data Team (Pedram Navid)&lt;/a&gt;, &lt;a href=&quot;https://nirantk.com/writing/data-science-org-design.html&quot;&gt;Data Science Org Design for Startups (Nirant Kasliwal)&lt;/a&gt;, &lt;a href=&quot;https://www.linkedin.com/pulse/search-leadership-daniel-tunkelang&quot;&gt;On Search Leadership (Daniel Tunkelang)&lt;/a&gt;, &lt;a href=&quot;https://blog.collectors.com/building-a-data-platform-from-scratch-at-collectors-part-3-of-3/&quot;&gt;Building A Data Platform From Scratch At Collectors. Part 3 of 3 (Sam Bail)&lt;/a&gt;, &lt;a href=&quot;https://amplifypartners.com/moderndateteamshub/&quot;&gt;Modern Data Teams Hub (Amplify Partners, Emilie Schario)&lt;/a&gt;, &lt;a href=&quot;https://github.com/jm-contreras/data-science-management-resources&quot;&gt;Data Science Management Resources (jm-contreras)&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;crosspost-container post-container&quot;&gt;
This post was originally published on &lt;a href=&quot;https://medium.com/@djpardis/models-for-integrating-data-science-teams-within-organizations-7c5afa032ebd&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Medium&lt;/a&gt; and is cross-posted here.
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Hourly mentions of a word on Twitter</title>
   <link href="https://djpardis.com/blog/2015/05/23/hourly-mentions-of-a-word-on-twitter/"/>
   <updated>2015-05-23T00:00:00+00:00</updated>
   <id>https://djpardis.com/blog/2015/05/23/hourly-mentions-of-a-word-on-twitter</id>
   <content type="html">&lt;p&gt;Some time ago (OK, a month ago—time ✈️s), I saw this tweet:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Need a simple tool to track mentions of a keyword on Twitter by hour. Don’t need a bunch of bells and whistles. Thoughts?&lt;/p&gt;

  &lt;p&gt;— Kaegan Donnelly (&lt;a href=&quot;https://twitter.com/kaequan/status/591359379431104513&quot;&gt;@kaequan&lt;/a&gt;) • April 23, 2015&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I thought, “Should be easy, lmgt.” However, results for the query “hourly mentions of a word on Twitter” didn’t offer clear solutions.&lt;/p&gt;

&lt;p&gt;Days later I came across two relatively simple approaches to tackling the problem. The first is &lt;a href=&quot;https://github.com/tweepy/tweepy&quot;&gt;Tweepy&lt;/a&gt;. The other is &lt;a href=&quot;https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html&quot;&gt;Logstash&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Tweepy is an &lt;a href=&quot;http://www.tweepy.org/&quot;&gt;open source Python library&lt;/a&gt; for accessing the Twitter API, including the Twitter Streaming API.&lt;/p&gt;

&lt;p&gt;Logstash is an open source tool for &lt;a href=&quot;https://wikitech.wikimedia.org/wiki/Logstash&quot;&gt;collecting, processing, and forwarding events&lt;/a&gt;. Logstash can read events from the Twitter Streaming API using &lt;a href=&quot;https://www.elastic.co/guide/en/logstash/current/plugins-inputs-twitter.html&quot;&gt;its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;twitter&lt;/code&gt; plugin&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Having tried both, I recommend Logstash over Tweepy for two main reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It &lt;a href=&quot;https://github.com/logstash-plugins/logstash-input-twitter/blob/master/lib/logstash/inputs/twitter.rb&quot;&gt;deals&lt;/a&gt; with the Twitter API rate limits by default&lt;/li&gt;
  &lt;li&gt;It offers Elasticsearch and Kibana integration—simplifying the aggregation and visualization steps, respectively, that naturally follow the data (tweet) collection step&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For both Tweepy and Logstash you need access to Twitter’s streaming API. Follow steps 2 and 3 &lt;a href=&quot;https://www.digitalocean.com/community/tutorials/how-to-authenticate-a-python-application-with-twitter-using-tweepy-on-ubuntu-14-04&quot;&gt;here&lt;/a&gt; to create a Twitter app and obtain your &lt;em&gt;Consumer Key&lt;/em&gt;, &lt;em&gt;Consumer Key Secret&lt;/em&gt;, &lt;em&gt;Access Token&lt;/em&gt;, and &lt;em&gt;Access Token Secret&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;the-elk-solution&quot;&gt;The ELK solution&lt;/h3&gt;

&lt;p&gt;Download and install &lt;a href=&quot;https://www.elastic.co/downloads/past-releases/elasticsearch-1-4-4&quot;&gt;Elasticsearch&lt;/a&gt;, &lt;a href=&quot;https://www.elastic.co/downloads/logstash&quot;&gt;Logstash&lt;/a&gt;, and &lt;a href=&quot;https://www.elastic.co/downloads/kibana&quot;&gt;Kibana&lt;/a&gt;. If you are on a Mac, you can&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;elasticsearch
brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;logstash&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Make sure you have Elasticsearch and Kibana running. Before running Logstash, you need to prepare a configuration file. Below is a sample configuration file to collect tweets containing the word &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ireland&lt;/code&gt; (call it &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ireland.conf&lt;/code&gt;):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-apacheconf&quot; data-lang=&quot;apacheconf&quot;&gt;# a logstash config file has three sections:
# input{}, output{}, and (optional) filter{}; add plugins
# to specify how events should be handled in each section

input {
    twitter {
        # set key and token values from the previous step
        consumer_key =&amp;gt; &quot;&quot;
        consumer_secret =&amp;gt; &quot;&quot;
        oauth_token =&amp;gt; &quot;&quot;
        oauth_token_secret =&amp;gt; &quot;&quot;
        # assume we are interested in tracking all
        # mentions of the word &quot;ireland&quot;
        keywords =&amp;gt; [&quot;ireland&quot;]
        # no need for all fields to get hourly counts
        full_tweet =&amp;gt; false
    }
}

output {
	stdout {
		# include this to pretty-print the event&apos;s json to stdout
		codec =&amp;gt; rubydebug
  	}
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To start streaming tweets, run&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-apacheconf&quot; data-lang=&quot;apacheconf&quot;&gt;logstash -f ireland.conf&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;At this point, tweets are written to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stdout&lt;/code&gt;. In order to visualize tweet counts using Kibana, you need to save the tweets to Elasticsearch.&lt;/p&gt;

&lt;p&gt;Add the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;elasticsearch&lt;/code&gt; plugin to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output&lt;/code&gt; section of the configuration:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-apacheconf&quot; data-lang=&quot;apacheconf&quot;&gt;output {
    elasticsearch {
        protocol =&amp;gt; &quot;http&quot;
        host =&amp;gt; &quot;localhost&quot;
        index =&amp;gt; &quot;irelandtweets&quot;
    }

	stdout {
		# include this to pretty-print the event&apos;s json to stdout
		codec =&amp;gt; rubydebug
  	}
}&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Run Logstash again and have a look at:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;http://localhost:9200/irelandtweets/_search/?pretty&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Below is a sample of the output format. You can see, for example, that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;65235&lt;/code&gt; documents (tweets) have been stored in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;irelandtweets&lt;/code&gt; index:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;took&quot;&lt;/span&gt; : 2,
  &lt;span class=&quot;s2&quot;&gt;&quot;timed_out&quot;&lt;/span&gt; : &lt;span class=&quot;nb&quot;&gt;false&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;_shards&quot;&lt;/span&gt; : &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;total&quot;&lt;/span&gt; : 5,
    &lt;span class=&quot;s2&quot;&gt;&quot;successful&quot;&lt;/span&gt; : 5,
    &lt;span class=&quot;s2&quot;&gt;&quot;failed&quot;&lt;/span&gt; : 0
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;hits&quot;&lt;/span&gt; : &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;total&quot;&lt;/span&gt; : 65235,
    &lt;span class=&quot;s2&quot;&gt;&quot;max_score&quot;&lt;/span&gt; : 1.0,
    &lt;span class=&quot;s2&quot;&gt;&quot;hits&quot;&lt;/span&gt; : &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;_index&quot;&lt;/span&gt; : &lt;span class=&quot;s2&quot;&gt;&quot;irelandtweets&quot;&lt;/span&gt;,
      &lt;span class=&quot;s2&quot;&gt;&quot;_type&quot;&lt;/span&gt; : &lt;span class=&quot;s2&quot;&gt;&quot;logs&quot;&lt;/span&gt;,
      &lt;span class=&quot;s2&quot;&gt;&quot;_id&quot;&lt;/span&gt; : &lt;span class=&quot;s2&quot;&gt;&quot;AU2B1MGZPj_44djTabLA&quot;&lt;/span&gt;,
      &lt;span class=&quot;s2&quot;&gt;&quot;_score&quot;&lt;/span&gt; : 1.0,
      &lt;span class=&quot;s2&quot;&gt;&quot;_source&quot;&lt;/span&gt;:&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;@timestamp&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;2015-05-23T17:31:51.000Z&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;message&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;Y&apos;all have no idea how happy I am for Ireland 💗 Can my country say yes to equality too 😭&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;user&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;LesbiForLauren&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;client&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;a href=&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;http://twitter.com/download/iphone&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; rel=&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;nofollow&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;gt;Twitter for iPhone&amp;lt;/a&amp;gt;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;retweeted&quot;&lt;/span&gt;:false,&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;http://twitter.com/LesbiForLauren/status/602165054042034176&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;@version&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;, &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;_index&quot;&lt;/span&gt; : &lt;span class=&quot;s2&quot;&gt;&quot;irelandtweets&quot;&lt;/span&gt;,
      &lt;span class=&quot;s2&quot;&gt;&quot;_type&quot;&lt;/span&gt; : &lt;span class=&quot;s2&quot;&gt;&quot;logs&quot;&lt;/span&gt;,
      &lt;span class=&quot;s2&quot;&gt;&quot;_id&quot;&lt;/span&gt; : &lt;span class=&quot;s2&quot;&gt;&quot;AU2B1MGZPj_44djTabLF&quot;&lt;/span&gt;,
      &lt;span class=&quot;s2&quot;&gt;&quot;_score&quot;&lt;/span&gt; : 1.0,
      &lt;span class=&quot;s2&quot;&gt;&quot;_source&quot;&lt;/span&gt;:&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;@timestamp&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;2015-05-23T17:31:51.000Z&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;message&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;RT @muyskerm: @Jack_Septic_Eye Well done Ireland. The U.S. could take a lesson.&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;user&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;SOUTHERNjamespb&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;client&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;a href=&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;http://www.twitter.com&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; rel=&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;nofollow&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;gt;Twitter for BlackBerry&amp;lt;/a&amp;gt;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;retweeted&quot;&lt;/span&gt;:false,&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;http://twitter.com/SOUTHERNjamespb/status/602165054889283584&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;@version&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;, &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
               ...&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To start using Kibana, visit&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;http://localhost:5601/&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;On the Discover tab, there is a configuration form:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Check off the box: &lt;em&gt;Index contains time-based events&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;Fill the &lt;em&gt;Index name or pattern&lt;/em&gt; field with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;irelandtweets&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Fill the &lt;em&gt;Time-field name&lt;/em&gt; field with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@timestamp&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On the Visualize tab, choose visualization type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Line chart&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Choose option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;From a saved search&lt;/code&gt; to use the same query you specified on the Discover tab&lt;/li&gt;
  &lt;li&gt;For metric aggregation (Y-Axis): Choose &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Count&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;For bucket aggregation (X-Axis):
i. Aggregation: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Date Histogram&lt;/code&gt;
ii. Field: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@timestamp&lt;/code&gt;
iii. Interval: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Minute&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Click on the Refresh Interval tab at the top. Choose &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;5 seconds&lt;/code&gt; and see your line chart come alive 📈&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/files/pics/blog/2015/kibana_screenshot.png&quot; alt=&quot;Kibana screenshot&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Done. Thank you for starting the conversation Kaegan!&lt;/p&gt;

&lt;h3 id=&quot;more-resources&quot;&gt;More resources&lt;/h3&gt;

&lt;p&gt;For details about Logstash plugins see &lt;a href=&quot;https://www.elastic.co/guide/en/logstash/current/configuration.html&quot; target=&quot;_blank&quot;&gt;this guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Anna Roes has written an excellent overview of Kibana in &lt;a href=&quot;https://www.timroes.de/2015/02/07/kibana-4-tutorial-part-1-introduction/&quot; target=&quot;_blank&quot;&gt;this tutorial&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 

</feed>
