Tag: llm

  • Building AI Chat Apps Across Python, TypeScript & Java: A Hands-on Comparison

    In this post, I explore how to build the same AI-powered chat app in Python, TypeScript, and Java using LangChain, LangChain.js, and LangChain4j. If you’re deciding how to bring AI into your stack, this guide will help you understand trade-offs and developer experience across ecosystems.

    Why This Matters

    AI chat applications are becoming central to digital experiences. They support customer service, internal tools, and user engagement. Whether you’re a Python data scientist, a TypeScript full-stack developer, or a Java enterprise engineer, LLMs are transforming your landscape.

    Fortunately, frameworks like LangChain (Python), LangChain.js (TypeScript), and LangChain4j (Java) now make it easier to integrate LLMs without starting from scratch.

    One Chat App, Three Languages

    I built a basic chat app in each language using their respective LangChain implementation. The goal was to compare developer experience, language fit, and production readiness.

    Python (3.12) + LangChain

    from langchain_openai import ChatOpenAI
    
    chat_model = ChatOpenAI(model="gpt-4o", temperature=0.7, api_key="YOUR_OPENAI_API_KEY")
    
    while True:
        user_input = input("You: ")
        if user_input.lower() in ["exit", "quit"]:
            break
        response = chat_model.invoke(user_input)
        print(f"AI: {response.content}")
    

    Takeaway
    Python offers the most seamless and concise development experience. It is ideal for fast prototyping and experimentation.

    TypeScript () + LangChain.js

    import { ChatOpenAI } from "@langchain/openai";
    import readline from "readline";
    
    const chatModel = new ChatOpenAI({
      openAIApiKey: process.env.OPENAI_API_KEY,
      model: "gpt-4o",
      temperature: 0.7,
    });
    
    const rl = readline.createInterface({ input: process.stdin, output: process.stdout });
    
    function promptUser() {
      rl.question("You: ", async (input) => {
        if (input.toLowerCase() === "exit" || input.toLowerCase() === "quit") {
          rl.close();
          return;
        }
        const response = await chatModel.invoke(input);
        console.log(`AI: ${response.content}`);
        promptUser();
      });
    }
    
    promptUser();
    

    Takeaway
    TypeScript is a great fit for web-first and full-stack developers. The async structure aligns well with modern web development, and the LangChain.js ecosystem is growing rapidly.

    Java (17) + LangChain4j

    import dev.langchain4j.model.chat.ChatLanguageModel;
    import dev.langchain4j.model.openai.OpenAiChatModel;
    import java.util.Scanner;
    
    public class BasicChatApp {
        public static void main(String[] args) {
            ChatLanguageModel model = OpenAiChatModel.builder()
                .apiKey(System.getenv("OPENAI_API_KEY"))
                .modelName("gpt-4o")
                .temperature(0.7)
                .build();
    
            Scanner scanner = new Scanner(System.in);
            while (true) {
                System.out.print("You: ");
                String input = scanner.nextLine();
                if (input.equalsIgnoreCase("exit") || input.equalsIgnoreCase("quit")) break;
                String response = model.chat(input);
                System.out.println("AI: " + response);
            }
        }
    }
    

    Takeaway
    Java with LangChain4j is designed for enterprise environments. It offers strong typing and structure, making it a solid choice for scalable, production-grade systems.

    Side-by-Side Comparison

    FeaturePython (LangChain)TypeScript (LangChain.js)Java (LangChain4j)
    Ease of SetupEasiestModerateMost Complex
    Best Use CasePrototyping, researchWeb apps, full-stackEnterprise backends
    Ecosystem MaturityMost matureRapidly growingEvolving
    Code VerbosityConciseConcise with asyncVerbose and structured

    Strategic Insights

    • If you are working in a startup or a research lab, Python is the fastest way to test ideas and iterate quickly.
    • For web and cross-platform products, TypeScript provides excellent alignment with frontend and serverless workflows.
    • In regulated or large-scale enterprise systems, Java continues to be a reliable foundation. LangChain4j brings modern AI capabilities into that world.

    All three ecosystems now offer viable paths to LLM integration. Choose the one that aligns with your team’s strengths and your system’s goals.

    What Do You Think?

    Which tech stack do you prefer for building AI applications?
    Have you tried LangChain or LangChain4j in your projects?
    I’d love to hear your thoughts or questions in the comments.

  • Why BERTScore and Cosine Similarity Are Not Enough for Evaluating GenAI Outputs

    As Generative AI becomes integral to modern workflows, evaluating the quality of model outputs has become critical. Metrics like BERTScore and cosine similarity are widely used to compare generated text with reference answers. However, recent experiments in our gen-ai-tests repository show that these metrics often overestimate similarity, even when the content is unrelated or incorrect.

    In this post, we will explore how these metrics work, highlight their key shortcomings, and provide real examples including test failures from GitHub Actions to show why relying on them alone is risky.


    How Do These Metrics Work?

    • BERTScore: Uses pre-trained transformer models to compare the contextual embeddings of tokens. It calculates similarity based on token-level precision, recall, and F1 scores.
    • Cosine Similarity: Measures the cosine of the angle between two high-dimensional sentence embeddings. A score closer to 1 indicates greater similarity.

    While these metrics are fast and easy to implement, they have critical blind spots.


    Experimental Results From gen-ai-tests

    We evaluated prompts and responses using prompt_eval.py.

    Example 1: High Similarity for Valid Output

    "prompt": "What is the capital of France?",
    "expected_answer": "Paris is the capital of France.",
    "generated_answer": "The capital of France is Paris."
    

    Results:

    • BERTScore F1: approximately 0.93
    • Cosine Similarity: approximately 0.99

    This is expected. Both metrics perform well when the content is semantically correct and phrased differently.

    Example 2: High Similarity for Unrelated Sentences

    In test_prompt_eval.py, we evaluate unrelated sentences:

    def test_bertscore_unrelated():
        # Two sentences that are unrelated should have a low BERTScore
        # This is a simple example to showcase the limitations of BERTScore
        s1 = "The quick brown fox jumps over the lazy dog."
        s2 = "Quantum mechanics describes the behavior of particles at atomic scales."
        score = evaluate_bertscore(s1, s2)
        print(f"BERTScore between unrelated sentences: {score}")
        assert score < 0.8
    
    
    def test_cosine_similarity_unrelated():
        s1 = "The quick brown fox jumps over the lazy dog."
        s2 = "Quantum mechanics describes the behavior of particles at atomic scales."
        sim = evaluate_cosine_similarity(s1, s2)
        print(f"Cosine similarity between unrelated sentences: {sim}")
        assert sim < 0.8

    However, this test fails. Despite the sentences being completely unrelated, BERTScore returns a score above 0.8.

    Real Test Failure in CI

    Here is the actual test failure from our GitHub Actions pipeline:

    FAILED prompt_tests/test_prompt_eval.py::test_bertscore_unrelated - assert 0.8400543332099915 < 0.8

    📎 View the GitHub Actions log here

    This demonstrates how BERTScore can be misleading even in automated test pipelines, letting incorrect or irrelevant GenAI outputs pass evaluation.


    Key Limitations Observed

    1. Overestimation of Similarity
      Common linguistic patterns and phrasing can inflate similarity scores, even when the content is semantically different.
    2. No Factual Awareness
      These metrics do not measure whether the generated output is correct or grounded in fact. They only compare vector embeddings.
    3. Insensitive to Word Order or Meaning Shift
      Sentences like “The cat chased the dog” and “The dog chased the cat” may receive similarly high scores, despite the reversed meaning.

    What to Use Instead?

    To evaluate GenAI reliably, especially in production, we recommend integrating context-aware, task-specific, or model-in-the-loop evaluation strategies:

    • LLM-as-a-judge using GPT-4 or Claude for qualitative feedback.
    • BLEURT, G-Eval, or BARTScore for learned scoring aligned with human judgments.
    • Fact-checking modules, citation checkers, or hallucination detectors.
    • Hybrid pipelines that combine automatic similarity scoring with targeted LLM evaluation and manual review.

    Final Takeaway

    If you are evaluating GenAI outputs for tasks like question answering, summarization, or decision support, do not rely solely on BERTScore or cosine similarity. These metrics can lead to overconfident assessments of poor-quality outputs.

    You can find all code and examples here:
    📁 gen-ai-tests
    📄 prompt_eval.py, test_prompt_eval.py

  • The Future Ecosystem of Renting AI Coding Agents

    Introduction

    The rapid advancement of AI agents, particularly in software development, is paving the way for a transformative ecosystem where businesses can rent or hire specialised AI agents tailored to specific coding tasks. This article explores a future where one or more companies provide a marketplace of AI agents with varying capabilities – such as front-end development, security analysis, or backend optimisation – powered by Small Language Models (SLMs) or Large Language Models (LLMs). The pricing of these agents is tiered based on their computational backing and expertise, creating a dynamic and accessible solution for companies of all sizes.

    The Agent Rental Ecosystem

    Concept Overview

    Imagine a platform operated by an AI agent provider company, functioning as a marketplace for renting AI coding agents. These agents are pre-trained for specialised roles, such as:

    • Front-End Specialist: Designs and implements user interfaces using frameworks like React or Vue.js, ensuring responsive and accessible designs.
    • Security Specialist: Performs vulnerability assessments, penetration testing, and secure code reviews to safeguard applications.
    • Backend Specialist: Optimizes server-side logic, database management, and API development using technologies like Node.js or Django.
    • DevOps Specialist: Automates CI/CD pipelines, manages cloud infrastructure, and ensures scalability with tools like Docker and Kubernetes.
    • Full-Stack Generalist: Handles end-to-end development for smaller projects requiring versatility.

    Each agent is backed by either an SLM for lightweight, cost-effective tasks or an LLM for complex, context-heavy projects. The provider company maintains a robust infrastructure to deploy these agents on-demand, integrating seamlessly with clients’ development environments.

    Technical Architecture

    The ecosystem operates on a cloud-based platform with the following components:

    1. Agent Catalog: A user-friendly interface where clients browse agents by role, expertise, and model type (SLM or LLM).
    2. Model Management: A backend system that dynamically allocates SLMs or LLMs based on task requirements, optimizing for cost and performance.
    3. Integration Layer: APIs and SDKs that allow agents to plug into existing IDEs, version control systems (e.g., Git), and cloud platforms (e.g., AWS, Azure).
    4. Monitoring and Feedback: Real-time dashboards to track agent performance, code quality, and task completion, with feedback loops to improve agent training.
    5. Billing System: A usage-based pricing model that charges clients based on agent runtime, model type, and task complexity.

    Pricing Model

    The cost of renting an AI agent is determined by:

    • Model Type: SLM-backed agents are cheaper, suitable for routine tasks like UI component design or basic debugging. LLM-backed agents, with their superior reasoning and context awareness, are priced higher for tasks like architectural design or advanced security audits.
    • Task Duration: Short-term tasks (e.g., a one-hour code review) are billed hourly, while long-term projects (e.g., building an entire application) offer subscription-based discounts.
    • Specialization Level: Highly specialized agents, such as those trained for niche domains like blockchain or IoT security, command premium rates.
    • Resource Usage: Computational resources (e.g., GPU usage for LLMs) and data storage needs influence the final cost.

    For example:

    • A front-end SLM agent for designing a landing page might cost $10/hour.
    • A security-specialist LLM agent for a comprehensive penetration test could cost $100/hour.

    Benefits of the Ecosystem

    1. Accessibility: Small startups and individual developers can access high-quality AI expertise without hiring full-time specialists.
    2. Scalability: Enterprises can scale development teams instantly by renting multiple agents for parallel tasks.
    3. Cost Efficiency: Clients pay only for the specific skills and duration needed, avoiding the overhead of traditional hiring.
    4. Quality Assurance: The provider company ensures agents are trained on the latest frameworks, standards, and best practices.
    5. Flexibility: Clients can mix and match agents (e.g., a front-end SLM agent with a backend LLM agent) to suit project needs.

    Challenges and Considerations

    1. Ethical Concerns: Ensuring agents do not produce biased or insecure code, requiring rigorous auditing and transparency.
    2. Integration Complexity: Seamlessly embedding agents into diverse development environments may require significant upfront configuration.
    3. Skill Gaps: SLM-backed agents may struggle with highly creative or ambiguous tasks, necessitating LLM intervention.
    4. Data Privacy: Safeguarding client code and proprietary data processed by agents is critical, demanding robust encryption and compliance with regulations like GDPR.
    5. Market Competition: The provider must differentiate itself in a crowded AI market by offering superior agent performance and customer support.

    Future Outlook

    As AI models become more efficient and specialized, the agent rental ecosystem could expand beyond coding to domains like design, marketing, or legal analysis. The provider company could introduce features like:

    • Agent Customization: Allowing clients to fine-tune agents with proprietary data or specific workflows.
    • Collaborative Agents: Enabling teams of agents to work together on complex projects, mimicking human development teams.
    • Global Accessibility: Offering multilingual agents to cater to diverse markets, powered by localized SLMs or LLMs.

    Conclusion

    The ecosystem of renting AI coding agents represents a paradigm shift in software development, democratising access to specialised expertise while optimising costs. By offering a range of SLM- and LLM-backed agents, the provider company can cater to diverse needs, from startups building MVPs to enterprises securing mission-critical systems. While challenges like data privacy and integration remain, the potential for innovation and efficiency makes this a compelling vision for the future of work.

  • Comparison of Gen AI providers

    Generative AI agents are transforming how we interact with technology, offering powerful tools for creativity, productivity, and research. Let us explore the free tier offerings of four leading AI agents – ChatGPT, Google Gemini, Grok, and Claude – highlighting their core features and recent updates available to users without a paid subscription.

    ChatGPT (OpenAI)

    What It Offers: ChatGPT, powered by the GPT-4o model, is a versatile conversational AI accessible for free with a registered account. It excels in tasks like casual conversation, creative writing, coding assistance, and answering complex questions. Its clean interface and conversational memory (when enabled) allow for personalized, context-aware interactions, making it ideal for writers, students, and casual users. The free tier supports text generation, basic reasoning, and limited image description capabilities.

    Recent Updates: As of April 2025, free users can access GPT-4o, which offers improved speed and reasoning compared to GPT-3.5. However, usage is capped at approximately 15 messages every three hours, reverting to GPT-3.5 during peak times or after limits are reached. OpenAI has also introduced limited access to “Operators,” AI agents that can perform tasks like booking or shopping, though these are more restricted in the free tier.

    Why It Stands Out: ChatGPT’s user-friendly design and broad task versatility make it a go-to for general-purpose AI needs, with a proven track record of refinement based on millions of users’ feedback.

    Google Gemini

    What It Offers: Gemini, Google’s multimodal AI, is deeply integrated with Google’s ecosystem (Search, Gmail, Docs) and shines in real-time web access, research, and creative tasks. The free tier, capped at around 500 interactions per month, supports text generation, image analysis, and basic image generation via Imagen 3. Gemini’s ability to provide multiple response drafts and its conversational tone make it great for brainstorming and research.

    Recent Updates: In March 2025, Google made Gemini 2.5 Pro experimental available to free users, boosting performance in reasoning and coding tasks. The Deep Research feature, offering comprehensive, citation-rich reports, is now free with a limit of 10 queries per month. Additionally, free users can create limited “Gems” (custom AI personas) for tasks like fitness coaching or resume editing, enhancing personalisation.

    Why It Stands Out: Gemini’s seamless Google integration and free access to advanced features like Deep Research give it an edge for users already in the Google ecosystem or those needing robust research tools.

    Grok (xAI)

    What It Offers: Grok, developed by xAI, is designed for witty, less-filtered conversations and integrates with the X platform for real-time insights. The free tier, available temporarily as of February 2025, supports text generation, image analysis, and basic image generation. Grok’s “workspaces” feature allows users to organize thoughts, share related material, and collaborate, making it ideal for dynamic, social-media-driven workflows.

    Recent Updates: Launched on February 18, 2025, Grok 3 has shown strong performance in benchmarks, excelling in reasoning, coding, and creative writing. The recent introduction of Grok Studio (April 2025) enables free users to generate websites, papers, and games with real-time editing, similar to OpenAI’s Canvas. Integration with Google Drive further enhances its utility for collaborative projects.

    Why It Stands Out: Grok’s workspaces and Studio features offer a unique, interactive approach to organising and creating content, appealing to users who value humour and real-time social context.

    Claude (Anthropic)

    What It Offers: Claude, powered by Claude 3.5 Sonnet, is a text-focused AI emphasizing ethical responses and strong contextual understanding. The free tier supports basic text generation, long-document processing (up to 100K tokens), and image analysis (up to 5 images per prompt). Its “Projects” space, similar to Grok’s workspaces, allows users to organize documents and prompts for focused tasks, making it suitable for researchers and writers.

    Recent Updates: In late 2024, Claude added vision capabilities to its free tier, enabling image analysis for tasks like chart interpretation or text extraction. The Projects feature has been enhanced to support better document management, offering a structured environment for summarising or comparing large texts.

    Why It Stands Out: Claude’s ability to handle lengthy documents and its Projects space make it a top choice for users needing deep text analysis or organized workflows, with a focus on safe, moderated responses.

    Below is a tabular comparison

    CriteriaChatGPT (OpenAI)Gemini (Google)Grok (xAI)Claude (Anthropic)
    Natural LanguageConversational, creative, great for writing & Q&AAdvanced research & brainstorming, nuanced drafts Witty, creative dialogue, less filtered Ethical, contextual, excels in text analysis
    Languages~100 (English, Spanish, Mandarin, etc.).150+ with Google Translate.~50, English-focused, expanding.~30, mainly English.
    Tone & PersonalityFriendly, neutral, adaptable.Approachable, customizable via Gems.Humorous, edgy, JARVIS-like.Safe, formal, ethical.
    Real-Time InfoLimited, no web access.Strong, Google Search integration.Strong, X platform news & social.None, internal knowledge only.
    Chat OrganizationBasic history with search.Google account, no workspaces.Workspaces for collaboration.Projects for structured docs.
    Context Window~128K tokens.~1M tokens.~128K tokens.~200K tokens.
    Deep Search/ThinkDeep ResearchDeep Research (10/mo).Think mode via UI.None in free tier.
    Coding SupportStrong (Python, JS, debugging).Excellent (multi-language).Strong (Grok Studio for websites/games).Moderate, basic coding.
    Custom ModelsLimited GPTs (e.g., tutor).Gems (1-2, e.g., chef).None, default personality.None, Project-based workflows.
    Daily Limits~15 msgs/3hr (GPT-4o), then GPT-3.5.~500/mo, throttled at peak.Temporarily unlimited (Feb 2025).~50 msgs/day, varies.
    Top ModelGPT-4o (text, image).Gemini 2.5 Pro (text, image).Grok 3 (text, image).Claude 3.7 Sonnet (text, image).
    Response SpeedFast (1-2s), slows at peak.Very fast (0.5-1s).Fast (1-2s), varies with X.Moderate (2-3s), some delays.
    Recent HighlightsLookout app, Operators for tasks.Imagen 3, Spotify extension.Grok Studio, Google Drive.Vision for images, enhanced Projects.
    Daily active users122.5M35M16.5M3.3M

    Key Takeaways

    • ChatGPT: Versatile, great for general tasks, limited by message caps.
    • Gemini: Research powerhouse with Google integration.
    • Grok: Creative, social-media-driven with workspaces.
    • Claude: Ethical, text-heavy tasks with Projects.

    Which AI fits your workflow? Share your thoughts! #AI #Tech #GenAI

  • Ollama – The power of local LLMs

    Ollama – What is it?

    Ollama is a tool for running, managing, and interacting with large language models (LLMs) on local machines. It provides an easy way to download, run, and fine-tune open-source models like Llama, Mistral, and Gemma without requiring cloud-based APIs.

    Key Features:

    • Runs Locally: No need for cloud services—everything runs on your computer.
    • Supports Multiple Models: Works with models like Meta’s Llama, Mistral, and others.
    • Simple Interface: You can interact with models via a CLI or programmatically in Python/Node.js.
    • Fine-tuning & Customization: Allows you to fine-tune models on your own data.
    • Efficient Execution: Optimized for fast performance on local hardware.

    How to get started?

    Download the ollama tool navigating to the official website ollama.com. The installation is straight forward just like any other software tool

    Once installed, you are ready to run LLMs locally

    Download and run the model using the below command

    ollama run <Model:Parameter>
    
    Ex: ollama run gemma3:1b

    You can find the list of models available and their memory requirements at model library

    How to use?

    Once running, interaction with ollama can be through command line or through APIs.

    In the command line, you can interact with the LLM by providing prompts. Sample Prompt: “What is the capital of Australia?”

    You can also set system message, show the current settings. Available options can be found by typing “/?”

    What is the capital of Australia?

    The other way to interact is to use APIs. Ollama by default runs on port 11434. You can test the APIs using Postman tool

    Below are some of the APIs to try

    Generate API:

    POST http://localhost:11434/api/generate
    Content-Type: application/json
    
    {
        "model": "gemma3:1b",
        "prompt": "What is the capital of France?",
        "stream": false
    }

    Chat completion API:

    POST http://localhost:11434/api/chat
    Content-Type: application/json
    
    {
      "model": "gemma3:1b",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke"}
      ]
    }

    Lets take a quick look at the differences between the two APIs

    /api/generate/api/chat/completion
    Used for single promptUsed for prompts with multiple interactions
    Request has only one “prompt”Request has an array of “messages”
    Doesn’t hold contextPrevious messages can be added to maintain context in subsequent requests
    Resposne contains one “response” Response contains an array of “messages” along with context, token count etc
    Usecase: Random text generationUsecase: Chatbot

    Other APIs to try include the below

    GET /api/tags: Lists the installed models
    
    
    POST /api/pull: Pulls and installs the model
    { "model": "gemma3:1b" }
    
    
    POST /api/create: Create a custom model
    {
      "name": "custom-mistral",
      "modelfile": "FROM mistral\nPARAMETER temperature=0.7\n"
    }
    
    
    POST /api/embeddings: Generate embeddings
    {
      "model": "mistral",
      "prompt": "Generate embeddings for this text"
    }

    The Postman collection can be found in my github repo at https://github.com/dcurioustech/ollama-local