Java Streams

Today we will look at Streams in Java

An example of Java Streams to print the even numbers is as follows

import java.util.Arrays;
import java.util.List;

public class StreamsSamples {
    public static void main(String[] args) {
        List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5);
        numbers.stream()
                .filter(a -> a % 2 == 0)
                .forEach(a -> System.out.println("Even number: " + a));
    }
}

Java Streams come in two flavours .stream() and .parallelStream(). Below is a quick comparison of the two

FeatureStreamParallelStream
ExecutionSequential (one element at a time)Parallel (multiple elements simultaneously)
ThreadingSingle-threadedMulti-threaded (uses ForkJoinPool)
PerformanceMay be slower for large datasetsCan be faster for large datasets with CPU cores
Order PreservationMaintains encounter orderMay not preserve order (unless explicitly stated)
Use CaseSmall to medium datasets, order-sensitive opsLarge datasets, CPU-intensive operations
DeterminismMore predictable and deterministicMay have non-deterministic results
Side EffectsEasier to manageHarder to control due to concurrent execution
OverheadLowHigher due to thread management overhead
Custom Thread PoolNot requiredUses common ForkJoinPool (customization is tricky)
Exampleslist.stream()list.parallelStream()

As highlighted in the above table, ParallelStream is not useful when the dataset count is very small to medium. This adds additional overhead of multiple threads creation and their lifecycle management.

Lets look at the below example of identifying a prime number in about 1000 numbers

package com.dcurioustech.streams;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class StreamsSamples {
    public static void main(String[] args) {
        System.out.println("================================");
        // Inefficient use of parallel streams
        List<Integer> largeNumbers = new java.util.Random().ints(1_000, 1, 1000).boxed().collect(Collectors.toList());
        System.out.println("Sample count:" + largeNumbers.size());

        // Using sequential streams
        long startTime = System.nanoTime();
        largeNumbers.stream().filter(StreamsSamples::isPrime).count();
        long endTime = System.nanoTime();
        float sequentialTime = endTime - startTime;
        System.out.println("Sequential stream time (milli seconds): " + (sequentialTime)/1_000_000);

        // Using parallel streams
        startTime = System.nanoTime();
        largeNumbers.parallelStream().filter(StreamsSamples::isPrime).count();
        endTime = System.nanoTime();
        float parallelTime = endTime - startTime;
        System.out.println("Parallel stream time (milli seconds): " + (parallelTime)/1_000_000);
        System.out.println("Speedup: " + sequentialTime/parallelTime);

    }

    // Intentionally inefficient CPU intensive method
    public static boolean isPrime(int number) {
        if (number <= 1) {
            return false;
        }
        for (int i = 2; i < number; i++) {
            if (number % i == 0) {
                return false;
            }
        }
        return true;
    }
}

Output as below:
================================
Sample count:1000
Sequential stream time (milli seconds): 1.867237
Parallel stream time (milli seconds): 5.67832
Speedup: 0.32883617

As can be seen the ParallelStream time is more than the Sequential stream. This is due to the overhead of thread life cycle management.

Lets now look at the example of about 10 million sized sample

package com.dcurioustech.streams;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class StreamsSamples {
    public static void main(String[] args) {
        System.out.println("================================");
        // Efficient use of sequential streams
        List<Integer> largeNumbers = new java.util.Random().ints(10_000_000, 1, 1000).boxed().collect(Collectors.toList());
        System.out.println("Sample count:" + largeNumbers.size());

        // Using sequential streams
        long startTime = System.nanoTime();
        largeNumbers.stream().filter(StreamsSamples::isPrime).count();
        long endTime = System.nanoTime();
        long sequentialTime = endTime - startTime;
        System.out.println("Sequential stream time (milli seconds): " + (sequentialTime)/1_000_000);

        // Using parallel streams
        startTime = System.nanoTime();
        largeNumbers.parallelStream().filter(StreamsSamples::isPrime).count();
        endTime = System.nanoTime();
        long parallelTime = endTime - startTime;
        System.out.println("Parallel stream time (milli seconds): " + (parallelTime)/1_000_000);
        System.out.println("Speedup: " + sequentialTime/parallelTime);
    }

    // Intentionally inefficient CPU intensive method
    public static boolean isPrime(int number) {
        if (number <= 1) {
            return false;
        }
        for (int i = 2; i < number; i++) {
            if (number % i == 0) {
                return false;
            }
        }
        return true;
    }
}

Output as below

================================
Sample count:10000000
Sequential stream time (milli seconds): 1978.1862
Parallel stream time (milli seconds): 589.46625
Speedup: 3.3558939

As seen from the results, the performance with the use of parallel streams is 3.35 times faster

Summary

Stick to Sequential streams when
> Sample size is small to medium
> Order of the execution matters in the stream

Use Parallel streams when
> Sample size is large
> Order of execution doesn’t matter

Java streams are powerful and can improve the performance significantly for certain operations and large datasets, while also improving code readability over normal iterative constructs.

You can refer to the code in here

Comments

Leave a comment