Repeated Examples
Some examples appear multiple times
- Increases weight of certain patterns
- Creates emphasis on specific cases
Sampling with Replacement is a statistical technique used to create diverse training sets for building tree ensembles.
Setup: Four colored tokens (red, yellow, green, blue) in a bag
Process:
Key Observations:
Original Training Set: 10 examples of cats and dogs
Sampling Process:
Original Set: [Example 1, Example 2, …, Example 10] Sampled Set: [Example 3, Example 7, Example 3, Example 1, Example 9, Example 2, Example 7, Example 5, Example 8, Example 7]
Characteristics:
Repeated Examples
Some examples appear multiple times
Missing Examples
Some examples don’t appear
Different Emphasis
Each sample has unique focus
Tree 1: Trained on Sample Set A
Tree 2: Trained on Sample Set B
Tree 3: Trained on Sample Set C
For each draw: Every example has equal probability (1/n) of being selected Across multiple draws: Some examples selected more, some less, some not at all Overall effect: Creates natural variation in training set composition
Probability an example appears:
Practical Result: Each sampled training set missing about 1/3 of original examples, with different 1/3 missing each time.
Common practice: Sample same number of examples as original training set
Without replacement: Always get identical training set With replacement: Essential for creating diversity
# Conceptual sampling with replacementdef sample_with_replacement(original_data, sample_size): new_sample = [] for i in range(sample_size): # Randomly select index from original data random_index = random.choice(range(len(original_data))) # Add selected example to new sample new_sample.append(original_data[random_index]) return new_sampleSampling with replacement provides the foundation for creating diverse training sets that enable robust tree ensembles, transforming the weakness of individual decision tree sensitivity into the strength of ensemble robustness.