Semgrep Static Analysis

When to Use Semgrep

Ideal scenarios:

Quick security scans (minutes, not hours)
Pattern-based bug detection
Enforcing coding standards and best practices
Finding known vulnerability patterns
Single-file analysis without complex data flow
First-pass analysis before deeper tools

Consider CodeQL instead when:

Need interprocedural taint tracking across files
Complex data flow analysis required
Analyzing custom proprietary frameworks

When NOT to Use

Do NOT use this skill for:

Complex interprocedural data flow analysis (use CodeQL instead)
Binary analysis or compiled code without source
Custom deep semantic analysis requiring AST/CFG traversal
When you need to track taint across many function boundaries

Installation

# pip
python3 -m pip install semgrep

# Homebrew
brew install semgrep

# Docker
docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src

# Update
pip install --upgrade semgrep

Core Workflow

1. Quick Scan

semgrep --config auto .                    # Auto-detect rules
semgrep --config auto --metrics=off .      # Disable telemetry for proprietary code
`### 2\. Use Rulesets`
semgrep --config p/<RULESET> .             # Single ruleset
semgrep --config p/security-audit --config p/trailofbits .  # Multiple

Ruleset

Description

p/default

General security and code quality

p/security-audit

Comprehensive security rules

p/owasp-top-ten

OWASP Top 10 vulnerabilities

p/cwe-top-25

CWE Top 25 vulnerabilities

p/r2c-security-audit

r2c security audit rules

p/trailofbits

Trail of Bits security rules

p/python

Python-specific

p/javascript

JavaScript-specific

p/golang

Go-specific

3. Output Formats

semgrep --config p/security-audit --sarif -o results.sarif .   # SARIF
semgrep --config p/security-audit --json -o results.json .     # JSON
semgrep --config p/security-audit --dataflow-traces .          # Show data flow
`### 4\. Scan Specific Paths`
semgrep --config p/python app.py           # Single file
semgrep --config p/javascript src/         # Directory
semgrep --config auto --include='**/test/**' .  # Include tests (excluded by default)

Writing Custom Rules

Basic Structure

rules:
  - id: hardcoded-password
    languages: [python]
    message: "Hardcoded password detected: $PASSWORD"
    severity: ERROR
    pattern: password = "$PASSWORD"

Pattern Syntax

Syntax

Description

Example

...

Match anything

func(...)

$VAR

Capture metavariable

$FUNC($INPUT)

<... ...>

Deep expression match

<... user_input ...>

Pattern Operators

Operator

Description

pattern

Match exact pattern

patterns

All must match (AND)

pattern-either

Any matches (OR)

pattern-not

Exclude matches

pattern-inside

Match only inside context

pattern-not-inside

Match only outside context

pattern-regex

Regex matching

metavariable-regex

Regex on captured value

metavariable-comparison

Compare values

Combining Patterns

rules:
  - id: sql-injection
    languages: [python]
    message: "Potential SQL injection"
    severity: ERROR
    patterns:
      - pattern-either:
          - pattern: cursor.execute($QUERY)
          - pattern: db.execute($QUERY)
      - pattern-not:
          - pattern: cursor.execute("...", (...))
      - metavariable-regex:
          metavariable: $QUERY
          regex: .*\+.*|.*\.format\(.*|.*%.*

Taint Mode (Data Flow)

Simple pattern matching finds obvious cases:

# Pattern `os.system($CMD)` catches this:
os.system(user_input)  # Found
`But misses indirect flows:`
# Same pattern misses this:
cmd = user_input
processed = cmd.strip()
os.system(processed)  # Missed - no direct match

Taint mode tracks data through assignments and transformations:

Source: Where untrusted data enters (user_input)
Propagators: How it flows (cmd = ..., processed = ...)
Sanitizers: What makes it safe (shlex.quote())
Sink: Where it becomes dangerous (os.system())

rules:
  - id: command-injection
    languages: [python]
    message: "User input flows to command execution"
    severity: ERROR
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: os.system($SINK)
      - pattern: subprocess.call($SINK, shell=True)
      - pattern: subprocess.run($SINK, shell=True, ...)
    pattern-sanitizers:
      - pattern: shlex.quote(...)
      - pattern: int(...)
`### Full Rule with Metadata`
rules:
  - id: flask-sql-injection
    languages: [python]
    message: "SQL injection: user input flows to query without parameterization"
    severity: ERROR
    metadata:
      cwe: "CWE-89: SQL Injection"
      owasp: "A03:2021 - Injection"
      confidence: HIGH
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: cursor.execute($QUERY)
      - pattern: db.execute($QUERY)
    pattern-sanitizers:
      - pattern: int(...)
    fix: cursor.execute($QUERY, (params,))

Testing Rules

Test File Format

# test_rule.py
def test_vulnerable():
    user_input = request.args.get("id")
    # ruleid: flask-sql-injection
    cursor.execute("SELECT * FROM users WHERE id = " + user_input)

def test_safe():
    user_input = request.args.get("id")
    # ok: flask-sql-injection
    cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))

semgrep --test rules/

CI/CD Integration (GitHub Actions)

name: Semgrep

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 0 1 * *'  # Monthly

jobs:
  semgrep:
    runs-on: ubuntu-latest
    container:
      image: returntocorp/semgrep

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for diff-aware scanning

      - name: Run Semgrep
        run: |
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
          else
            semgrep ci
          fi
        env:
          SEMGREP_RULES: >-
            p/security-audit
            p/owasp-top-ten
            p/trailofbits

Configuration

.semgrepignore

tests/fixtures/
**/testdata/
generated/
vendor/
node_modules/
`### Suppress False Positives`
password = get_from_vault()  # nosemgrep: hardcoded-password
dangerous_but_safe()  # nosemgrep
`## Performance`
semgrep --config rules/ --time .    # Check rule performance
ulimit -n 4096                       # Increase file descriptors for large codebases
`### Path Filtering in Rules`
rules:
  - id: my-rule
    paths:
      include: [src/]
      exclude: [src/generated/]
`## Third-Party Rules`
pip install semgrep-rules-manager
semgrep-rules-manager --dir ~/semgrep-rules download
semgrep -f ~/semgrep-rules .

Rationalizations to Reject

Shortcut

Why It's Wrong

"Semgrep found nothing, code is clean"

Semgrep is pattern-based; it can't track complex data flow across functions

"I wrote a rule, so we're covered"

Rules need testing with semgrep --test; false negatives are silent

"Taint mode catches injection"

Only if you defined all sources, sinks, AND sanitizers correctly

"Pro rules are comprehensive"

Pro rules are good but not exhaustive; supplement with custom rules for your codebase

"Too many findings = noisy tool"

High finding count often means real problems; tune rules, don't disable them

Resources

Registry: https://semgrep.dev/explore
Playground: https://semgrep.dev/playground
Docs: https://semgrep.dev/docs/
Trail of Bits Rules: https://github.com/trailofbits/semgrep-rules
Blog: https://semgrep.dev/blog/