# Ads Ingestion & Scoring (majidapi + Normalize/Dedup/Median/Score)

## Goals
- Pull ads (bama/divar), normalize to canonical schema, deduplicate, compute city medians, score opportunities, persist with reasons.

## majidapi Endpoints (Bama + Divar)

### Bama (authoritative per majidapi docs)
- List latest:
  - `GET https://api.majidapi.ir/bama?action=latest&page=1`
- Search:
  - `GET https://api.majidapi.ir/bama?action=search&s=<brand_or_model>&page=1`
- Details:
  - `GET https://api.majidapi.ir/bama?action=details&code=<CODE>`
- Car price (optional):
  - `GET https://api.majidapi.ir/bama?action=carPrice&brand=<BRAND>`
- Motorcycle price (optional):
  - `GET https://api.majidapi.ir/bama?action=motorcyclePrice&brand=<BRAND>`

**Important Limitations:**
- **No City Filtering**: Bama API does not support city-specific search parameters
- **Location in Response**: City information is available in `detail.location` field (e.g., "تهران / کلاهدوز")
- **Smart Search Required**: We implement multi-page search with city filtering on our end

Notes
- Auth: when required, send `Authorization: Bearer $MAJIDAPI_TOKEN` header (keep token in `.env`).
- Params: `action` mandatory; `page` 1-based; `s` is brand/model text; `code` is ad identifier.
- Timeout/Retry: 10s timeout, up to 2 retries with exponential backoff on 429/5xx.
- Normalization: map fields to our canonical schema (see below).

Response mapping (typical)
- Input fields (examples, may vary): `code`, `title`, `brand`, `model`, `year`, `km`, `city`, `price`, `body_status`, `link`.
- Normalize to: `{source:'bama', code, brand, model, trim?, year:int, km:int, city_id?, price:int, body_status, link, contact_ref?}`.

### Divar Scraper via majidapi (City-Specific URLs)
- Endpoint: `GET https://api.majidapi.ir/tools/scraper` with query params:
  - `url` (required): percent-encoded target page (e.g., `https%3A%2F%2Fdivar.ir%2Fs%2Ftehran%2Fcar`).
  - `className` (required for list extraction): CSS class of item containers, e.g., `kt-post-card`.
  - `id` (optional): element id when you target a specific node on a page.
  - `token` (required by service): token obtained from majidapi's Telegram bot.

**Key Advantages over Bama:**
- **City Filtering Support**: Divar URLs support city-specific filtering (`/s/tehran/car`)
- **No Multi-Page Search Needed**: Single page per city provides sufficient results
- **Stable HTML Structure**: Consistent CSS selectors for parsing
- **Direct City Mapping**: URL slug directly maps to city_id

**Implementation Flow:**
1) **City-Specific URL Construction**:
   - Build Divar list URL: `https://divar.ir/s/<city>/car` (e.g., `tehran`, `mashhad`)
   - City slug mapping: `tehran` → `تهران`, `isfahan` → `اصفهان`, etc.
   - Support 50+ cities with comprehensive slug-to-name mapping

2) **HTML Parsing & Data Extraction**:
   - Parse returned HTML using stable CSS selectors (`kt-post-card__title`, `kt-post-card__description`)
   - Extract: title, price, mileage, location, and ad links
   - Filter out navigation items and non-car content
   - Convert Persian digits to English for processing

3) **Data Normalization**:
   - Extract `external_code` from ad URL slug (`/v/<slug>` → `<slug>`)
   - Parse price from title text (handle both Rials and Tomans)
   - Extract year from title (4-digit numbers 1300-1450)
   - Extract location from title text
   - Map to normalized item with `source:'divar'`, `city_id`, `external_code`, `link`, `price` as integer

**Operational Notes:**
- **City Detection**: Automatic city_id extraction from URL slug
- **Content Filtering**: Skip navigation, menu, and non-car items
- **Data Quality**: Handle missing fields gracefully, log parsing issues
- **Rate Limiting**: Respect API limits, no additional delays needed
- **Session Management**: Add user and session info for Telegram filtering
- **Error Handling**: Comprehensive logging for debugging and monitoring

Deep-dive checklist (Scraping end-to-end, resilient)
- Profiles (admin-driven): define city, source URL, CSS selector(s), optional keywords; toggle enable/disable.
- Fetch policy: `timeout=10s`, retries=2 with backoff (1s, 3s), jitter; global circuit breaker after N consecutive failures.
- Parse flow:
  1) Extract cards via primary `className`; if empty → search `a[href^="/v/"]`.
  2) Build absolute links; compute `external_code` from slug.
  3) For details, parse `script#__NEXT_DATA__` JSON; fallback to robust CSS.
  4) Normalize city via list URL or alias table.
  5) Sanitize price/year/km; drop invalids with `ads.normalize.rejected`.
- Dedup: key = `sha1(source|external_code)`; fallback composite hash otherwise.
- Throttling: per-host sleep (100–300ms); respect 429 `Retry-After`.
- Observability:
  - `ads.pull.ok/err {source,count|code}` for fetch outcome
  - `ads.normalize.rejected {reason, code?, city?}` for drops
  - Daily counters per city/source for coverage
- Security/legal: use MajidAPI only; keep data minimal per PRD; never log tokens.

Security & Legal
- Use the majidapi-provided scraper interface (do not scrape divar directly with bots).
- Respect terms of service and robots; throttle responsibly; store only fields needed per PRD.

## Normalization Rules (App\Services\Ads\Normalizer)
- Input → fields: source, code, brand, model, trim, year (int), km (int), city_id, price (int), body_status, link, contact_ref
- City name/alias → `cities` lookup; if not found, attempt fuzzy match; else null
- Brand/model normalization via mapping table (future); fallback to raw
- Price sanity: 10M < price < 10B IRR (configurable); else drop
- Year sanity: 1380 ≤ year ≤ current
- KM sanity: 0 ≤ km ≤ 500,000

## Deduplication (App\Services\Ads\Deduper)
- Key: `sha1(source|external_code)`
- Unique constraint on `opportunities.dedup_key` ensures no duplicates
- If code missing: combine brand|model|year|km|city_id|price; beware collisions; prefer source code when present

## City Median (App\Services\Ads\CityMedianService)
- Query latest N (≤100) `opportunities` with same city_id, brand, model, year; compute median
- Threshold `median_min_samples` (default 20):
  - If `n < minN`, treat median as unavailable → fallback scoring rules
- TTL/Cache (optional) to reduce repeated queries

## Scoring (App\Services\Ads\OpportunityScorer)
- Primary:
  - If price ≤ median × 0.85 → score A (0.9), reason `under_median_15pct`
  - Else if price ≤ median × 0.90 → score B (0.7), reason `under_median_10pct`
- Fallback (no reliable median):
  - Boost for year ≥ (this year − 4) → `recent_year`
  - Boost for km ≤ 80k → `low_km`
  - Score baseline 0.5 if any boost; else 0.0
- Reasons max 2 bullets; store in `reasons_json`

## Configuration (config/ads.php)

### Bama Smart Search Settings
```php
'bama' => [
    'smart_search' => [
        'min_results' => 5,           // Minimum city-specific ads before stopping
        'max_pages' => 15,            // Maximum pages to search in smart phase
        'delay_between_requests' => 500000, // Microseconds (0.5s)
        'fallback_enabled' => true,   // Enable fallback search
        'fallback_max_pages' => 5,    // Maximum pages in fallback phase
    ],
    'city_detection' => [
        'enabled' => true,            // Enable smart city detection
        'fallback_to_constructor' => true, // Use constructor city_id if detection fails
    ]
]
```

### Scoring Thresholds
- `threshold_primary_percent`: -15 (score A for price ≤ median × 0.85)
- `threshold_fallback_percent`: -10 (score B for price ≤ median × 0.90)
- `ttl_hours`: 72 (archive ads older than 72 hours)
- `median_min_samples`: 5 (minimum samples for reliable median)

## TTL/Archive Policy
- Ads/opportunities older than 72h (config `ads.ttl_hours`) → archive/mark
- Job runs daily to mark/archive

## Jobs & Flow

### BamaSearchJob (Smart Multi-Page Search)
- **Smart Search Phase**: 
  - Searches up to 15 pages (configurable)
  - Filters ads by `detail.location` field to match target city
  - Stops when minimum 5 city-specific ads found
  - Uses 0.5s delay between requests to avoid rate limiting
- **Fallback Search Phase**:
  - Activates if smart search finds < 3 city-specific ads
  - Searches up to 5 additional pages
  - Stores all ads with requested `city_id` (overrides actual location)
- **City Detection**: 
  - Extracts city name from `detail.location` (e.g., "تهران / کلاهدوز" → "تهران")
  - Matches against `cities` table (exact, alias, partial match)
  - Supports 20+ major cities with fallback patterns

### DivarScrapeJob (City-Specific Scraping)
- **URL-Based City Filtering**: Uses Divar city URLs (`/s/tehran/car`, `/s/isfahan/car`)
- **HTML Parsing**: Parses Divar HTML cards using stable CSS selectors
- **Data Extraction**: Extracts title, price, mileage, location, and links
- **City Detection**: Automatically detects city from URL slug
- **Content Filtering**: Filters out navigation and non-car items
- **Session Management**: Adds user and session info for Telegram filtering

### ScoreOpportunitiesJob
- Normalize → Dedup → Median → Score → Persist to `opportunities`
- Prioritizes `ads_raw.city_id` over normalized city detection
- Adds session info (`search_user_id`, `search_session_id`) for Telegram filtering

### PushSchedulerJob
- Uses `opportunities` to prepare messages
- Filters by `city_id` and `source` for user queries

### Scheduler
- Bama smart search on user request (not scheduled)
- Divar scraping on user request
- Scoring after ingestion
- Weekly city report generation

Admin & Ops
- Filament resources: ScrapeProfiles (enable/disable/test), Opportunities (approve/push/archive), Settings (thresholds)
- Test hook: `Admin\\ScrapeProfilesController@test` returns `ok`, `sample_count`, `notes`

## Observability & Logging

### BamaSearchJob Logs
- `BamaSearchJob: Starting smart multi-page search` - Search initiation
- `BamaSearchJob: Found city-specific ads` - City matches found per page
- `BamaSearchJob: Reached minimum results, stopping` - Early termination
- `BamaSearchJob: Smart search insufficient, trying fallback` - Fallback activation
- `BamaSearchJob: City match found` - Individual city detection success
- `BamaSearchJob: No city match found` - City detection failure

### DivarScrapeJob Logs
- `DivarScrapeJob: Processed API data` - Successful data processing
- `DivarScrapeJob: No valid items found` - No car ads found in response
- `DivarScrapeJob: Adding session info to ad` - Session info added to ad
- `DivarScrapeJob: No session info available` - Missing session info
- `DivarScrapeJob: Stored raw ads` - Database storage summary

### Performance Metrics
- `total_pages_searched`: Number of pages processed
- `total_ads_processed`: Total ads examined
- `city_specific_ads_found`: Ads matching target city
- `success`: Whether minimum results achieved

### Error Handling
- HTTP 429/5xx: retry with backoff; circuit-breaker if consecutive failures exceed threshold
- Schema drift: tolerant parser; log fields not recognized
- MajidAPI outages: degrade gracefully; keep last known opportunities
- Rate limiting: 0.5s delay between requests, configurable
- City detection failures: fallback to constructor city_id

## Errors & Resilience
- HTTP 429/5xx: retry with backoff; circuit-breaker if consecutive failures exceed threshold
- Schema drift: tolerant parser; log fields not recognized
- MajidAPI outages: degrade gracefully; keep last known opportunities
