Daily Scholar Papers Report — 2026-04-28¶

Download PDF

Window covered: 2026-04-27 → 2026-04-28 (Google Scholar alerts + user-curated self-emails, last 24 h). Backfilled on 2026-04-29 — the original scheduled run was skipped.

Executive Summary¶

The Scholar-alert side of this window was empty: the next batch from Google Scholar didn't arrive until late on 04-28 CST (and was caught by the 04-29 report). The user-curated self-email queue, however, contributed exactly one paper — flagged via the new STEP 2b pipeline that picks up self-addressed emails whose subject contains "paper" or "read".

That paper is FIKA (KTH + Université de Montréal), a pipeline that augments static dependency reachability with dynamic executability proofs for Java/Maven projects. Static analysis tools can show that a third-party library call site is reachable from a project's public methods, but cannot prove there is an actual input that reaches it at runtime. FIKA closes this gap: for each statically-reachable call site not covered by existing tests, it prompts an LLM to generate a self-contained "reachability scenario" — a runnable unit-test harness method that initialises the project, walks the call path, and triggers the target. The scenario is compiled, run, and validated by JaCoCo coverage tooling — success is defined by the coverage tool, never by the LLM's self-report.

Across 8 Java projects covering 3,219 third-party call sites, developer-written tests cover 1,754 (54%); FIKA adds 609 new executability proofs (42% of the not-covered sites), bringing the total dynamic guarantee to 73% — a 19-pp lift. On six of eight projects, FIKA crosses the 75%-guaranteed-coverage line. Applied to vulnerability triage, FIKA confirms strong reachability + executability for 31 of 59 CVEs that Semgrep had marked "undetermined" — converting ambiguous static results into prioritise-this-fix evidence. Cost: $0.0074 per successful scenario (DeepSeek V3.2), $4.52 total for the eight-project evaluation.

The architectural pattern echoes the rest of this fortnight's reading: LLM produces, deterministic oracle confirms.

Outstanding: 0 · Keep: 1 · Borderline High-Priority: 0

The full analysis follows.

Highlighted Papers¶

#	Title	Authors	Venue	Link
5.1	FIKA: Expanding Dependency Reachability with Executability Guarantees	Yogya Gamage, Meriem Ben Chaaben, Martin Monperrus, Benoit Baudry	arXiv 2604.20015 [cs.SE] (preprint, ICSE/FSE/ISSTA-style empirical)	arXiv

Keep Papers (Deep-Read)¶

5.1 · DEP-REACHABILITY · [USER-PICK] LLM-generated reachability scenarios + JaCoCo oracle lift dynamic dependency-coverage from 54% to 73% on 8 Java projects, $0.0074 per proof👍🫥🔖

5.1 FIKA: Expanding Dependency Reachability with Executability Guarantees¶

arXiv:2604.20015

Paper¶

Title: FIKA: Expanding Dependency Reachability with Executability Guarantees
Authors: Yogya Gamage, Meriem Ben Chaaben (Université de Montréal), Martin Monperrus (KTH Royal Institute of Technology), Benoit Baudry (Université de Montréal)
Venue / Source: arXiv:2604.20015 [cs.SE] — preprint, submitted 2026-04-21, ~12 pp + appendix. Empirical-track formatting (4 RQs, ablation, cost section, threats to validity, public artefact); plausible target ICSE / FSE / ISSTA / TSE.
Year: 2026
Link: https://arxiv.org/abs/2604.20015
License: arXiv non-exclusive distribution (figures not embedded; pipeline recreated in Mermaid below).

Objective Summary¶

Problem. Static dependency-reachability tools determine that a third-party library call site is statically reachable from a project's public methods, but cannot prove it is executable — that there exists an actual input exercising that call path at runtime. This drives notification fatigue across vulnerability scanners (Dependabot, Semgrep, etc.): thousands of "potentially reachable" CVEs that are never actually reached. Developers ignore most of them.
Approach. FIKA adds a runtime-grounded layer. For every statically-reachable call site that is not covered by the project's existing test suite, FIKA prompts an LLM (DeepSeek V3.2) to generate a reachability scenario — a self-contained unit-test harness method that initialises the project state, invokes a public entry point, and provably triggers the target third-party call site. The LLM is given (i) the static call path produced by CHA + BFS shortest-path, (ii) the source code of every method along that path, (iii) the public entry point's constructors / factory methods / class-level state-setters, and (iv) feedback from previous failed attempts. Each generated scenario is compiled and run; success is defined by JaCoCo coverage confirming the target line was executed — never by the LLM's self-report.
Workflow split.
Static phase: CHA call graph → BFS one shortest path per (m_e, m_d^p, m_t^{tpl}) tuple → context extraction (method bodies + entry-point boilerplate).
Dynamic phase 1: run the existing test suite under JaCoCo to collect call sites already covered by developer-written tests.
Dynamic phase 2: for each not-covered call site, LLM generates a reachability scenario; compile, run, validate via JaCoCo; on failure, feed compiler / runtime / coverage errors back to the LLM for up to 5 iterations.
Headline numbers (verbatim §IV, Table II):
8 Java/Maven projects (flink, graphhopper, jooby, mybatis-3, pdfbox, tablesaw, tika, poi-tl) covering 1,363 unique third-party methods invoked across 3,219 distinct call sites.
RQ1: developer-written tests cover 1,754 / 3,219 = 54% of third-party call sites.
RQ2: FIKA generates a successful reachability scenario for 609 / 1,465 = 42% of the not-covered sites. Total dynamic guarantee climbs to 2,363 / 3,219 = 73%, a +19 pp lift. On 6 of 8 projects, FIKA crosses the 75% guaranteed-coverage line.
RQ3 (ablation): removing static-analysis context (BL1 = path only) collapses successes from 609 → 221 (-64%); removing entry-point boilerplate (BL2 = path + method bodies) drops to 387 (-37%); the five-iteration feedback loop adds 167 (503 → 609).
RQ4 (vulnerability triage vs. Semgrep): of 59 Semgrep-undetermined CVEs across 13 vulnerable modules, FIKA confirms strong reachability + executability for 31 (Table V).
Cost: $0.0074 / successful scenario** (DeepSeek V3.2), **$4.52 total for the eight-project evaluation. Generation latency, however, is hours-to-a-day per project — limits CI integration.
Datasets: 8 mature Java open-source projects curated from prior work by Soto-Valero et al. [26]; 13 vulnerable modules drawn from the same dataset for the Semgrep comparison.
Backbone: DeepSeek V3.2 via the LangChain / LangGraph orchestration framework. CHA call graph via Sootup; AST manipulation via Spoon.

Formal Definitions Quoted (§II.A)¶

The paper specifies its terminology explicitly. Reproduced verbatim under fair-use:

"Reachability of a third-party library method. We define library reachability as the possibility of a given entry point method ($m_e$) in a project ($p$) to invoke a target method ($m_t$) in a third-party library ($tpl$)."

"Reachability scenario. We call reachability scenario a code snippet that invokes a $m_e$ and initializes the project state in order to trigger a call path reaching the target $m_d^p$ and invoke a $m^{tpl}$. To execute a reachability scenario in isolation, we implement, and run it within a unit testing framework, and use test coverage tools to confirm executability. This resembles a unit test case, except that its purpose is to collect evidence of third-party library call site executability, and hence a reachability scenario does not include any assertion."

The four-tuple (m_e, m_d^p, m_t^{tpl}, path) is the unit of work; FIKA selects exactly one shortest static path per tuple via BFS to bound the search space.

Pipeline Diagram (Mermaid recreation; original figures not embedded)¶

flowchart LR
    A[Project source + pom.xml] --> B[Static analysis<br/>CHA call graph<br/>BFS shortest paths]
    B --> C[Set of (me, md^p, mt^tpl, path)<br/>tuples]
    C --> D[Dynamic phase 1<br/>run existing test suite<br/>JaCoCo coverage]
    D --> E{Call site<br/>covered?}
    E -->|yes| F[Mark executable<br/>via dev-written tests]
    E -->|no| G[Extract LLM context:<br/>path source + entry-point<br/>ctors/factories/setters]
    G --> H[DeepSeek V3.2 generates<br/>reachability scenario]
    H --> I[Compile + run as<br/>JUnit harness]
    I --> J{JaCoCo confirms<br/>target line hit?}
    J -->|yes| K[Mark executable<br/>via FIKA scenario]
    J -->|no| L[Feedback loop<br/>compiler/runtime errors<br/>back to LLM, ≤5 iters]
    L --> H

Methodological Reusable Ideas¶

JaCoCo-as-oracle. The success predicate ("did the generated scenario actually hit the target line?") is delegated to a deterministic coverage tool, not to LLM self-report. Reusable any time the LLM's task can be reduced to "produce code; coverage/diff tool verifies."
Static analysis as scaffolding, not as verifier. CHA + BFS gives the LLM one short, complete path's worth of context (method bodies + entry-point boilerplate) rather than expecting it to find its way through the codebase. The 64% success drop in BL1 (path-only) is the strongest evidence that context shape dominates LLM scale.
Iterative feedback loop with structured error signals. Compiler errors, runtime exceptions, and missing-coverage signals are fed back to the LLM for up to 5 iterations, lifting success from 503 → 609. A clean ablation of the "give the LLM another chance with the error message" pattern that is becoming standard in agentic SE tools.
Scenarios-as-tests-without-assertions is a clarifying framing: developer-written tests verify correctness, FIKA scenarios verify reachability. The two artifact types share infrastructure (JUnit, JaCoCo) but have different success predicates.

Limitations Honestly Reported¶

Java/Maven only; Gradle / Kotlin / other ecosystems flagged as open transferability questions.
LLM choice (DeepSeek V3.2) not ablated against other backbones — model-dependence unknown.
Generation latency dominates: "several hours, sometimes more than a day" per project. The authors propose running FIKA pre-release rather than per-commit.
8-project evaluation is small relative to the SE empirical-evaluation norm.

Closing Quote (one verbatim line allowed)¶

"By making the third-party library interactions explicit, FIKA converts ambiguous results into verifiable evidence that developers can act upon."

Cross-Paper Synthesis (with the surrounding fortnight)¶

FIKA fits squarely into the dominant architectural pattern emerging across this fortnight's papers — LLM produces, deterministic oracle confirms. Four papers in two weeks now land on this recipe in four different domains.

Pattern	FIKA (5.1, 04-28)	TraceScope (4.1, 04-29)	AeroReq2LTL (4.2, 04-29)	LLMVD.js (3.1, 04-27)
Production artifact	reachability scenario (Java unit test)	evidence bundle + checklist adjudication	NL→TNL→LTL formula	exploit driver (Node.js)
External oracle	JaCoCo line-coverage check	MITRE checklist + Evidence Citation Protocol	deterministic NL→LTL rules + expert ground truth	class-specific execution oracle
Ban on self-grading	LLM never reports success — JaCoCo does	adjudicator must cite resource ID	deterministic translation stage	side-effect verification
Cost lever	$0.0074/scenario, ≤5 iters bound	$0.04/URL median, 60s exec cap	(cost not reported)	$0.05/valid exploit

A few specific cross-cutting ideas worth pulling out from FIKA:

The static-analysis-as-scaffolding insight is the same shape as TraceScope's evidence-bundle scaffolding and AeroReq2LTL's SpaceRDL templates. In all three cases, a non-LLM component constrains the search space before the LLM runs, and the LLM operates over a small, well-typed input. The pattern is more general than "give the LLM tools" — it's "give the LLM a discrete, finite menu of structured choices."
Coverage-as-oracle is broadly transferable. Any LLM task that produces code (test generation, refactoring, bug repair, exploit synthesis) can use the same scaffold: existing coverage tooling already reports execution traces, so verifying "did the LLM's code achieve the structural goal" is essentially free. FIKA exploits this for reachability; the same idea applies to mutation testing, fault injection, and most LLM-driven code synthesis tasks.
Cost reporting at the per-artefact level is unusually clean here. $0.0074 / valid scenario lets a project owner reason directly about a budget — "I'll spend $50 to reach 75% guaranteed coverage." Other papers should adopt this framing.

Writing & Rationale Insights¶

The "Answer to RQx" boxed callout after each RQ section lets readers skim the answer first and decide whether to read the methodology. Strong template — costs nothing in space, dramatically improves skim-readability. Worth copying.
Four-symbol cross-tool matrix notation in Table V (, , , ⋆) is more compact than the usual yes/no/partial/N-A grid. Useful for any paper comparing N tools across M techniques.
Frank cost / latency reporting. "Several hours, sometimes more than a day" of generation time is a real limitation; many papers bury this in a final paragraph. FIKA puts it in §V.Discussion (Time budget) and again in Threats to Validity, which makes it discoverable for readers who actually want to deploy the tool.
Pre-release-not-per-commit framing is a useful operational frame for any tool whose latency rules out CI integration. Acknowledge the limitation and reposition the use-case.
Citing your own series. The team has multiple recent papers in the same area (lockfiles [11], breaking-update benchmark [5], Java-bytecode debloating [18]); FIKA leverages this lineage for related-work coverage and dataset reuse. Useful template for building a paper-series strategy.