API Reference
causers - High-performance statistical operations for Polars DataFrames
A Python package with Rust backend for fast statistical computations on Polars DataFrames.
- class causers.LinearRegressionResult
Bases:
objectResult of a linear regression computation.
This struct contains the results of fitting a linear regression model, including coefficients, standard errors, goodness-of-fit metrics, and optional information about clustering and fixed effects.
# Fields
coefficients - Regression coefficients for each covariate
intercept - Intercept term (None if include_intercept=False)
r_squared - Coefficient of determination (R²)
n_samples - Number of observations used in the regression
slope - Deprecated: Same as coefficients[0] for single covariate (backward compatibility)
standard_errors - HC3 robust or clustered standard errors for coefficients
intercept_se - Standard error for intercept (None if no intercept)
n_clusters - Number of unique clusters (None if not clustered)
cluster_se_type - Type of clustered SE: “analytical” or “bootstrap” (None if not clustered)
bootstrap_iterations_used - Number of bootstrap iterations (None if not bootstrap)
fixed_effects_absorbed - Number of groups absorbed per FE variable
fixed_effects_names - Column names used for fixed effects
within_r_squared - R² computed on demeaned data (within-R²)
- __new__(**kwargs)
- bootstrap_iterations_used
Number of bootstrap iterations used (None if not bootstrap).
- cluster_se_type
“analytical” or “bootstrap” (None if not clustered).
- Type:
Type of clustered SE
- coefficients
Regression coefficients for each covariate.
- fixed_effects_absorbed
Number of groups absorbed per FE variable (e.g., [100, 10] for firm+year).
- fixed_effects_names
Column names used for fixed effects (e.g., [“firm_id”, “year”]).
- intercept
Intercept term (None if include_intercept=False).
- intercept_se
HC3 robust standard error for intercept (None if include_intercept=False).
- n_clusters
Number of unique clusters (None if not clustered).
- n_samples
Number of observations used in the regression.
- r_squared
Coefficient of determination (R²).
- slope
Use coefficients[0] instead. Kept for backward compatibility.
- Type:
Deprecated
- standard_errors
HC3 robust standard errors for each coefficient (or clustered SE if cluster specified).
- within_r_squared
R² computed on demeaned data (within-R²).
- class causers.LogisticRegressionResult
Bases:
objectResult of a logistic regression computation.
Contains coefficient estimates, robust standard errors, and diagnostic information about the optimization process.
- __new__(**kwargs)
- bootstrap_iterations_used
Number of bootstrap iterations used (None if not bootstrap)
- cluster_se_type
“analytical” or “bootstrap” (None if not clustered)
- Type:
Type of clustered SE
- coefficients
Coefficient estimates for x variables (log-odds scale)
- converged
Whether the optimization converged
- fixed_effects_absorbed
Number of unique groups per FE dimension (e.g., [100, 20] for 100 entities, 20 time periods) None when fixed_effects is not used.
- fixed_effects_names
Column names of absorbed fixed effects (e.g., [“entity”, “time”]) None when fixed_effects is not used.
- intercept
Intercept term (None if include_intercept=False)
- intercept_se
Robust standard error for intercept (None if include_intercept=False)
- iterations
Number of iterations used by the optimizer
- log_likelihood
Log-likelihood at the MLE solution
- n_clusters
Number of unique clusters (None if not clustered)
- n_samples
Number of observations used
- pseudo_r_squared
McFadden’s pseudo R² = 1 - (LL_model / LL_null)
- standard_errors
Robust standard errors for each coefficient
- within_pseudo_r_squared
Pseudo R² computed on the Mundlak-augmented model (within-pseudo R²) None when fixed_effects is not used.
- class causers.SyntheticDIDResult
Bases:
objectPython-accessible result class for Synthetic DID estimation.
This class wraps the internal SdidEstimate struct and provides Python access to all estimation results including the ATT, standard error, weights, and diagnostic information.
# Attributes
att: Average Treatment Effect on the Treated
standard_error: Bootstrap standard error
unit_weights: Weights for control units
time_weights: Weights for pre-treatment periods
n_units_control: Number of control units
n_units_treated: Number of treated units
n_periods_pre: Number of pre-treatment periods
n_periods_post: Number of post-treatment periods
solver_iterations: Tuple of (unit_iterations, time_iterations)
solver_converged: Whether the solver converged successfully
pre_treatment_fit: RMSE of pre-treatment synthetic control fit
bootstrap_iterations_used: Number of successful bootstrap iterations
# Example
`python result = synthetic_did(...) print(f"ATT: {result.att} ± {result.standard_error}") print(f"Unit weights: {result.unit_weights}") `- __new__(**kwargs)
- att
Average Treatment Effect on the Treated.
- bootstrap_iterations_used
Number of bootstrap iterations actually used (excluding failures).
- n_periods_post
Number of post-treatment periods.
- n_periods_pre
Number of pre-treatment periods.
- n_units_control
Number of control units in the panel.
- n_units_treated
Number of treated units in the panel.
- pre_treatment_fit
Pre-treatment fit RMSE between synthetic and actual treated outcomes.
- solver_converged
Whether the solver converged successfully.
- solver_iterations
Solver iterations as (unit_weight_iterations, time_weight_iterations).
- standard_error
Standard error estimated via placebo bootstrap.
- time_weights
Time weights (λ) for pre-treatment periods.
- unit_weights
Unit weights (ω) for control units.
- class causers.SyntheticControlResult
Bases:
objectResult of Synthetic Control estimation, exposed to Python via PyO3.
Contains the ATT estimate, optional standard error, control unit weights, and various diagnostic information about the estimation.
- __new__(**kwargs)
- att
Average Treatment Effect on Treated
- lambda_used
Lambda used for penalized method (None for others)
- method
“traditional”, “penalized”, “robust”, “augmented”
- Type:
SC method used
- n_failed_iterations
Number of failed placebo iterations (None if SE not computed)
- n_periods_post
Number of post-treatment periods
- n_periods_pre
Number of pre-treatment periods
- n_placebo_used
Number of successful placebo iterations (None if SE not computed)
- n_units_control
Number of control units
- pre_treatment_mse
Pre-treatment MSE
- pre_treatment_rmse
Pre-treatment RMSE (fit quality)
- solver_converged
Whether the optimizer converged
- solver_iterations
Number of optimizer iterations
- standard_error
Standard error via in-space placebo (None if not computed)
- unit_weights
Control unit weights (sum to 1, non-negative)
- class causers.DMLResult
Bases:
objectResult of DML estimation.
Contains the treatment effect estimate, standard error, confidence interval, and various diagnostics.
- __new__(**kwargs)
- cate_coefficients
CATE coefficients keyed by covariate name (if estimate_cate=True)
- cate_standard_errors
CATE coefficient standard errors (if estimate_cate=True)
- confidence_interval
Confidence interval bounds (lower, upper)
- n_folds
Number of cross-fitting folds used
- n_propensity_clipped
Count of clipped propensity scores
- n_samples
Number of observations
- outcome_r_squared
Average R² of outcome nuisance model across folds
- outcome_residual_var
Variance of outcome residuals Var(Ỹ)
- p_value
θ = 0
- Type:
Two-sided p-value for H₀
- propensity_r_squared
Average R² (or pseudo-R²) of propensity nuisance model across folds
- propensity_residual_var
Variance of treatment residuals Var(D̃)
- standard_error
Robust standard error (Neyman-orthogonal)
- summary()
Return formatted summary string
- theta
Average Treatment Effect (ATE) point estimate
- class causers.TwoStageLSResult
Bases:
objectResult of Two-Stage Least Squares estimation.
- __new__(**kwargs)
- coefficients
Structural coefficients (endogenous + exogenous, excluding intercept)
- confidence_intervals(alpha=0.05)
Compute confidence intervals for all coefficients.
Confidence intervals are computed as: coef ± t_crit × se where t_crit is the critical value from the t-distribution.
# Arguments * alpha - Significance level (default: 0.05 for 95% CI).
Must be in (0, 1).
# Returns List of (lower, upper) bounds for each coefficient.
- cragg_donald
Cragg-Donald statistic (multiple endogenous only)
- first_stage_coefficients
First-stage coefficients for instruments only (per endogenous variable)
- first_stage_f
F-statistics per endogenous variable
- intercept
Intercept term if included
- intercept_se
Standard error for intercept
- n_clusters
Number of clusters (if clustered)
- n_endogenous
Number of endogenous regressors
- n_instruments
Number of excluded instruments
- n_samples
Number of observations
- p_values
Compute two-tailed p-values for all coefficients.
P-values are computed using the t-distribution with (n_samples - n_params) degrees of freedom. Tests H₀: β = 0 against H₁: β ≠ 0.
# Returns Vec of p-values, one for each coefficient.
- r_squared
R² from structural equation (using original D)
- se_type
“conventional”, “hc3”, or “clustered”
- Type:
SE type
- standard_errors
Standard errors for all coefficients
- stock_yogo_critical
Stock-Yogo 10% critical value
- summary()
Return a formatted summary table similar to statsmodels IV2SLS output.
The summary includes: - Model information (N, R², SE type) - First-stage diagnostics (F-statistics, Cragg-Donald) - Coefficient table with estimates, SEs, t-stats, p-values, and CIs
# Returns Formatted string suitable for printing.
- t_statistics
Compute t-statistics for all coefficients.
T-statistics are computed as coefficients / standard_errors (element-wise). Returns t-statistics for slope coefficients only (intercept excluded).
# Returns Vec of t-statistics, one for each coefficient.
- causers.linear_regression(df, x_cols, y_col, include_intercept=True, cluster=None, bootstrap=False, bootstrap_iterations=1000, seed=None, bootstrap_method='rademacher', fixed_effects=None)[source]
Perform linear regression on Polars or pandas DataFrame columns.
Supports both single and multiple covariate regression using ordinary least squares (OLS). For multiple covariates, uses matrix operations: β = (X’X)^-1 X’y
Optionally absorbs fixed effects (entity/time) via within-transformation (demeaning) before regression. When fixed effects are absorbed, the intercept is implicitly absorbed and not returned.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas. For pandas DataFrames with Arrow-backed columns (pd.ArrowDtype), data is extracted via zero-copy where possible.
x_cols (str or List[str]) – Name(s) of the independent variable column(s). Can be: - A single column name as a string - A list of column names for multiple covariates
y_col (str) – Name of the dependent variable column
include_intercept (bool, default=True) – Whether to include an intercept term in the regression. If False, forces the regression line through the origin. Note: When fixed_effects are specified, intercept is implicitly absorbed regardless of this setting.
cluster (str, optional) – Column name for cluster identifiers. When specified, computes cluster-robust standard errors instead of HC3. Supports integer, string, or categorical columns.
bootstrap (bool, default=False) – If True and cluster is specified, use wild cluster bootstrap for standard error computation. Requires cluster to be specified. Recommended when number of clusters is less than 42.
bootstrap_iterations (int, default=1000) – Number of bootstrap replications when bootstrap=True.
seed (int, optional) – Random seed for reproducibility when using bootstrap. When None, uses a random seed which may produce different results each call.
bootstrap_method (str, default "rademacher") – Weight distribution for wild bootstrap. Options: - “rademacher”: Standard Rademacher weights (±1 with equal probability) - “webb”: Webb’s 6-point distribution for improved small-sample performance Only used when bootstrap=True and cluster is specified.
fixed_effects (str or List[str], optional) – Column name(s) for fixed effects to absorb. Supports 1 or 2 columns. When specified: - One column: One-way fixed effects (e.g., entity or time) - Two columns: Two-way fixed effects (e.g., entity + time) Fixed effect columns must not overlap with x_cols or y_col. Columns can be integer, string, or categorical type.
- Returns:
Result object with the following attributes: - coefficients : List[float]
Regression coefficients for each x variable
- interceptfloat or None
Intercept term (None if include_intercept=False or fixed_effects used)
- r_squaredfloat
Coefficient of determination (R²) using original y
- n_samplesint
Number of samples used in the regression
- slopefloat or None
For single covariate only, same as coefficients[0]
- standard_errorsList[float]
Robust standard errors for each coefficient. Uses HC3 by default, or cluster-robust SE if cluster is specified.
- intercept_sefloat or None
Robust standard error for intercept (None if include_intercept=False or fixed_effects used)
- n_clustersint or None
Number of unique clusters (None if cluster not specified)
- cluster_se_typestr or None
Type of clustered SE: “analytical”, “bootstrap_rademacher”, or “bootstrap_webb” (None if not clustered)
- bootstrap_iterations_usedint or None
Number of bootstrap iterations (None if not bootstrap)
- fixed_effects_absorbedList[int] or None
Number of groups absorbed for each fixed effect (None if no FE)
- fixed_effects_namesList[str] or None
Names of the fixed effect columns absorbed (None if no FE)
- within_r_squaredfloat or None
Within R² computed on demeaned data (None if no FE)
- Return type:
- Raises:
If x_cols is empty or columns don’t exist - If cluster column contains null values - If bootstrap=True without cluster specified - If fewer than 2 clusters detected - If single-observation clusters exist (analytical mode only) - If numerical instability detected (condition number > 1e10) - If bootstrap_iterations < 1 - If fixed_effects has more than 2 columns - If fixed_effects column overlaps with x_cols or y_col - If fixed_effects column contains null values - If fixed_effects column has only one unique value - If covariate becomes collinear after FE demeaning
- Warns:
UserWarning –
When fewer than 42 clusters with bootstrap=False: recommends using wild cluster bootstrap for more accurate inference.
When cluster column has float dtype: implicit cast to string.
When fixed_effects column has float dtype: implicit cast warning.
Examples
Single covariate regression:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2, 4, 6, 8, 10]}) >>> result = causers.linear_regression(df, "x", "y") >>> print(f"y = {result.slope:.2f}x + {result.intercept:.2f}") y = 2.00x + 0.00
Accessing standard errors:
>>> df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2.1, 3.9, 6.2, 7.8, 10.1]}) >>> result = causers.linear_regression(df, "x", "y") >>> print(f"Coefficient: {result.coefficients[0]:.4f} ± {result.standard_errors[0]:.4f}") Coefficient: 1.9900 ± 0.0682 >>> print(f"Intercept: {result.intercept:.4f} ± {result.intercept_se:.4f}") Intercept: 0.0500 ± 0.1896
Multiple covariate regression:
>>> df = pl.DataFrame({ ... "x1": [1, 2, 3, 4, 5], ... "x2": [1, 1, 2, 2, 3], ... "y": [6, 8, 13, 15, 20] ... }) >>> result = causers.linear_regression(df, ["x1", "x2"], "y") >>> print(f"Coefficients: {result.coefficients}") Coefficients: [2.0, 3.0]
Clustered standard errors (analytical):
>>> df = pl.DataFrame({ ... "x": [1, 2, 3, 4, 5, 6], ... "y": [2, 4, 5, 8, 9, 12], ... "firm_id": [1, 1, 2, 2, 3, 3] ... }) >>> result = causers.linear_regression(df, "x", "y", cluster="firm_id") >>> print(f"Clustered SE: {result.standard_errors[0]:.4f} (G={result.n_clusters})") Clustered SE: ... (G=3)
Wild cluster bootstrap (recommended for <42 clusters):
>>> result = causers.linear_regression( ... df, "x", "y", ... cluster="firm_id", bootstrap=True, seed=42 ... ) >>> print(f"Bootstrap SE: {result.standard_errors[0]:.4f}") Bootstrap SE: ...
Notes
Standard errors are computed using:
HC3 (default): Heteroskedasticity-consistent standard errors when no cluster is specified. Provides robust inference when error variance may not be constant (MacKinnon & White, 1985).
Analytical clustered SE: When cluster is specified and bootstrap=False. Uses the sandwich estimator with small-sample adjustment (G/(G-1) × (N-1)/(N-k)). Accounts for within-cluster correlation.
Wild cluster bootstrap SE: When cluster and bootstrap=True. Uses Rademacher weights (±1 with equal probability) and is recommended when the number of clusters is small (G < 42).
The 42-cluster threshold is based on asymptotic theory and simulation evidence that analytical clustered SE can be unreliable with few clusters.
References
Cameron, A. C., & Miller, D. L. (2015). A Practitioner’s Guide to Cluster-Robust Inference. Journal of Human Resources, 50(2), 317-372.
MacKinnon, J. G., & Webb, M. D. (2018). The wild bootstrap for few (treated) clusters. The Econometrics Journal, 21(2), 114-135.
See also
LinearRegressionResultResult class with coefficient estimates and diagnostics.
- causers.logistic_regression(df, x_cols, y_col, include_intercept=True, cluster=None, bootstrap=False, bootstrap_iterations=1000, seed=None, bootstrap_method='rademacher', fixed_effects=None)[source]
Perform logistic regression on binary outcome with robust standard errors.
Fits a logistic regression model using Maximum Likelihood Estimation (MLE) with Newton-Raphson optimization. Returns coefficient estimates (log-odds), robust standard errors, and diagnostic information.
Optionally absorbs fixed effects using the Mundlak (1978) approach: adds group means of covariates as additional regressors. This is suitable for nonlinear models where standard within-transformation is not valid.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas. For pandas DataFrames with Arrow-backed columns (pd.ArrowDtype), data is extracted via zero-copy where possible.
x_cols (str or List[str]) – Name(s) of the independent variable column(s). Can be: - A single column name as a string - A list of column names for multiple covariates
y_col (str) – Name of the binary outcome column (must contain only 0 and 1)
include_intercept (bool, default=True) – Whether to include an intercept term in the regression.
cluster (str, optional) – Column name for cluster identifiers. When specified, computes cluster-robust standard errors using the score-based approach. Supports integer, string, or categorical columns.
bootstrap (bool, default=False) – If True and cluster is specified, use score bootstrap for standard error computation. Requires cluster to be specified. Recommended when number of clusters is less than 42.
bootstrap_iterations (int, default=1000) – Number of bootstrap replications when bootstrap=True.
seed (int, optional) – Random seed for reproducibility when using bootstrap. When None, uses a random seed which may produce different results each call.
bootstrap_method (str, default "rademacher") – Weight distribution for score bootstrap. Options: - “rademacher”: Standard Rademacher weights (±1 with equal probability) - “webb”: Webb’s 6-point distribution for improved small-sample performance Only used when bootstrap=True and cluster is specified.
fixed_effects (str or List[str], optional) – Column name(s) for fixed effects to absorb using the Mundlak approach. Supports 1 or 2 columns. When specified: - One column: One-way fixed effects (e.g., entity or time) - Two columns: Two-way fixed effects (e.g., entity + time) The Mundlak approach adds group means of covariates as additional regressors, which is appropriate for nonlinear models like logistic regression. Fixed effect columns must not overlap with x_cols or y_col. Columns can be integer, string, or categorical type.
- Returns:
Result object with the following attributes: - coefficients : List[float]
Coefficient estimates for x variables (log-odds scale). When fixed_effects is used, only returns coefficients for the original K covariates (Mundlak terms are filtered out).
- interceptfloat or None
Intercept term (None if include_intercept=False)
- standard_errorsList[float]
Robust standard errors for each coefficient. Uses HC3 by default, or clustered SE if cluster is specified.
- intercept_sefloat or None
Robust standard error for intercept (None if include_intercept=False)
- n_samplesint
Number of observations used
- n_clustersint or None
Number of unique clusters (None if cluster not specified)
- cluster_se_typestr or None
Type of clustered SE: “analytical”, “bootstrap_rademacher”, or “bootstrap_webb” (None if not clustered)
- bootstrap_iterations_usedint or None
Number of bootstrap iterations (None if not bootstrap)
- convergedbool
Whether the MLE optimizer converged
- iterationsint
Number of Newton-Raphson iterations used
- log_likelihoodfloat
Log-likelihood at the MLE solution
- pseudo_r_squaredfloat
McFadden’s pseudo R² = 1 - (LL_model / LL_null)
- fixed_effects_absorbedList[int] or None
Number of groups absorbed for each fixed effect (None if no FE)
- fixed_effects_namesList[str] or None
Names of the fixed effect columns absorbed (None if no FE)
- within_pseudo_r_squaredfloat or None
Within pseudo R² for FE model (None if no FE)
- Return type:
- Raises:
If y_col contains values other than 0 and 1 - If y_col contains only 0s or only 1s - If x_cols is empty or columns don’t exist - If cluster column contains null values - If bootstrap=True without cluster specified - If fewer than 2 clusters detected - If perfect separation is detected - If Hessian is singular (collinearity) - If convergence fails after max iterations - If numerical instability detected (condition number > 1e10) - If bootstrap_iterations < 1 - If fixed_effects has more than 2 columns - If fixed_effects column overlaps with x_cols or y_col - If fixed_effects column contains null values - If fixed_effects column has only one unique value - If augmented design matrix is collinear after adding Mundlak terms
- Warns:
UserWarning –
When fewer than 42 clusters with bootstrap=False: recommends using score bootstrap for more accurate inference.
When cluster column has float dtype: implicit cast to string.
When fixed_effects column has float dtype: implicit cast warning.
Examples
Simple logistic regression:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... "x": [0.5, 1.0, 1.5, 2.0, 2.5, 3.0], ... "y": [0, 0, 0, 1, 1, 1] ... }) >>> result = causers.logistic_regression(df, "x", "y") >>> print(f"Coefficient: {result.coefficients[0]:.4f}") Coefficient: ...
Accessing convergence information:
>>> if result.converged: ... print(f"Converged in {result.iterations} iterations") ... print(f"Log-likelihood: {result.log_likelihood:.2f}") ... print(f"McFadden R²: {result.pseudo_r_squared:.3f}") Converged in ... iterations Log-likelihood: ... McFadden R²: ...
Multiple covariates:
>>> df = pl.DataFrame({ ... "x1": [1, 2, 3, 4, 5, 6], ... "x2": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], ... "y": [0, 0, 0, 1, 1, 1] ... }) >>> result = causers.logistic_regression(df, ["x1", "x2"], "y") >>> print(f"Coefficients: {result.coefficients}") Coefficients: [...]
Clustered standard errors:
>>> df = pl.DataFrame({ ... "x": [1, 2, 3, 4, 5, 6], ... "y": [0, 0, 1, 0, 1, 1], ... "firm_id": [1, 1, 2, 2, 3, 3] ... }) >>> result = causers.logistic_regression(df, "x", "y", cluster="firm_id") >>> print(f"Clustered SE: {result.standard_errors[0]:.4f} (G={result.n_clusters})") Clustered SE: ... (G=3)
Score bootstrap (recommended for <42 clusters):
>>> result = causers.logistic_regression( ... df, "x", "y", ... cluster="firm_id", bootstrap=True, seed=42 ... ) >>> print(f"Bootstrap SE: {result.standard_errors[0]:.4f}") Bootstrap SE: ...
Fixed effects (entity FE using Mundlak approach):
>>> df = pl.DataFrame({ ... "x": [1.0, 2.0, 3.0, 4.0, 1.5, 2.5, 3.5, 4.5], ... "y": [0, 0, 1, 1, 0, 1, 0, 1], ... "entity_id": [1, 1, 1, 1, 2, 2, 2, 2] ... }) >>> result = causers.logistic_regression(df, "x", "y", fixed_effects="entity_id") >>> print(f"Coefficient (FE): {result.coefficients[0]:.4f}") Coefficient (FE): ... >>> print(f"FE absorbed: {result.fixed_effects_absorbed}") FE absorbed: [2]
Notes
The logistic regression model is:
P(y=1|x) = 1 / (1 + exp(-x’β))
Coefficients are on the log-odds scale. To convert to odds ratios, use exp(coefficient).
Standard errors are computed using:
HC3 (default): Heteroskedasticity-consistent standard errors adapted for logistic regression, using weighted leverages.
Analytical clustered SE: When cluster is specified and bootstrap=False. Uses the sandwich estimator with cluster-level scores.
Score bootstrap SE: When cluster and bootstrap=True. Uses Rademacher weights (±1 with equal probability) following Kline & Santos (2012). Recommended for small cluster counts (G < 42).
The optimizer uses Newton-Raphson with step halving for stability, converging when gradient norm < 1e-8 or after 35 iterations.
Fixed Effects (Mundlak Approach)
For nonlinear models like logistic regression, standard within-transformation (demeaning) is not valid. The Mundlak (1978) approach provides an alternative:
Compute group means of all covariates: X̄_g for each FE group g
Augment the design matrix: [X | X̄_g1 | X̄_g2 | …]
Run standard logistic regression on the augmented model
Return only the coefficients for the original X variables
This approach is equivalent to correlated random effects (CRE) and produces consistent estimates under the same assumptions as fixed effects.
References
Kline, P., & Santos, A. (2012). “A Score Based Approach to Wild Bootstrap Inference.” Journal of Econometric Methods, 1(1), 23-41. https://doi.org/10.1515/2156-6674.1006
MacKinnon, J. G., & White, H. (1985). “Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties.” Journal of Econometrics, 29(3), 305-325.
Mundlak, Y. (1978). “On the Pooling of Time Series and Cross Section Data.” Econometrica, 46(1), 69-85.
See also
LogisticRegressionResultResult class with coefficient estimates and diagnostics.
linear_regressionFor continuous outcome regression with FE support.
- causers.synthetic_did(df, unit_col, time_col, outcome_col, treatment_col, bootstrap_iterations=200, seed=None)[source]
Compute Synthetic Difference-in-Differences (SDID) estimator.
Implements the SDID estimator from Arkhangelsky et al. (2021), which combines synthetic control weighting with difference-in-differences to estimate the Average Treatment Effect on the Treated (ATT).
The estimator uses two-stage optimization: 1. Unit weights: Find control unit weights that match pre-treatment trends 2. Time weights: Find pre-period weights that predict post-period outcomes
Standard errors are computed via placebo bootstrap.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – Panel data in long format with one row per unit-time observation. Must be a balanced panel (all units observed in all time periods). Accepts both Polars and pandas DataFrames.
unit_col (str) – Column name identifying unique units (e.g., “state”, “firm_id”). Must be integer or string type.
time_col (str) – Column name identifying time periods (e.g., “year”, “quarter”). Must be integer or string type.
outcome_col (str) – Column name for the outcome variable. Must be numeric.
treatment_col (str) – Column name for treatment indicator. Must contain only 0 and 1 values. Value of 1 indicates the unit is treated in that period.
bootstrap_iterations (int, default=200) – Number of placebo bootstrap iterations for standard error estimation. Must be at least 1. Values < 100 will emit a warning.
seed (int, optional) – Random seed for reproducibility. If None, uses system time.
- Returns:
Result object with the following attributes:
- attfloat
Average Treatment Effect on the Treated
- standard_errorfloat
Bootstrap standard error of the ATT
- unit_weightsList[float]
Weights assigned to each control unit (sums to 1)
- time_weightsList[float]
Weights assigned to each pre-treatment period (sums to 1)
- n_units_controlint
Number of control units
- n_units_treatedint
Number of treated units
- n_periods_preint
Number of pre-treatment periods
- n_periods_postint
Number of post-treatment periods
- solver_iterationsTuple[int, int]
Number of iterations for (unit_weights, time_weights) optimization
- solver_convergedbool
Whether the Frank-Wolfe solver converged
- pre_treatment_fitfloat
RMSE of pre-treatment fit (lower is better)
- bootstrap_iterations_usedint
Number of successful bootstrap iterations
- Return type:
- Raises:
If DataFrame is empty - If any specified column doesn’t exist - If unit_col or time_col is float type - If outcome_col is not numeric - If outcome_col contains null values - If treatment_col contains values other than 0 and 1 - If bootstrap_iterations < 1 - If fewer than 2 control units found - If fewer than 2 pre-treatment periods found - If no treated units found - If no post-treatment periods found - If panel is not balanced
- Warns:
UserWarning –
If any unit weight > 0.5 (weight concentration on single unit)
If any time weight > 0.5 (weight concentration on single period)
If bootstrap_iterations < 100 (may be unreliable)
Examples
Basic usage with panel data:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 2, 3, 3, 3], ... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3], ... 'y': [1.0, 2.0, 5.0, 1.5, 2.5, 3.0, 1.2, 2.2, 2.8], ... 'treated': [0, 0, 1, 0, 0, 0, 0, 0, 0] ... }) >>> result = causers.synthetic_did(df, 'unit', 'time', 'y', 'treated', seed=42) >>> print(f"ATT: {result.att:.4f} ± {result.standard_error:.4f}") ATT: ... ± ...
Accessing weights and diagnostics:
>>> print(f"Control unit weights: {result.unit_weights}") Control unit weights: [...] >>> print(f"Pre-treatment fit RMSE: {result.pre_treatment_fit:.4f}") Pre-treatment fit RMSE: ...
Notes
Panel Structure Detection
The function automatically detects: - Control units: Units where treatment=0 in all periods - Treated units: Units where treatment=1 in at least one period - Pre-periods: Periods where all observations have treatment=0 - Post-periods: Periods where at least one treated unit has treatment=1
Algorithm
The SDID estimator is:
\[\hat{\tau}_{sdid} = (\bar{Y}_{tr,post} - \bar{Y}_{synth,post}) - \sum_t \lambda_t (\bar{Y}_{tr,t} - \bar{Y}_{synth,t})\]where \(\bar{Y}_{synth,t} = \sum_i \omega_i Y_{i,t}\) uses optimized unit weights \(\omega\) on control units.
Standard Errors
Standard errors are computed via placebo bootstrap: 1. Randomly select a control unit as “placebo treated” 2. Re-run SDID with this unit treated 3. Repeat for bootstrap_iterations 4. SE = standard deviation of placebo ATTs
References
Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118.
See also
SyntheticDIDResultResult class with ATT and diagnostics.
linear_regressionFor standard regression analysis.
- causers.synthetic_control(df, unit_col, time_col, outcome_col, treatment_col, method='traditional', lambda_param=None, compute_se=True, n_placebo=None, max_iter=1000, tol=1e-06, seed=None)[source]
Compute Synthetic Control (SC) estimator.
Implements the Synthetic Control method from Abadie et al. (2010, 2015), which constructs a weighted combination of control units to create a synthetic control that matches the treated unit’s pre-treatment outcomes.
Supports four method variants: - Traditional: Classic SC with simplex-constrained weights (Abadie et al., 2010) - Penalized: L2 regularization for more uniform weights - Robust: De-meaned data for matching dynamics instead of levels - Augmented: Bias correction via ridge outcome model (Ben-Michael et al., 2021)
- Parameters:
df (pl.DataFrame or pd.DataFrame) – Panel data in long format with one row per unit-time observation. Must be a balanced panel (all units observed in all time periods). Accepts both Polars and pandas DataFrames.
unit_col (str) – Column name identifying unique units (e.g., “state”, “firm_id”). Must be integer or string type.
time_col (str) – Column name identifying time periods (e.g., “year”, “quarter”). Must be integer or string type.
outcome_col (str) – Column name for the outcome variable. Must be numeric.
treatment_col (str) – Column name for treatment indicator. Must contain only 0 and 1 values. Exactly one unit must be treated (have treatment=1 in post-period).
method (str, default="traditional") – Synthetic control method to use. Options: - “traditional”: Classic SC minimizing pre-treatment MSE - “penalized”: L2 regularized SC for more uniform weights - “robust”: De-meaned SC for matching dynamics - “augmented”: Bias-corrected SC with ridge adjustment
lambda_param (float, optional) – Regularization parameter for penalized/augmented methods. If None, auto-selected via LOOCV for penalized method. Must be >= 0 when specified.
compute_se (bool, default=True) – Whether to compute standard errors via in-space placebo.
n_placebo (int, optional) – Number of placebo iterations for SE. If None, uses all control units.
max_iter (int, default=1000) – Maximum iterations for Frank-Wolfe optimizer.
tol (float, default=1e-6) – Convergence tolerance for optimizer.
seed (int, optional) – Random seed for reproducibility. If None, uses system time.
- Returns:
Result object with the following attributes:
- attfloat
Average Treatment Effect on the Treated
- standard_errorfloat or None
In-space placebo standard error (None if compute_se=False)
- unit_weightsList[float]
Weights assigned to each control unit (sums to 1)
- pre_treatment_rmsefloat
Root Mean Squared Error of pre-treatment fit
- pre_treatment_msefloat
Mean Squared Error of pre-treatment fit
- methodstr
Method used (“traditional”, “penalized”, “robust”, “augmented”)
- lambda_usedfloat or None
Lambda parameter used (for penalized/augmented methods)
- n_units_controlint
Number of control units
- n_periods_preint
Number of pre-treatment periods
- n_periods_postint
Number of post-treatment periods
- solver_convergedbool
Whether the Frank-Wolfe solver converged
- solver_iterationsint
Number of optimizer iterations
- n_placebo_usedint or None
Number of successful placebo iterations (if SE computed)
- Return type:
- Raises:
If DataFrame is empty - If any specified column doesn’t exist - If unit_col or time_col is float type - If outcome_col is not numeric - If outcome_col contains null values - If treatment_col contains values other than 0 and 1 - If not exactly one treated unit found - If fewer than 1 control unit found - If fewer than 1 pre-treatment period found - If no post-treatment periods found - If panel is not balanced - If method is not recognized - If lambda_param < 0
- Warns:
UserWarning –
If any unit weight > 0.5 (weight concentration on single unit)
If pre_treatment_rmse > 0.1 × outcome std (poor pre-treatment fit)
Examples
Basic usage with panel data:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 2, 3, 3, 3], ... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3], ... 'y': [1.0, 2.0, 8.0, 1.5, 2.5, 3.0, 1.2, 2.2, 2.8], ... 'treated': [0, 0, 1, 0, 0, 0, 0, 0, 0] ... }) >>> result = causers.synthetic_control(df, 'unit', 'time', 'y', 'treated', seed=42) >>> print(f"ATT: {result.att:.4f}") ATT: ...
Using penalized method with auto lambda:
>>> result = causers.synthetic_control( ... df, 'unit', 'time', 'y', 'treated', ... method="penalized", seed=42 ... ) >>> print(f"Lambda used: {result.lambda_used}") Lambda used: ...
Accessing weights and diagnostics:
>>> print(f"Control unit weights: {result.unit_weights}") Control unit weights: [...] >>> print(f"Pre-treatment RMSE: {result.pre_treatment_rmse:.4f}") Pre-treatment RMSE: ...
Without standard errors (faster):
>>> result = causers.synthetic_control( ... df, 'unit', 'time', 'y', 'treated', ... compute_se=False ... ) >>> print(f"ATT: {result.att:.4f} (SE not computed)") ATT: ... (SE not computed)
Notes
Key Difference from Synthetic DID
Synthetic Control requires exactly ONE treated unit, while SDID supports multiple treated units. If you have multiple treated units, use synthetic_did() instead.
Panel Structure Detection
The function automatically detects: - Control units: Units where treatment=0 in all periods - Treated unit: The single unit where treatment=1 in post-period - Pre-periods: Periods where treatment=0 for all units - Post-periods: Periods where the treated unit has treatment=1
Algorithm
The SC estimator finds weights ω such that:
\[\hat{\omega} = \arg\min_{\omega \geq 0, \sum \omega = 1} \sum_{t \in \text{pre}} (Y_{1t} - \sum_j \omega_j Y_{jt})^2\]Then the ATT is:
\[\hat{\tau}_{SC} = \frac{1}{|\text{post}|} \sum_{t \in \text{post}} (Y_{1t} - \sum_j \hat{\omega}_j Y_{jt})\]Standard Errors
Standard errors are computed via in-space placebo: 1. For each control unit, treat it as the “placebo treated” unit 2. Compute SC weights and ATT using remaining controls 3. SE = standard deviation of placebo ATTs
References
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies. Journal of the American Statistical Association.
Abadie, A., Diamond, A., & Hainmueller, J. (2015). Comparative Politics and the Synthetic Control Method. American Journal of Political Science.
Ben-Michael, E., Feller, A., & Rothstein, J. (2021). The Augmented Synthetic Control Method. Journal of the American Statistical Association.
See also
SyntheticControlResultResult class with ATT and diagnostics.
synthetic_didFor multiple treated units with DID adjustment.
- causers.dml(df, y_col, d_col, x_cols, n_folds=5, treatment_type='binary', estimate_cate=False, alpha=0.05, propensity_clip=(0.01, 0.99), cluster=None, seed=None)[source]
Estimate causal treatment effects using Double Machine Learning (DML).
Implements the DML estimator from Chernozhukov et al. (2018) with cross-fitting for debiased inference. Uses linear regression for outcome model and logistic (binary) or linear (continuous) regression for propensity model.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas.
y_col (str) – Name of the outcome variable column
d_col (str) – Name of the treatment variable column
x_cols (str or List[str]) – Name(s) of the covariate columns to control for
n_folds (int, default=5) – Number of cross-fitting folds. Must be >= 2.
treatment_type (str, default="binary") – Type of treatment variable: - “binary”: 0/1 treatment (uses logistic propensity model) - “continuous”: Continuous treatment (uses linear propensity model)
estimate_cate (bool, default=False) – Whether to estimate Conditional Average Treatment Effect (CATE) coefficients. If True, returns CATE as linear function of X.
alpha (float, default=0.05) – Significance level for confidence intervals. 0.05 = 95% CI.
propensity_clip (tuple, default=(0.01, 0.99)) – Bounds for propensity score clipping (binary treatment only). Extreme propensity scores are clipped to these bounds.
cluster (str, optional) – Column name for cluster identifiers for cluster-robust standard errors. When specified, uses cluster-robust Neyman-orthogonal variance estimation with G/(G-1) small-sample adjustment. Requires at least 2 clusters.
seed (int, optional) – Random seed for reproducible fold assignment. When None, uses deterministic row-order assignment.
- Returns:
Result object with the following attributes:
- thetafloat
Average Treatment Effect (ATE) point estimate
- standard_errorfloat
Neyman-orthogonal robust standard error
- confidence_intervalTuple[float, float]
(1-alpha) confidence interval bounds (lower, upper)
- p_valuefloat
Two-sided p-value for H₀: θ = 0
- n_samplesint
Number of observations used
- n_foldsint
Number of cross-fitting folds used
- propensity_residual_varfloat
Variance of treatment residuals Var(D̃)
- outcome_residual_varfloat
Variance of outcome residuals Var(Ỹ)
- outcome_r_squaredfloat
Average R² of outcome nuisance model across folds
- propensity_r_squaredfloat
Average R² (or pseudo-R²) of propensity model across folds
- n_propensity_clippedint
Count of propensity scores clipped to bounds
- cate_coefficientsDict[str, float] or None
CATE coefficients keyed by covariate name (if estimate_cate=True)
- cate_standard_errorsDict[str, float] or None
CATE coefficient standard errors (if estimate_cate=True)
- Return type:
- Raises:
If x_cols is empty - If n_folds is not 2, 5, or 10 - If n_folds >= n_samples - If treatment_type is not “binary” or “continuous” - If binary treatment doesn’t contain both 0 and 1 - If binary treatment contains non-0/1 values - If alpha is not in (0, 1) - If propensity_clip bounds are invalid - If treatment has no variation - If treatment is fully explained by covariates - If nuisance model fails to converge - If covariate matrix is singular in any fold
Examples
Basic ATE estimation with binary treatment:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... "y": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], ... "d": [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], ... "x1": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], ... "x2": [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9] ... }) >>> result = causers.dml(df, "y", "d", ["x1", "x2"], n_folds=2, seed=42) >>> print(f"ATE: {result.theta:.4f} ± {result.standard_error:.4f}") ATE: ... ± ...
With CATE estimation:
>>> result = causers.dml( ... df, "y", "d", ["x1", "x2"], ... estimate_cate=True, seed=42 ... ) >>> print(f"CATE coefficients: {result.cate_coefficients}") CATE coefficients: {...}
Continuous treatment:
>>> df["d_cont"] = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] >>> result = causers.dml( ... df, "y", "d_cont", ["x1", "x2"], ... treatment_type="continuous", seed=42 ... ) >>> print(f"ATE: {result.theta:.4f}") ATE: ...
Accessing diagnostics:
>>> print(f"Outcome R²: {result.outcome_r_squared:.3f}") Outcome R²: ... >>> print(f"Propensity R²: {result.propensity_r_squared:.3f}") Propensity R²: ... >>> print(result.summary()) # Formatted summary Double Machine Learning Results ...
Notes
Algorithm Overview
The DML estimator uses cross-fitting (sample splitting) to avoid overfitting bias in the nuisance models:
Partition data into K folds for cross-fitting
For each fold k: - Train outcome model ℓ(X) on observations NOT in fold k - Train propensity model m(X) on observations NOT in fold k - Predict for observations IN fold k (out-of-fold predictions)
Compute residuals: Ỹ = Y - ℓ̂(X), D̃ = D - m̂(X)
Final-stage regression: θ̂ = (D̃’D̃)⁻¹ D̃’Ỹ
Neyman-orthogonal variance: V̂ = (1/N) × J⁻² × Σψ²ᵢ
CATE Estimation
When estimate_cate=True, estimates heterogeneous treatment effects as:
τ(X) = θ₀ + X’γ
where γ coefficients capture how treatment effect varies with covariates.
Standard Errors
Uses Neyman-orthogonal variance estimation which provides: - Robustness to first-stage nuisance estimation error - Valid inference even with moderate sample sizes - Automatic bias correction from cross-fitting
References
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.
See also
DMLResultResult class with ATE, CATE, and diagnostics.
linear_regressionFor standard regression analysis.
- causers.two_stage_least_squares(df, y_col, d_cols, z_cols, x_cols=None, include_intercept=True, robust=False, cluster=None)[source]
Estimate causal effects using Two-Stage Least Squares (2SLS) instrumental variables.
The 2SLS estimator addresses endogeneity problems where treatment variables are correlated with the error term. It uses instrumental variables (Z) that affect the outcome (Y) only through their effect on the endogenous treatment (D).
Algorithm:
First Stage: Regress each endogenous variable on instruments and exogenous controls:
\[D = Z\pi + X\delta + \nu\]Second Stage: Regress outcome on predicted endogenous values and controls:
\[Y = \hat{D}\beta + X\gamma + \epsilon\]CRITICAL: Standard errors are computed using residuals from the original D, not the predicted D̂. Using D̂ would understate variance.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas.
y_col (str) – Name of the outcome variable column
d_cols (str or List[str]) – Name(s) of the endogenous treatment variable column(s). These are the variables suspected to be correlated with the error term.
z_cols (str or List[str]) – Name(s) of the excluded instrument column(s). Instruments must: - Affect Y only through D (exclusion restriction) - Be correlated with D (relevance) - Be uncorrelated with the error (exogeneity)
x_cols (str or List[str], optional) – Name(s) of exogenous control variable column(s). These variables are included in both stages.
include_intercept (bool, default=True) – Whether to include an intercept term in both stages.
robust (bool, default=False) – If True, compute HC3 heteroskedasticity-robust standard errors. If False, compute conventional (homoskedastic) standard errors.
cluster (str, optional) – Column name for cluster identifiers for cluster-robust SE. When specified, computes cluster-robust standard errors.
- Returns:
Result object with the following attributes:
- coefficientsList[float]
Structural coefficients for endogenous + exogenous variables (excluding intercept). Order: d_cols first, then x_cols.
- standard_errorsList[float]
Standard errors for all coefficients
- interceptfloat or None
Intercept term (None if include_intercept=False)
- intercept_sefloat or None
Standard error for intercept
- n_samplesint
Number of observations used
- n_endogenousint
Number of endogenous regressors
- n_instrumentsint
Number of excluded instruments
- first_stage_fList[float]
F-statistics for each endogenous variable. Rule of thumb: F < 10 suggests weak instruments. F < 4 raises an error (too weak for reliable inference).
- first_stage_coefficientsList[List[float]]
First-stage coefficients for instruments only (per endogenous variable)
- cragg_donaldfloat or None
Cragg-Donald statistic for multiple endogenous variables. None for single endogenous (use first_stage_f instead).
- stock_yogo_criticalfloat or None
Stock-Yogo 10% maximal bias critical value for comparison with Cragg-Donald statistic
- r_squaredfloat
R² from structural equation (using original D)
- se_typestr
Type of standard errors: “conventional”, “hc3”, or “clustered”
- n_clustersint or None
Number of clusters if clustered SE used
- Return type:
- Raises:
If d_cols is empty - If z_cols is empty - If number of instruments < number of endogenous variables (under-identified) - If first-stage F-statistic < 4 for any endogenous variable - If first-stage or second-stage design matrix is singular - If not enough observations for the number of parameters - If any column contains null values - If any column has zero variance - If cluster column not found when cluster specified
- Warns:
UserWarning –
If first-stage F-statistic < 10 (weak instruments warning)
If Cragg-Donald < Stock-Yogo critical value
If number of instruments > n_samples/10
Examples
Basic IV regression with one endogenous variable:
>>> import polars as pl >>> import causers >>> # Angrist-Krueger style: quarter of birth as instrument for education >>> df = pl.DataFrame({ ... "wage": [2.5, 3.0, 2.8, 3.5, 3.2, 4.0, 3.8, 4.5, 4.2, 5.0], ... "educ": [10, 11, 10, 12, 11, 14, 13, 15, 14, 16], ... "quarter_born": [1, 4, 2, 3, 1, 4, 2, 4, 3, 4], ... "age": [30, 32, 31, 33, 34, 35, 36, 38, 40, 42] ... }) >>> result = causers.two_stage_least_squares( ... df, ... y_col="wage", ... d_cols="educ", ... z_cols="quarter_born", ... x_cols="age" ... ) >>> print(f"Returns to education: {result.coefficients[0]:.4f}") Returns to education: ... >>> print(f"First-stage F: {result.first_stage_f[0]:.1f}") First-stage F: ...
Multiple endogenous variables:
>>> # Two endogenous vars with two instruments >>> result = causers.two_stage_least_squares( ... df, ... y_col="wage", ... d_cols=["educ", "age"], ... z_cols=["quarter_born", "year_born"] ... ) >>> print(f"Cragg-Donald: {result.cragg_donald}") Cragg-Donald: ...
Robust standard errors:
>>> result = causers.two_stage_least_squares( ... df, "wage", "educ", "quarter_born", ... robust=True ... ) >>> print(f"SE type: {result.se_type}") SE type: hc3
Clustered standard errors:
>>> df = df.with_columns(_pl.lit([1, 1, 2, 2, 3, 3, 4, 4, 5, 5]).alias("state")) >>> result = causers.two_stage_least_squares( ... df, "wage", "educ", "quarter_born", ... cluster="state" ... ) >>> print(f"N clusters: {result.n_clusters}") N clusters: 5
Notes
Identification Requirements
The model is identified when: - m ≥ k₁ (number of instruments ≥ number of endogenous variables) - m = k₁: exactly identified (use all instruments) - m > k₁: over-identified (more instruments than needed)
Weak Instrument Detection
Weak instruments (those poorly correlated with D) lead to: - Biased 2SLS estimates (toward OLS bias) - Unreliable inference (undersized confidence intervals)
This function uses two diagnostics:
First-stage F-statistic: For single endogenous variable. Rule of thumb: F > 10 is generally considered acceptable. F < 4 raises an error as inference is unreliable.
Cragg-Donald statistic: For multiple endogenous variables. Compare to Stock-Yogo critical values for desired bias/size control.
Standard Errors
Three options are available:
Conventional: Assumes homoskedasticity. Use when error variance is believed constant across observations.
HC3 Robust: Heteroskedasticity-consistent standard errors. Recommended when error variance may vary with X.
Clustered: For data with within-cluster correlation (e.g., students within schools, observations over time).
References
Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression. In D. W. K. Andrews & J. H. Stock (Eds.), Identification and Inference for Econometric Models (pp. 80-108). Cambridge University Press.
Staiger, D., & Stock, J. H. (1997). Instrumental Variables Regression with Weak Instruments. Econometrica, 65(3), 557-586.
See also
TwoStageLSResultResult class with coefficients and diagnostics.
linear_regressionFor standard OLS regression without instruments.
dmlFor causal inference using machine learning methods.
- causers.balance_check(df, treatment_col, covariate_cols, weights=None, treatment_value=None, control_value=None, max_categorical_levels=1000)[source]
Check covariate balance between treatment and control groups.
Computes standardized mean differences (SMD), variance ratios, and group-level summary statistics for each covariate. Supports weighted analysis (e.g. inverse-propensity weights) and automatic treatment / control value detection for binary indicators.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas.
treatment_col (str) – Column that identifies the treatment assignment.
covariate_cols (str or list[str]) – Column name(s) of the covariates to check for balance.
weights (str, optional) – Column name containing observation weights (e.g. IPW weights). When provided, weighted means / variances are computed.
treatment_value (int, float, str, or None) – Value in
treatment_colthat denotes the treated group. IfNone, auto-detected from the column: when exactly two unique non-null values exist the larger value is used.control_value (int, float, str, or None) – Value in
treatment_colthat denotes the control group. IfNone, auto-detected (the value that is nottreatment_value).max_categorical_levels (int, default 1000) – Maximum number of unique levels allowed for categorical columns before raising an error.
- Returns:
Result object exposing all per-covariate statistics plus the convenience methods
summary(),imbalanced(), andto_dataframe().- Return type:
- Raises:
If
treatment_color any covariate column is missing. - If auto-detection finds != 2 unique non-null treatment values. - If the treatment column contains only one unique value.
- Warns:
UserWarning –
Large imbalance: |SMD| > 0.25 for any covariate.
Extreme variance ratio: VR < 0.5 or VR > 2.0.
Small treatment or control group (n < 10).
Low effective sample size (ESS < 10) in weighted analysis.
Examples
Basic balance check with a binary treatment:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... "treated": [1, 1, 1, 0, 0, 0], ... "age": [25, 30, 35, 40, 45, 50], ... "income": [50000, 60000, 55000, 48000, 52000, 47000], ... }) >>> result = causers.balance_check(df, "treated", ["age", "income"]) >>> print(result.smd) {...}
Using the convenience methods:
>>> summary_df = result.summary() >>> imbalanced_covs = result.imbalanced(threshold=0.1) >>> full_df = result.to_dataframe()
Weighted balance check:
>>> df = df.with_columns(pl.lit([1.0, 1.0, 1.0, 0.5, 0.8, 0.7]).alias("w")) >>> result = causers.balance_check(df, "treated", ["age", "income"], weights="w") >>> print(result.is_weighted) True
See also
BalanceCheckResultResult class with balance statistics and helpers.
- class causers.BalanceCheckResult[source]
Bases:
objectPython wrapper around the Rust BalanceResult with convenience methods.
This class wraps the native
BalanceResultreturned by Rust and addssummary(),imbalanced(), andto_dataframe()helper methods. All attributes of the underlying Rust object are accessible directly (e.g.result.smd,result.n_treated).- ess_treated
Effective sample size for the treatment group (weighted analysis only).
- Type:
float or None
- ess_control
Effective sample size for the control group (weighted analysis only).
- Type:
float or None
- summary()[source]
Return a Polars DataFrame summarizing balance statistics.
Columns: covariate, mean_treated, mean_control, sd_treated, sd_control, smd, variance_ratio.
- Returns:
One row per covariate with key balance statistics.
- Return type:
pl.DataFrame
Functions
- causers.linear_regression(df, x_cols, y_col, include_intercept=True, cluster=None, bootstrap=False, bootstrap_iterations=1000, seed=None, bootstrap_method='rademacher', fixed_effects=None)[source]
Perform linear regression on Polars or pandas DataFrame columns.
Supports both single and multiple covariate regression using ordinary least squares (OLS). For multiple covariates, uses matrix operations: β = (X’X)^-1 X’y
Optionally absorbs fixed effects (entity/time) via within-transformation (demeaning) before regression. When fixed effects are absorbed, the intercept is implicitly absorbed and not returned.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas. For pandas DataFrames with Arrow-backed columns (pd.ArrowDtype), data is extracted via zero-copy where possible.
x_cols (str or List[str]) – Name(s) of the independent variable column(s). Can be: - A single column name as a string - A list of column names for multiple covariates
y_col (str) – Name of the dependent variable column
include_intercept (bool, default=True) – Whether to include an intercept term in the regression. If False, forces the regression line through the origin. Note: When fixed_effects are specified, intercept is implicitly absorbed regardless of this setting.
cluster (str, optional) – Column name for cluster identifiers. When specified, computes cluster-robust standard errors instead of HC3. Supports integer, string, or categorical columns.
bootstrap (bool, default=False) – If True and cluster is specified, use wild cluster bootstrap for standard error computation. Requires cluster to be specified. Recommended when number of clusters is less than 42.
bootstrap_iterations (int, default=1000) – Number of bootstrap replications when bootstrap=True.
seed (int, optional) – Random seed for reproducibility when using bootstrap. When None, uses a random seed which may produce different results each call.
bootstrap_method (str, default "rademacher") – Weight distribution for wild bootstrap. Options: - “rademacher”: Standard Rademacher weights (±1 with equal probability) - “webb”: Webb’s 6-point distribution for improved small-sample performance Only used when bootstrap=True and cluster is specified.
fixed_effects (str or List[str], optional) – Column name(s) for fixed effects to absorb. Supports 1 or 2 columns. When specified: - One column: One-way fixed effects (e.g., entity or time) - Two columns: Two-way fixed effects (e.g., entity + time) Fixed effect columns must not overlap with x_cols or y_col. Columns can be integer, string, or categorical type.
- Returns:
Result object with the following attributes: - coefficients : List[float]
Regression coefficients for each x variable
- interceptfloat or None
Intercept term (None if include_intercept=False or fixed_effects used)
- r_squaredfloat
Coefficient of determination (R²) using original y
- n_samplesint
Number of samples used in the regression
- slopefloat or None
For single covariate only, same as coefficients[0]
- standard_errorsList[float]
Robust standard errors for each coefficient. Uses HC3 by default, or cluster-robust SE if cluster is specified.
- intercept_sefloat or None
Robust standard error for intercept (None if include_intercept=False or fixed_effects used)
- n_clustersint or None
Number of unique clusters (None if cluster not specified)
- cluster_se_typestr or None
Type of clustered SE: “analytical”, “bootstrap_rademacher”, or “bootstrap_webb” (None if not clustered)
- bootstrap_iterations_usedint or None
Number of bootstrap iterations (None if not bootstrap)
- fixed_effects_absorbedList[int] or None
Number of groups absorbed for each fixed effect (None if no FE)
- fixed_effects_namesList[str] or None
Names of the fixed effect columns absorbed (None if no FE)
- within_r_squaredfloat or None
Within R² computed on demeaned data (None if no FE)
- Return type:
- Raises:
If x_cols is empty or columns don’t exist - If cluster column contains null values - If bootstrap=True without cluster specified - If fewer than 2 clusters detected - If single-observation clusters exist (analytical mode only) - If numerical instability detected (condition number > 1e10) - If bootstrap_iterations < 1 - If fixed_effects has more than 2 columns - If fixed_effects column overlaps with x_cols or y_col - If fixed_effects column contains null values - If fixed_effects column has only one unique value - If covariate becomes collinear after FE demeaning
- Warns:
UserWarning –
When fewer than 42 clusters with bootstrap=False: recommends using wild cluster bootstrap for more accurate inference.
When cluster column has float dtype: implicit cast to string.
When fixed_effects column has float dtype: implicit cast warning.
Examples
Single covariate regression:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2, 4, 6, 8, 10]}) >>> result = causers.linear_regression(df, "x", "y") >>> print(f"y = {result.slope:.2f}x + {result.intercept:.2f}") y = 2.00x + 0.00
Accessing standard errors:
>>> df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2.1, 3.9, 6.2, 7.8, 10.1]}) >>> result = causers.linear_regression(df, "x", "y") >>> print(f"Coefficient: {result.coefficients[0]:.4f} ± {result.standard_errors[0]:.4f}") Coefficient: 1.9900 ± 0.0682 >>> print(f"Intercept: {result.intercept:.4f} ± {result.intercept_se:.4f}") Intercept: 0.0500 ± 0.1896
Multiple covariate regression:
>>> df = pl.DataFrame({ ... "x1": [1, 2, 3, 4, 5], ... "x2": [1, 1, 2, 2, 3], ... "y": [6, 8, 13, 15, 20] ... }) >>> result = causers.linear_regression(df, ["x1", "x2"], "y") >>> print(f"Coefficients: {result.coefficients}") Coefficients: [2.0, 3.0]
Clustered standard errors (analytical):
>>> df = pl.DataFrame({ ... "x": [1, 2, 3, 4, 5, 6], ... "y": [2, 4, 5, 8, 9, 12], ... "firm_id": [1, 1, 2, 2, 3, 3] ... }) >>> result = causers.linear_regression(df, "x", "y", cluster="firm_id") >>> print(f"Clustered SE: {result.standard_errors[0]:.4f} (G={result.n_clusters})") Clustered SE: ... (G=3)
Wild cluster bootstrap (recommended for <42 clusters):
>>> result = causers.linear_regression( ... df, "x", "y", ... cluster="firm_id", bootstrap=True, seed=42 ... ) >>> print(f"Bootstrap SE: {result.standard_errors[0]:.4f}") Bootstrap SE: ...
Notes
Standard errors are computed using:
HC3 (default): Heteroskedasticity-consistent standard errors when no cluster is specified. Provides robust inference when error variance may not be constant (MacKinnon & White, 1985).
Analytical clustered SE: When cluster is specified and bootstrap=False. Uses the sandwich estimator with small-sample adjustment (G/(G-1) × (N-1)/(N-k)). Accounts for within-cluster correlation.
Wild cluster bootstrap SE: When cluster and bootstrap=True. Uses Rademacher weights (±1 with equal probability) and is recommended when the number of clusters is small (G < 42).
The 42-cluster threshold is based on asymptotic theory and simulation evidence that analytical clustered SE can be unreliable with few clusters.
References
Cameron, A. C., & Miller, D. L. (2015). A Practitioner’s Guide to Cluster-Robust Inference. Journal of Human Resources, 50(2), 317-372.
MacKinnon, J. G., & Webb, M. D. (2018). The wild bootstrap for few (treated) clusters. The Econometrics Journal, 21(2), 114-135.
See also
LinearRegressionResultResult class with coefficient estimates and diagnostics.
- causers.logistic_regression(df, x_cols, y_col, include_intercept=True, cluster=None, bootstrap=False, bootstrap_iterations=1000, seed=None, bootstrap_method='rademacher', fixed_effects=None)[source]
Perform logistic regression on binary outcome with robust standard errors.
Fits a logistic regression model using Maximum Likelihood Estimation (MLE) with Newton-Raphson optimization. Returns coefficient estimates (log-odds), robust standard errors, and diagnostic information.
Optionally absorbs fixed effects using the Mundlak (1978) approach: adds group means of covariates as additional regressors. This is suitable for nonlinear models where standard within-transformation is not valid.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas. For pandas DataFrames with Arrow-backed columns (pd.ArrowDtype), data is extracted via zero-copy where possible.
x_cols (str or List[str]) – Name(s) of the independent variable column(s). Can be: - A single column name as a string - A list of column names for multiple covariates
y_col (str) – Name of the binary outcome column (must contain only 0 and 1)
include_intercept (bool, default=True) – Whether to include an intercept term in the regression.
cluster (str, optional) – Column name for cluster identifiers. When specified, computes cluster-robust standard errors using the score-based approach. Supports integer, string, or categorical columns.
bootstrap (bool, default=False) – If True and cluster is specified, use score bootstrap for standard error computation. Requires cluster to be specified. Recommended when number of clusters is less than 42.
bootstrap_iterations (int, default=1000) – Number of bootstrap replications when bootstrap=True.
seed (int, optional) – Random seed for reproducibility when using bootstrap. When None, uses a random seed which may produce different results each call.
bootstrap_method (str, default "rademacher") – Weight distribution for score bootstrap. Options: - “rademacher”: Standard Rademacher weights (±1 with equal probability) - “webb”: Webb’s 6-point distribution for improved small-sample performance Only used when bootstrap=True and cluster is specified.
fixed_effects (str or List[str], optional) – Column name(s) for fixed effects to absorb using the Mundlak approach. Supports 1 or 2 columns. When specified: - One column: One-way fixed effects (e.g., entity or time) - Two columns: Two-way fixed effects (e.g., entity + time) The Mundlak approach adds group means of covariates as additional regressors, which is appropriate for nonlinear models like logistic regression. Fixed effect columns must not overlap with x_cols or y_col. Columns can be integer, string, or categorical type.
- Returns:
Result object with the following attributes: - coefficients : List[float]
Coefficient estimates for x variables (log-odds scale). When fixed_effects is used, only returns coefficients for the original K covariates (Mundlak terms are filtered out).
- interceptfloat or None
Intercept term (None if include_intercept=False)
- standard_errorsList[float]
Robust standard errors for each coefficient. Uses HC3 by default, or clustered SE if cluster is specified.
- intercept_sefloat or None
Robust standard error for intercept (None if include_intercept=False)
- n_samplesint
Number of observations used
- n_clustersint or None
Number of unique clusters (None if cluster not specified)
- cluster_se_typestr or None
Type of clustered SE: “analytical”, “bootstrap_rademacher”, or “bootstrap_webb” (None if not clustered)
- bootstrap_iterations_usedint or None
Number of bootstrap iterations (None if not bootstrap)
- convergedbool
Whether the MLE optimizer converged
- iterationsint
Number of Newton-Raphson iterations used
- log_likelihoodfloat
Log-likelihood at the MLE solution
- pseudo_r_squaredfloat
McFadden’s pseudo R² = 1 - (LL_model / LL_null)
- fixed_effects_absorbedList[int] or None
Number of groups absorbed for each fixed effect (None if no FE)
- fixed_effects_namesList[str] or None
Names of the fixed effect columns absorbed (None if no FE)
- within_pseudo_r_squaredfloat or None
Within pseudo R² for FE model (None if no FE)
- Return type:
- Raises:
If y_col contains values other than 0 and 1 - If y_col contains only 0s or only 1s - If x_cols is empty or columns don’t exist - If cluster column contains null values - If bootstrap=True without cluster specified - If fewer than 2 clusters detected - If perfect separation is detected - If Hessian is singular (collinearity) - If convergence fails after max iterations - If numerical instability detected (condition number > 1e10) - If bootstrap_iterations < 1 - If fixed_effects has more than 2 columns - If fixed_effects column overlaps with x_cols or y_col - If fixed_effects column contains null values - If fixed_effects column has only one unique value - If augmented design matrix is collinear after adding Mundlak terms
- Warns:
UserWarning –
When fewer than 42 clusters with bootstrap=False: recommends using score bootstrap for more accurate inference.
When cluster column has float dtype: implicit cast to string.
When fixed_effects column has float dtype: implicit cast warning.
Examples
Simple logistic regression:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... "x": [0.5, 1.0, 1.5, 2.0, 2.5, 3.0], ... "y": [0, 0, 0, 1, 1, 1] ... }) >>> result = causers.logistic_regression(df, "x", "y") >>> print(f"Coefficient: {result.coefficients[0]:.4f}") Coefficient: ...
Accessing convergence information:
>>> if result.converged: ... print(f"Converged in {result.iterations} iterations") ... print(f"Log-likelihood: {result.log_likelihood:.2f}") ... print(f"McFadden R²: {result.pseudo_r_squared:.3f}") Converged in ... iterations Log-likelihood: ... McFadden R²: ...
Multiple covariates:
>>> df = pl.DataFrame({ ... "x1": [1, 2, 3, 4, 5, 6], ... "x2": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], ... "y": [0, 0, 0, 1, 1, 1] ... }) >>> result = causers.logistic_regression(df, ["x1", "x2"], "y") >>> print(f"Coefficients: {result.coefficients}") Coefficients: [...]
Clustered standard errors:
>>> df = pl.DataFrame({ ... "x": [1, 2, 3, 4, 5, 6], ... "y": [0, 0, 1, 0, 1, 1], ... "firm_id": [1, 1, 2, 2, 3, 3] ... }) >>> result = causers.logistic_regression(df, "x", "y", cluster="firm_id") >>> print(f"Clustered SE: {result.standard_errors[0]:.4f} (G={result.n_clusters})") Clustered SE: ... (G=3)
Score bootstrap (recommended for <42 clusters):
>>> result = causers.logistic_regression( ... df, "x", "y", ... cluster="firm_id", bootstrap=True, seed=42 ... ) >>> print(f"Bootstrap SE: {result.standard_errors[0]:.4f}") Bootstrap SE: ...
Fixed effects (entity FE using Mundlak approach):
>>> df = pl.DataFrame({ ... "x": [1.0, 2.0, 3.0, 4.0, 1.5, 2.5, 3.5, 4.5], ... "y": [0, 0, 1, 1, 0, 1, 0, 1], ... "entity_id": [1, 1, 1, 1, 2, 2, 2, 2] ... }) >>> result = causers.logistic_regression(df, "x", "y", fixed_effects="entity_id") >>> print(f"Coefficient (FE): {result.coefficients[0]:.4f}") Coefficient (FE): ... >>> print(f"FE absorbed: {result.fixed_effects_absorbed}") FE absorbed: [2]
Notes
The logistic regression model is:
P(y=1|x) = 1 / (1 + exp(-x’β))
Coefficients are on the log-odds scale. To convert to odds ratios, use exp(coefficient).
Standard errors are computed using:
HC3 (default): Heteroskedasticity-consistent standard errors adapted for logistic regression, using weighted leverages.
Analytical clustered SE: When cluster is specified and bootstrap=False. Uses the sandwich estimator with cluster-level scores.
Score bootstrap SE: When cluster and bootstrap=True. Uses Rademacher weights (±1 with equal probability) following Kline & Santos (2012). Recommended for small cluster counts (G < 42).
The optimizer uses Newton-Raphson with step halving for stability, converging when gradient norm < 1e-8 or after 35 iterations.
Fixed Effects (Mundlak Approach)
For nonlinear models like logistic regression, standard within-transformation (demeaning) is not valid. The Mundlak (1978) approach provides an alternative:
Compute group means of all covariates: X̄_g for each FE group g
Augment the design matrix: [X | X̄_g1 | X̄_g2 | …]
Run standard logistic regression on the augmented model
Return only the coefficients for the original X variables
This approach is equivalent to correlated random effects (CRE) and produces consistent estimates under the same assumptions as fixed effects.
References
Kline, P., & Santos, A. (2012). “A Score Based Approach to Wild Bootstrap Inference.” Journal of Econometric Methods, 1(1), 23-41. https://doi.org/10.1515/2156-6674.1006
MacKinnon, J. G., & White, H. (1985). “Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties.” Journal of Econometrics, 29(3), 305-325.
Mundlak, Y. (1978). “On the Pooling of Time Series and Cross Section Data.” Econometrica, 46(1), 69-85.
See also
LogisticRegressionResultResult class with coefficient estimates and diagnostics.
linear_regressionFor continuous outcome regression with FE support.
- causers.synthetic_did(df, unit_col, time_col, outcome_col, treatment_col, bootstrap_iterations=200, seed=None)[source]
Compute Synthetic Difference-in-Differences (SDID) estimator.
Implements the SDID estimator from Arkhangelsky et al. (2021), which combines synthetic control weighting with difference-in-differences to estimate the Average Treatment Effect on the Treated (ATT).
The estimator uses two-stage optimization: 1. Unit weights: Find control unit weights that match pre-treatment trends 2. Time weights: Find pre-period weights that predict post-period outcomes
Standard errors are computed via placebo bootstrap.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – Panel data in long format with one row per unit-time observation. Must be a balanced panel (all units observed in all time periods). Accepts both Polars and pandas DataFrames.
unit_col (str) – Column name identifying unique units (e.g., “state”, “firm_id”). Must be integer or string type.
time_col (str) – Column name identifying time periods (e.g., “year”, “quarter”). Must be integer or string type.
outcome_col (str) – Column name for the outcome variable. Must be numeric.
treatment_col (str) – Column name for treatment indicator. Must contain only 0 and 1 values. Value of 1 indicates the unit is treated in that period.
bootstrap_iterations (int, default=200) – Number of placebo bootstrap iterations for standard error estimation. Must be at least 1. Values < 100 will emit a warning.
seed (int, optional) – Random seed for reproducibility. If None, uses system time.
- Returns:
Result object with the following attributes:
- attfloat
Average Treatment Effect on the Treated
- standard_errorfloat
Bootstrap standard error of the ATT
- unit_weightsList[float]
Weights assigned to each control unit (sums to 1)
- time_weightsList[float]
Weights assigned to each pre-treatment period (sums to 1)
- n_units_controlint
Number of control units
- n_units_treatedint
Number of treated units
- n_periods_preint
Number of pre-treatment periods
- n_periods_postint
Number of post-treatment periods
- solver_iterationsTuple[int, int]
Number of iterations for (unit_weights, time_weights) optimization
- solver_convergedbool
Whether the Frank-Wolfe solver converged
- pre_treatment_fitfloat
RMSE of pre-treatment fit (lower is better)
- bootstrap_iterations_usedint
Number of successful bootstrap iterations
- Return type:
- Raises:
If DataFrame is empty - If any specified column doesn’t exist - If unit_col or time_col is float type - If outcome_col is not numeric - If outcome_col contains null values - If treatment_col contains values other than 0 and 1 - If bootstrap_iterations < 1 - If fewer than 2 control units found - If fewer than 2 pre-treatment periods found - If no treated units found - If no post-treatment periods found - If panel is not balanced
- Warns:
UserWarning –
If any unit weight > 0.5 (weight concentration on single unit)
If any time weight > 0.5 (weight concentration on single period)
If bootstrap_iterations < 100 (may be unreliable)
Examples
Basic usage with panel data:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 2, 3, 3, 3], ... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3], ... 'y': [1.0, 2.0, 5.0, 1.5, 2.5, 3.0, 1.2, 2.2, 2.8], ... 'treated': [0, 0, 1, 0, 0, 0, 0, 0, 0] ... }) >>> result = causers.synthetic_did(df, 'unit', 'time', 'y', 'treated', seed=42) >>> print(f"ATT: {result.att:.4f} ± {result.standard_error:.4f}") ATT: ... ± ...
Accessing weights and diagnostics:
>>> print(f"Control unit weights: {result.unit_weights}") Control unit weights: [...] >>> print(f"Pre-treatment fit RMSE: {result.pre_treatment_fit:.4f}") Pre-treatment fit RMSE: ...
Notes
Panel Structure Detection
The function automatically detects: - Control units: Units where treatment=0 in all periods - Treated units: Units where treatment=1 in at least one period - Pre-periods: Periods where all observations have treatment=0 - Post-periods: Periods where at least one treated unit has treatment=1
Algorithm
The SDID estimator is:
\[\hat{\tau}_{sdid} = (\bar{Y}_{tr,post} - \bar{Y}_{synth,post}) - \sum_t \lambda_t (\bar{Y}_{tr,t} - \bar{Y}_{synth,t})\]where \(\bar{Y}_{synth,t} = \sum_i \omega_i Y_{i,t}\) uses optimized unit weights \(\omega\) on control units.
Standard Errors
Standard errors are computed via placebo bootstrap: 1. Randomly select a control unit as “placebo treated” 2. Re-run SDID with this unit treated 3. Repeat for bootstrap_iterations 4. SE = standard deviation of placebo ATTs
References
Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118.
See also
SyntheticDIDResultResult class with ATT and diagnostics.
linear_regressionFor standard regression analysis.
- causers.synthetic_control(df, unit_col, time_col, outcome_col, treatment_col, method='traditional', lambda_param=None, compute_se=True, n_placebo=None, max_iter=1000, tol=1e-06, seed=None)[source]
Compute Synthetic Control (SC) estimator.
Implements the Synthetic Control method from Abadie et al. (2010, 2015), which constructs a weighted combination of control units to create a synthetic control that matches the treated unit’s pre-treatment outcomes.
Supports four method variants: - Traditional: Classic SC with simplex-constrained weights (Abadie et al., 2010) - Penalized: L2 regularization for more uniform weights - Robust: De-meaned data for matching dynamics instead of levels - Augmented: Bias correction via ridge outcome model (Ben-Michael et al., 2021)
- Parameters:
df (pl.DataFrame or pd.DataFrame) – Panel data in long format with one row per unit-time observation. Must be a balanced panel (all units observed in all time periods). Accepts both Polars and pandas DataFrames.
unit_col (str) – Column name identifying unique units (e.g., “state”, “firm_id”). Must be integer or string type.
time_col (str) – Column name identifying time periods (e.g., “year”, “quarter”). Must be integer or string type.
outcome_col (str) – Column name for the outcome variable. Must be numeric.
treatment_col (str) – Column name for treatment indicator. Must contain only 0 and 1 values. Exactly one unit must be treated (have treatment=1 in post-period).
method (str, default="traditional") – Synthetic control method to use. Options: - “traditional”: Classic SC minimizing pre-treatment MSE - “penalized”: L2 regularized SC for more uniform weights - “robust”: De-meaned SC for matching dynamics - “augmented”: Bias-corrected SC with ridge adjustment
lambda_param (float, optional) – Regularization parameter for penalized/augmented methods. If None, auto-selected via LOOCV for penalized method. Must be >= 0 when specified.
compute_se (bool, default=True) – Whether to compute standard errors via in-space placebo.
n_placebo (int, optional) – Number of placebo iterations for SE. If None, uses all control units.
max_iter (int, default=1000) – Maximum iterations for Frank-Wolfe optimizer.
tol (float, default=1e-6) – Convergence tolerance for optimizer.
seed (int, optional) – Random seed for reproducibility. If None, uses system time.
- Returns:
Result object with the following attributes:
- attfloat
Average Treatment Effect on the Treated
- standard_errorfloat or None
In-space placebo standard error (None if compute_se=False)
- unit_weightsList[float]
Weights assigned to each control unit (sums to 1)
- pre_treatment_rmsefloat
Root Mean Squared Error of pre-treatment fit
- pre_treatment_msefloat
Mean Squared Error of pre-treatment fit
- methodstr
Method used (“traditional”, “penalized”, “robust”, “augmented”)
- lambda_usedfloat or None
Lambda parameter used (for penalized/augmented methods)
- n_units_controlint
Number of control units
- n_periods_preint
Number of pre-treatment periods
- n_periods_postint
Number of post-treatment periods
- solver_convergedbool
Whether the Frank-Wolfe solver converged
- solver_iterationsint
Number of optimizer iterations
- n_placebo_usedint or None
Number of successful placebo iterations (if SE computed)
- Return type:
- Raises:
If DataFrame is empty - If any specified column doesn’t exist - If unit_col or time_col is float type - If outcome_col is not numeric - If outcome_col contains null values - If treatment_col contains values other than 0 and 1 - If not exactly one treated unit found - If fewer than 1 control unit found - If fewer than 1 pre-treatment period found - If no post-treatment periods found - If panel is not balanced - If method is not recognized - If lambda_param < 0
- Warns:
UserWarning –
If any unit weight > 0.5 (weight concentration on single unit)
If pre_treatment_rmse > 0.1 × outcome std (poor pre-treatment fit)
Examples
Basic usage with panel data:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 2, 3, 3, 3], ... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3], ... 'y': [1.0, 2.0, 8.0, 1.5, 2.5, 3.0, 1.2, 2.2, 2.8], ... 'treated': [0, 0, 1, 0, 0, 0, 0, 0, 0] ... }) >>> result = causers.synthetic_control(df, 'unit', 'time', 'y', 'treated', seed=42) >>> print(f"ATT: {result.att:.4f}") ATT: ...
Using penalized method with auto lambda:
>>> result = causers.synthetic_control( ... df, 'unit', 'time', 'y', 'treated', ... method="penalized", seed=42 ... ) >>> print(f"Lambda used: {result.lambda_used}") Lambda used: ...
Accessing weights and diagnostics:
>>> print(f"Control unit weights: {result.unit_weights}") Control unit weights: [...] >>> print(f"Pre-treatment RMSE: {result.pre_treatment_rmse:.4f}") Pre-treatment RMSE: ...
Without standard errors (faster):
>>> result = causers.synthetic_control( ... df, 'unit', 'time', 'y', 'treated', ... compute_se=False ... ) >>> print(f"ATT: {result.att:.4f} (SE not computed)") ATT: ... (SE not computed)
Notes
Key Difference from Synthetic DID
Synthetic Control requires exactly ONE treated unit, while SDID supports multiple treated units. If you have multiple treated units, use synthetic_did() instead.
Panel Structure Detection
The function automatically detects: - Control units: Units where treatment=0 in all periods - Treated unit: The single unit where treatment=1 in post-period - Pre-periods: Periods where treatment=0 for all units - Post-periods: Periods where the treated unit has treatment=1
Algorithm
The SC estimator finds weights ω such that:
\[\hat{\omega} = \arg\min_{\omega \geq 0, \sum \omega = 1} \sum_{t \in \text{pre}} (Y_{1t} - \sum_j \omega_j Y_{jt})^2\]Then the ATT is:
\[\hat{\tau}_{SC} = \frac{1}{|\text{post}|} \sum_{t \in \text{post}} (Y_{1t} - \sum_j \hat{\omega}_j Y_{jt})\]Standard Errors
Standard errors are computed via in-space placebo: 1. For each control unit, treat it as the “placebo treated” unit 2. Compute SC weights and ATT using remaining controls 3. SE = standard deviation of placebo ATTs
References
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies. Journal of the American Statistical Association.
Abadie, A., Diamond, A., & Hainmueller, J. (2015). Comparative Politics and the Synthetic Control Method. American Journal of Political Science.
Ben-Michael, E., Feller, A., & Rothstein, J. (2021). The Augmented Synthetic Control Method. Journal of the American Statistical Association.
See also
SyntheticControlResultResult class with ATT and diagnostics.
synthetic_didFor multiple treated units with DID adjustment.
- causers.dml(df, y_col, d_col, x_cols, n_folds=5, treatment_type='binary', estimate_cate=False, alpha=0.05, propensity_clip=(0.01, 0.99), cluster=None, seed=None)[source]
Estimate causal treatment effects using Double Machine Learning (DML).
Implements the DML estimator from Chernozhukov et al. (2018) with cross-fitting for debiased inference. Uses linear regression for outcome model and logistic (binary) or linear (continuous) regression for propensity model.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas.
y_col (str) – Name of the outcome variable column
d_col (str) – Name of the treatment variable column
x_cols (str or List[str]) – Name(s) of the covariate columns to control for
n_folds (int, default=5) – Number of cross-fitting folds. Must be >= 2.
treatment_type (str, default="binary") – Type of treatment variable: - “binary”: 0/1 treatment (uses logistic propensity model) - “continuous”: Continuous treatment (uses linear propensity model)
estimate_cate (bool, default=False) – Whether to estimate Conditional Average Treatment Effect (CATE) coefficients. If True, returns CATE as linear function of X.
alpha (float, default=0.05) – Significance level for confidence intervals. 0.05 = 95% CI.
propensity_clip (tuple, default=(0.01, 0.99)) – Bounds for propensity score clipping (binary treatment only). Extreme propensity scores are clipped to these bounds.
cluster (str, optional) – Column name for cluster identifiers for cluster-robust standard errors. When specified, uses cluster-robust Neyman-orthogonal variance estimation with G/(G-1) small-sample adjustment. Requires at least 2 clusters.
seed (int, optional) – Random seed for reproducible fold assignment. When None, uses deterministic row-order assignment.
- Returns:
Result object with the following attributes:
- thetafloat
Average Treatment Effect (ATE) point estimate
- standard_errorfloat
Neyman-orthogonal robust standard error
- confidence_intervalTuple[float, float]
(1-alpha) confidence interval bounds (lower, upper)
- p_valuefloat
Two-sided p-value for H₀: θ = 0
- n_samplesint
Number of observations used
- n_foldsint
Number of cross-fitting folds used
- propensity_residual_varfloat
Variance of treatment residuals Var(D̃)
- outcome_residual_varfloat
Variance of outcome residuals Var(Ỹ)
- outcome_r_squaredfloat
Average R² of outcome nuisance model across folds
- propensity_r_squaredfloat
Average R² (or pseudo-R²) of propensity model across folds
- n_propensity_clippedint
Count of propensity scores clipped to bounds
- cate_coefficientsDict[str, float] or None
CATE coefficients keyed by covariate name (if estimate_cate=True)
- cate_standard_errorsDict[str, float] or None
CATE coefficient standard errors (if estimate_cate=True)
- Return type:
- Raises:
If x_cols is empty - If n_folds is not 2, 5, or 10 - If n_folds >= n_samples - If treatment_type is not “binary” or “continuous” - If binary treatment doesn’t contain both 0 and 1 - If binary treatment contains non-0/1 values - If alpha is not in (0, 1) - If propensity_clip bounds are invalid - If treatment has no variation - If treatment is fully explained by covariates - If nuisance model fails to converge - If covariate matrix is singular in any fold
Examples
Basic ATE estimation with binary treatment:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... "y": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], ... "d": [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], ... "x1": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], ... "x2": [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9] ... }) >>> result = causers.dml(df, "y", "d", ["x1", "x2"], n_folds=2, seed=42) >>> print(f"ATE: {result.theta:.4f} ± {result.standard_error:.4f}") ATE: ... ± ...
With CATE estimation:
>>> result = causers.dml( ... df, "y", "d", ["x1", "x2"], ... estimate_cate=True, seed=42 ... ) >>> print(f"CATE coefficients: {result.cate_coefficients}") CATE coefficients: {...}
Continuous treatment:
>>> df["d_cont"] = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] >>> result = causers.dml( ... df, "y", "d_cont", ["x1", "x2"], ... treatment_type="continuous", seed=42 ... ) >>> print(f"ATE: {result.theta:.4f}") ATE: ...
Accessing diagnostics:
>>> print(f"Outcome R²: {result.outcome_r_squared:.3f}") Outcome R²: ... >>> print(f"Propensity R²: {result.propensity_r_squared:.3f}") Propensity R²: ... >>> print(result.summary()) # Formatted summary Double Machine Learning Results ...
Notes
Algorithm Overview
The DML estimator uses cross-fitting (sample splitting) to avoid overfitting bias in the nuisance models:
Partition data into K folds for cross-fitting
For each fold k: - Train outcome model ℓ(X) on observations NOT in fold k - Train propensity model m(X) on observations NOT in fold k - Predict for observations IN fold k (out-of-fold predictions)
Compute residuals: Ỹ = Y - ℓ̂(X), D̃ = D - m̂(X)
Final-stage regression: θ̂ = (D̃’D̃)⁻¹ D̃’Ỹ
Neyman-orthogonal variance: V̂ = (1/N) × J⁻² × Σψ²ᵢ
CATE Estimation
When estimate_cate=True, estimates heterogeneous treatment effects as:
τ(X) = θ₀ + X’γ
where γ coefficients capture how treatment effect varies with covariates.
Standard Errors
Uses Neyman-orthogonal variance estimation which provides: - Robustness to first-stage nuisance estimation error - Valid inference even with moderate sample sizes - Automatic bias correction from cross-fitting
References
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.
See also
DMLResultResult class with ATE, CATE, and diagnostics.
linear_regressionFor standard regression analysis.
- causers.two_stage_least_squares(df, y_col, d_cols, z_cols, x_cols=None, include_intercept=True, robust=False, cluster=None)[source]
Estimate causal effects using Two-Stage Least Squares (2SLS) instrumental variables.
The 2SLS estimator addresses endogeneity problems where treatment variables are correlated with the error term. It uses instrumental variables (Z) that affect the outcome (Y) only through their effect on the endogenous treatment (D).
Algorithm:
First Stage: Regress each endogenous variable on instruments and exogenous controls:
\[D = Z\pi + X\delta + \nu\]Second Stage: Regress outcome on predicted endogenous values and controls:
\[Y = \hat{D}\beta + X\gamma + \epsilon\]CRITICAL: Standard errors are computed using residuals from the original D, not the predicted D̂. Using D̂ would understate variance.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas.
y_col (str) – Name of the outcome variable column
d_cols (str or List[str]) – Name(s) of the endogenous treatment variable column(s). These are the variables suspected to be correlated with the error term.
z_cols (str or List[str]) – Name(s) of the excluded instrument column(s). Instruments must: - Affect Y only through D (exclusion restriction) - Be correlated with D (relevance) - Be uncorrelated with the error (exogeneity)
x_cols (str or List[str], optional) – Name(s) of exogenous control variable column(s). These variables are included in both stages.
include_intercept (bool, default=True) – Whether to include an intercept term in both stages.
robust (bool, default=False) – If True, compute HC3 heteroskedasticity-robust standard errors. If False, compute conventional (homoskedastic) standard errors.
cluster (str, optional) – Column name for cluster identifiers for cluster-robust SE. When specified, computes cluster-robust standard errors.
- Returns:
Result object with the following attributes:
- coefficientsList[float]
Structural coefficients for endogenous + exogenous variables (excluding intercept). Order: d_cols first, then x_cols.
- standard_errorsList[float]
Standard errors for all coefficients
- interceptfloat or None
Intercept term (None if include_intercept=False)
- intercept_sefloat or None
Standard error for intercept
- n_samplesint
Number of observations used
- n_endogenousint
Number of endogenous regressors
- n_instrumentsint
Number of excluded instruments
- first_stage_fList[float]
F-statistics for each endogenous variable. Rule of thumb: F < 10 suggests weak instruments. F < 4 raises an error (too weak for reliable inference).
- first_stage_coefficientsList[List[float]]
First-stage coefficients for instruments only (per endogenous variable)
- cragg_donaldfloat or None
Cragg-Donald statistic for multiple endogenous variables. None for single endogenous (use first_stage_f instead).
- stock_yogo_criticalfloat or None
Stock-Yogo 10% maximal bias critical value for comparison with Cragg-Donald statistic
- r_squaredfloat
R² from structural equation (using original D)
- se_typestr
Type of standard errors: “conventional”, “hc3”, or “clustered”
- n_clustersint or None
Number of clusters if clustered SE used
- Return type:
- Raises:
If d_cols is empty - If z_cols is empty - If number of instruments < number of endogenous variables (under-identified) - If first-stage F-statistic < 4 for any endogenous variable - If first-stage or second-stage design matrix is singular - If not enough observations for the number of parameters - If any column contains null values - If any column has zero variance - If cluster column not found when cluster specified
- Warns:
UserWarning –
If first-stage F-statistic < 10 (weak instruments warning)
If Cragg-Donald < Stock-Yogo critical value
If number of instruments > n_samples/10
Examples
Basic IV regression with one endogenous variable:
>>> import polars as pl >>> import causers >>> # Angrist-Krueger style: quarter of birth as instrument for education >>> df = pl.DataFrame({ ... "wage": [2.5, 3.0, 2.8, 3.5, 3.2, 4.0, 3.8, 4.5, 4.2, 5.0], ... "educ": [10, 11, 10, 12, 11, 14, 13, 15, 14, 16], ... "quarter_born": [1, 4, 2, 3, 1, 4, 2, 4, 3, 4], ... "age": [30, 32, 31, 33, 34, 35, 36, 38, 40, 42] ... }) >>> result = causers.two_stage_least_squares( ... df, ... y_col="wage", ... d_cols="educ", ... z_cols="quarter_born", ... x_cols="age" ... ) >>> print(f"Returns to education: {result.coefficients[0]:.4f}") Returns to education: ... >>> print(f"First-stage F: {result.first_stage_f[0]:.1f}") First-stage F: ...
Multiple endogenous variables:
>>> # Two endogenous vars with two instruments >>> result = causers.two_stage_least_squares( ... df, ... y_col="wage", ... d_cols=["educ", "age"], ... z_cols=["quarter_born", "year_born"] ... ) >>> print(f"Cragg-Donald: {result.cragg_donald}") Cragg-Donald: ...
Robust standard errors:
>>> result = causers.two_stage_least_squares( ... df, "wage", "educ", "quarter_born", ... robust=True ... ) >>> print(f"SE type: {result.se_type}") SE type: hc3
Clustered standard errors:
>>> df = df.with_columns(_pl.lit([1, 1, 2, 2, 3, 3, 4, 4, 5, 5]).alias("state")) >>> result = causers.two_stage_least_squares( ... df, "wage", "educ", "quarter_born", ... cluster="state" ... ) >>> print(f"N clusters: {result.n_clusters}") N clusters: 5
Notes
Identification Requirements
The model is identified when: - m ≥ k₁ (number of instruments ≥ number of endogenous variables) - m = k₁: exactly identified (use all instruments) - m > k₁: over-identified (more instruments than needed)
Weak Instrument Detection
Weak instruments (those poorly correlated with D) lead to: - Biased 2SLS estimates (toward OLS bias) - Unreliable inference (undersized confidence intervals)
This function uses two diagnostics:
First-stage F-statistic: For single endogenous variable. Rule of thumb: F > 10 is generally considered acceptable. F < 4 raises an error as inference is unreliable.
Cragg-Donald statistic: For multiple endogenous variables. Compare to Stock-Yogo critical values for desired bias/size control.
Standard Errors
Three options are available:
Conventional: Assumes homoskedasticity. Use when error variance is believed constant across observations.
HC3 Robust: Heteroskedasticity-consistent standard errors. Recommended when error variance may vary with X.
Clustered: For data with within-cluster correlation (e.g., students within schools, observations over time).
References
Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression. In D. W. K. Andrews & J. H. Stock (Eds.), Identification and Inference for Econometric Models (pp. 80-108). Cambridge University Press.
Staiger, D., & Stock, J. H. (1997). Instrumental Variables Regression with Weak Instruments. Econometrica, 65(3), 557-586.
See also
TwoStageLSResultResult class with coefficients and diagnostics.
linear_regressionFor standard OLS regression without instruments.
dmlFor causal inference using machine learning methods.
- causers.balance_check(df, treatment_col, covariate_cols, weights=None, treatment_value=None, control_value=None, max_categorical_levels=1000)[source]
Check covariate balance between treatment and control groups.
Computes standardized mean differences (SMD), variance ratios, and group-level summary statistics for each covariate. Supports weighted analysis (e.g. inverse-propensity weights) and automatic treatment / control value detection for binary indicators.
- Parameters:
df (pl.DataFrame or pd.DataFrame) – The DataFrame containing the data. Accepts both Polars and pandas.
treatment_col (str) – Column that identifies the treatment assignment.
covariate_cols (str or list[str]) – Column name(s) of the covariates to check for balance.
weights (str, optional) – Column name containing observation weights (e.g. IPW weights). When provided, weighted means / variances are computed.
treatment_value (int, float, str, or None) – Value in
treatment_colthat denotes the treated group. IfNone, auto-detected from the column: when exactly two unique non-null values exist the larger value is used.control_value (int, float, str, or None) – Value in
treatment_colthat denotes the control group. IfNone, auto-detected (the value that is nottreatment_value).max_categorical_levels (int, default 1000) – Maximum number of unique levels allowed for categorical columns before raising an error.
- Returns:
Result object exposing all per-covariate statistics plus the convenience methods
summary(),imbalanced(), andto_dataframe().- Return type:
- Raises:
If
treatment_color any covariate column is missing. - If auto-detection finds != 2 unique non-null treatment values. - If the treatment column contains only one unique value.
- Warns:
UserWarning –
Large imbalance: |SMD| > 0.25 for any covariate.
Extreme variance ratio: VR < 0.5 or VR > 2.0.
Small treatment or control group (n < 10).
Low effective sample size (ESS < 10) in weighted analysis.
Examples
Basic balance check with a binary treatment:
>>> import polars as pl >>> import causers >>> df = pl.DataFrame({ ... "treated": [1, 1, 1, 0, 0, 0], ... "age": [25, 30, 35, 40, 45, 50], ... "income": [50000, 60000, 55000, 48000, 52000, 47000], ... }) >>> result = causers.balance_check(df, "treated", ["age", "income"]) >>> print(result.smd) {...}
Using the convenience methods:
>>> summary_df = result.summary() >>> imbalanced_covs = result.imbalanced(threshold=0.1) >>> full_df = result.to_dataframe()
Weighted balance check:
>>> df = df.with_columns(pl.lit([1.0, 1.0, 1.0, 0.5, 0.8, 0.7]).alias("w")) >>> result = causers.balance_check(df, "treated", ["age", "income"], weights="w") >>> print(result.is_weighted) True
See also
BalanceCheckResultResult class with balance statistics and helpers.
Result Classes
- class causers.LinearRegressionResult
Result of a linear regression computation.
This struct contains the results of fitting a linear regression model, including coefficients, standard errors, goodness-of-fit metrics, and optional information about clustering and fixed effects.
# Fields
coefficients - Regression coefficients for each covariate
intercept - Intercept term (None if include_intercept=False)
r_squared - Coefficient of determination (R²)
n_samples - Number of observations used in the regression
slope - Deprecated: Same as coefficients[0] for single covariate (backward compatibility)
standard_errors - HC3 robust or clustered standard errors for coefficients
intercept_se - Standard error for intercept (None if no intercept)
n_clusters - Number of unique clusters (None if not clustered)
cluster_se_type - Type of clustered SE: “analytical” or “bootstrap” (None if not clustered)
bootstrap_iterations_used - Number of bootstrap iterations (None if not bootstrap)
fixed_effects_absorbed - Number of groups absorbed per FE variable
fixed_effects_names - Column names used for fixed effects
within_r_squared - R² computed on demeaned data (within-R²)
- __new__(**kwargs)
- bootstrap_iterations_used
Number of bootstrap iterations used (None if not bootstrap).
- cluster_se_type
“analytical” or “bootstrap” (None if not clustered).
- Type:
Type of clustered SE
- coefficients
Regression coefficients for each covariate.
- fixed_effects_absorbed
Number of groups absorbed per FE variable (e.g., [100, 10] for firm+year).
- fixed_effects_names
Column names used for fixed effects (e.g., [“firm_id”, “year”]).
- intercept
Intercept term (None if include_intercept=False).
- intercept_se
HC3 robust standard error for intercept (None if include_intercept=False).
- n_clusters
Number of unique clusters (None if not clustered).
- n_samples
Number of observations used in the regression.
- r_squared
Coefficient of determination (R²).
- slope
Use coefficients[0] instead. Kept for backward compatibility.
- Type:
Deprecated
- standard_errors
HC3 robust standard errors for each coefficient (or clustered SE if cluster specified).
- within_r_squared
R² computed on demeaned data (within-R²).
- class causers.LogisticRegressionResult
Result of a logistic regression computation.
Contains coefficient estimates, robust standard errors, and diagnostic information about the optimization process.
- __new__(**kwargs)
- bootstrap_iterations_used
Number of bootstrap iterations used (None if not bootstrap)
- cluster_se_type
“analytical” or “bootstrap” (None if not clustered)
- Type:
Type of clustered SE
- coefficients
Coefficient estimates for x variables (log-odds scale)
- converged
Whether the optimization converged
- fixed_effects_absorbed
Number of unique groups per FE dimension (e.g., [100, 20] for 100 entities, 20 time periods) None when fixed_effects is not used.
- fixed_effects_names
Column names of absorbed fixed effects (e.g., [“entity”, “time”]) None when fixed_effects is not used.
- intercept
Intercept term (None if include_intercept=False)
- intercept_se
Robust standard error for intercept (None if include_intercept=False)
- iterations
Number of iterations used by the optimizer
- log_likelihood
Log-likelihood at the MLE solution
- n_clusters
Number of unique clusters (None if not clustered)
- n_samples
Number of observations used
- pseudo_r_squared
McFadden’s pseudo R² = 1 - (LL_model / LL_null)
- standard_errors
Robust standard errors for each coefficient
- within_pseudo_r_squared
Pseudo R² computed on the Mundlak-augmented model (within-pseudo R²) None when fixed_effects is not used.
- class causers.SyntheticDIDResult
Python-accessible result class for Synthetic DID estimation.
This class wraps the internal SdidEstimate struct and provides Python access to all estimation results including the ATT, standard error, weights, and diagnostic information.
# Attributes
att: Average Treatment Effect on the Treated
standard_error: Bootstrap standard error
unit_weights: Weights for control units
time_weights: Weights for pre-treatment periods
n_units_control: Number of control units
n_units_treated: Number of treated units
n_periods_pre: Number of pre-treatment periods
n_periods_post: Number of post-treatment periods
solver_iterations: Tuple of (unit_iterations, time_iterations)
solver_converged: Whether the solver converged successfully
pre_treatment_fit: RMSE of pre-treatment synthetic control fit
bootstrap_iterations_used: Number of successful bootstrap iterations
# Example
`python result = synthetic_did(...) print(f"ATT: {result.att} ± {result.standard_error}") print(f"Unit weights: {result.unit_weights}") `- __new__(**kwargs)
- att
Average Treatment Effect on the Treated.
- bootstrap_iterations_used
Number of bootstrap iterations actually used (excluding failures).
- n_periods_post
Number of post-treatment periods.
- n_periods_pre
Number of pre-treatment periods.
- n_units_control
Number of control units in the panel.
- n_units_treated
Number of treated units in the panel.
- pre_treatment_fit
Pre-treatment fit RMSE between synthetic and actual treated outcomes.
- solver_converged
Whether the solver converged successfully.
- solver_iterations
Solver iterations as (unit_weight_iterations, time_weight_iterations).
- standard_error
Standard error estimated via placebo bootstrap.
- time_weights
Time weights (λ) for pre-treatment periods.
- unit_weights
Unit weights (ω) for control units.
- class causers.SyntheticControlResult
Result of Synthetic Control estimation, exposed to Python via PyO3.
Contains the ATT estimate, optional standard error, control unit weights, and various diagnostic information about the estimation.
- __new__(**kwargs)
- att
Average Treatment Effect on Treated
- lambda_used
Lambda used for penalized method (None for others)
- method
“traditional”, “penalized”, “robust”, “augmented”
- Type:
SC method used
- n_failed_iterations
Number of failed placebo iterations (None if SE not computed)
- n_periods_post
Number of post-treatment periods
- n_periods_pre
Number of pre-treatment periods
- n_placebo_used
Number of successful placebo iterations (None if SE not computed)
- n_units_control
Number of control units
- pre_treatment_mse
Pre-treatment MSE
- pre_treatment_rmse
Pre-treatment RMSE (fit quality)
- solver_converged
Whether the optimizer converged
- solver_iterations
Number of optimizer iterations
- standard_error
Standard error via in-space placebo (None if not computed)
- unit_weights
Control unit weights (sum to 1, non-negative)
- class causers.DMLResult
Result of DML estimation.
Contains the treatment effect estimate, standard error, confidence interval, and various diagnostics.
- __new__(**kwargs)
- cate_coefficients
CATE coefficients keyed by covariate name (if estimate_cate=True)
- cate_standard_errors
CATE coefficient standard errors (if estimate_cate=True)
- confidence_interval
Confidence interval bounds (lower, upper)
- n_folds
Number of cross-fitting folds used
- n_propensity_clipped
Count of clipped propensity scores
- n_samples
Number of observations
- outcome_r_squared
Average R² of outcome nuisance model across folds
- outcome_residual_var
Variance of outcome residuals Var(Ỹ)
- p_value
θ = 0
- Type:
Two-sided p-value for H₀
- propensity_r_squared
Average R² (or pseudo-R²) of propensity nuisance model across folds
- propensity_residual_var
Variance of treatment residuals Var(D̃)
- standard_error
Robust standard error (Neyman-orthogonal)
- summary()
Return formatted summary string
- theta
Average Treatment Effect (ATE) point estimate
- class causers.TwoStageLSResult
Result of Two-Stage Least Squares estimation.
- __new__(**kwargs)
- coefficients
Structural coefficients (endogenous + exogenous, excluding intercept)
- confidence_intervals(alpha=0.05)
Compute confidence intervals for all coefficients.
Confidence intervals are computed as: coef ± t_crit × se where t_crit is the critical value from the t-distribution.
# Arguments * alpha - Significance level (default: 0.05 for 95% CI).
Must be in (0, 1).
# Returns List of (lower, upper) bounds for each coefficient.
- cragg_donald
Cragg-Donald statistic (multiple endogenous only)
- first_stage_coefficients
First-stage coefficients for instruments only (per endogenous variable)
- first_stage_f
F-statistics per endogenous variable
- intercept
Intercept term if included
- intercept_se
Standard error for intercept
- n_clusters
Number of clusters (if clustered)
- n_endogenous
Number of endogenous regressors
- n_instruments
Number of excluded instruments
- n_samples
Number of observations
- p_values
Compute two-tailed p-values for all coefficients.
P-values are computed using the t-distribution with (n_samples - n_params) degrees of freedom. Tests H₀: β = 0 against H₁: β ≠ 0.
# Returns Vec of p-values, one for each coefficient.
- r_squared
R² from structural equation (using original D)
- se_type
“conventional”, “hc3”, or “clustered”
- Type:
SE type
- standard_errors
Standard errors for all coefficients
- stock_yogo_critical
Stock-Yogo 10% critical value
- summary()
Return a formatted summary table similar to statsmodels IV2SLS output.
The summary includes: - Model information (N, R², SE type) - First-stage diagnostics (F-statistics, Cragg-Donald) - Coefficient table with estimates, SEs, t-stats, p-values, and CIs
# Returns Formatted string suitable for printing.
- t_statistics
Compute t-statistics for all coefficients.
T-statistics are computed as coefficients / standard_errors (element-wise). Returns t-statistics for slope coefficients only (intercept excluded).
# Returns Vec of t-statistics, one for each coefficient.
- class causers.BalanceCheckResult[source]
Python wrapper around the Rust BalanceResult with convenience methods.
This class wraps the native
BalanceResultreturned by Rust and addssummary(),imbalanced(), andto_dataframe()helper methods. All attributes of the underlying Rust object are accessible directly (e.g.result.smd,result.n_treated).- ess_treated
Effective sample size for the treatment group (weighted analysis only).
- Type:
float or None
- ess_control
Effective sample size for the control group (weighted analysis only).
- Type:
float or None
- summary()[source]
Return a Polars DataFrame summarizing balance statistics.
Columns: covariate, mean_treated, mean_control, sd_treated, sd_control, smd, variance_ratio.
- Returns:
One row per covariate with key balance statistics.
- Return type:
pl.DataFrame