Welcome to Crunchers’s documentation!

Contents:

Overview

Crunchers

docs Documentation Status
tests
Travis-CI Build Status AppVeyor Build Status
Coverage Status
package PyPI Package latest release PyPI Package monthly downloads

A library that provides a set of helper functions etc that I tend to use a lot when crunching data with scikit-learn, pandas, et al.

  • Free software: BSD license

Installation

pip install crunchers

Development

To run the all tests run:

tox

Installation

At the command line:

pip install crunchers

Usage

To use Crunchers in a project:

import crunchers

Reference

crunchers package

Subpackages

crunchers.pandas_helpers package
Submodules
crunchers.pandas_helpers.transformations module

Provide functions for performing non-standard-ish column-wise transformations.

crunchers.pandas_helpers.transformations.apply_ignore_null(func, s, fillwith=None)[source]

Perform func on values on s that are not ‘nan’ or equivalent.

func applied to s after filling the ‘nan’ with fillwith. If fillwith is None, min(s) is used.

You may prefer to use the mean or median like this:

apply_ignore_null(func, s, fillwith=np.mean(s))

Returns a reconstituted pandas.Series with ‘nan’ everywhere there was an original ‘nan’, but with the transformed values everywhere else.

crunchers.pandas_helpers.transformations.apply_pairwise(series, func)[source]

Apply func to items in series pairwise: return dataframe.

crunchers.pandas_helpers.transformations.robust_scale(df)[source]

Return copy of df scaled by (df - df.median()) / MAD(df) where MAD is a function returning the median absolute deviation.

crunchers.pandas_helpers.transformations.std_scale(df)[source]

Return scaled copy of df tolerating columns where stdev == 0.

crunchers.pandas_helpers.transformations.zero_stdv_columns(df)[source]

Return list of column names where standard deviation == 0.

Module contents
crunchers.sklearn_helpers package
Submodules
crunchers.sklearn_helpers.assessment module

Provide helper functions for working with scikit-learn based objects.

crunchers.sklearn_helpers.assessment.confusion_matrix_to_pandas(cm, labels)[source]

Return the confusion matrix as a pandas dataframe.

It is created from the confusion matrix stored in cm with rows and columns labeled with labels.

crunchers.sklearn_helpers.assessment.normalize_confusion_matrix(cm)[source]

Return confusion matrix with values as fractions of outcomes instead of specific cases.

crunchers.sklearn_helpers.assessment.plot_confusion_matrix(cm, labels=None, cmap='Blues', title=None, norm=False, context=None, annot=True)[source]

Plot and return the confusion matrix heatmap figure.

crunchers.sklearn_helpers.exploration module

Provide functions that help quickly explore datasets with sklearn.

class crunchers.sklearn_helpers.exploration.KMeansReport(data, n_clusters, seed=None, n_jobs=-1, palette='deep')[source]

Bases: object

Manage KMeans Clustering and exploration of results.

cluster()[source]

Fit each estimator.

eval_silhouette(verbose=True)[source]

Evaluate each estimator via silhouette score.

init_estimators()[source]

Set up and return dictionary of estimators with key = n_clusters.

plot_silhouette_results(feature_names=None, feature_space=None)[source]

Perform plotting similar to that from sklearn link below.

http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

class crunchers.sklearn_helpers.exploration.PCAReport(data, pca=None, n_components=None, data_labels=None, color_palette=None, label_colors=None, name=None)[source]

Bases: object

Manage PCA and exploration of results.

filter_by_loadings(kind, column, hi_thresh, lo_thresh)[source]

Return index of row names.

kind (str): either [‘pearsonr’,’spearmanr’] column (str): which PC column to filter hi_thresh (float): retain rows with >= hi_thresh lo_thresh (float): retain rows with <= lo_thresh

get_loading_corr(kind='pearsonr')[source]

Return dataframe of correlation based “loadings” repective of kind.

get_pcs(rerun=True)[source]

Fit and Transform via our local PCA object; store results in self.pcs.

n_components

Provide access to the number of PCs.

plot_pcs(components=None, label_colors=None, diag='kde', diag_kws=None, **kwargs)[source]

Plot scatter-plots below the diagonal and density plots on the diagonal.

components (list): list of components to plot

label_colors = {‘label1’:’g’,
‘label2’:’r’, ‘label3’:’b’ }
plot_variance_accumulation(thresh=6, verbose=False)[source]

Plot variance accumulation over PCs.

plot_variance_decay(thresh=6, verbose=False)[source]

Plot variance decay over PCs.

crunchers.sklearn_helpers.misc module

Collect misc sklearn helpers here.

crunchers.sklearn_helpers.misc.repandasify(array, y_names, X_names=None)[source]

Convert numpy array into pandas dataframe using provided index and column names.

Module contents
crunchers.statsmodels_helpers package
Submodules
crunchers.statsmodels_helpers.lazy_stats module

Functions for streamlining analysis.

crunchers.statsmodels_helpers.lazy_stats.build_regression_models_grid(X_hyps_dicts, ctrl_coefs_dicts, outcomes_dicts)[source]
crunchers.statsmodels_helpers.lazy_stats.compare_coefs(row, value, results)[source]
crunchers.statsmodels_helpers.lazy_stats.do_regression(data, y_var, X_ctrls=None, X_hyp=None, kind='OLS', **kwargs)[source]

Provide a further abstracted way to build and run multiple types of regressions.

data (pd.DataFrame): data table to use when retrieving the column headers y_var (str): column header of the outcome variable X_ctrls (str): formula specification of the “boring” variables “column_header_1 + column_header_2”… X_hyp (str): formula specification of the “interesting” variables “column_header_1 + column_header_2”… kind (str): the type of regression to run kind in [‘GLM’,’OLS’,’RLM’] == True

crunchers.statsmodels_helpers.lazy_stats.format_all_regression_models(regs, total)[source]

Return tuple of string formated versions of all regression tables in the regs object.

Parameters:
  • (reg-tree (regs) – dict-like): tree-like dict containing the regression results objects as leaves and descriptors as nodes.
  • total (int) – total number of results tables to format.
Returns:

tuple

crunchers.statsmodels_helpers.lazy_stats.get_diff(a, b)[source]
crunchers.statsmodels_helpers.lazy_stats.get_log2_fold(a, b)[source]
crunchers.statsmodels_helpers.lazy_stats.identify_full_ctrl_names(X_vars, orig_ctrl_names)[source]

Return set of variable names actually used in regression, tolerating mangling of categoricals.

crunchers.statsmodels_helpers.lazy_stats.regression_grid_single(grid_item, data, kind, **kwargs)[source]
crunchers.statsmodels_helpers.lazy_stats.report_glm(formula, data, verbose=True, **kwargs)[source]

Fit GLM, print a report, and return the fit object.

crunchers.statsmodels_helpers.lazy_stats.report_logitreg(formula, data, verbose=True, disp=1)[source]

Fit logistic regression, print a report, and return the fit object.

crunchers.statsmodels_helpers.lazy_stats.report_ols(formula, data, fit_regularized=False, L1_wt=1, refit=False, **kwargs)[source]

Fit OLS regression, print a report, and return the fit object.

crunchers.statsmodels_helpers.lazy_stats.report_rlm(formula, data, verbose=True, **kwargs)[source]

Fit RLM, print a report, and return the fit object.

crunchers.statsmodels_helpers.lazy_stats.run_regressions_grid(grid, data, kind, max_workers=None, **kwargs)[source]
crunchers.statsmodels_helpers.lazy_stats.summarize_X_vars(results, sig_thresh=0.05, X_ctrls=None, X_ignore=None)[source]
crunchers.statsmodels_helpers.lazy_stats.summarize_grid_OLS(regs, reg_grid)[source]
crunchers.statsmodels_helpers.lazy_stats.summarize_grid_X_vars_OLS(regs, reg_grid, sig_thresh=0.05)[source]
crunchers.statsmodels_helpers.lazy_stats.summarize_multi_LOGIT(results)[source]

Return dataframe aggregating over-all stats from a dictionary-like object containing LOGIT result objects.

crunchers.statsmodels_helpers.lazy_stats.summarize_multi_OLS(results)[source]

Return dataframe aggregating over-all stats from a dictionary-like object containing OLS result objects.

crunchers.statsmodels_helpers.lazy_stats.summarize_single_OLS(regression, col_dict, name, is_regularized=False)[source]

Return dataframe aggregating over-all stats from a dictionary-like object containing OLS result objects.

crunchers.statsmodels_helpers.lazy_stats.tree()[source]
Module contents

Module contents

crunchers.ipython_info()[source]

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

Bug reports

When reporting a bug please include:

  • Your operating system name and version.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.

Documentation improvements

Crunchers could always use more documentation, whether as part of the official Crunchers docs, in docstrings, or even on the web in blog posts, articles, and such.

Feature requests and feedback

The best way to send feedback is to file an issue at https://github.com/xguse/crunchers/issues.

If you are proposing a feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible, to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Development

To set up crunchers for local development:

  1. Fork crunchers on GitHub.

  2. Clone your fork locally:

    git clone git@github.com:your_name_here/crunchers.git
    
  3. Create a branch for local development:

    git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  4. When you’re done making changes, run all the checks, doc builder and spell checker with tox one command:

    tox
    
  5. Commit your changes and push your branch to GitHub:

    git add .
    git commit -m "Your detailed description of your changes."
    git push origin name-of-your-bugfix-or-feature
    
  6. Submit a pull request through the GitHub website.

Pull Request Guidelines

If you need some code review or feedback while you’re developing the code just make the pull request.

For merging, you should:

  1. Include passing tests (run tox) [1].
  2. Update documentation when there’s new API, functionality etc.
  3. Add a note to CHANGELOG.rst about the changes.
  4. Add yourself to AUTHORS.rst.
[1]

If you don’t have all the necessary python versions available locally you can rely on Travis - it will run the tests for each change you add in the pull request.

It will be slower though …

Tips

To run a subset of tests:

tox -e envname -- py.test -k test_myfeature

To run all the test environments in parallel (you need to pip install detox):

detox

Authors

Changelog

0.0.1 (2016-05-13)

  • First release on GitHub.

Indices and tables