analysis.process

bigbang.analysis.process.ai(m, parts, i)
bigbang.analysis.process.bi(m, parts, i)
bigbang.analysis.process.consolidate_senders_activity(activity_df, to_consolidate)

takes a DataFrame in the format returned by activity takes a list of tuples of format (‘from 1’, ‘from 2’) to consolidate returns the consolidated DataFrame (a copy, not in place)

bigbang.analysis.process.containment_distance(a, b)

A case-insensitive distance measure on strings.

Returns

  • 0 if strings are identical

  • positive infinity if neither string contains the other

  • 1 / (minimum string length) if one string contains the other.

Good for Organizations. I.e. “cisco” “Cisco” “Cisco Systems” are all ‘close’ (< .2)

bigbang.analysis.process.domain_name_from_email(name)
bigbang.analysis.process.eij(m, parts, i, j)
bigbang.analysis.process.from_header_distance(a, b, verbose=False)

A distance measure specifically for the ‘From’ header of emails. Normalizes based on common differences in client handling of email, then computes Levenshtein distance between components of the field.

bigbang.analysis.process.matricize(series, func)

create a matrix by applying func to pairwise combos of elements in a Series returns a square matrix as a DataFrame should return a symmetric matrix if func(a,b) == func(b,a) should return the identity matrix if func == ‘==’

bigbang.analysis.process.minimum_but_not_self(column, dataframe)
bigbang.analysis.process.modularity(m, parts)

Compute modularity of an adjacency matrix. Use metric from:

Zanetti, M. and Schweitzer, F. 2012. “A Network Perspective on Software Modularity” ARCS Workshops 2012, pp. 175-186.

bigbang.analysis.process.resolve_entities(significance, distance_function, threshold=0)

Takes a Series mapping entities (index) to significance (values, numerical).

Resolves the entities based on a lexical distance function.

Returns a dictionary of labeled (keys) entity lists (values). Key is the most significant member of the entity list.

bigbang.analysis.process.resolve_sender_entities(act, lexical_distance=0)

Given an Archive’s activity matrix, return a dict of lists, each containing message senders (‘From’ fields) that have been groups to be probably the same entity.

bigbang.analysis.process.sorted_matrix(from_dataframe, limit=None, sort_key=None)

Takes a dataframe with ‘from’ fields for column headers

.

Returns a sorted distance matrix for the column headers, using from_header_distance (see method).