analysis.listserv

class bigbang.analysis.listserv.ListservMailList(name: str, filepath: str, msgs: pandas.core.frame.DataFrame)

Bases: object

Note

Issues loading 3GPP_TSG_RAN_WG1 which is 3.3Gb large

__iter__()

Iterate over each message within the mailing list.

__len__() → int

Get number of messsages within the mailing list.

add_thread_info()

Edit pd.DataFrame to include extra column to identify which thread a message belongs to.

add_weight_to_edge(dic: dict, key1: str, key2: str) → dict
Parameters
  • dic

  • key1

  • key2

static contract(count: numpy.array, label: list, contract: float) → Dict[str, int]

This function contracts all domain names that contributed to a mailinglists below the contract threshold into one entity called Others. Meaning, if contract=3 and nokia.com, nokia.com, t-mobile.at all wrote less then three Emails to the mailinglist in question, their contributions are going to be summed into one entity denoted as Others.

Parameters
  • count (Number of Emails send to mailinglist.) –

  • label (Names of contributers to mailinglist.) –

  • contract (Threshold below which all contributions will be summed.) –

create_sender_receiver_digraph(nw: Optional[dict] = None, entity_in_focus: Optional[list] = None, node_attributes: Optional[Dict[str, list]] = None)

Create directed graph from messaging network created with ListservMailList.get_sender_receiver_dict().

Parameters
  • nw (dictionary created with self.get_sender_receiver_dict()) –

  • entity_in_focus (This can be a list of domain names or localparts. If) – such a list is provided, the creaed di-graph will only focus on their relations.

crop_by_address(header_field: str, per_address_field: Dict[str, List[str]])bigbang.analysis.listserv.ListservMailList
Parameters
  • header_field (For a Listserv mailing list the most representative) – header fields for senders and receivers are ‘from’ and ‘comments-to’ respectively.

  • per_address_field (Filter by 'local-part' or 'domain' part of an address.) –

    The data structure of the argument should be, e.g.:

    {‘localpart’: [string-1, string-2, …]}

Returns

Return type

ListservMailList object cropped to specification.

crop_by_subject(match=<class 'str'>, place: int = 2)bigbang.analysis.listserv.ListservMailList
Parameters
  • match (Only keep messages with subject lines containing match string.) –

  • place (Define how to filter for match. Use on of the following methods:) – 0 = Using Regex expression 1 = String ends with match 2 =

Returns

Return type

ListservMailList object cropped to message subject.

crop_by_year(yrs: Union[int, list])bigbang.analysis.listserv.ListservMailList

Filter self.df DataFrame by year in message date.

Parameters

yrs (Specify a specific year, such as 2021, or a range of years, such) – as [2011, 2021].

Returns

Return type

ListservMailList object cropped to specification.

crop_dic_to_entity_in_focus(dic: dict, entity_in_focus: list) → dict
Parameters

entity_in_focus (This can a list of domain names or localparts.) –

classmethod from_mbox(name: str, filepath: str, include_body: bool = True)bigbang.analysis.listserv.ListservMailList
classmethod from_pandas_dataframe(df: pandas.core.frame.DataFrame, name: Optional[str] = None, filepath: Optional[str] = None)bigbang.analysis.listserv.ListservMailList
get_domains(header_fields: List[str], return_msg_counts: bool = False, df: Optional[pandas.core.frame.DataFrame] = None) → dict

Get contribution of members per affiliation.

Note

For a Listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.

Parameters
  • header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.

  • return_msg_counts (If 'True', return # of messages per domain.) –

get_domainscount(header_fields: List[str], per_year: bool = False) → dict
Parameters
  • header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.

  • per_year (Aggregate results for each year.) –

get_graph_prop_per_domain_per_year(years: Optional[tuple] = None, func=<function betweenness_centrality>, **args) → dict
Parameters
  • years

  • func

get_localparts(header_fields: List[str], per_domain: bool = False, return_msg_counts: bool = False, df: Optional[pandas.core.frame.DataFrame] = None) → dict

Get contribution of members per affiliation.

Parameters
  • header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.

  • per_domain

  • return_msg_counts (If 'True', return # of messages per localpart.) –

get_localpartscount(header_fields: List[str], per_domain: bool = False, per_year: bool = False) → dict
Parameters
  • header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.

  • per_domain (Aggregate results for each domain.) –

  • per_year (Aggregate results for each year.) –

get_messagescount(header_fields: Optional[List[str]] = None, per_address_field: Optional[str] = None, per_year: bool = False) → dict
Parameters
  • header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.

  • per_year (Aggregate results for each year.) –

get_messagescount_per_timezone(percentage: bool = False) → Dict[str, int]

Get contribution of messages per time zone.

Parameters

percentage (Whether to return count of messages percentage w.r.t. total.) –

static get_name_localpart_domain(string: str) → tuple

Split an address field which has (ideally) a format as ‘Heinrich von Kleist <Heinrich.vonKleist@SELBST.org>’ into name, local-part, and domain. All strings are returned in lower case only to avoid duplicates.

Note

Test whether the incorporation of email.utils.parseaddr() can improve this function.

get_sender_receiver_dict(address_field: str = 'domain', entity_in_focus: Optional[list] = None, df: Optional[pandas.core.frame.DataFrame] = None) → Dict
Parameters
  • address_field

  • entity_in_focus (This can a list of domain names or localparts. If such) – a list is provided, the creaed dictionary will only contain their information.

Returns

  • Nested dictionary with first layer the ‘from’ domain keys and

  • the second layer the ‘comments-to’ domain keys with the

  • integer indicating the number of messages between them.

get_threads(return_length: bool = False) → dict

Collect all messages that belong to the same thread.

Note

Computationally very intensive.

Parameters

return_length – If ‘True’, the returned dictionary will be of the form {‘subject1’: # of messages, ‘subject2’: # of messages, …}. If ‘False’, the returned dictionary will be of the form {‘subject1’: list of indices, ‘subject2’: list of indices, …}.

get_threadsroot(per_address_field: Optional[str] = None, df: Optional[pandas.core.frame.DataFrame] = None) → dict

Find all unique message subjects. Replies not treated as a new subject.

Note

The most reliable ways to find the beginning of threads is to check whether the subject line of a message contains an element of reply_labels at the beginning. Checking whether the header field ‘comments-to’ is empty is not reliable, as ‘reply-all’ is often chosen by mistake as seen here: 2020-04-01 10:08:58+00:00 joern.krause@etsi.org, juergen.hofmann@nokia.com 2020-03-26 21:41:27+00:00 joern.krause@etsi.org NaN 2020-03-26 21:00:08+00:00 joern.krause@etsi.org juergen.hofmann@nokia.com

  1. Some Emails start with ‘AW:’, which comes from German and has

    the same meaning as ‘Re:’.

  2. Some Emails start with ‘=?utf-8?b?J+WbnuWkjTo=?=’ or ‘=?utf-8?b?J+etlOWkjTo=?=’, which are UTF-8 encodings of the Chinese characters ‘回复’ and ‘答复’ both of which have the same meaning as ‘Re:’.

  3. Leading strings such as ‘FW:’ are treated as new subjects.

Parameters

per_address_field

Returns

  • A dictionary of the form {‘subject1’ (index of message, ‘subject2’: …})

  • is returned. If per_address_field is specified, the subjects are sorted

  • into the domain or localpart from which they originate.

get_threadsrootcount(per_address_field: Optional[str] = None, per_year: bool = False) → Union[int, dict]

Identify number conversation threads in mailing list.

Parameters
  • per_address_field (Aggregate results for each address field, which can) – be, e.g., from, send-to, received-by.

  • per_year (Aggregate results for each year.) –

static iterator_name_localpart_domain(li: list) → tuple

Generator for the self.get_name_localpart_domain() function.

period_of_activity(format: str = '%a, %d %b %Y %H:%M:%S %z') → list

Return a list containing the datetime of the first and last message written in the mailing list.

static to_percentage(arr: numpy.array) → numpy.array
class bigbang.analysis.listserv.ListservMailListDomain(name: str, filedsc: str, lists: pandas.core.frame.DataFrame)

Bases: object

Parameters
  • name – The of whom the mail list domain is (e.g. 3GPP, IEEE, …)

  • filedsc – The file description of the mail list domain

  • lists – A list containing the mailing lists as ListservMailList types

get_mlistscount_per_institution()
classmethod from_mbox(name: str, directorypath: str, filedsc: str = '*.mbox')bigbang.analysis.listserv.ListservMailListDomain
get_mlistscount_per_institution() → Dict[str, int]

Get a dictionary that lists the mailing lists/working groups in which a institute/company is active.

exception bigbang.analysis.listserv.ListservMailListDomainWarning

Bases: BaseException

Base class for ListservMailListDomain class specific exceptions

exception bigbang.analysis.listserv.ListservMailListWarning

Bases: BaseException

Base class for ListservMailList class specific exceptions