Mailinglists

Below we describe, how the public mailing lists of each of the Internet standard developing organisations can be scrabed from the web. Some mailng lists reach back to 1998 and is multiple GBs in size. Therefore, it can take a considerable amount of time to scrape an entire mailing list. This process can’t be speed up, since one would commit a DDoS attack otherwise. So be prepared to leave your machine running over (multiple) night(s).

IETF

To scrabed public mailing lists of the Internet Engineering Task Force (IETF), there are two options outlined below.

Public Mailman Web Archive

BigBang comes with a script for collecting files from public Mailman web archives. An example of this is the scipy-dev mailing list page. To collect the archives of the scipy-dev mailing list, run the following command from the root directory of this repository:

python3 bin/collect_mail.py -u http://mail.python.org/pipermail/scipy-dev/

You can also give this command a file with several urls, one per line. One of these is provided in the examples/ directory.

python3 bin/collect_mail.py -f examples/urls.txt

Once the data has been collected, BigBang has functions to support analysis.

Datatracker

BigBang can also be used to analyze data of IETF RFC drafts.

It does this using the Glasgow IPL group’s ietfdata tool.

The script takes an argument, the working group acronym

python3 bin/collect_draft_metadata.py -w httpbis

W3C

The World Wide Web Consortium (W3C) mailing archive is managed using the Hypermail software and is hosted at:

https://lists.w3.org/Archives/Public/

There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:

from bigbang.ingress import ListservMailList

mlist = W3CMailList.from_url(
    name="public-testtwf",
    url="https://lists.w3.org/Archives/Public/public-testtwf/",
    select={"years": 2014, "fields": "header"},
)
mlist.to_mbox(path_to_file)

Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:

python bin/collect_mail.py -f examples/url_collections/W3C.txt

3GPP

The 3rd Generation Partnership Project (3GPP) mailing archive is managed using the LISTSERV software and is hosted at:

https://list.etsi.org/scripts/wa.exe?HOME

In order to successfully scrape all public mailing lists, one needs to create an account here: https://list.etsi.org/scripts/wa.exe?GETPW1=&X=&Y=

There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:

from bigbang.ingress import ListservMailList

mlist = ListservMailList.from_url(
    name="3GPP_TSG_SA_WG2_EMEET",
    url="https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_SA_WG2_EMEET",
    select={"fields": "header",},
    url_login="https://list.etsi.org/scripts/wa.exe?LOGON=INDEX",
    url_pref="https://list.etsi.org/scripts/wa.exe?PREF",
    login=auth_key,
)
mlist.to_mbox(path_to_file)

Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:

python bin/collect_mail.py -f examples/url_collections/listserv.3GPP.txt

IEEE

The Institute of Electrical and Electronics Engineers (W3C) mailing archive is managed using the LISTSERV software and is hosted at:

https://listserv.ieee.org/cgi-bin/wa?INDEX

There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:

from bigbang.ingress import ListservMailList

mlist = ListservMailList.from_url(
    name="IEEE-TEST",
    url="https://listserv.ieee.org/cgi-bin/wa?A0=IEEE-TEST",
    select={"fields": "header",},
    url_login="https://listserv.ieee.org/cgi-bin/wa?LOGON",
    url_pref="https://listserv.ieee.org/cgi-bin/wa?PREF",
    login=auth_key,
)
mlist.to_mbox(path_to_file)

Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:

python bin/collect_mail.py -f examples/url_collections/listserv.IEEE.txt