Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries in Python

João Eduardo Montandon (Créateur)
Luciana Lourdes Silva (Créateur)
Cristiano Politowski (Créateur)
Daniel Prates (Créateur)
Arthur de Brito Bonifácio (Créateur)
Ghizlane El boussaidi (Créateur)

Ensemble de données

Description

Replication Package This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn". Requirements We recommend the following requirements to replicate our study: Internet access At least 100GB of space Docker installed Git installed Package Structure We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers: data-analysis, an R-based Container we used to run our data analysis. data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications. database, a Postgres Container we used to store clients' data, obtainer from Grotov et al. storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers. docker-compose.yml, the Docker file that configures all containers used in the package. In the remainder of this document, we describe how to set up each container properly. Using VSCode to Setup the Package We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration. You first need to set up the containers $ cd /replication/package/folder $ docker-compose build $ docker-compose up # Wait docker creating and running all containers Then, you can open them in Visual Studio Code: Open VSCode in project root folder Access the command palette and select "Dev Container: Reopen in Container" Select either Data Collection or Data Analysis. Start working If you want/need a more customized organization, the remainder of this file describes it in detail. Longest Road: Manual Package Setup Database Setup The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should: Build an image: $ cd ./database $ docker build --tag 'dabc-database' . $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB Create and enter inside the container: $ docker run -it --name dabc-database-1 dabc-database $ docker exec -it dabc-database-1 /bin/bash root# psql -U postgres -h localhost -d jupyter-notebooks jupyter-notebooks=# \dt List of relations Schema | Name | Type | Owner --------+-------------------+-------+------- public | Cell | table | root public | Code_cell | table | root public | Md_cell | table | root public | Notebook | table | root public | Notebook_features | table | root public | Notebook_metadata | table | root public | repository | table | root If you got the tables list as above, your database is properly setup. It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database. Data Collection Setup This container is responsible for collecting the data to answer our research questions. It has the following structure: dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file. dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory. Makefile, commands to set up and run both dabcs.py and dabcs-clients.py matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset. storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis. requirements.txt, Python dependencies adopted in this module. Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands: $ cd ./data-collection $ docker build --tag "data-collection" . $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection $ docker exec -it data-collection-1 /bin/bash $ ls Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py If you see project files, it means the container is configured accordingly. Data Analysis Setup We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure: dependencies.R, an R script containing the dependencies used in our data analysis. data-analysis.Rmd, the R notebook we used to perform our data analysis datasets, a docker volume pointing to the storage directory. Execute the following commands to run this container: $ cd ./data-analysis $ docker build --tag "data-analysis" . $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis $ docker exec -it data-analysis-1 /bin/bash $ ls data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile If you see project files, it means the container is configured accordingly. A note on storage shared folder As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder. $ make unzip # extract files $ ls clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv $ make zip # compress files $ ls csv-files.tar.gz Makefile

Date mise à disposition	24 juin 2024
Editeur	Zenodo

DOI
10.5281/zenodo.7868227

Contient cette citation

DataSetCite

Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries in Python

Description

DOI

Contient cette citation