API Reference

FACT_data_project.download_datasetMethod
download_dataset(project_filename::String, output_filename::String)

Downloads and processes a dataset defined in a project file and saves the resulting dataset to an output file in the specified format.

Arguments:

  • project_filename::String: The name of the project file (with extension) located in the projects directory. This file defines the dataset to be downloaded.
  • output_filename::String: The name of the output file (with extension) where the processed dataset will be saved. This file will be created in the data directory.

Workflow:

  1. Constructs full paths for the input project file and the output file.
  2. Logs into the registry and retrieves an authentication token.
  3. Starts a results container and establishes a connection to retrieve data.
  4. Processes the dataset defined in the project file using the connection.
  5. Writes the processed dataset to the specified output file in the desired format.
  6. Stops the results container after the operation is complete.

Supported Output Formats:

  • .json: Saves the dataset as a JSON file.
  • .csv: Saves the dataset as a CSV file.
  • .dta: Saves the dataset in Stata format.
  • .parquet: Saves the dataset as a Parquet file.
  • .sqlite: Saves the dataset to an SQLite database.

Example Usage: This example downloads a dataset defined in example_project.toml and saves it as output.parquet:

downloaddataset("exampleproject.toml", "output.parquet")

source
FACT_data_project.getResultsFromTaskMethod
getResultsFromTask(task_definition::Dict, connection_info::Dict) -> DataFrame

Retrieves the results of a specific task from a database, based on the task's definition and connection information. This function establishes a database connection, queries the database for the task's results using details specified in task_definition, and then closes the database connection.

Arguments

  • task_definition::Dict: A dictionary containing the task's definition, including the function name, arguments, and parameters dictionary to query the database.
  • connection_info::Dict: A dictionary containing the database connection information required to establish the connection.

Returns

DataFrame: A DataFrame containing the results of the database query based on the task's definition. Returns an empty DataFrame if no results are found or if any required information is missing in task_definition.

Example

To retrieve results for a task with specific function name, arguments, and parameters, and given database connection info:

taskdefinition = Dict("function" => "computeStats", "arguments" => "arg1", "parametersdict" => Dict("param1" => "value1")) connectioninfo = Dict("host" => "localhost", "db" => "mydatabase") result = getResultsFromTask(taskdefinition, connection_info)

This example queries the database for the results of the computeStats function with specified arguments and parameters, returning the results as a DataFrame.

source
FACT_data_project.get_joined_results_datasetMethod
get_joined_results_dataset(filename::String, connection_info::Dict) -> Dict{String, DataFrame}

Reads task and output configurations from a TOML file specified by filename and retrieves the results for each task, joining them as specified in the output configuration. The function constructs a dictionary mapping each task or join name to its respective DataFrame. Single tasks are added directly, while joins are processed according to the join configurations defined in the TOML file.

Arguments

  • filename::String: The path to the TOML file containing the workflow definition, including tasks and output configurations.
  • connection_info::Dict: A dictionary containing the database connection information needed to retrieve task results.

Returns

A dictionary where keys are task or join names and values are DataFrames representing the results of individual tasks or the results of specified joins among tasks.

Workflow

  • Parses the TOML file to extract tasks and output configurations.
  • Retrieves results for each task defined in the tasks section, using the provided connection_info.
  • Processes joins as defined in the output section, creating a new DataFrame for each join and adding it to the output dictionary with the specified join name.
  • Adds single task results directly to the output dictionary using the task name as the key.

Example

Assuming project.toml defines tasks and their dependencies, and connection_info provides the necessary database connection details:

requesteddf = getjoinedresultsdataset("project.toml", connection_info)

This will return a dictionary containing DataFrames for each task and join specified in project.toml, ready for further analysis or export.

source
FACT_data_project.join_on_matching_columnsMethod
join_on_matching_columns(df_left::DataFrame, df_right::DataFrame; join_type::Symbol=:outer) -> DataFrame

Performs a join operation between two DataFrames (df_left and df_right) based on columns with matching names. The type of join performed can be specified using the join_type parameter, which defaults to an outer join if not provided.

Arguments

  • df_left::DataFrame: The left DataFrame in the join operation.
  • df_right::DataFrame: The right DataFrame in the join operation.
  • join_type::Symbol: (Optional) The type of join to perform, specified as a Symbol. Accepted values include :inner, :left, :right, :outer, :semi, and :anti, mirroring the join types supported by the DataFrames.jl package. Defaults to :outer.

Returns

DataFrame: The result of joining df_left and df_right based on columns with matching names, according to the specified join_type.

Example

Given two DataFrames df1 and df2 with some columns having matching names:

joineddf = joinonmatchingcolumns(df1, df2, join_type=:inner)

This will return a new DataFrame resulting from an inner join of df1 and df2 on all columns that exist in both DataFrames.

source
FACT_data_project.parseOutputflowMethod
parseOutputflow(filePath::String) -> (Dict, Dict)

Parses a workflow file specified by filePath, extracting configurations for tasks and output definitions. This function utilizes the TOML format for the workflow file, providing a structured way to define and retrieve task and output configurations.

Arguments

  • filePath::String: The path to the workflow definition file in TOML format.

Returns

  • (Dict, Dict): A tuple containing two dictionaries. The first dictionary contains the tasks configuration, and the second dictionary contains the output configuration as defined in the workflow file.

Example

To parse a workflow definition from 'workflow.toml' and retrieve task and output configurations:

tasks, output = parseOutputflow("workflow.toml")

This will return the tasks and output configurations as two separate dictionaries, based on the content of 'workflow.toml'.

source
FACT_data_project.preprocess_dataframe_and_remove_shape_obj!Method
preprocess_dataframe_to_numerical!(df::DataFrame)

Transforms columns of a DataFrame to numerical values, specifically to Float64, except for columns designated to remain as strings. This function is particularly useful for preparing dataset columns for statistical analysis or machine learning models that require numerical input. It attempts to parse each string in the DataFrame to a Float64 and replaces non-convertible strings with NaN to maintain numerical operations' integrity.

Arguments

  • df::DataFrame: The DataFrame to be processed. This function modifies the DataFrame in place.

Behavior

  • Columns named "geo_id" are skipped to preserve any geographic identifiers or other string-based identifiers that should not be converted to numbers.
  • For other columns, the function attempts to convert string representations of numbers into Float64. If a value cannot be parsed as a Float64, it is replaced with NaN (Not a Number), a standard IEEE floating-point representation for undefined or unrepresentable values.

Example

Given a DataFrame df with mixed string numerical values and identifiers:

  • df = DataFrame(geo_id = ["id1", "id2"], measurement = ["1.23", "4.56"], category = ["A", "B"])

Applying preprocess_dataframe_to_numerical!(df) would result in df having its measurement column converted to Float64, with category values replaced by NaN, and geo_id left unchanged.

Notes

  • This preprocessing removes the "shape_obj" column from the DataFrame if present.
  • It's important to backup your DataFrame before applying this function if you need to preserve the original data.
  • Handling of NaN values should be considered in subsequent data processing or analysis stages.
source
FACT_data_project.runWorkflowMethod
runWorkflow(filename::String)

Executes a workflow defined in a file, utilizing the FACTWorkflowManager to handle the execution process. This function wraps the workflow execution in a try-catch block to gracefully handle any errors that may occur during the workflow run.

Arguments

  • filename::String: The path to the workflow definition file to be executed.

Behavior

  • The function attempts to run the workflow specified in filename using the FACTWorkflowManager.
  • If an error occurs during the workflow execution, it catches the exception and prints an error message indicating that the workflow run encountered an issue.

Example

To execute a workflow defined in 'workflow.toml':

runWorkflow("workflow.toml")

This will initiate the execution of the workflow, with error handling in case of issues during the process.

source
FACT_data_project.write_results_datasetMethod
write_results_dataset(result::Dict, filename::String)

Writes the contents of the result dictionary, which maps names to DataFrames, to disk in various formats based on the file extension specified in filename. The function supports exporting data to JSON, CSV, Stata, Parquet, and SQLite formats, creating one file per DataFrame in the dictionary, except for SQLite where all DataFrames are written to a single database file.

Arguments

  • result::Dict: A dictionary mapping unique names (as keys) to DataFrames (as values).
  • filename::String: The base name for output files, including the desired extension which determines the format of the output. The actual filenames will be constructed by appending the key from result to the base name.

Supported Formats

  • .json: Exports each DataFrame as a separate JSON file.
  • .csv: Exports each DataFrame as a separate CSV file.
  • .dta: Exports each DataFrame as a separate Stata file.
  • .parquet: Exports each DataFrame as a separate Parquet file.
  • .sqlite: Writes all DataFrames to a single SQLite database file, with each DataFrame in its own table named after its key in result.

Errors

  • Raises an error if the file extension is not supported, indicating the format is unrecognized.

Example

Given a result dictionary resultDict with DataFrames and a filename 'output.json':

writeresultsdataset(resultDict, "output.json")

This will export each DataFrame in resultDict to a separate JSON file, named with the corresponding key in resultDict and prefixed with 'output_'.

source
FACT_data_project.FACT_data_projectModule

Module FACTdataproject

The FACTDatasetIO module serves as a comprehensive toolkit for managing, processing, and exporting datasets within the FACT framework. It integrates functionality for executing data processing workflows, retrieving results from various data sources, and exporting data to multiple formats. This module acts as a bridge between high-level workflow definitions and the detailed, task-specific operations needed to handle data effectively in the FACT ecosystem.

Dependencies

  • FACTWorkflowManager: Utilized for orchestrating and executing defined workflows.
  • FACTResultsIO: Handles the input/output operations related to data retrieval and storage.
  • External Libraries: Uses TOML for configuration parsing, DataFrames for data manipulation, JSON, CSV, StatFiles, Parquet, and SQLite for data serialization and storage.

Components

  • WorkflowRunner.jl: Implements the logic for running and managing data processing workflows.
  • DataFrameManager.jl: Provides utilities for manipulating DataFrames, including transformations and aggregations.
  • ResultsReader.jl: Contains functionality for reading and interpreting results from data processing tasks.
  • FileWriter.jl: Offers methods for writing data to files in various formats, facilitating data export and sharing.

Functionality

  • runWorkflow: Executes a data processing workflow defined in a TOML file, orchestrating the various tasks according to their dependencies and the specified execution order.
  • getjoinedresults_dataset: Retrieves and joins results from multiple data processing tasks, forming a comprehensive dataset ready for analysis or export.

Usage

This module is designed to be used as part of larger data processing and analysis projects within the FACT framework. It can be invoked programmatically from Julia scripts or applications that require robust data handling capabilities.

Example: To execute a workflow and retrieve the results, you would use the runWorkflow function from the FACTDatasetIO module with the path to your workflow TOML file as the argument. Similarly, to get a comprehensive dataset by joining results from multiple tasks, use the getjoinedresults_dataset function, specifying the necessary arguments as documented.

This module encapsulates the core functionalities needed for effective data management in the FACT framework, streamlining the process from workflow execution to data export.

source