API Reference
FACT_data_project.download_dataset
— Methoddownload_dataset(project_filename::String, output_filename::String)
Downloads and processes a dataset defined in a project file and saves the resulting dataset to an output file in the specified format.
Arguments:
project_filename::String
: The name of the project file (with extension) located in theprojects
directory. This file defines the dataset to be downloaded.output_filename::String
: The name of the output file (with extension) where the processed dataset will be saved. This file will be created in thedata
directory.
Workflow:
- Constructs full paths for the input project file and the output file.
- Logs into the registry and retrieves an authentication token.
- Starts a results container and establishes a connection to retrieve data.
- Processes the dataset defined in the project file using the connection.
- Writes the processed dataset to the specified output file in the desired format.
- Stops the results container after the operation is complete.
Supported Output Formats:
.json
: Saves the dataset as a JSON file..csv
: Saves the dataset as a CSV file..dta
: Saves the dataset in Stata format..parquet
: Saves the dataset as a Parquet file..sqlite
: Saves the dataset to an SQLite database.
Example Usage: This example downloads a dataset defined in example_project.toml
and saves it as output.parquet
:
downloaddataset("exampleproject.toml", "output.parquet")
FACT_data_project.getResultsFromTask
— MethodgetResultsFromTask(task_definition::Dict, connection_info::Dict) -> DataFrame
Retrieves the results of a specific task from a database, based on the task's definition and connection information. This function establishes a database connection, queries the database for the task's results using details specified in task_definition
, and then closes the database connection.
Arguments
task_definition::Dict
: A dictionary containing the task's definition, including the function name, arguments, and parameters dictionary to query the database.connection_info::Dict
: A dictionary containing the database connection information required to establish the connection.
Returns
DataFrame
: A DataFrame containing the results of the database query based on the task's definition. Returns an empty DataFrame if no results are found or if any required information is missing in task_definition
.
Example
To retrieve results for a task with specific function name, arguments, and parameters, and given database connection info:
taskdefinition = Dict("function" => "computeStats", "arguments" => "arg1", "parametersdict" => Dict("param1" => "value1")) connectioninfo = Dict("host" => "localhost", "db" => "mydatabase") result = getResultsFromTask(taskdefinition, connection_info)
This example queries the database for the results of the computeStats
function with specified arguments and parameters, returning the results as a DataFrame.
FACT_data_project.get_joined_results_dataset
— Methodget_joined_results_dataset(filename::String, connection_info::Dict) -> Dict{String, DataFrame}
Reads task and output configurations from a TOML file specified by filename
and retrieves the results for each task, joining them as specified in the output configuration. The function constructs a dictionary mapping each task or join name to its respective DataFrame. Single tasks are added directly, while joins are processed according to the join configurations defined in the TOML file.
Arguments
filename::String
: The path to the TOML file containing the workflow definition, including tasks and output configurations.connection_info::Dict
: A dictionary containing the database connection information needed to retrieve task results.
Returns
A dictionary where keys are task or join names and values are DataFrames representing the results of individual tasks or the results of specified joins among tasks.
Workflow
- Parses the TOML file to extract tasks and output configurations.
- Retrieves results for each task defined in the tasks section, using the provided
connection_info
. - Processes joins as defined in the output section, creating a new DataFrame for each join and adding it to the output dictionary with the specified join name.
- Adds single task results directly to the output dictionary using the task name as the key.
Example
Assuming project.toml
defines tasks and their dependencies, and connection_info
provides the necessary database connection details:
requesteddf = getjoinedresultsdataset("project.toml", connection_info)
This will return a dictionary containing DataFrames for each task and join specified in project.toml
, ready for further analysis or export.
FACT_data_project.join_on_matching_columns
— Methodjoin_on_matching_columns(df_left::DataFrame, df_right::DataFrame; join_type::Symbol=:outer) -> DataFrame
Performs a join operation between two DataFrames (df_left
and df_right
) based on columns with matching names. The type of join performed can be specified using the join_type
parameter, which defaults to an outer join if not provided.
Arguments
df_left::DataFrame
: The left DataFrame in the join operation.df_right::DataFrame
: The right DataFrame in the join operation.join_type::Symbol
: (Optional) The type of join to perform, specified as a Symbol. Accepted values include:inner
,:left
,:right
,:outer
,:semi
, and:anti
, mirroring the join types supported by the DataFrames.jl package. Defaults to:outer
.
Returns
DataFrame
: The result of joining df_left
and df_right
based on columns with matching names, according to the specified join_type
.
Example
Given two DataFrames df1
and df2
with some columns having matching names:
joineddf = joinonmatchingcolumns(df1, df2, join_type=:inner)
This will return a new DataFrame resulting from an inner join of df1
and df2
on all columns that exist in both DataFrames.
FACT_data_project.parseOutputflow
— MethodparseOutputflow(filePath::String) -> (Dict, Dict)
Parses a workflow file specified by filePath
, extracting configurations for tasks and output definitions. This function utilizes the TOML format for the workflow file, providing a structured way to define and retrieve task and output configurations.
Arguments
filePath::String
: The path to the workflow definition file in TOML format.
Returns
(Dict, Dict)
: A tuple containing two dictionaries. The first dictionary contains the tasks configuration, and the second dictionary contains the output configuration as defined in the workflow file.
Example
To parse a workflow definition from 'workflow.toml' and retrieve task and output configurations:
tasks, output = parseOutputflow("workflow.toml")
This will return the tasks and output configurations as two separate dictionaries, based on the content of 'workflow.toml'.
FACT_data_project.preprocess_dataframe_and_remove_shape_obj!
— Methodpreprocess_dataframe_to_numerical!(df::DataFrame)
Transforms columns of a DataFrame to numerical values, specifically to Float64
, except for columns designated to remain as strings. This function is particularly useful for preparing dataset columns for statistical analysis or machine learning models that require numerical input. It attempts to parse each string in the DataFrame to a Float64
and replaces non-convertible strings with NaN
to maintain numerical operations' integrity.
Arguments
df::DataFrame
: The DataFrame to be processed. This function modifies the DataFrame in place.
Behavior
- Columns named
"geo_id"
are skipped to preserve any geographic identifiers or other string-based identifiers that should not be converted to numbers. - For other columns, the function attempts to convert string representations of numbers into
Float64
. If a value cannot be parsed as aFloat64
, it is replaced withNaN
(Not a Number), a standard IEEE floating-point representation for undefined or unrepresentable values.
Example
Given a DataFrame df
with mixed string numerical values and identifiers:
df = DataFrame(geo_id = ["id1", "id2"], measurement = ["1.23", "4.56"], category = ["A", "B"])
Applying preprocess_dataframe_to_numerical!(df)
would result in df
having its measurement
column converted to Float64
, with category
values replaced by NaN
, and geo_id
left unchanged.
Notes
- This preprocessing removes the "shape_obj" column from the DataFrame if present.
- It's important to backup your DataFrame before applying this function if you need to preserve the original data.
- Handling of
NaN
values should be considered in subsequent data processing or analysis stages.
FACT_data_project.runWorkflow
— MethodrunWorkflow(filename::String)
Executes a workflow defined in a file, utilizing the FACTWorkflowManager to handle the execution process. This function wraps the workflow execution in a try-catch block to gracefully handle any errors that may occur during the workflow run.
Arguments
filename::String
: The path to the workflow definition file to be executed.
Behavior
- The function attempts to run the workflow specified in
filename
using the FACTWorkflowManager. - If an error occurs during the workflow execution, it catches the exception and prints an error message indicating that the workflow run encountered an issue.
Example
To execute a workflow defined in 'workflow.toml':
runWorkflow("workflow.toml")
This will initiate the execution of the workflow, with error handling in case of issues during the process.
FACT_data_project.write_results_dataset
— Methodwrite_results_dataset(result::Dict, filename::String)
Writes the contents of the result
dictionary, which maps names to DataFrames, to disk in various formats based on the file extension specified in filename
. The function supports exporting data to JSON, CSV, Stata, Parquet, and SQLite formats, creating one file per DataFrame in the dictionary, except for SQLite where all DataFrames are written to a single database file.
Arguments
result::Dict
: A dictionary mapping unique names (as keys) to DataFrames (as values).filename::String
: The base name for output files, including the desired extension which determines the format of the output. The actual filenames will be constructed by appending the key fromresult
to the base name.
Supported Formats
.json
: Exports each DataFrame as a separate JSON file..csv
: Exports each DataFrame as a separate CSV file..dta
: Exports each DataFrame as a separate Stata file..parquet
: Exports each DataFrame as a separate Parquet file..sqlite
: Writes all DataFrames to a single SQLite database file, with each DataFrame in its own table named after its key inresult
.
Errors
- Raises an error if the file extension is not supported, indicating the format is unrecognized.
Example
Given a result dictionary resultDict
with DataFrames and a filename 'output.json':
writeresultsdataset(resultDict, "output.json")
This will export each DataFrame in resultDict
to a separate JSON file, named with the corresponding key in resultDict
and prefixed with 'output_'.
FACT_data_project.FACT_data_project
— ModuleModule FACTdataproject
The FACTDatasetIO module serves as a comprehensive toolkit for managing, processing, and exporting datasets within the FACT framework. It integrates functionality for executing data processing workflows, retrieving results from various data sources, and exporting data to multiple formats. This module acts as a bridge between high-level workflow definitions and the detailed, task-specific operations needed to handle data effectively in the FACT ecosystem.
Dependencies
- FACTWorkflowManager: Utilized for orchestrating and executing defined workflows.
- FACTResultsIO: Handles the input/output operations related to data retrieval and storage.
- External Libraries: Uses TOML for configuration parsing, DataFrames for data manipulation, JSON, CSV, StatFiles, Parquet, and SQLite for data serialization and storage.
Components
- WorkflowRunner.jl: Implements the logic for running and managing data processing workflows.
- DataFrameManager.jl: Provides utilities for manipulating DataFrames, including transformations and aggregations.
- ResultsReader.jl: Contains functionality for reading and interpreting results from data processing tasks.
- FileWriter.jl: Offers methods for writing data to files in various formats, facilitating data export and sharing.
Functionality
- runWorkflow: Executes a data processing workflow defined in a TOML file, orchestrating the various tasks according to their dependencies and the specified execution order.
- getjoinedresults_dataset: Retrieves and joins results from multiple data processing tasks, forming a comprehensive dataset ready for analysis or export.
Usage
This module is designed to be used as part of larger data processing and analysis projects within the FACT framework. It can be invoked programmatically from Julia scripts or applications that require robust data handling capabilities.
Example: To execute a workflow and retrieve the results, you would use the runWorkflow function from the FACTDatasetIO module with the path to your workflow TOML file as the argument. Similarly, to get a comprehensive dataset by joining results from multiple tasks, use the getjoinedresults_dataset function, specifying the necessary arguments as documented.
This module encapsulates the core functionalities needed for effective data management in the FACT framework, streamlining the process from workflow execution to data export.