A Java-based application that uses LLMs (via LangChain4j and Ollama) to analyze data files (.csv and .tab) and identify variables containing location and time/date information.
The application scans a specified directory (recursively), reads each data file, and uses an LLM to determine if any of the columns represent:
- Location: Cities, countries, coordinates, addresses, latitude, longitude, etc.
- Time/Date: Years, months, timestamps, dates, durations, etc.
It outputs whether each file meets the requirements (contains both a location and a time variable).
- Java 17 or higher.
- Maven for building the project.
- Ollama installed and running locally (or accessible via network).
- Ensure you have the
llama3.2model (or your preferred model) pulled in Ollama:ollama pull llama3.2
- Ensure you have the
The application is configured via src/main/resources/application.properties:
ollama.url: The base URL for the Ollama API (default:http://localhost:11434).ollama.model: The LLM model to use (default:llama3.2).analyzer.search-root: The root directory to scan for data files (default:data).
mvn clean compilemvn exec:javaAlternatively, if you've already compiled:
mvn exec:java -Dexec.mainClass="edu.harvard.iq.datacommons.analyzer.Application"- Scanning: The
AnalyzerServicewalks the directory tree starting fromanalyzer.search-root. - Parsing: For each
.csvor.tabfile, it reads the header and the first 5 rows of data. - LLM Analysis: For each column, it sends a prompt to the Ollama model (using LangChain4j) containing the column label and sample values.
- Classification: The LLM responds with
YESorNOto classify if the column represents a location or time/date. - Results: The application prints the analysis results for each file to the console.
- Copying: If a file is identified as having both a location and a time/date variable, it is copied to a new directory named
DataCommonsReady.- The original directory structure is preserved within
DataCommonsReady. - For example, if
data/subdir/file.csvmeets the requirements, it will be copied toDataCommonsReady/subdir/file.csv.
- The original directory structure is preserved within
The DataCommonsReady directory will contain all files that are deemed "compliant" (containing both Location and Time data). This is useful for downstream processing that requires these specific dimensions.
src/main/java/edu/harvard/iq/datacommons/analyzer/Application.java: Main entry point.src/main/java/edu/harvard/iq/datacommons/analyzer/AnalyzerService.java: Core analysis logic.src/main/resources/application.properties: Configuration settings.pom.xml: Maven dependencies and build configuration.data/: Sample data directory.