ARFF files (Attribute-Relation File Format) are the most common format for data used in Weka. Each ARFF file must have a header describing what each data instance should be like. The attributes that can be used are as follows:
Real or integer numbers.
Nominal attributes must provide a set of possible values. For example:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
Allows for arbitrary string values. Usually processed later using the StringToWordVector
filter.
Allows for dates to be specified. As with Java's SimpleDateFormat
, this date can also be formatted; it will default to ISO-8601 format.
An example header can be seen as follows:
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
Following the header each instance must be listed with the correct number of instances; if an attributes value for an instance is not known a ?
can be used instead. The following shows an example of the set of instances in an ARFF file:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
Depending on the version of Weka being used different methods for loading ARFF files should be utilised.
The following sample code shows how to load an ARFF file:
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;
...
BufferedReader reader = new BufferedReader(new FileReader("data.arff"));
Instances data = new Instances(reader);
reader.close();
data.setClassIndex(data.numAttributes() - 1);
The class index shows what attribute should be used for classification. In most ARFF files this is the last attribute which is why it is set to data.numAttributes() - 1
. If you are using a Weka function, such as buildClassifier
, you must set the class index.
In the latest version of Weka it is very easy to load an ARFF file. This method can also load CSV files and any other files Weka can understand.
import weka.core.converters.ConverterUtils.DataSource;
...
DataSource source = new DataSource("data.arff");
Instances data = source.getDataSet();
if (data.classIndex() == -1) {
data.setClassIndex(data.numAttributes() - 1);
}
Many databases can be used in Weka. Firstly, the DatabaseUtils.props file must be edited to match your database; specifically you must provide your database's name, location, port and correct driver.
jdbcDriver=org.gjt.mm.mysql.Driver
jdbcURL=jdbc:mysql://localhost:3306/my_database
Then the database can be loaded by using some simple code.
import weka.core.Instances;
import weka.experiment.InstanceQuery;
...
InstanceQuery query = new InstanceQuery();
query.setUsername("user");
query.setPassword("pass");
query.setQuery("select * from mytable");
Instances data = query.retrieveInstances();
Some notes about loading from a database:
InstanceQuery
method converts VARCHAR to nominal attributes and TEXT to string attributes. A filter, such as NominalToString
or StringToNormal
, can convert the attributes back to their correct type.