Package libai.classifiers.dataset
Interface DataSet
-
- All Superinterfaces:
java.lang.Iterable<java.util.List<Attribute>>
- All Known Implementing Classes:
MySQLDataSet
,TextFileDataSet
public interface DataSet extends java.lang.Iterable<java.util.List<Attribute>>
DataSet is a container compound by examples. Each example (or DataRecord) is a set of named attributes either discrete or continuous that represents a particular case of some type of class object. This DataSet contain methods to manipulate those DataRecord and calculate information gain, and entropy among the dataset.TODO: - methods to read from .names/.data files - methods to export in format .names/.data - methods to read from databases tables? - optimizations in the sorting to avoid multiple sorting.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description Attribute
allTheSame()
Returns the most common output attribute if the rest of the attributes are exactly the same over the whole data set.boolean
allTheSameOutput()
void
close()
Closes the underlying data source if possible.java.util.HashMap<Attribute,java.lang.Integer>
getFrequencies(int lo, int hi, int fieldIndex)
Gets a map between the different values of the attribute at the fieldIndex and their respective frequencies.int
getItemsCount()
MetaData
getMetaData()
int
getOutputIndex()
DataSet
getSubset(int lo, int hi)
Create a slice of this data set as a new data set from [lo, hi).java.lang.Iterable<java.util.List<Attribute>>
sortOver(int fieldIndex)
Sort the data set over the fieldIndex as primary key and the output index to break any ties.java.lang.Iterable<java.util.List<Attribute>>
sortOver(int lo, int hi, int fieldIndex)
Sort the data set over the fieldIndex as primary key and the output index to break any ties, and limit the elements to [lo, hi).DataSet[]
splitKeepingRelation(double proportion)
Splits this data set into two new dataset where the proportion between the output classes is kept.
-
-
-
Method Detail
-
getOutputIndex
int getOutputIndex()
- Returns:
- The index of the field used as output.
-
getItemsCount
int getItemsCount()
- Returns:
- The total count of elements in this data set.
-
getMetaData
MetaData getMetaData()
- Returns:
- A MetaData object containing information about the attributes on the data set.
-
sortOver
java.lang.Iterable<java.util.List<Attribute>> sortOver(int fieldIndex)
Sort the data set over the fieldIndex as primary key and the output index to break any ties. This index is remember for future internal references.- Parameters:
fieldIndex
- The field to sort over- Returns:
- A iterable representation of this data set sorted over the field index.
-
sortOver
java.lang.Iterable<java.util.List<Attribute>> sortOver(int lo, int hi, int fieldIndex)
Sort the data set over the fieldIndex as primary key and the output index to break any ties, and limit the elements to [lo, hi). This index is remember for future internal references.- Parameters:
lo
- The lower bound (inclusive) of the data set to be returnedhi
- The upper bound (exclusive) of the data set to be returned.fieldIndex
- The field to sort over- Returns:
- A iterable representation of this data set sorted over the field index from [lo, hi).
-
getSubset
DataSet getSubset(int lo, int hi)
Create a slice of this data set as a new data set from [lo, hi).- Parameters:
lo
- The lower bound (inclusive) of the data set to be returnedhi
- The upper bound (exclusive) of the data set to be returned.- Returns:
- A new data set that is the copy of this from [lo, hi)
-
allTheSameOutput
boolean allTheSameOutput()
- Returns:
- True if all the classes (value of output index) are the same.
-
allTheSame
Attribute allTheSame()
Returns the most common output attribute if the rest of the attributes are exactly the same over the whole data set. If there is one single record with one single attribute different from the rest, then this method will return null.- Returns:
- The most common output attribute or null if there is one record different from the rest.
-
splitKeepingRelation
DataSet[] splitKeepingRelation(double proportion)
Splits this data set into two new dataset where the proportion between the output classes is kept. The first dataset contains theproportion
of the original data set, for instance, if the data set has 100 elements distributed between 2 classes, in a 60/40 proportion, this first set will contain (60*proportion + 40*proportion) elements and the second data set will contain the rest. This method is useful to generate training/test sets from one massive data set.- Parameters:
proportion
- the percentage of element to be keep of each class for the first data set.- Returns:
- an array of 2 positions with the dataset as described above.
-
getFrequencies
java.util.HashMap<Attribute,java.lang.Integer> getFrequencies(int lo, int hi, int fieldIndex)
Gets a map between the different values of the attribute at the fieldIndex and their respective frequencies. It limits the count space to [lo, hi).- Parameters:
lo
- The lower bound (inclusive) of the data set to be returnedhi
- The upper bound (exclusive) of the data set to be returned.fieldIndex
- The field to count- Returns:
- a map with the different values and their respective frequencies.
-
close
void close()
Closes the underlying data source if possible.
-
-