String utilities

Utilities for working with strings

Encoding

encode_sequence(sequence, encoding_map, …) Extract the amino acid properties of the given sequence
encode_all_sequences(sequences, …) Extract the amino acid feature vectors for each peptide sequence

Length manipulation

pad_sequence(seq, max_seq_len, pad_value, align) Pad seq to max_seq_len with value based on the align strategy
pad_trim_sequences(seq_vec, pad_value, …) Pad and/or trim a list of sequences to have common length
trim_sequence(seq, maxlen, align) Trim seq to at most maxlen characters using align strategy

Other operations

simple_fill(text, width) Split text into equal-sized chunks of length width
split(s, delimiters, maxsplit) Split s on any of the given delimiters

Human-readable data type helpers

bytes2human(n, format) Convert n bytes to a human-readable format
human2bytes(s) Convert a human-readable byte string to an integer
try_parse_float(s) Convert s to a float, if possible
try_parse_int(s) Convert s to an integer, if possible
str2bool(s) Convert s to a boolean value, if possible

Definitions

Utilities for working with strings

pyllars.string_utils.bytes2human(n: int, format: str = '%(value)i%(symbol)s') → str[source]

Convert n bytes to a human-readable format

This code is adapted from: http://goo.gl/zeJZl

Parameters:
  • n (int) – The number of bytes
  • format (string) – The format string
Returns:

human_str – A human-readable version of the number of bytes

Return type:

string

Examples

>>> bytes2human(10000)
'9K'
>>> bytes2human(100001221)
'95M'
pyllars.string_utils.encode_all_sequences(sequences: Iterable[str], encoding_map: Mapping[str, numpy.ndarray], maxlen: Optional[int] = None, align: str = 'start', pad_value: str = 'J', same_length: bool = False, flatten: bool = False, return_as_numpy: bool = True, swap_axes: bool = False, progress_bar: bool = True) → Union[numpy.ndarray, List][source]

Extract the amino acid feature vectors for each peptide sequence

See get_peptide_aa_features for more details.

Parameters:
  • sequences (typing.Iterable[str]) – The sequences
  • encoding_map (typing.Mapping[str, numpy.ndarray]) – The features for each character
  • maxlen (typing.Optional[int]) –
  • align (str) –
  • pad_value (str) –
  • same_length (bool) –
  • flatten (bool) – Whether to (attempt to) convert the features of each peptide into a single long vector (True) or leave as a (presumably) 2d position-feature vector.
  • return_as_numpy (bool) – Whether to return as a 2d or 3d numpy array (True) or a list containing 1d or 2d numpy arrays. (The dimensionality depends upon flatten.)
  • swap_axes (bool) –

    If the values are returned as a numpy tensor, swap axes 1 and 2.

    N.B. This flag is only compatible with return_as_numpy=True and flatten=False.

  • progress_bar (bool) – Whether to show a progress bar for collecting the features.
Returns:

all_encoded_peptides – The resulting features. See the flatten and return_as_numpy parameters for the expected output.

Return type:

typing.Union[numpy.ndarray, typing.List]

pyllars.string_utils.encode_sequence(sequence: str, encoding_map: Mapping[str, numpy.ndarray], flatten: bool = False) → numpy.ndarray[source]

Extract the amino acid properties of the given sequence

This function is designed with the idea of mapping from a sequence to numeric features (such as chemical properties or BLOSUM features for amino acid sequences). It may fail if other features are included in encoding_map.

Parameters:
  • sequence (str) – The sequence
  • encoding_map (typing.Mapping[str, numpy.ndarray]) – A mapping from each character to a set of features. Presumably, the features are numpy-like arrays, though they need not be.
  • flatten (bool) – Whether to flatten the encoded sequence into a single, 1d array or leave them as-is.
Returns:

encoded_sequence – A 1d or 2d np.array, depending on flatten. By default (flatten=False), this is a 1d array of objects, in which the outer dimension indexes the position in the epitope. If flatten is True, then the function attempts to reshape the features into a single long feature vector. This will likely fail if the encoding_map values are not numpy-like arrays.

Return type:

numpy.ndarray

pyllars.string_utils.human2bytes(s: str) → int[source]

Convert a human-readable byte string to an integer

This code is adapted from: http://goo.gl/zeJZl

Parameters:s (string) – The human-readable byte string
Returns:num_bytes – The number of bytes
Return type:int

Examples

>>> human2bytes('1M')
1048576
>>> human2bytes('1G')
1073741824
pyllars.string_utils.pad_sequence(seq: str, max_seq_len: int, pad_value: str = 'J', align: str = 'end') → str[source]

Pad seq to max_seq_len with value based on the align strategy

If seq is already of length max_seq_len or longer it will not be changed.

Parameters:
  • seq (str) – The character sequence
  • max_seq_len (int) – The maximum length for a sequence
  • pad_value (str) – The value for padding. This should be a single character
  • align (str) – The strategy for padding the string. Valid options are start, end, and center
Returns:

padded_seq – The padded string. In case seq was already long enough or longer, it will not be changed. So padded_seq could be longer than max_seq_len.

Return type:

str

pyllars.string_utils.pad_trim_sequences(seq_vec: Sequence[str], pad_value: str = 'J', maxlen: Optional[int] = None, align: str = 'start') → List[str][source]

Pad and/or trim a list of sequences to have common length

The procedure is as follows:
  1. Pad the sequence with pad_value
  2. Trim the sequence
Parameters:
  • seq_vec (typing.Sequence[str]) – List of sequences that can have various lengths
  • pad_value (str) – Neutral element with which to pad the sequence. This should be a single character.
  • maxlen (typing.Optional[int]) – Length of padded/trimmed sequences. If None, maxlen is set to the longest sequence length.
  • align (str) – To which end to align the sequences when triming/padding. Valid options are start, end, center
Returns:

padded_sequences – The padded and/or trimmed sequences

Return type:

typing.List[str]

pyllars.string_utils.simple_fill(text: str, width: int = 60) → str[source]

Split text into equal-sized chunks of length width

This is a simplified version of textwrap.fill.

The code is adapted from: http://stackoverflow.com/questions/11781261

Parameters:
  • text (string) – The text to split
  • width (int) – The (exact) length of each line after splitting
Returns:

split_str – A single string with lines of length width (except possibly the last line)

Return type:

string

pyllars.string_utils.split(s: str, delimiters: Iterable[str], maxsplit: int = 0) → List[str][source]

Split s on any of the given delimiters

This code is adapted from: http://stackoverflow.com/questions/4998629/

Parameters:
  • s (string) – The string to split
  • delimiters (list of strings) – The strings to use as delimiters
  • maxsplit (int) – The maximum number of splits (or 0 for no limit)
Returns:

splits – the split strings

Return type:

list of strings

pyllars.string_utils.str2bool(s: str) → bool[source]

Convert s to a boolean value, if possible

Parameters:s (string) – A string which may represent a boolean value
Returns:bool_sTrue if s is in _TRUE_STRING, and False otherwise
Return type:boolean
pyllars.string_utils.trim_sequence(seq: str, maxlen: int, align: str = 'end') → str[source]

Trim seq to at most maxlen characters using align strategy

Parameters:
  • seq (str) – The (amino acid) sequence
  • maxlen (int) – The maximum length
  • align (str) – The strategy for trimming the string. Valid options are start, end, and center
Returns:

trimmed_seq – The trimmed string. In case seq was already an appropriate length, it will not be changed. So trimmed_seq could be shorter than maxlen.

Return type:

str

pyllars.string_utils.try_parse_float(s: str) → Optional[float][source]

Convert s to a float, if possible

Parameters:s (string) – A string which may represent a float
Returns:
  • float_s (float) – A float
  • — OR —
  • None – If s cannot be parsed into a float.
pyllars.string_utils.try_parse_int(s: str) → Optional[int][source]

Convert s to an integer, if possible

Parameters:s (string) – A string which may represent an integer
Returns:
  • int_s (int) – An integer
  • — OR —
  • None – If s cannot be parsed into an int.